Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About train data of clf.py #53

Open
svjack opened this issue Jan 28, 2021 · 7 comments
Open

About train data of clf.py #53

svjack opened this issue Jan 28, 2021 · 7 comments

Comments

@svjack
Copy link

svjack commented Jan 28, 2021

Hi,
I review the code , and have two questions want to ask.
One is about the wikidata.csv trained in clf.py
The samples mainly from english and also have some tiny data from other languages, such as japanese.
And have some samples use upper case.
My question is that, this data used to train a classifier about the question meaning format of sql.
the tensorflow hub model also trained on english only. and the question input seems always in english.
Why you use some multilingual input samples and use some upper case transformations.
This seems like your want to make the clf can also use in multilingual input ,
and want it to adapt with uppercase as some sql input .
If you want do it, why you do not use multilingual embedding in tf hub ? extend the sample by
some NMT translations and apply case transformations as sample augmentation methods ?

Second is that because the project construction mainly above on trained models and can use without
training. the lexicon parse of question input to identify intention mainly related with some custom (pre-assigned)
keywords defined in adapt methods of some inherent class of ColumnType (such as Number and Date)
so this project is mainly focus on simple input questions in lexicon.
So if i used it in question from other languages (such as chinese or japanese), It seems that can use some simple
NMT model to translate from these language into english and use your model. without replace the keywords
defined in above adapt methods. (because the lexicon is simple in question input, so the translated question
should be will formed or formatted)
As we all know, the schema or column name defined in database table or pandas dataframe may always in
english, And the table content may from other languages.
In this situation, i must have a choice in the translation the other language content. If i also translate content into
english, this seems like works. If i don’t, the qa function defined in your nlp.py should use multilingual squad transformers (some roberta model)

All i want to do is to adapt this project from only english tableQA to multilingual tableQA.
Because the input question and table dataset is simple in lexicon, dose this feature will support
in the project in the future ?

@abhijithneilabraham
Copy link
Owner

Multilingual seems like a good suggestion.You need to have is the QA model and clf.py classifier to be available for multilinguals. Also, the tokenization,lemmatisation, etc are done mainly for english, probably might need to change everything for that too. So my best guess is we can adapt to languages like german, french, etc which follows similar structure to english and also could follow similar rules for the lemmatisation, tokenisation etc.

@svjack
Copy link
Author

svjack commented Jan 28, 2021

Multilingual seems like a good suggestion.You need to have is the QA model and clf.py classifier to be available for multilinguals. Also, the tokenization,lemmatisation, etc are done mainly for english, probably might need to change everything for that too. So my best guess is we can adapt to languages like german, french, etc which follows similar structure to english and also could follow similar rules for the lemmatisation, tokenisation etc.

I had try EasyNMT
https://github.com/UKPLab/EasyNMT
opus-mt model to transform your wikidata.csv to other languages,
It seems that the translations are well formatted in intuition.

@abhijithneilabraham
Copy link
Owner

That's okay for wikidata.csv, but what about the QA models?

@svjack
Copy link
Author

svjack commented Jan 28, 2021

That's okay for wikidata.csv, but what about the QA models?

https://huggingface.co/deepset/xlm-roberta-large-squad2

@abhijithneilabraham
Copy link
Owner

Interesting. Let's give this a shot, would you be able to contribute while we're at this task? It would also be good to have some fluency in the respective language as well.

@svjack
Copy link
Author

svjack commented Jan 28, 2021

i will try to adapt it to chinese firstly.

@abhijithneilabraham
Copy link
Owner

Cool. Feel free to ask questions on the go.
Also, the multilingual should be optional, as you already know. Like, agent.query_db(lang="chinese") likewise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants