About train data of clf.py #53

svjack · 2021-01-28T05:20:52Z

Hi,
I review the code , and have two questions want to ask.
One is about the wikidata.csv trained in clf.py
The samples mainly from english and also have some tiny data from other languages, such as japanese.
And have some samples use upper case.
My question is that, this data used to train a classifier about the question meaning format of sql.
the tensorflow hub model also trained on english only. and the question input seems always in english.
Why you use some multilingual input samples and use some upper case transformations.
This seems like your want to make the clf can also use in multilingual input ,
and want it to adapt with uppercase as some sql input .
If you want do it, why you do not use multilingual embedding in tf hub ? extend the sample by
some NMT translations and apply case transformations as sample augmentation methods ?

Second is that because the project construction mainly above on trained models and can use without
training. the lexicon parse of question input to identify intention mainly related with some custom (pre-assigned)
keywords defined in adapt methods of some inherent class of ColumnType (such as Number and Date)
so this project is mainly focus on simple input questions in lexicon.
So if i used it in question from other languages (such as chinese or japanese), It seems that can use some simple
NMT model to translate from these language into english and use your model. without replace the keywords
defined in above adapt methods. (because the lexicon is simple in question input, so the translated question
should be will formed or formatted)
As we all know, the schema or column name defined in database table or pandas dataframe may always in
english, And the table content may from other languages.
In this situation, i must have a choice in the translation the other language content. If i also translate content into
english, this seems like works. If i don’t, the qa function defined in your nlp.py should use multilingual squad transformers (some roberta model)

All i want to do is to adapt this project from only english tableQA to multilingual tableQA.
Because the input question and table dataset is simple in lexicon, dose this feature will support
in the project in the future ?

abhijithneilabraham · 2021-01-28T05:29:26Z

Multilingual seems like a good suggestion.You need to have is the QA model and clf.py classifier to be available for multilinguals. Also, the tokenization,lemmatisation, etc are done mainly for english, probably might need to change everything for that too. So my best guess is we can adapt to languages like german, french, etc which follows similar structure to english and also could follow similar rules for the lemmatisation, tokenisation etc.

svjack · 2021-01-28T05:41:57Z

Multilingual seems like a good suggestion.You need to have is the QA model and clf.py classifier to be available for multilinguals. Also, the tokenization,lemmatisation, etc are done mainly for english, probably might need to change everything for that too. So my best guess is we can adapt to languages like german, french, etc which follows similar structure to english and also could follow similar rules for the lemmatisation, tokenisation etc.

I had try EasyNMT
https://github.com/UKPLab/EasyNMT
opus-mt model to transform your wikidata.csv to other languages,
It seems that the translations are well formatted in intuition.

abhijithneilabraham · 2021-01-28T11:22:02Z

That's okay for wikidata.csv, but what about the QA models?

svjack · 2021-01-28T11:47:16Z

That's okay for wikidata.csv, but what about the QA models?

https://huggingface.co/deepset/xlm-roberta-large-squad2

abhijithneilabraham · 2021-01-28T11:57:47Z

Interesting. Let's give this a shot, would you be able to contribute while we're at this task? It would also be good to have some fluency in the respective language as well.

svjack · 2021-01-28T12:56:18Z

i will try to adapt it to chinese firstly.

abhijithneilabraham · 2021-01-28T12:59:37Z

Cool. Feel free to ask questions on the go.
Also, the multilingual should be optional, as you already know. Like, agent.query_db(lang="chinese") likewise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About train data of clf.py #53

About train data of clf.py #53

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

About train data of clf.py #53

About train data of clf.py #53

Comments

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021

svjack commented Jan 28, 2021

abhijithneilabraham commented Jan 28, 2021