Fine-tune crosslingual model for language detection #30

artitw · 2022-01-01T19:28:11Z

Two approaches to try:

Use crosslingual embeddings as input to MLP or tree-based model in transfer learning fashion
Fine-tune crosslingual translator with softmax output

Mofetoluwa · 2022-01-03T21:59:38Z

I'd get started on this

artitw · 2022-01-03T22:05:07Z

Awesome, I've assigned you to this project. Let's keep track of progress here.

Mofetoluwa · 2022-01-03T22:34:28Z

Alright, sure :)

Mofetoluwa · 2022-01-11T20:42:26Z

Hi Artit @artitw

Please I need your help, I’m facing some roadblocks.

I decided to start with the second approach you suggested, which is Fine-tuning the cross-lingual translator with a softmax output. My thought process for this:

Get a language detection dataset. I decided to work with wili_2018 (https://arxiv.org/pdf/1801.07779v1.pdf, https://zenodo.org/record/841984#.Yd1XWRPMK3I). What do you think about it, and do you have any other datasets in mind?
Looking at the Fitter class in the code, I'm not exactly sure how I can fine-tune the translator to train on the identification dataset. Since it requires both the source and target languages, I was thinking of writing another module to do the fine-tuning but was not sure if that was necessary? Do you have any ideas on how I can apply the softmax output to the already existing translator?

Also, does this sound like the right track for approach number 2?

artitw · 2022-01-12T01:51:02Z

Very much appreciate the updates on this. The dataset you cite looks appropriate; I suggest filtering for the languages which the pretrained model supports for tokenization.

Your ideas on the second approach seem fine so far. Yes, you are correct that you would have to write another module to finetune with a softmax output. I expect this second approach to be more challenging for this reason. If it helps, consider taking the first approach to get things working and then come back to the second approach to get better performance.

Mofetoluwa · 2022-01-12T11:15:50Z

Alright then, I'll get started with approach 1

Mofetoluwa · 2022-02-07T19:29:26Z

Hi Art @artitw

Here is the link to the notebook: https://colab.research.google.com/drive/1VxRRURRAaXBZFsYsXC5hSTkSc-4TGdOj?usp=sharing

Based on the last discussion:

I decided to work with a batch size of 5, but the session took too long and would stop while going through the dataset.
So I decided to divide the dataset, create embeddings for each division and save it to a file.
I'm still doing that, as there is still some crashing. But there are currently embeddings for 2200 samples which was used for training.
The models trained are very simple since the size of the data used is small.

I would add more embeddings and retrain the models to see how performance improves.

What do you think about it? Thank you :)

artitw · 2022-02-12T02:40:08Z

Hi @Mofetoluwa

Thanks for all the work and the summary. It looks like the MLP model is best performing. Would you be able to add it to the repo? It would be great if language classification is available for use while we continue improving it with finetuning and other methods.

Mofetoluwa · 2022-02-13T11:23:43Z

Hi @artitw

Alright then :)

So just to clarify, how would we want to use the model on the repo? Asides from pushing the saved model, are we also creating a module for it to be called e.g identification.py, like the other functionalities?

artitw · 2022-02-13T20:20:23Z

It would be awesome to create an Identifier class so that we can do sonething like t2t.Handler(["test text"]).identify(). Could we give that a try?

Mofetoluwa · 2022-02-14T16:54:13Z

Alright, I'll add the model to the repo first. In which of the folders should I put it?

artitw · 2022-02-15T02:06:35Z

Can we store the model in somewhere like Google Drive and only download it when the Identifier is used? This approach would follow the existing convention to keep the core library lightweight.

Mofetoluwa · 2022-02-15T10:17:01Z

Alright then :)

Mofetoluwa · 2022-02-27T16:14:18Z

Hi @artitw

My sincere apologies the updates are just coming in, I wanted to have done some work on the Identifier class before sending the updates.

A pull request has been made for the code, so you can have a look at it and let me know your thoughts.
More embeddings were created (~ 6900) and used to train the MLP model, so there has been some improvement as seen in the notebook: https://colab.research.google.com/drive/1Cq1lnDJMI2-ZZxm1VmCWzvLL78E2UmVy?usp=sharing
One limitation this current model has, is that it doesn't correctly predict the language for some short text sequences (less than 10 tokens). But it correctly identifies the language for longer sequences. Hopefully, this can be resolved using the second approach, and I'll also appreciate your thoughts on this problem too.

Thank you :)

artitw · 2022-02-27T23:47:25Z

@Mofetoluwa thanks for the updates and the pull request. I added some comments there. With regards to the third point you raise, when I tested the model, it returned "hy" for "hello" and "ja" for "你好!". Is this consistent with your testing as well?

Mofetoluwa · 2022-02-28T09:48:58Z

Yeah, it's a problem I noticed with most languages.

I believe approach 2 would resolve this? Another thing could be to generate shorter texts for this approach. What do you think?

artitw · 2022-03-06T01:00:29Z

I think training with shorter texts and approach 2 would address the issue. Another approach us to use 2D embeddings. Currently we are using 1D embeddings, which are calculate by averaging the last layer outputs, but we can use the last layer outputs directly as 2D embeddings.

I also just realized from adding the 2D embeddings option in the latest release that the last layer averaging could be improved by removing the paddings from the calculations. In other words, I think it might be helpful to re-train the MLP identification model on the latest release.

Mofetoluwa · 2022-03-08T12:24:42Z

Hi @artitw

Oh that sounds great. So how can the 2D embeddings be gotten? Is it still by using the vectorize() function?

artitw · 2022-03-08T14:02:15Z

@Mofetoluwa yes, we can do vectorize(output_dimension=2) as specified in the latest version.

Also note that the default 1D output should be improved now compared to the version you used most recently.

Mofetoluwa · 2022-03-08T14:44:58Z

@artitw Oh alright. So should we do a comparison of both?

Then also... adding shorter texts did not really improve the performance of the model. The F1 score and accuracy dropped to about ~0.66.

artitw · 2022-03-08T15:47:02Z

Yes, a comparison of both would be useful. Thanks so much for checking the shorter texts. It will help to confirm the fix for the way 1D embeddings are calculated.

artitw · 2022-05-15T19:16:41Z

Hi Mofe,

Are we sampling the data so that each class is balanced when training?
Could we update the README so that users could have some documentation to use the Identifier?

Mofetoluwa · 2022-05-16T08:51:00Z

Hi Art,

Yes the dataset is balanced, there are 96 languages and 100 samples for each, so we used 9600 samples to train the current model. I believe the original dataset should have one or two of the 4 languages I didn't find but maybe represented in another name/code. So I'll find that out.
Alright I'll do that shortly...

artitw · 2022-05-30T18:39:51Z

Could we also add the Identifier in the README's class diagram?

artitw · 2022-09-17T20:10:04Z

@Mofetoluwa, what do you think about using the TFIDF embeddings to perform the language prediction? I think that might be better than the neural embeddings currently used, as it won't have the length dependency.

Mofetoluwa · 2022-09-26T08:48:10Z

Hi Art @artitw sure that should work actually... I'll try it out and let you know how it goes. I hope you're doing great :)

artitw · 2022-10-02T01:06:41Z

great, thanks so much Mofe. Really looking forward to it

artitw · 2022-11-19T18:51:24Z

Hi Mofe, in the latest release I fixed an issue with TFIDF embeddings so that they now output a consistent embedding size. Hope this helps

Mofetoluwa · 2022-11-24T19:38:27Z

Hi Art,

Alright that's cool :)... I'll work with it and let you know how it goes soon

artitw added this to To Do in Research Science Jan 1, 2022

artitw assigned Mofetoluwa Jan 3, 2022

artitw moved this from To Do to In Progress in Research Science Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune crosslingual model for language detection #30

Fine-tune crosslingual model for language detection #30

artitw commented Jan 1, 2022

Mofetoluwa commented Jan 3, 2022

artitw commented Jan 3, 2022

Mofetoluwa commented Jan 3, 2022

Mofetoluwa commented Jan 11, 2022

artitw commented Jan 12, 2022

Mofetoluwa commented Jan 12, 2022

Mofetoluwa commented Feb 7, 2022

artitw commented Feb 12, 2022

Mofetoluwa commented Feb 13, 2022

artitw commented Feb 13, 2022

Mofetoluwa commented Feb 14, 2022

artitw commented Feb 15, 2022

Mofetoluwa commented Feb 15, 2022

Mofetoluwa commented Feb 27, 2022

artitw commented Feb 27, 2022

Mofetoluwa commented Feb 28, 2022

artitw commented Mar 6, 2022 •

edited

Loading

Mofetoluwa commented Mar 8, 2022

artitw commented Mar 8, 2022

Mofetoluwa commented Mar 8, 2022

artitw commented Mar 8, 2022

artitw commented May 15, 2022

Mofetoluwa commented May 16, 2022

artitw commented May 30, 2022

artitw commented Sep 17, 2022

Mofetoluwa commented Sep 26, 2022

artitw commented Oct 2, 2022

artitw commented Nov 19, 2022

Mofetoluwa commented Nov 24, 2022

Fine-tune crosslingual model for language detection #30

Fine-tune crosslingual model for language detection #30

Comments

artitw commented Jan 1, 2022

Mofetoluwa commented Jan 3, 2022

artitw commented Jan 3, 2022

Mofetoluwa commented Jan 3, 2022

Mofetoluwa commented Jan 11, 2022

artitw commented Jan 12, 2022

Mofetoluwa commented Jan 12, 2022

Mofetoluwa commented Feb 7, 2022

artitw commented Feb 12, 2022

Mofetoluwa commented Feb 13, 2022

artitw commented Feb 13, 2022

Mofetoluwa commented Feb 14, 2022

artitw commented Feb 15, 2022

Mofetoluwa commented Feb 15, 2022

Mofetoluwa commented Feb 27, 2022

artitw commented Feb 27, 2022

Mofetoluwa commented Feb 28, 2022

artitw commented Mar 6, 2022 • edited Loading

Mofetoluwa commented Mar 8, 2022

artitw commented Mar 8, 2022

Mofetoluwa commented Mar 8, 2022

artitw commented Mar 8, 2022

artitw commented May 15, 2022

Mofetoluwa commented May 16, 2022

artitw commented May 30, 2022

artitw commented Sep 17, 2022

Mofetoluwa commented Sep 26, 2022

artitw commented Oct 2, 2022

artitw commented Nov 19, 2022

Mofetoluwa commented Nov 24, 2022

artitw commented Mar 6, 2022 •

edited

Loading