Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune crosslingual model for language detection #30

Open
artitw opened this issue Jan 1, 2022 · 29 comments
Open

Fine-tune crosslingual model for language detection #30

artitw opened this issue Jan 1, 2022 · 29 comments
Assignees

Comments

@artitw
Copy link
Owner

artitw commented Jan 1, 2022

Two approaches to try:

  1. Use crosslingual embeddings as input to MLP or tree-based model in transfer learning fashion
  2. Fine-tune crosslingual translator with softmax output
@artitw artitw added this to To Do in Research Science Jan 1, 2022
@Mofetoluwa
Copy link
Contributor

I'd get started on this

@artitw
Copy link
Owner Author

artitw commented Jan 3, 2022

Awesome, I've assigned you to this project. Let's keep track of progress here.

@Mofetoluwa
Copy link
Contributor

Alright, sure :)

@Mofetoluwa
Copy link
Contributor

Hi Artit @artitw

Please I need your help, I’m facing some roadblocks.

I decided to start with the second approach you suggested, which is Fine-tuning the cross-lingual translator with a softmax output. My thought process for this:

  1. Get a language detection dataset. I decided to work with wili_2018 (https://arxiv.org/pdf/1801.07779v1.pdf, https://zenodo.org/record/841984#.Yd1XWRPMK3I). What do you think about it, and do you have any other datasets in mind?
  2. Looking at the Fitter class in the code, I'm not exactly sure how I can fine-tune the translator to train on the identification dataset. Since it requires both the source and target languages, I was thinking of writing another module to do the fine-tuning but was not sure if that was necessary? Do you have any ideas on how I can apply the softmax output to the already existing translator?

Also, does this sound like the right track for approach number 2?

@artitw
Copy link
Owner Author

artitw commented Jan 12, 2022

Very much appreciate the updates on this. The dataset you cite looks appropriate; I suggest filtering for the languages which the pretrained model supports for tokenization.

Your ideas on the second approach seem fine so far. Yes, you are correct that you would have to write another module to finetune with a softmax output. I expect this second approach to be more challenging for this reason. If it helps, consider taking the first approach to get things working and then come back to the second approach to get better performance.

@Mofetoluwa
Copy link
Contributor

Alright then, I'll get started with approach 1

@Mofetoluwa
Copy link
Contributor

Hi Art @artitw

Here is the link to the notebook: https://colab.research.google.com/drive/1VxRRURRAaXBZFsYsXC5hSTkSc-4TGdOj?usp=sharing

Based on the last discussion:

  • I decided to work with a batch size of 5, but the session took too long and would stop while going through the dataset.
  • So I decided to divide the dataset, create embeddings for each division and save it to a file.
  • I'm still doing that, as there is still some crashing. But there are currently embeddings for 2200 samples which was used for training.
  • The models trained are very simple since the size of the data used is small.

I would add more embeddings and retrain the models to see how performance improves.

What do you think about it? Thank you :)

@artitw
Copy link
Owner Author

artitw commented Feb 12, 2022

Hi @Mofetoluwa

Thanks for all the work and the summary. It looks like the MLP model is best performing. Would you be able to add it to the repo? It would be great if language classification is available for use while we continue improving it with finetuning and other methods.

@Mofetoluwa
Copy link
Contributor

Hi @artitw

Alright then :)

So just to clarify, how would we want to use the model on the repo? Asides from pushing the saved model, are we also creating a module for it to be called e.g identification.py, like the other functionalities?

@artitw
Copy link
Owner Author

artitw commented Feb 13, 2022

It would be awesome to create an Identifier class so that we can do sonething like t2t.Handler(["test text"]).identify(). Could we give that a try?

@Mofetoluwa
Copy link
Contributor

Alright, I'll add the model to the repo first. In which of the folders should I put it?

@artitw
Copy link
Owner Author

artitw commented Feb 15, 2022

Can we store the model in somewhere like Google Drive and only download it when the Identifier is used? This approach would follow the existing convention to keep the core library lightweight.

@Mofetoluwa
Copy link
Contributor

Alright then :)

@artitw artitw moved this from To Do to In Progress in Research Science Feb 20, 2022
@Mofetoluwa
Copy link
Contributor

Hi @artitw

My sincere apologies the updates are just coming in, I wanted to have done some work on the Identifier class before sending the updates.

  • A pull request has been made for the code, so you can have a look at it and let me know your thoughts.
  • More embeddings were created (~ 6900) and used to train the MLP model, so there has been some improvement as seen in the notebook: https://colab.research.google.com/drive/1Cq1lnDJMI2-ZZxm1VmCWzvLL78E2UmVy?usp=sharing
  • One limitation this current model has, is that it doesn't correctly predict the language for some short text sequences (less than 10 tokens). But it correctly identifies the language for longer sequences. Hopefully, this can be resolved using the second approach, and I'll also appreciate your thoughts on this problem too.

Thank you :)

@artitw
Copy link
Owner Author

artitw commented Feb 27, 2022

@Mofetoluwa thanks for the updates and the pull request. I added some comments there. With regards to the third point you raise, when I tested the model, it returned "hy" for "hello" and "ja" for "你好!". Is this consistent with your testing as well?

@Mofetoluwa
Copy link
Contributor

Yeah, it's a problem I noticed with most languages.

I believe approach 2 would resolve this? Another thing could be to generate shorter texts for this approach. What do you think?

@artitw
Copy link
Owner Author

artitw commented Mar 6, 2022

I think training with shorter texts and approach 2 would address the issue. Another approach us to use 2D embeddings. Currently we are using 1D embeddings, which are calculate by averaging the last layer outputs, but we can use the last layer outputs directly as 2D embeddings.

I also just realized from adding the 2D embeddings option in the latest release that the last layer averaging could be improved by removing the paddings from the calculations. In other words, I think it might be helpful to re-train the MLP identification model on the latest release.

@Mofetoluwa
Copy link
Contributor

Hi @artitw

Oh that sounds great. So how can the 2D embeddings be gotten? Is it still by using the vectorize() function?

@artitw
Copy link
Owner Author

artitw commented Mar 8, 2022

@Mofetoluwa yes, we can do vectorize(output_dimension=2) as specified in the latest version.

Also note that the default 1D output should be improved now compared to the version you used most recently.

@Mofetoluwa
Copy link
Contributor

@artitw Oh alright. So should we do a comparison of both?

Then also... adding shorter texts did not really improve the performance of the model. The F1 score and accuracy dropped to about ~0.66.

@artitw
Copy link
Owner Author

artitw commented Mar 8, 2022

Yes, a comparison of both would be useful. Thanks so much for checking the shorter texts. It will help to confirm the fix for the way 1D embeddings are calculated.

@artitw
Copy link
Owner Author

artitw commented May 15, 2022

Hi Mofe,

  1. Are we sampling the data so that each class is balanced when training?
  2. Could we update the README so that users could have some documentation to use the Identifier?

@Mofetoluwa
Copy link
Contributor

Hi Art,

  1. Yes the dataset is balanced, there are 96 languages and 100 samples for each, so we used 9600 samples to train the current model. I believe the original dataset should have one or two of the 4 languages I didn't find but maybe represented in another name/code. So I'll find that out.

  2. Alright I'll do that shortly...

@artitw
Copy link
Owner Author

artitw commented May 30, 2022

Could we also add the Identifier in the README's class diagram?

@artitw
Copy link
Owner Author

artitw commented Sep 17, 2022

@Mofetoluwa, what do you think about using the TFIDF embeddings to perform the language prediction? I think that might be better than the neural embeddings currently used, as it won't have the length dependency.

@Mofetoluwa
Copy link
Contributor

Hi Art @artitw sure that should work actually... I'll try it out and let you know how it goes. I hope you're doing great :)

@artitw
Copy link
Owner Author

artitw commented Oct 2, 2022

great, thanks so much Mofe. Really looking forward to it

@artitw
Copy link
Owner Author

artitw commented Nov 19, 2022

Hi Mofe, in the latest release I fixed an issue with TFIDF embeddings so that they now output a consistent embedding size. Hope this helps

@Mofetoluwa
Copy link
Contributor

Hi Art,

Alright that's cool :)... I'll work with it and let you know how it goes soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Research Science
In Progress
Development

No branches or pull requests

2 participants