Skip to content

AbinashBishoyi/Bangla2Vec

 
 

Repository files navigation

Bangla2Vec

Language Modelling and Classification in the Bengali Language

Announcement: I will be giving a talk at IEM, Kolkata this Saturday about this work. The event link is here. Hope to see you there!!!

Bangla2Vec is an open source project for modelling the Bengali Language. The models released here can be used for a variety of tasks like classification and translation. Furthermore, all the data and models are opensourced so you can train your own model or use the pretrained models for your own tasks.

Releases

  • Trained a skipgram model on a news dataset: Training Script | Results | Model
  • Trained a skipgram model on wikipedia dataset: Training Script | Results | Model
  • Visualise Word Embeddings: Script | Create a directory vis, run the script and then start Tensorboard using tensorboard --logdir=vis
  • Scripts to scrape data from Bengali news websites: Github Repo

Results

Words most similar to the word chele (boy)

Father + Girl - Boy = Mother

Odd one out

Bengali's Love Sweets!

Data

Data was scraped from multiple online Bengali news websites.

Data was also collected from a Wikipedia dump.

You can view the data in the data folder.

Examples

  • Classification: Using the trained Bangal2vec models, a news classifier was built. This classifier can classify news into 5 categories based on the news headlines. The best model achieved a testing f1 score of 0.76 after training on just 40k news headlines.

Similar Projects

This project is a sister project of other projects working on IndicNLP. They include:

To get resources to start working on IndicNLP or to learn more about it, you can see our Awesome List of resources

Future Work

  • Build a word2vec model
  • Visualise the trained embeddings
  • Build a UlmFit model
  • Get translation data

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.4%
  • Python 4.6%