Bangla2Vec

Language Modelling and Classification in the Bengali Language

Announcement: I will be giving a talk at IEM, Kolkata this Saturday about this work. The event link is here. Hope to see you there!!!

Bangla2Vec is an open source project for modelling the Bengali Language. The models released here can be used for a variety of tasks like classification and translation. Furthermore, all the data and models are opensourced so you can train your own model or use the pretrained models for your own tasks.

Releases

Trained a skipgram model on a news dataset: Training Script | Results | Model
Trained a skipgram model on wikipedia dataset: Training Script | Results | Model
Visualise Word Embeddings: Script | Create a directory vis, run the script and then start Tensorboard using tensorboard --logdir=vis
Scripts to scrape data from Bengali news websites: Github Repo

Results

Words most similar to the word chele (boy)

Father + Girl - Boy = Mother

Odd one out

Bengali's Love Sweets!

Data

Data was scraped from multiple online Bengali news websites.

Data was also collected from a Wikipedia dump.

You can view the data in the data folder.

Examples

Classification: Using the trained Bangal2vec models, a news classifier was built. This classifier can classify news into 5 categories based on the news headlines. The best model achieved a testing f1 score of 0.76 after training on just 40k news headlines.

Similar Projects

This project is a sister project of other projects working on IndicNLP. They include:

To get resources to start working on IndicNLP or to learn more about it, you can see our Awesome List of resources

Future Work

Build a word2vec model
Visualise the trained embeddings
Build a UlmFit model
Get translation data

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
img		img
LICENSE		LICENSE
README.md		README.md
news_vector_training.ipynb		news_vector_training.ipynb
test_word2vec.ipynb		test_word2vec.ipynb
visualise.py		visualise.py
wikipedia_embeddings.ipynb		wikipedia_embeddings.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

img

img

LICENSE

LICENSE

README.md

README.md

news_vector_training.ipynb

news_vector_training.ipynb

test_word2vec.ipynb

test_word2vec.ipynb

visualise.py

visualise.py

wikipedia_embeddings.ipynb

wikipedia_embeddings.ipynb

Repository files navigation

Bangla2Vec

Language Modelling and Classification in the Bengali Language

Releases

Results

Words most similar to the word chele (boy)

Father + Girl - Boy = Mother

Odd one out

Bengali's Love Sweets!

Data

Examples

Similar Projects

Future Work

About

Releases

Packages

Languages

License

AbinashBishoyi/Bangla2Vec

Folders and files

Latest commit

History

Repository files navigation

Bangla2Vec

Language Modelling and Classification in the Bengali Language

Releases

Results

Words most similar to the word chele (boy)

Father + Girl - Boy = Mother

Odd one out

Bengali's Love Sweets!

Data

Examples

Similar Projects

Future Work

About

Resources

License

Stars

Watchers

Forks

Languages