Skip to content
This repository has been archived by the owner on Feb 9, 2019. It is now read-only.

Latest commit

 

History

History
16 lines (9 loc) · 1.47 KB

File metadata and controls

16 lines (9 loc) · 1.47 KB

Why word vectors in Indic NLP?

In general, word vector representation is very important part of most of the NLP algorithms. Although we could feed a language model with direct (eg: one hot) word representations, it will be great if our model somehow is able to figure out relationships between different words. The underlying principle of word vectors can be abstracted by this famous quote by John Rupert Firth - "You shall know a word by the company it keeps".

Use of word vectors will reward us with two primary advantages.

  • The model will be able to learn fast from less amount of training data. Thus the model could generalize more and make predictions on unseen training data.
  • We could perform transfer learning with high flexibility while compared to one hot encoding based models.

What a_മ്മ does for this?

a_മ്മ is interested to collect raw text data from websites, articles, newspapers (with the help of OCR). The collected data should be diverse and properly categorized. In this case, content creators are the ones who could mostly contribute. We request contributions from content creators for the greater good. Eventually, a_മ്മ after effect is expected to provide contributions from researchers and developers in the form of models and algorithms.

footnote

To developers and researchers: Use Facebook fasttext (until this project gets matured) to contribute to other a_മ്മ repositories as a_മ്മ after effects.