In this project I am trying to find out the best distribution that describes most of linguistic features in social media in different levels of analysis such as county, user, and message level. For that, I go through two different regimes
- Unsupervised
- Supervised
In "unsupervised" section, we use statistical testing methods to find the mostly confident distribution that best describes our feature empirical distribution.
In "supervised" section, we use a different distributions as a prior in a NaiveBayes classifier to predict a label like gender or sentiment and we see that which of these distributions give us the best accuracy.