Human beings generally communicate information, facts and sentiments through the spoken or written word. We do this very efficiently and from an early age. Humans are also very good at processing these communications - that is, by reading a document or hearing a speech , we can parse information, facts and sentiments of the author very quickly, efficiently and accurately. However, this is not an easy task for a machine to do. Researchers have been trying to solve some of these problems for a long time and they have achieved some remarkable success over the years. This particular branch of research is referred to as Natural Language Processing. In this project, I focused on one particular task of natural language processing - sentiment analysis of written text. The project explores three approaches to represent written word as fixed length feature vectors and use classification algorithms to predict sentiments expressed by them. Using a standard dataset of movie reviews, the three approaches are compared through experiments and their limitations are noted to help identify future work.
- IMDB movie review dataset link
Feature Representation Algorithms
- Bag-of-word model (baseline)
- Distributed representation of words (Word2vec)
- Distributed representation of paragraph (Doc2vec)
- K Nearest Neighbour
- Random Forest
- Neural Network
- Support Vector Machine
Language and Libraries
- python 2.7
- After installing the necessary libraries, run
python bow.pyfor bag-of-word model.
python word2vec.pyfor Word2vec model.
python doc2vec.pyfor Doc2vec model.
- All the references are mentioned in projectPaper.