# Introduction
Effectiveness of bigram in sentiment analysis.

# Environment set-up and data preparation
Let's start by setting up the environment.
To have a clean installation that would not mess up my current python packages, I created a virtual environment named sentimentVenv. The python version is 3.6.

```console
virtualenv sentimentVenv --python=python3.6
```

Now, activate the environment.

```console
source sentimentVenv/bin/activate
```

Inside this environment, we'll need to install these libraries:
* scikit-learn
* scipy
* jupyter
* pandas

```console
pip install scikit-learn
pip install scipy
pip install jupyter
pip install pandas
```

The environment should now be ready.
The dataset can be downloaded from this link. It includes 50000 text files. Each text represents movie review. These files are stored in pos/neg directory, corresponding sentiment.

Let's load the python libraries and have a look at the dataset.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import SGDClassifier

Let's define a function that loads the dataset and extracts the two columns we need:
* The sentiment: a binary (0/1) variable
* The text of the movie review: string

In [68]:
import urllib.request
import pandas as pd


def download(url):
    path, _ = urllib.request.urlretrieve(url)
    return path


def load_data(path):
    dataset = pd.read_csv(path)
    return dataset

url = 'https://github.com/dipanjanS/text-analytics-with-python/blob/master/Chapter-7/movie_reviews.csv?raw=true'
path = download(url)
dataset = load_data(path)
print(dataset.head())

# prepare training and testing dataset
train_data, test_data = dataset[:25000], dataset[25000:]
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


# Building a sentiment classifier: unigram features
Let's now get to the sentiment classification part. 
In order to classify text, we have to turn them into vectors as well. In scikit-learn, this task is very easy. We have only to pass dataset to CountVectorizer. It tokenizes text and convert tokenized text to frequency matrix. In addition,, a better operation, we compute weights for words where each weight gives the importance of the word. Such a weight could the tf-idf score.

Let's start by building a tf-idf matrix.

In [69]:
def build_pipeline():
    text_clf = Pipeline([('vect', CountVectorizer(min_df=1, stop_words='english', binary=True)),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(l1_ratio=0, n_jobs=-1)),
                         ])
    return text_clf

We should now be ready to feed these vectors into a classifier. 

In [70]:
text_clf = build_pipeline()
text_clf = text_clf.fit(X_train, y_train)

Now that the model is trained, let's evaluate it on the test set:

In [71]:
y_pred = text_clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))

Accuracy: 0.88936
             precision    recall  f1-score   support

   negative       0.90      0.87      0.89     12474
   positive       0.88      0.91      0.89     12526

avg / total       0.89      0.89      0.89     25000



Almost 88.9% accuracy. This is not bad. If we tune more parameters,  we reach a higher score.

# Building a sentiment classifier: bigram features

In [62]:
def build_pipeline():
    text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1, stop_words='english', binary=True)),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(l1_ratio=0, n_jobs=-1)),
                         ])
    return text_clf

We should now be ready to feed these vectors into a classifier

In [63]:
text_clf = build_pipeline()
text_clf = text_clf.fit(X_train, y_train)

Now that the model is trained, let's evaluate it on the test set:

In [64]:
y_pred = text_clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))

Accuracy: 0.89672
             precision    recall  f1-score   support

   negative       0.91      0.88      0.89     12474
   positive       0.88      0.91      0.90     12526

avg / total       0.90      0.90      0.90     25000



Almost 89.7% accuracy. This is not bad. If we tune more parameters,  we reach a higher score.

# Conclusion
In this post we explored different features to perform sentiment analysis: We built a sentiment classifier using unigram and bigram.
The classifier using unigram feature resulted in a 83% classification model accuracy. This is not bad.

For improving this classifier, we can investigate the classifier using bigram features. The classifier resulted in a 86% accuracy. It is higher than the classifier based on unigram.

I hope this tutorial was a good introductory start to sentiment analysis.