# Sentiment Analysis

We will now see how to classify documents based on their sentiment.

## IMDb movie review dataset

The dataset consists of 50,000 polar movie reviews labeled as positive or negative. Now we will load the dataset as a **DataFrame**

In [3]:
import pyprind
import pandas as pd
import os

basepath = "/home/alanmarazzi/Scaricati/aclImdb"

# Create progress bar for data loading
pbar = pyprind.ProgBar(50000)
labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()

df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA: 00:01:21 | ETA: 00:01:19 | ETA: 00:01:18 | ETA: 00:01:17 | ETA: 00:01:18 | ETA: 00:01:17 | ETA: 00:01:16 | ETA: 00:01:14 | ETA: 00:01:12 | ETA: 00:01:10 | ETA: 00:01:08 | ETA: 00:01:06 | ETA: 00:01:05 | ETA: 00:01:05 | ETA: 00:01:02 | ETA: 00:01:01 | ETA: 00:00:58 | ETA: 00:00:54 | ETA: 00:00:49 | ETA: 00:00:45 | ETA: 00:00:40 | ETA: 00:00:36 | ETA: 00:00:31 | ETA: 00:00:27 | ETA: 00:00:22 | ETA: 00:00:18 | ETA: 00:00:13 | ETA: 00:00:09 | ETA: 00:00:04 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:02:20


After loading the dataset we have to clean it, in fact class labels are sorted and we don't want this since we have to split the set in training and test.

In [1]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('./movie_data.csv', index=False)

NameError: name 'df' is not defined

After shuffling we saved the dataset as csv, so it's going to be easier to work with it.

In [2]:
import pandas as pd

df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,<br /><br />There is something about seeing a ...,1
1,I fail to understand why anyone would allow a ...,0
2,Disney has yet to meet a movie it couldn't mak...,0


## Bag-of-words model

With **bag-of-words** we can represent text as numerical feature vectors, we can do this by creating a vocabulary of unique **tokens** from the entire set of documents and we construct a feature vector from each document that contains the counts of how often each word appears in that document.

### Transforming words into feature vectors

To build a bag-of-words model based on the word counts in the respective documents, we can use the [**CountVectorizer**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class in scikit-learn. This class takes an array of text data and constructs the model for us

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'
])
bag = count.fit_transform(docs)

And with just this we constructed the **vocabulary** and the **sparse feature vectors**. Now we can print the vocabulary to understand what we are talking about

In [4]:
print(count.vocabulary_)

{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}


We get the index of every word in all documents in a dictionary. Next let's print the feature vectors we just created

In [7]:
print(bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


The above array shows the count of words for each document and are **raw term frequencies**.

> Note that this is a **one-gram** model, if we want to get **ngrams** there is the *ngram_range* parameter in **CountVectorizer** that we can use

### Word relevancy via tf-idf

With **term frequency-inverse document frequency (tf-idf)** we can select the most interesting words by downweighting the most frequent common words (such as: and, or, if, etc).The tf-idf can be defined as the product of the term frequency and the inverse document frequency.

$$
tfidf(t,d)=tf(t,d)\times idf(t,d)
$$

Here $tf(t,d)$ is the term frequency that we introduced above, $idf(t,d)$ can be calculated as:

$$
idf(t,d)=log \frac{n_d}{1+df(d,t)}
$$

$n_d$ is the total number of documents, and $df(d,t)$ is the number of documents $d$ that contain the term $t$. By adding 1 to the denominator we make sure that we get non-zero values to terms that occur in all training samples (though this is optional). The $log$ is used to not give too much weight to low document frequencies.

In scikit we have the [**TfidfTransformer**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) that takes the raw term frequencies from **CountVectorizer** as input and transforms them into tf-idfs

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]


Scikit doesn't use the formula we described earlier, this is because it adds **l2** normalization since normalization is a best practice when dealing with tf-idf.

### Cleaning text data

The simple example we saw earlier didn't require any cleaning, but usually before performing any modeling we have to strip all unwanted characters from data. To see why this is important let's print some characters from the movies dataset

In [6]:
df.loc[0, 'review'][:51]

'<br /><br />There is something about seeing a movie'

We want to remove all html markup and punctuation, except for emoticons

In [7]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

Let's check that our **preprocessor** works correctly, and then apply it on all the reviews in the DataFrame

In [8]:
preprocessor(df.loc[0, 'review'][:51])

'there is something about seeing a movie'

In [9]:
df['review'] = df['review'].apply(preprocessor)

### Processing documents into tokens

Now we have to **tokenize** documents, one technique is to split them into individual words by splitting at its whitespace characters

In [10]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

Another useful preprocessing technique is **word stemming**, which transforms a word into its root. We will use [**nltk**](http://www.nltk.org/) to perform [**Porter stemming**](http://www.nltk.org/api/nltk.stem.html?highlight=porter#module-nltk.stem.porter)

In [11]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Finally, we have to remove stop-words

In [12]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('runners like running and thus they run')[-10:] if w not in stop]

['runner', 'like', 'run', 'thu', 'run']

## Logistic regression for document classification

In [28]:
x_train = df.loc[:10000, 'review'].values
y_train = df.loc[:10000, 'sentiment'].values
x_test = df.loc[10000:20000, 'review'].values
y_test = df.loc[10000:20000, 'sentiment'].values

In [29]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [31]:
gs_lr_tfidf.fit(x_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  5.5min


KeyboardInterrupt: 