# Sentiment Analysis

We will now see how to classify documents based on their sentiment.

## IMDb movie review dataset

The dataset consists of 50,000 polar movie reviews labeled as positive or negative. Now we will load the dataset as a **DataFrame**

In [3]:
import pyprind
import pandas as pd
import os

basepath = "/home/alanmarazzi/Scaricati/aclImdb"

# Create progress bar for data loading
pbar = pyprind.ProgBar(50000)
labels = {'pos': 1, 'neg': 0}
df = pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()

df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA: 00:01:21 | ETA: 00:01:19 | ETA: 00:01:18 | ETA: 00:01:17 | ETA: 00:01:18 | ETA: 00:01:17 | ETA: 00:01:16 | ETA: 00:01:14 | ETA: 00:01:12 | ETA: 00:01:10 | ETA: 00:01:08 | ETA: 00:01:06 | ETA: 00:01:05 | ETA: 00:01:05 | ETA: 00:01:02 | ETA: 00:01:01 | ETA: 00:00:58 | ETA: 00:00:54 | ETA: 00:00:49 | ETA: 00:00:45 | ETA: 00:00:40 | ETA: 00:00:36 | ETA: 00:00:31 | ETA: 00:00:27 | ETA: 00:00:22 | ETA: 00:00:18 | ETA: 00:00:13 | ETA: 00:00:09 | ETA: 00:00:04 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:02:20


After loading the dataset we have to clean it, in fact class labels are sorted and we don't want this since we have to split the set in training and test.

In [1]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('./movie_data.csv', index=False)

NameError: name 'df' is not defined

After shuffling we saved the dataset as csv, so it's going to be easier to work with it.

In [1]:
import pandas as pd

df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,<br /><br />There is something about seeing a ...,1
1,I fail to understand why anyone would allow a ...,0
2,Disney has yet to meet a movie it couldn't mak...,0


## Bag-of-words model

With **bag-of-words** we can represent text as numerical feature vectors, we can do this by creating a vocabulary of unique **tokens** from the entire set of documents and we construct a feature vector from each document that contains the counts of how often each word appears in that document.

### Transforming words into feature vectors

To build a bag-of-words model based on the word counts in the respective documents, we can use the [**CountVectorizer**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class in scikit-learn. This class takes an array of text data and constructs the model for us

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'
])
bag = count.fit_transform(docs)

And with just this we constructed the **vocabulary** and the **sparse feature vectors**. Now we can print the vocabulary to understand what we are talking about

In [4]:
print(count.vocabulary_)

{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}


We get the index of every word in all documents in a dictionary. Next let's print the feature vectors we just created

In [7]:
print(bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


The above array shows the count of words for each document and are **raw term frequencies**.

> Note that this is a **one-gram** model, if we want to get **ngrams** there is the *ngram_range* parameter in **CountVectorizer** that we can use

### Word relevancy via tf-idf

With **term frequency-inverse document frequency (tf-idf)** we can select the most interesting words by downweighting the most frequent common words (such as: and, or, if, etc).The tf-idf can be defined as the product of the term frequency and the inverse document frequency.

$$
tfidf(t,d)=tf(t,d)\times idf(t,d)
$$

Here $tf(t,d)$ is the term frequency that we introduced above, $idf(t,d)$ can be calculated as:

$$
idf(t,d)=log \frac{n_d}{1+df(d,t)}
$$

$n_d$ is the total number of documents, and $df(d,t)$ is the number of documents $d$ that contain the term $t$. By adding 1 to the denominator we make sure that we get non-zero values to terms that occur in all training samples (though this is optional). The $log$ is used to not give too much weight to low document frequencies.

In scikit we have the [**TfidfTransformer**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) that takes the raw term frequencies from **CountVectorizer** as input and transforms them into tf-idfs

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]


Scikit doesn't use the formula we described earlier, this is because it adds **l2** normalization since normalization is a best practice when dealing with tf-idf.

### Cleaning text data

The simple example we saw earlier didn't require any cleaning, but usually before performing any modeling we have to strip all unwanted characters from data. To see why this is important let's print some characters from the movies dataset

In [6]:
df.loc[0, 'review'][:51]

'<br /><br />There is something about seeing a movie'

We want to remove all html markup and punctuation, except for emoticons

In [3]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

Let's check that our **preprocessor** works correctly, and then apply it on all the reviews in the DataFrame

In [8]:
preprocessor(df.loc[0, 'review'][:51])

'there is something about seeing a movie'

In [4]:
df['review'] = df['review'].apply(preprocessor)

### Processing documents into tokens

Now we have to **tokenize** documents, one technique is to split them into individual words by splitting at its whitespace characters

In [10]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

Another useful preprocessing technique is **word stemming**, which transforms a word into its root. We will use [**nltk**](http://www.nltk.org/) to perform [**Porter stemming**](http://www.nltk.org/api/nltk.stem.html?highlight=porter#module-nltk.stem.porter)

In [5]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Finally, we have to remove stop-words

In [6]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('runners like running and thus they run')[-10:] if w not in stop]

['runner', 'like', 'run', 'thu', 'run']

## Logistic regression for document classification

In [7]:
x_train = df.loc[:10000, 'review'].values
y_train = df.loc[:10000, 'sentiment'].values
x_test = df.loc[10000:20000, 'review'].values
y_test = df.loc[10000:20000, 'sentiment'].values

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [12]:
gs_lr_tfidf.fit(x_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 22.8min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 99.3min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 128.5min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', '...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
    

We replaced the **CountVectorizer** and **TfidfTransformer** with [**TfidfVectorizer**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) which combines the two. Our *param_grid* consisted of two parameter dictionaries: in the first one we used default parameters for **TfidfVectorizer**, in the second one we set *use_idf=False*, *smooth_idf=False* and *norm=None* to train a model based on raw term frequencies.

For the logistic regression classifier we trained models using *L2* and *L1* regularization and compared different regularization strenghts with *C*.

Now we can check what's the best parameter set

In [13]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fb038a65730>} 


We obtained the best grid search results using the regular tokenizer without Porter stemming, no stop-words and tf-idf in combination with a logistic regression that uses L2 regularization with the regularization strength *C=10.0*.

Now let's print the accuracy of the best model, check the test accuracy, and persist the model to disk to avoid retraining.

In [14]:
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

CV Accuracy: 0.882


In [15]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(x_test, y_test))

Test Accuracy: 0.880


In [16]:
from sklearn.externals import joblib

joblib.dump(clf, 'lr_movies', compress=1)

['lr_movies']

## Working with bigger data - online algos and out-of-core learning

It is pretty common to work with even bigger datasets than the previous one, we will now apply **out-of-core** learn to avoid memory issues.

We saw earlier **stochastic gradient descent**, now we will use the [**partial_fit**](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit) of the **SGDClassifier** to stream documents directly from disk and train a logistic regression model using small minibatches of documents.

First: define a **tokenizer** function that cleans data from the movie_data.csv that we constructed previously

In [1]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


Next we define a generator function **stream_docs** that reads in and returns one document at a time

In [2]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

Let's verify that the **stream_docs** function works correctly by reading the first document from the dataset which should return a tuple consisting of the review text and the class label

In [3]:
next(stream_docs('./movie_data.csv'))

('"<br /><br />There is something about seeing a movie in a good, old-fashioned movie house that adds enormous appeal to every picture. I, fortunately enough, was able to see at Film Forum in New York City a pair of Ernst Lubitsch comedies during their three week tribute to the legendary director. The double feature I attended was a screening of Lubitsch\'s 1938 comedy Bluebeard\'s Eighth Wife and the pre-Code classic Design for Living, neither of which I had seen before. Everything I read of Design for Living praised the film, but I could not find a good review anywhere for Bluebeard\'s Eighth Wife. Leonard Maltin disliked it.VideoHound, too, gave the comedy a low rating.its IMDB score was not complimentary.and Pauline Kael (not a great surprise) blasted the film in her scathing review. So, when I went into the city that day I was expecting to enjoy Bluebeard\'s Eighth Wife only slightly and love Design for Living completely. Bluebeard\'s Eighth Wife (which was showing first) began, a

Now let's define a function **get_minibatch** that will take a document stream from the **stream_docs** function and return a number of documents

In [4]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately we can't use the **CountVectorizer** for out-of-core learning since it needs the whole vocabulary in memory, the **TfidfVectorizer** needs to keep all feature vectors in memory as well. What we can use is the [**HasingVectorizer**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) that is data-independent and makes use of the hashing trick.

In [5]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

In [6]:
vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log',
                    random_state=1, 
                    n_iter=1)

doc_stream = stream_docs('./movie_data.csv')

We initialized **HashingVectorizer** with our tokenizer function and set the number of features to $2^{21}$, afterwards we initialized the **SGDClassifier** with *loss=log*. Notice that by using a large number of features in the **HashingVectorizer** we reduce the chance to cause a hash collision, but at the same time we increase the number of coefficients in our logistic regression.

It's time to start the out-of-core learning!

In [8]:
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])

for _ in range(45):
    x_train, y_train = get_minibatch(doc_stream, size = 1000)
    if not x_train:
        break
    x_train = vect.transform(x_train)
    clf.partial_fit(x_train, y_train, classes=classes)
    pbar.update()

0%                          100%
[##############################] | ETA: 00:00:28 | ETA: 00:00:27 | ETA: 00:00:25 | ETA: 00:00:24 | ETA: 00:00:23 | ETA: 00:00:22 | ETA: 00:00:21 | ETA: 00:00:20 | ETA: 00:00:19 | ETA: 00:00:17 | ETA: 00:00:17 | ETA: 00:00:16 | ETA: 00:00:15 | ETA: 00:00:14 | ETA: 00:00:13 | ETA: 00:00:12 | ETA: 00:00:11 | ETA: 00:00:11 | ETA: 00:00:09 | ETA: 00:00:09 | ETA: 00:00:08 | ETA: 00:00:06 | ETA: 00:00:06 | ETA: 00:00:05 | ETA: 00:00:04 | ETA: 00:00:03 | ETA: 00:00:02 | ETA: 00:00:01 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:00:27


We iterated over 45 minibatches of documents where each minibatch consists of 1000 documents. Now we will use the remaining 5000 documents to evaluate the performance of our model

In [9]:
x_test, y_test = get_minibatch(doc_stream, size=5000)
x_test = vect.transform(x_test)
print('Accuracy: %.3f' % clf.score(x_test, y_test))

Accuracy: 0.869


We reached 87% accuracy which is slightly below the grid search we did before, but it took only a minute to train the model.

Finally, we can update the model with the test set to make it even better

In [10]:
clf = clf.partial_fit(x_test, y_test)

## Cross with Ch 9

To persist a model we can use [**pickle**](https://docs.python.org/3/library/pickle.html) to save it to disc and serialize and de-serialize object structures.

In [11]:
import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')

if not os.path.exists(dest):
    os.makedirs(dest)

pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4)
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=4)