# Logistic Regression for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


## Exploración de Datos

In [None]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
tmp = df['review'].str.replace('<[^>]*>|(?::|;|=)(?:-)?(?:\)|\(|D|P)', '', regex=True).replace('[\W]+', ' ', regex=True)
df['review'] = df['review'].str.findall('[\!|\¡]').str.join(' ') + " " + tmp.str.lower()
df.tail(n=5)

In [None]:
positive = pd.read_csv('positive-words.txt')
positive['words'] = positive['words'].str.replace('<[^>]*>|(?::|;|=)(?:-)?(?:\)|\(|D|P)', '').replace('[\W]+', ' ')
positive.describe()

In [None]:
negative = pd.read_csv('negative-words.txt', encoding = "ISO-8859-1")
negative['words'] = negative['words'].str.replace('<[^>]*>|(?::|;|=)(?:-)?(?:\)|\(|D|P)', '').replace('[\W]+', ' ')
negative.describe()

Let us shuffle the class labels.

In [None]:
import numpy as np
## uncomment these lines if you have dowloaded the original file:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)
df.head()

<br>
<br>

## Preprocesamiento de los Datos

First, we define a generator that returns the document body and the corresponding class label:

In [None]:
def stream_docs(path):
    with open(path, 'r', encoding="utf8") as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [None]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

## Obtención de Nuevas Características

In [None]:
# from sklearn.feature_extraction.text import HashingVectorizer
# vect = HashingVectorizer(decode_error='ignore', 
#                          n_features=2**21,
#                          preprocessor=None, 
#                          tokenizer=tokenizer)

# Excercise 1: define new features according to https://web.stanford.edu/~jurafsky/slp3/5.pdf

def getFeatures(X_test, positive, negative):
'''
        x1 = 0 # count positive words
        x2 = 0 # count negative words
        x3 = 0 # 'no' in doc
        x4 = 0 # count pronouns
        x5 = 0 # '!' in doc
        x6 = 0 # log(count of words)
        x7 = 1 # Solo usado para obtener el bias
'''
    
    df = pd.DataFrame(data={'review': X_test,
                            'x1': np.zeros(len(X_test), dtype=int),
                            'x2': np.zeros(len(X_test), dtype=int),
                            'x3': np.zeros(len(X_test), dtype=int),
                            'x4': np.zeros(len(X_test), dtype=int),
                            'x5': np.zeros(len(X_test), dtype=int),
                            'x6': np.zeros(len(X_test), dtype=int),
                            'x7': np.ones(len(X_test), dtype=int)})
    
    for i in range(len(positive)):
        counter = df['review'].str.count(positive['words'][i])
        df['x1'] += counter

    for i in range(len(negative)):
        counter = df['review'].str.count(negative['words'][i])
        df['x2'] += counter

    df['x3'] = df['review'].str.contains('no ', case=False).astype(int) + \
                df['review'].str.contains('not ', case=False).astype(int) + \
                df['review'].str.contains('don ', case=False).astype(int) + \
                df['review'].str.contains('dont ', case=False).astype(int) + \
                df['review'].str.contains('doesn ', case=False).astype(int) + \
                df['review'].str.contains('doesnt ', case=False).astype(int) + \
                df['review'].str.contains('doesnot ', case=False).astype(int)
    
    df['x4'] = df['review'].str.count('i ') + \
                df['review'].str.count(' me')  + \
                df['review'].str.count('my ')  + \
                df['review'].str.count(' mine') + \
                df['review'].str.count('we ') + \
                df['review'].str.count(' us')  + \
                df['review'].str.count('our')  + \
                df['review'].str.count('ours')  + \
                df['review'].str.count('you ')  + \
                df['review'].str.count('your ')  + \
                df['review'].str.count('yours ')  + \
                df['review'].str.count('myself')  + \
                df['review'].str.count('ourselves')  + \
                df['review'].str.count('yourselves')  + \
                df['review'].str.count(' u ')

    df['x5'] = df['review'].str.contains('!', case=False).astype(int)
#     df['x5'] = df['review'].str.count('!') + df['review'].str.count('¡')
    df['x6'] = np.log(df['review'].str.count(' '))
    
    df = df.drop(['review'], axis=1)
    
    return df


## Implementación de Logistic Regression classifier usando regularization

In [None]:
#from sklearn.linear_model import SGDClassifier
#clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
# doc_stream = stream_docs(path='shuffled_movie_data.csv')

# Excercise 2: implement a Logistic Regression classifier, using regularization, according to https://web.stanford.edu/~jurafsky/slp3/5.pdf

def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))

def stocashticGradientDescent(X,Y,theta,alpha=0.5, lamda=0.2, iterations=10):
    m = len(Y)
    
    for it in range(iterations):
        X_i = X.values
        Y_i = Y
        prediction = sigmoid(np.dot(X_i,theta))
        g = np.dot(prediction-Y_i,X_i)
        theta = theta - (1/m)*alpha*g + lamda*theta
        
    return theta


Procedemos a realizar el entrenamiento con 45 mil muestras

In [None]:
#import pyprind
#pbar = pyprind.ProgBar(45)

# classes = np.array([0, 1])

# for _ in range(45):
#     X_train, Y_train = get_minibatch(doc_stream, size=1000)
#     X_train = vect.transform(X_train)
#     clf.partial_fit(X_train, Y_train, classes=classes)
#     #pbar.update()

doc_stream = stream_docs(path='shuffled_movie_data.csv')
theta = np.zeros(7, dtype=int)

%xmode plain
%pdb on
for i in range(45):
    X_train, Y_train = get_minibatch(doc_stream, size=1000)
    X_train = getFeatures(X_train, positive, negative, tokenizer)
    theta = stocashticGradientDescent(X_train, Y_train, theta, alpha=0.15, lamda=0.2, iterations=1)
    

Procedemos a probar el clasificador usando las 5 mil muestras restantes:

In [None]:
X_test, Y_test = get_minibatch(doc_stream, size=5000)
X_test = getFeatures(X_test, positive, negative, tokenizer)
Y_pred = sigmoid(np.dot(X_test,theta))

from sklearn.metrics import confusion_matrix, accuracy_score
accuracy_score(Y_test, (Y_pred>0.5).astype(int))

I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [None]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
clf = clf.partial_fit(X_test, y_test)

<br>
<br>

# Model Persistence

In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on "new data."

So, to reuse this model, we could use the [`pickle`](https://docs.python.org/3.5/library/pickle.html) module to "serialize a Python object structure". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently.

To install:
conda install -c anaconda joblib

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

In [None]:
import joblib
import os
if not os.path.exists('./pkl_objects'):
    os.mkdir('./pkl_objects')
    
joblib.dump(vect, './vectorizer.pkl')
joblib.dump(clf, './clf.pkl')

Using the code above, we "pickled" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a known issue with `pickling` objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on <module '__main__'>` if we'd unpickle it later. Thus, to pickle the `tokenizer` function, we can write it to a file and import it to get the `namespace` "right".

In [None]:
%%writefile tokenizer.py
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

In [None]:
from tokenizer import tokenizer
joblib.dump(tokenizer, './tokenizer.pkl')

Now, let us restart this IPython notebook and check if the we can load our serialized objects:

In [None]:
import joblib
tokenizer = joblib.load('./tokenizer.pkl')
vect = joblib.load('./vectorizer.pkl')
clf = joblib.load('./clf.pkl')

After loading the `tokenizer`, `HashingVectorizer`, and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application -- a topic for another IPython notebook.

In [None]:
example = ['I did not like this movie']
X = vect.transform(example)
clf.predict(X)

In [None]:
example = ['I loved this movie']
X = vect.transform(example)
clf.predict(X)