<div style = "text-align:right"> Student: Antoine Moulin </div>

# SD-TSIA 214
# Computer Lab: Sentiment analysis in textual movie reviews

<b>Lab's authors:</b> Chloé Clavel, Laurence Likforman, Emile Chapuis, Hamid Jalalzai <br/>
<b>Date:</b> June 5, 2019

In [1]:
import os.path as op
import numpy as np
import re, string

In [2]:
# Load data
print("Loading dataset")

from glob import glob
filenames_neg = sorted(glob(op.join('./', 'data', 'imdb1', 'neg', '*.txt')))
filenames_pos = sorted(glob(op.join('./', 'data', 'imdb1', 'pos', '*.txt')))

texts_neg = [open(f).read() for f in filenames_neg]
texts_pos = [open(f).read() for f in filenames_pos]
texts = texts_neg + texts_pos
y = np.ones(len(texts), dtype=np.int)
y[:len(texts_neg)] = 0.

print("%d documents" % len(texts))

Loading dataset
2000 documents


## Implementation of the classifier

### Question 1

<b>Complete the <tt>count_words</tt> function that will count the number of occurrences of each distinct word in a list of <tt>string</tt> and return <tt>vocabulary</tt> (the Python dictionary) and <tt>counts</tt>. Do no forget to delete the punctuation. Give the vocabulary size.</b>

In [3]:
def count_words(texts):
    """Vectorize text: return count of each word in the text snippets

    Parameters
    ----------
    texts: list of str
        The texts

    Returns
    -------
    vocabulary: dict
        A dictionary that points to an index in counts for each word.
    counts: ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
        n_samples == number of documents.
        n_features == number of words in vocabulary.
    """
    
    # initialization
    table = str.maketrans('', '', string.punctuation) # to remove punctuation
    words = set()
    vocabulary = dict()
    
    # build the dictionary
    cleaned_texts = []
    index = 0
    
    for text in texts:
        text = text.translate(table).lower().replace('\n', '').split() # removes punctuation 
        cleaned_texts.append(text) # just to be faster when building counts (avoid a second pre-processing)
        for word in text:
            if word not in words: # to make sure every word is added only once to the dictionary
                vocabulary[word] = index
                index += 1
                words.add(word)
                
    # build counts
    n_features = len(words)
    counts = np.zeros((len(texts), n_features))
    
    for k in range(len(cleaned_texts)):
        text = cleaned_texts[k]
        for word in list(set(text)):
            counts[k, vocabulary[word]] = text.count(word)
    
    return vocabulary, counts

In [4]:
vocabulary, counts = count_words(texts)

In [5]:
print('The size of the vocabulary is: {}'.format(counts.shape[1]))

The size of the vocabulary is: 47567


### Question 2

<b>Explain how positive and negative classes have been assigned to movie reviews (see <tt>poldata.README.2.0</tt> file).</b>

As explained in the file, <b>the assignement of the class is based on the rating that appears in the review</b>. It may take different forms: five-star system, four-star system, letter grade system. And to classify a review as a positive example or a negative one, they use arbitrary rules such as: 

"With a five-star system, three-and-a-half stars and up are considered positive, two stars and below are considered negative."

In this case, it is not explained how they classify reviews whose rating is between two stars and three-and-a-half stars. One may think they do not take a review into account if it has a rating in this interval in order to minimize their errors during the creation of the ground truth labels.


Half stars are difficult to recognize so they may lose a half star occasionally, but this was a minor issue.

### Question 3

<b>Complete the <tt>NB</tt> class to implement the Naive Bayes classifier by relying on the pseudo-code of Figure 1 and its documentation below:

- The vocabulary $V$ corresponds to the set of different words composing a set of documents (<tt>vocabulary</tt> in <tt>count_words</tt>).
- $\mathbb{C}$ corresponds to all classes and $\mathbb{D}$ to the set of documents.
- The function <tt>countTokensOfTerm(text, t)</tt> represents the number of occurrences of a word <tt>t</tt> in a set of texts <tt>texts</tt> (calculation done in <tt>count_words</tt>).
- The smoothing step called Laplace smoothing (+1 line 10) allows the attribution of non-zero probability to words that would not occur in the learning set.
- The function <tt>ExtractTokensFromDoc(V, d)</tt> retrieves the list of associated words (including the duplicates) to document <tt>d</tt>.</b>

The Naive Bayes classifier uses the maximum a posteriori as a decision rule. Given an observation $o$, it predicts the class $\widehat{c}$ that maximizes the probability a posteriori, i.e.

$$ \widehat{c} = \underset{c \in \mathbb{C}}{\arg \max} \, \mathbb{P} \left( c | o \right) $$

Using the Bayes rule and the independance of $o$ and $c$,

$$ \widehat{c} = \underset{c \in \mathbb{C}}{\arg \max} \, \frac{\mathbb{P} \left( o | c \right) \mathbb{P} \left( c \right)}{\mathbb{P} \left(o \right)} = \underset{c \in \mathbb{C}}{\arg \max} \, \mathbb{P} \left( o | c \right) \mathbb{P} \left( c \right)$$

When implementing a Naive Bayes classifier, the goal is thus to compute these two last terms, $\mathbb{P} \left( o | c \right)$ and $\mathbb{P} \left( c \right)$.

In [6]:
from sklearn.base import BaseEstimator, ClassifierMixin

class NB(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass


    def fit(self, X, y):
        """Fit our model to the data set (X, y)

        Parameters
        ----------
        X: ndarray, shape (n_samples, vocabulary_size)
            The vectorized training set.
            n_samples == number of documents.
            vocabulary_size == number of words in vocabulary.
            
        y: ndarray, shape (n_samples,)
            The class of each document (0: negative, 1: positive).
            n_samples == number of documents.

        Returns
        -------
        self: NB
            The Naive Bayes model trained on (X, y).
        """

        # initialization
        N, vocab_size = X.shape # N: number of documents, vocab_size: number of words in vocabulary

        self.classes = list(set(y))
        self.nb_classes = len(self.classes)

        self.prior = np.zeros(self.nb_classes) # distribution of the classes
        self.cond_prob = np.zeros((vocab_size, self.nb_classes)) # conditional probabilities

        for c in self.classes:
            # prior
            N_c = sum(y == c)
            self.prior[c] = N_c / N

            # T_c, no need to concatenate texts from class c
            T_c = np.zeros(vocab_size)
            for t in range(vocab_size):
                T_c[t] = sum(X[(y == c), t]) # count tokens of term t in texts with class c
            
            # conditional probability
            normalize = sum(T_c + 1)
            for t in range(vocab_size):
                self.cond_prob[t][c] = (T_c[t] + 1) / normalize # Laplace smoothing 
        
        return self


    def predict(self, X):
        """Predict the classes of documents in X
        
        Parameters
        ----------
        X: ndarray, shape (n_samples, vocabulary_size)
            The vectorized training set.
            n_samples == number of documents.
            vocabulary_size == number of words in vocabulary.
        
        Returns
        -------
        predictions: ndarray, shape (n_samples)
            The predicted classes.
            n_samples == number of documents.
        """
        
        n_samples = X.shape[0]
        predictions = np.zeros(n_samples, dtype=int)
        
        for k in range(n_samples):
            W = np.argwhere(X[k] != 0).flatten().tolist() # extract tokens from document
        
            scores = np.zeros(self.nb_classes)
            for c in self.classes:
                scores[c] = np.log(self.prior[c])
                for t in W:
                    scores[c] += X[k, t]*np.log(self.cond_prob[t][c])
        
            predictions[k] = np.argmax(scores)
        
        return predictions


    def score(self, X, y):
        return np.mean(self.predict(X) == y)

In [7]:
# Count words in text
vocabulary, X = count_words(texts)

In [8]:
# Try to fit, predict and score
nb = NB()
nb.fit(X[::2], y[::2])
print('The score is: {}'.format(nb.score(X[1::2], y[1::2])))

The score is: 0.818


### Question 4

<b>Evaluate the performance of your classifier in cross-validation 5-folds.</b>

In [9]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(nb, X, y, cv=5)

print('CV score of Naive Bayes: {} (+/- {})'.format(round(cv_scores.mean(), 4), 
                                                    round(cv_scores.std() * 2, 4)))

CV score of Naive Bayes: 0.816 (+/- 0.0246)


### Question 5

<b>Change the <tt>count_words</tt> function to ignore the “stop words” in the file <tt>data/english.stop</tt>. Are the performances improved?</b>

In [10]:
def count_words(texts, stopwords_path = './data/english.stop'):
    """Vectorize text: return count of each word in the text snippets

    Parameters
    ----------
    texts: list of str
        The texts.
        
    stopwords_path: str
        Path to the list of stopwords.

    Returns
    -------
    vocabulary: dict
        A dictionary that points to an index in counts for each word.
        
    counts: ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
        n_samples == number of documents.
        n_features == number of words in vocabulary.
    """
    
    # initialization
    stopwords = open(stopwords_path).read().replace('\n', ' ').replace('\'', '').split()
    table = str.maketrans('', '', string.punctuation)
    words = set()
    vocabulary = dict()
    
    # build the dictionary
    cleaned_texts = []
    index = 0
    
    for text in texts:
        text = text.translate(table).lower().replace('\n', '').split() # removes punctuation
        text = [w for w in text if w not in stopwords] # removes stopwords
        cleaned_texts.append(text) # just to be faster when building counts (avoid a second pre-processing)
        for word in text:
            if word not in words: # to make sure every word is added only once in the dictionary
                vocabulary[word] = index
                index += 1
                words.add(word)
                
    # build counts
    n_features = len(words)
    counts = np.zeros((len(cleaned_texts), n_features))
    
    for k in range(len(cleaned_texts)):
        text = cleaned_texts[k]
        for word in list(set(text)):
            counts[k, vocabulary[word]] = text.count(word)
    
    return vocabulary, counts

In [11]:
# Count words in text
vocabulary, X = count_words(texts)

In [12]:
# Try to fit, predict and score
nb = NB()
nb.fit(X[::2], y[::2])
print('The score is: {}'.format(nb.score(X[1::2], y[1::2])))

The score is: 0.797


In [13]:
cv_scores = cross_val_score(nb, X, y, cv=5)

print('CV score of Naive Bayes without the stopwords: {} (+/- {})'.format(round(cv_scores.mean(), 4), 
                                                                          round(cv_scores.std() * 2, 4)))

CV score of Naive Bayes without the stopwords: 0.814 (+/- 0.0236)


We see that the performances are a bit improved, but not that much. Besides, the standard deviation is higher. When the model is trained with half of the data set (two cells above), the score is lower than before.

## Scikit-learn use

In [14]:
def remove_stopwords(texts, stopwords_path='./data/english.stop'):
    """Remove the stopwords and the punctuation in a list of texts
    
    Parameters
    ----------
    texts: list of str
        The texts.
        
    stopwords_path: str
        Path to the list of stopwords.
        
    Returns
    -------
    cleaned_texts: list of str
        The texts without the stopwords and the punctuation.
    """
    
    stopwords = open(stopwords_path).read().replace('\n', ' ').replace('\'', '').split()
    table = str.maketrans('', '', string.punctuation)
    
    cleaned_texts = []
    
    for text in texts:
        text = text.translate(table).lower().replace('\n', '').split() # removes punctuation
        text = [w for w in text if w not in stopwords] # removes stopwords
        
        cleaned_texts.append(' '.join(text))
        
    return cleaned_texts

### Question 1

<b>Compare your implementation with <tt>scikit-learn</tt>.</b>

Here, we will keep the stopwords. Indeed, even if it seemed that it was better without the stopwords in a previous question, a few tests showed that the performances are better if we keep the stopwords (at least when we allow n-grams or when we use substrings).

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Vectorizers for the pre-processing
vect = CountVectorizer()
vect_bigrams = CountVectorizer(ngram_range=(1, 2))
vect_35grams = CountVectorizer(ngram_range=(3, 5))

vect_13substrings = CountVectorizer(analyzer='char', ngram_range=(1, 3))
vect_36substrings = CountVectorizer(analyzer='char', ngram_range=(3, 6))


# The model used
nb_sklearn = MultinomialNB()


# The pipelines
nb_bigrams = Pipeline(steps=[('bigrams', vect_bigrams), ('naive Bayes', nb_sklearn)])
nb_35grams = Pipeline(steps=[('3-5 grams', vect_35grams), ('naive Bayes', nb_sklearn)])

nb_13substrings = Pipeline(steps=[('substrings', vect_13substrings), ('naive Bayes', nb_sklearn)])
nb_36substrings = Pipeline(steps=[('substrings', vect_36substrings), ('naive Bayes', nb_sklearn)])

In [16]:
cv_scores_bigrams = cross_val_score(nb_bigrams, texts, y, cv=5)
cv_scores_35grams = cross_val_score(nb_35grams, texts, y, cv=5)

cv_scores_13substrings = cross_val_score(nb_13substrings, texts, y, cv=5)
cv_scores_36substrings = cross_val_score(nb_36substrings, texts, y, cv=5)

print('CV score of Naive Bayes with bigrams: {} (+/- {})'.format(round(cv_scores_bigrams.mean(), 4),
                                                                 round(cv_scores_bigrams.std() * 2, 4)))

print('CV score of Naive Bayes with 3-5 grams: {} (+/- {})'.format(round(cv_scores_35grams.mean(), 4),
                                                                   round(cv_scores_35grams.std() * 2, 4)))

print('CV score of Naive Bayes with substrings (length: 1-3): {} (+/- {})'.format(round(cv_scores_13substrings.mean(), 4),
                                                                                  round(cv_scores_13substrings.std() * 2, 4)))

print('CV score of Naive Bayes with substrings (length: 3-6): {} (+/- {})'.format(round(cv_scores_36substrings.mean(), 4),
                                                                                  round(cv_scores_36substrings.std() * 2, 4)))

CV score of Naive Bayes with bigrams: 0.8305 (+/- 0.0185)
CV score of Naive Bayes with 3-5 grams: 0.8135 (+/- 0.0384)
CV score of Naive Bayes with substrings (length: 1-3): 0.7475 (+/- 0.0283)
CV score of Naive Bayes with substrings (length: 3-6): 0.8215 (+/- 0.0304)


In the question $4$, we had an accuracy of $0.826$. Here, we see that only the method allowing bigrams is better. However, it is still a bit lower than the accuracy obtained simply without the stopwords, but the standard deviation is smaller ($0.019$ instead of $0.029$).

### Question 2

<b>Test another classification method <tt>scikit-learn</tt> (e.g. <tt>LinearSVC</tt>, <tt>LogisticRegression</tt>).</b>

As shown in the previous question, allowing bigrams improves the performances, so we do the same with Logistic Regression.

In [17]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

In [18]:
# # This cell takes more than 10 minutes to execute, with results similar to the ones obtained with logistic regression
# # so it may be better not to execute it


# p_grid_lsvm = {'C': [1e-3, 1e-2, 1e-1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1e1]}

# inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
# outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# # nested CV with parameter optimization
# svc = GridSearchCV(estimator=LinearSVC(), param_grid=p_grid_lsvm, cv=inner_cv)
# nested_score = cross_val_score(svc, vect.fit_transform(texts), y, cv=outer_cv,scoring="accuracy")

# print('Nested CV score of the SVC: {} (+/- {})'.format(round(nested_score.mean(), 4),
#                                                        round(nested_score.std() * 2, 4)))

# # nested CV with parameter optimization
# nested_score_bigrams = cross_val_score(svc, vect_bigrams.fit_transform(texts), y, cv=outer_cv,scoring="accuracy")

# print('Nested CV score of the SVC (with bigrams): {} (+/- {})'.format(round(nested_score_bigrams.mean(), 4),
#                                                                       round(nested_score_bigrams.std() * 2, 4)))

In [19]:
# Logistic regression
lr = Pipeline([('vect', vect), ('logistic_regression', LogisticRegression())])
lr.set_params(logistic_regression__solver='liblinear')

cv_scores_lr = cross_val_score(lr, texts, y, cv=5)

print('CV score of the logistic regression: {} (+/- {})'.format(round(cv_scores_lr.mean(), 4), 
                                                                round(cv_scores_lr.std() * 2, 4)))


# Logistic regression with bigrams
lr_bigrams = Pipeline([('bigrams', vect_bigrams), ('logistic_regression', LogisticRegression())])
lr_bigrams.set_params(logistic_regression__solver='liblinear')

cv_scores_lr_bigrams = cross_val_score(lr_bigrams, texts, y, cv=5)

print('CV score of the logistic regression (with bigrams): {} (+/- {})'.format(round(cv_scores_lr_bigrams.mean(), 4), 
                                                                               round(cv_scores_lr_bigrams.std() * 2, 4)))

CV score of the logistic regression: 0.8415 (+/- 0.0336)
CV score of the logistic regression (with bigrams): 0.8525 (+/- 0.033)


Both the results are better than what we had before, and it is the logistic regression with bigrams that has the best result.

### Question 3

<b>Use NLTK library in order to process a stemming. You will use the class <tt>SnowballStemmer</tt>.</b>

Now, let process a stemming, i.e. we only keep the stem of each word (that has the avantage to give the same representation for a word and its different forms such as the plural).

In [20]:
from nltk import SnowballStemmer

def stem_texts(texts):
    """Apply stemming on a list of texts
    
    Parameters
    ----------
    texts: list of str
        The texts
        
    Returns
    -------
    stemmed_texts: list of str
        The stemmed texts
    """
    
    stemmer = SnowballStemmer(language='english')
    table = str.maketrans('', '', string.punctuation)

    stemmed_texts = []

    for text in texts:
        text = text.translate(table).lower().replace('\n', '').split()
        text = [stemmer.stem(w) for w in text]
        stemmed_texts.append(' '.join(text))
        
    return stemmed_texts

stemmed_texts = stem_texts(texts)

In [21]:
cv_scores_stem_lr = cross_val_score(lr_bigrams, stemmed_texts, y, cv=5)

print('CV score of the logistic regression (bigrams, stemming): {} (+/- {})'.format(round(cv_scores_stem_lr.mean(), 4), 
                                                                                    round(cv_scores_stem_lr.std() * 2, 4)))

CV score of the logistic regression (bigrams, stemming): 0.853 (+/- 0.0287)


The score obtained is a bit better than with bigrams only. The improvement is not huge but it shows the stemming has a positive impact on the performances, which makes sense.

### Question 4

<b>Filter words by grammatical category (POS: Part Of Speech) and keep only nouns, verbs, adverbs and adjectives for classification.<b/>

In [22]:
from nltk import pos_tag

POS_kept = ['NOUN', 'VERB', 'ADV', 'ADJ']

def pos_tagging_texts(texts, keep=POS_kept):
    """Apply POS tagging selection to a list of texts
    
    Parameters
    ----------
    texts: list of str
        The texts
    keep: list of str
        List of the grammatical categories to keep

    Returns
    -------
    texts_pos: list of str
        The texts in which we only keep some grammatical categories
    """

    table = str.maketrans('', '', string.punctuation)

    texts_pos = []

    for text in texts:
        text = text.translate(table).lower().replace('\n', '').split()
        text_pos = ''

        for word, pos in pos_tag(text, tagset='universal'):
            if pos in POS_kept:
                text_pos += word + ' '

        texts_pos.append(text_pos)
        
    return texts_pos

texts_pos = pos_tagging_texts(texts)

In [23]:
cv_scores_lr_pos_tag = cross_val_score(lr_bigrams, texts_pos, y, cv=5)

print('CV score of the logistic regression (bigrams, pos tagging): {} (+/- {})'.format(round(cv_scores_lr_pos_tag.mean(), 4), 
                                                                                    round(cv_scores_lr_pos_tag.std() * 2, 4)))

CV score of the logistic regression (bigrams, pos tagging): 0.8535 (+/- 0.0299)


The pos tagging is a bit better than the stemming. It is the best result we obtained yet.