# Text Mining: Sentiment analysis

Author: Badr GHAZLANE

In [1]:
import glob 
import os.path as op 
import numpy as np
import re
from sklearn.base import BaseEstimator, ClassifierMixin

## Data Importation
I load the data stored in 2 files: Neg ( negative comments ) and Pos (positive comments. Texts contains both of Neg and Pos. 

In [2]:
# Load data 
print("Loading dataset")

from glob import glob
filenames_neg = sorted(glob(op.join('data/imdb1/neg/*.txt')))
filenames_pos = sorted(glob(op.join('data/imdb1/pos/*.txt')))

texts_neg = [open(f).read() for f in filenames_neg]
texts_pos = [open(f).read() for f in filenames_pos]
texts = texts_neg + texts_pos
y = np.ones(len(texts), dtype=np.int)
y[:len(texts_neg)] = 0.

print("%d documents" % len(texts))

Loading dataset
2000 documents


Then, the stop words (meaningless words) are imported.

In [3]:
files_stop = 'data/english.stop.txt'
stop_words = open(files_stop).read().split()

## Part1:  From Scratch classifier

In this part, we will implement a Naive Bayes Classifier from scratch. The Naive Bayes theory was studied and gave us formulas that I used in the NB class. This document (**Daniel Jurafsky & James H. Martin** **STANFORD**) explains very well sentiment analysis using Naive Bayes:
https://web.stanford.edu/~jurafsky/slp3/6.pdf

*In this work, we will consider that new data won't be processed by the models. We will train the model on all the data we have, then we test the model on a part of the data. Hence, the test data does not contain new words. If we wanted to give to the model new sentences, we should have implemented a small chunk of code that remove words that are not in the training dictionnary*. 

 ###  1 . Countword function: 

In [4]:
def count_words(texts, stopwords):
    """Vectorize text : return count of each word in the text snippets

    Parameters
    ----------
    texts : list of str
        The texts

    Returns
    -------
    vocabulary : dict
        A dictionary that points to an index in counts for each word.
    counts : ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
        n_samples == number of documents.
        n_features == number of words in vocabulary.
type()    """
    n_samples = len(texts)
    words = set()
    for text in texts:
        text = re.sub(r'\W', ' ', text)
        text = text.strip()
        
        words.update(text.split())
        
    words.difference(stopwords)  # remove stop word  
    n_features = len(words)
    vocabulary = dict(zip(words, range(n_features)))
    counts = np.zeros((n_samples , n_features ))
    
    for i in range(n_samples) :
        
        wordlist = re.sub(r'\W', ' ', texts[i]).strip().split() # remove punctuation
        for mot in wordlist:
            if mot not in stopwords:
                counts[i, vocabulary[mot]] += 1
    return vocabulary, counts

Let's notice that the countword function returns the vocabulary dictionnary and a 2d-array of (2000, 39696) shape. So rows contain the document and columns the words. **Each cell contains the occurence of the word_j in the document_i.**

In [5]:
vocabularys, Xs = count_words(texts, stop_words)

In [6]:
Xs.shape

(2000, 39696)

### 2 . This section describes how we determined whether a review was positive or negative.



The original html files do not have consistent formats -- a review may
not have the author's rating with it, and when it does, the rating can
appear at different places in the file in different forms.  We only
recognize some of the more explicit ratings, which are extracted via a
set of ad-hoc rules.  In essence, a file's classification is determined
based on the first rating we were able to identify.


- In order to obtain more accurate rating decisions, the maximum
	rating must be specified explicitly, both for numerical ratings
	and star ratings.  ("8/10", "four out of five", and "OUT OF
	****: ***" are examples of rating indications we recognize.)

- With a five-star system (or compatible number systems):
	three-and-a-half stars and up are considered positive, 
	two stars and below are considered negative.
- With a four-star system (or compatible number system):
	three stars and up are considered positive, 
	one-and-a-half stars and below are considered negative.  
- With a letter grade system:
	B or above is considered positive,
	C- or below is considered negative.

We attempted to recognize half stars, but they are specified in an
especially free way, which makes them difficult to recognize.  Hence,
we may lose a half star very occasionally; but this only results in 2.5
stars in five star system being categorized as negative, which is 
still reasonable.

### 3 . The NB class

The small class NB helps to fit, predict and score easily. 



- **FIT**: It learns the word's distribution of each class (Neg or Pos) through word frequencies. Then we apply a Laplace 1 smoothing. So, the maximum likelihood estimate of the probability will never be equal to 0. In the code below, I will give the reference on the Stanford document. 



- **PREDICT** : For each new document, a new vector containing the occurence of each word of the dictionnary is created. And, we compute the maximum posterior probability given the document, and keep the highest. 





- **SCORE** :  The score is the mean of the right classified examples.

In [7]:
class NB(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.condproba = None
        self.prior = None

    def fit(self, X, y):
        self.prior = np.zeros(len(set(y)))
        self.condproba = np.zeros((X.shape[1],len(set(y))))
        
        for i in range(0,len(set(y))):
            self.prior[i] = X[y == i].shape[0]/X.shape[0] # Equation (6.11) 
            self.condproba[:,i] = (np.sum(X[y==i], axis=0) + 1) / (np.sum(X[y==i]) + X[y==i].shape[1]) # Equation (6.14)
        
        return self

    def predict(self, X):
        return np.argmax(np.dot(X, (np.log(self.prior) + np.log(self.condproba)) ), axis=1) # Equation (6.10)

    def score(self, X, y):
        return np.mean(self.predict(X) == y)
    

In [8]:
%%time
# Count words in text
vocabulary, X = count_words(texts, stop_words)

CPU times: user 9.46 s, sys: 219 ms, total: 9.68 s
Wall time: 9.69 s


Let's compute cross validation to test our class accuracy: 

In [9]:
%%time

from sklearn.model_selection import cross_val_score
nb = NB()
scores = cross_val_score(nb, X, y, cv=5)
print('Max score:', max(scores))
print('Min score:', min(scores))
print('Mean:', scores.mean())
print('                     ')

Max score: 0.8325
Min score: 0.7725
Mean: 0.8055
                     
CPU times: user 4.96 s, sys: 4.84 s, total: 9.8 s
Wall time: 10 s


## Part2:  SKLEARN NB classifier

I will compare the from scratch NB implementation with the sklearn one. 
To simplify the process, I will use a pipeline. 

### 1 . Comparison with SKLEARN

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [11]:
etapes = [('vectorizer',CountVectorizer(analyzer='word')), 
          ('Bayes', MultinomialNB())]
pipe = Pipeline(etapes)
pipe.fit(texts[::2], y[::2])
ypred = pipe.predict(texts[1::2])
print('The SKLEARN implementation of NB gives a score of %.4f.' % np.mean(y[1::2] == ypred))
print('Our impementation gives us a score of %.4f' % scores.mean())



The SKLEARN implementation of NB gives a score of 0.8130.
Our impementation gives us a score of 0.8055


### 2 . Let's try another algorithm

- SVM algorithm

In [12]:
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

In [13]:
etapes_SVM = [('vectorizer',CountVectorizer(analyzer='word')), 
          ('SVM', SVC(C= 30, kernel='rbf', gamma=0.0001))]
pipe_SVM = Pipeline(etapes_SVM)
pipe_SVM.fit(texts[::2], y[::2])
ypred_SVM = pipe_SVM.predict(texts[1::2])
print('The SKLEARN implementation of SVM gives a score of %.4f.' % np.mean(y[1::2] == ypred_SVM))
print('Our impementation gives us a score of %.4f' % scores.mean())

The SKLEARN implementation of SVM gives a score of 0.8130.
Our impementation gives us a score of 0.8055


- Logistic Regression

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
etapes_logit = [('vectorizer',CountVectorizer(analyzer='word')), 
          ('Logistic regression', LogisticRegression(C=1, fit_intercept= True) )]
pipe_logit = Pipeline(etapes_logit)
pipe_logit.fit(texts[::2], y[::2])
ypred_logit = pipe_logit.predict(texts[1::2])
print('The SKLEARN implementation of Logistic Regression gives a score of %.4f.' % np.mean(y[1::2] == ypred_logit))
print('Our impementation gives us a score of %.4f' % scores.mean())

The SKLEARN implementation of Logistic Regression gives a score of 0.8310.
Our impementation gives us a score of 0.8055


### 3 . Stemming: NLTK

Stemmers remove morphological affixes from words, leaving only the word stem.

In [16]:
from nltk import SnowballStemmer
from nltk.tokenize import word_tokenize

In [17]:
stemmer = SnowballStemmer("english")

In [18]:
def stemmWords(doc):
    return [stemmer.stem(word) for word in word_tokenize(doc)]

stemmed_texts = [' '.join(stemmWords(text)) for text in texts ]

Let's try the stemming transformation on our best algorithm: 

In [19]:
%%time

pipe_logit.fit(stemmed_texts[::2], y[::2])
ypred_logit = pipe_logit.predict(stemmed_texts[1::2])
print('The SKLEARN implementation of Logistic Regression with stemming gives a score of %.4f.' % np.mean(y[1::2] == ypred_logit))
print('                                ')

The SKLEARN implementation of Logistic Regression with stemming gives a score of 0.8170.
                                
CPU times: user 1.36 s, sys: 19.6 ms, total: 1.38 s
Wall time: 1.38 s


### 4 . Grammatical analysis using POS

- In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In [20]:
import nltk
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('tagsets')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Ghazlane/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/Ghazlane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/Ghazlane/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

Using the NLTK documentation we kept the meaningful words as verb (VBZ), adjectives (JJ, JJR ...) etc. This allows to remove noise in our distribution. 

In [21]:
# ALL NLTK TAGS
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [22]:
def keep_groups(text): 
    auth_lexic=['NN','NNS', 'NNP', 'NNPS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ','JJ', 'JR', 'JJS']
    
    filtered_word = []
    
    for text in texts:
        words = word_tokenize(text)
        w = []
        
        for word, pos in pos_tag(words):
            if pos in auth_lexic:
                w.append(word)
        filtered_word.append(' '.join(w[:]))
    return filtered_word
    

In [23]:
%%time
filtered = keep_groups(texts)

CPU times: user 1min 33s, sys: 314 ms, total: 1min 33s
Wall time: 1min 34s


Let's try the POS effect on our best algorithm. **Removing meaningless words has increased our algorithm accuracy**.

In [24]:
%%time

pipe_logit.fit(filtered[::2], y[::2])
ypred_logit = pipe_logit.predict(filtered[1::2])
print('The SKLEARN implementation of Logistic Regression with POS gives a score of %.4f.' % np.mean(y[1::2] == ypred_logit))
print('                                ')

The SKLEARN implementation of Logistic Regression with POS gives a score of 0.8370.
                                
CPU times: user 955 ms, sys: 28.7 ms, total: 984 ms
Wall time: 985 ms


##  Conclusion



- Finally, the **best data transformation is the POS**. POS tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. It allows to our model to improve its accuracy. However, **Stemming technique is also efficient**, even though the model accuracy has not been improved. 

- However, the accuracy is clearly smaller than the human's sentiment analysis. (For image recognition, some models (CNN) or logistic regression (for MNIST) are better than human analysis). The models we studied do not consider the interactions between words and the complexity of a sentence. The unique information we consider is the frequency of a word on a document. So, the results we got are pretty satisfying because **the models considered are extremely simple. **

- Models as **word2vec, Seq2Seq (RNN)** are more complex and consider the posistion of a word in a sentence or have a spacial representation of a word meaning. So, it would be interesting to know how much these models may improve the accuracy. 

