# Applying Machine Learning To Sentiment Analysis
Notebook adopted from Sebastian Raschka's code examples from Machine Learning for Python

### Overview
In this ipynb we will expore sentiment analysis from two different approaches: as supervised classification task; and an unsupervised NLP analysis task.

In [12]:
import pandas as pd
import os

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
#from sklearn.feature_extraction.text import TfidfTransformer

import nltk
from nltk.stem.porter import PorterStemmer

# Toy Example to Introducing the bag-of-words model

The bag-of-words model is a representation of documents as vectors in a high dimensional vector space based on the training collection vocabulary. Its a way of representing text data when modeling text with machine learning algorithms.

The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

It basically assumes an encoding whereby each word in the training dataset vocabulary is considered a feature. 

This means that when handling reviews from the test set we must also consider what happens to words that were not seen in the training set. Two ways to deal with unknown terms observed at test time are to have an unknown token that absorbs all cases of new vocabulary, or more commonly to simply ignore those words in the test collection.

## Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we just constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [5]:
count = CountVectorizer(ngram_range=(1,2)) #from sklearn.feature_extraction.text import CountVectorizer
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

Now let us print the contents of the vocabulary to get a better understanding of the underlying concepts:

In [6]:
print(count.vocabulary_) # vocabulary_ attribute of CountVectorizer() shows a mapping of terms to feature indices.

{'the': 15, 'sun': 11, 'is': 2, 'shining': 9, 'the sun': 16, 'sun is': 12, 'is shining': 3, 'weather': 19, 'sweet': 13, 'the weather': 17, 'weather is': 20, 'is sweet': 4, 'and': 0, 'one': 6, 'two': 18, 'shining the': 10, 'sweet and': 14, 'and one': 1, 'one and': 7, 'one is': 8, 'is two': 5}


As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Next let us print the feature vectors that we just created:

Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the  first feature at index position 0 resembles the count of the word "and", which only occurs in the last document, and the word is at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*.

We can print this as follows:

In [7]:
print(bag.toarray())

[[0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0]
 [0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1]
 [2 2 3 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1]]


# Exercise
Notice that in the above the CountVectorizer we created frquency counts for unigrams i.e. single words. 
To preserve some of the local ordering information we can also ask it to extract 2-grams of words in addition to the 1-grams (individual words). To do this modify the code as follows:

count = CountVectorizer(ngram_range=(1,2)) 


## Assessing word relevancy via term frequency-inverse document frequency

In [8]:
np.set_printoptions(precision=2) # These options determine the way floating point numbers are displayed.

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be de ned as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

Here the tf(t, d) is the term frequency that we introduced in the previous section,
and the inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{N}{\text{df}(d, t)},$$

where $N$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [9]:
tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.   0.31 0.4  0.   0.   0.   0.   0.   0.4  0.   0.4  0.4  0.
  0.   0.31 0.4  0.   0.   0.   0.  ]
 [0.   0.   0.31 0.   0.4  0.   0.   0.   0.   0.   0.   0.   0.   0.4
  0.   0.31 0.   0.4  0.   0.4  0.4 ]
 [0.38 0.38 0.33 0.14 0.14 0.19 0.38 0.19 0.19 0.14 0.19 0.14 0.14 0.14
  0.19 0.22 0.14 0.14 0.19 0.14 0.14]]


As we saw in the previous subsection, the word "is" with the largest term frequency was contained in the 3rd document.

However, after transforming the same feature vector into tf-idfs, we see that the word "is" is
now associated with a relatively smaller tf-idf (0.45) in document 3 since it is
also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.

Note how "one" and "shining" are now considered more important even though they occured only once in the previous vector representation (using just TF).

However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we did in the lecture. The equations for the idf and tf-idf that were implemented in scikit-learn are:

The tf-idf equation that was implemented in scikit-learn is as follows:
$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

Here by setting smooth_idf=True, we acheive the extra addition of one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.



While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

By default (`norm='l2'`), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

To make sure that we understand how TfidfTransformer works, let us walk
through an example and calculate the tf-idf of the word is in the 3rd document.

The word "is" has a term frequency of 3 (tf = 3) in document 3, and the document frequency of this term is 3 since the term "is" occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [10]:
tf_is = 3 # suppose term "is" has a frequency of 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer that we used previously. The next step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2+ 3.0^2+ 3.39^2+ 1.29^2+ 1.29^2+ 1.29^2+ 2.0^2+ 1.69^2+ 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

lets recalculate the tfidf values by switching off the norm 

In [11]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

array([3.39, 3.39, 3.  , 1.29, 1.29, 1.69, 3.39, 1.69, 1.69, 1.29, 1.69,
       1.29, 1.29, 1.29, 1.69, 2.  , 1.29, 1.29, 1.69, 1.29, 1.29])

In [12]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([0.38, 0.38, 0.33, 0.14, 0.14, 0.19, 0.38, 0.19, 0.19, 0.14, 0.19,
       0.14, 0.14, 0.14, 0.19, 0.22, 0.14, 0.14, 0.19, 0.14, 0.14])

# TfidfVectorizer
Use TfidfVectorizer which is equivalent to CountVectorizer followed by TfidfTransformer.
It convert a collection of raw documents to a matrix of TF-IDF features.


In [17]:
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]

vectorizer = TfidfVectorizer(ngram_range=(1,3), min_df=2)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

['document', 'first', 'first document', 'is', 'is the', 'the', 'the first', 'the first document', 'this', 'this is', 'this is the']
(4, 11)


As we can see, the results match the results returned by scikit-learn's `TfidfTransformer` (Above). Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

# Exercise 
- Again try out the ngram_range=(1, 2, 3) to generate n-grams beyond ungrams with the TFIDFVectorizer.
- max_df and min_df are parameters that can be set with TFIDFVectorizer; where the former specifies the max cutoff for frequent occuring words and latter the minimum expected occurence of words with documents before words are considered. Try setting these and explore the different outputs. 

# Preparing the IMDb movie review data for text processing 

## Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
We have already done this and extracted the csv file. 

## Preprocessing the movie dataset into more convenient format

In [13]:
df = pd.read_csv('movie_data_cat.csv', encoding='utf-8')
df.head(10)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",pos
1,OK... so... I really like Kris Kristofferson a...,neg
2,"***SPOILER*** Do not read this, if you think a...",neg
3,hi for all the people who have seen this wonde...,pos
4,"I recently bought the DVD, forgetting just how...",neg
5,Leave it to Braik to put on a good show. Final...,pos
6,Nathan Detroit (Frank Sinatra) is the manager ...,pos
7,"To understand ""Crash Course"" in the right cont...",pos
8,I've been impressed with Chavez's stance again...,pos
9,This movie is directed by Renny Harlin the fin...,pos


In [19]:
df.shape
df.columns

Index(['review', 'sentiment'], dtype='object')

Since the sentiment column happens to be categorical we can map the "pos" and "neg" classes to 0 and 1 integers. 

In [14]:
class_mapping = {label:idx for idx,label in enumerate(np.unique(df['sentiment']))}

print(class_mapping)

#use the mapping dictionary to transform the class labels into integers

df['sentiment'] = df['sentiment'].map(class_mapping)
df.head(10)

{'neg': 0, 'pos': 1}


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


Now that the class column is as we need it for a classifier we next look at how to clean up the review text content. 

## Cleaning text data with Regular Expressions
execute the code below to view the first review. 
You will notice that the text needs cleaned up e.g. due to html markup, punctuation and other non-letter chars. 

In [21]:
df.loc[5635, 'review']#[-50:]

'I really thought this would be a good movie, boy...was I mistaken! For a quick summery: B grade acting C grade special effects D grade for the overall movie. Don\'t get me wrong, the story was pretty good and not kiddish so an adult too ride along with it, the "hero" is good looking so most women will like it :-), not a total chick flick as it contains some fight scenes and some blood<br /><br />but the way it is shot... horrible <br /><br />the special effects->would be better suited for TV->on a kids show <br /><br />and lastly...send some of the actors back to acting school if they ever attended a class there.<br /><br />Trust me there are much better ways to waste 2 hours.<br /><br />You have been warned.'

we can use python's regular expression library to clean up some of this data.
for details on the re library goto : https://docs.python.org/2/library/re.html

In [15]:
#import regular expressions to clean up the text
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # remove all html markup
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # findall the emoticons
    
    # remove the non-word chars '[\W]+'
    # append the emoticons to end 
    #convert all to lowercase
    # remove nose char for consistency
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', '')) 
    return text

In [24]:
preprocessor(df.loc[5635, 'review'])#[-50:]

'i really thought this would be a good movie boy was i mistaken for a quick summery b grade acting c grade special effects d grade for the overall movie don t get me wrong the story was pretty good and not kiddish so an adult too ride along with it the hero is good looking so most women will like it not a total chick flick as it contains some fight scenes and some bloodbut the way it is shot horrible the special effects would be better suited for tv on a kids show and lastly send some of the actors back to acting school if they ever attended a class there trust me there are much better ways to waste 2 hours you have been warned :)'

## Apply the clean data preprocessor to the text

In [25]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [26]:
# apply the preprocessor to the entire dataframe (i.e. column review)
df['review'] = df['review'].apply(preprocessor)

## Tokenise - break text into tokens

In [16]:
def tokenizer(text):
       return text.split()


In [28]:
print(tokenizer("Tokenise this sentence into its individual words"))

['Tokenise', 'this', 'sentence', 'into', 'its', 'individual', 'words']


## Stopwords - Removing stopwords from text
We need todown load the stopwords list from nltk.
You can do that as follows:


In [17]:
from nltk.corpus import stopwords 

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/craigpirie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

create a method to accept a piece of tokenised text and return text back without the stopped words

In [18]:
stop = set(stopwords.words('english'))
def stop_removal(text):
       return [w for w in text if not w in stop]

In [31]:
text = "This is a sample sentence, demonstrating the removal of stop words."
stopped_text = stop_removal(text.split())
print(stopped_text) 

['This', 'sample', 'sentence,', 'demonstrating', 'removal', 'stop', 'words.']


## Stemming - Processing tokens into their root form
For this purpose we will explore two different stemmers and select one.

In [19]:
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

#See which languages are supported.
print(" ".join(SnowballStemmer.languages))

arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish


In [22]:
#get the english stemmer
stemmer = SnowballStemmer("english")

#stem a word
print(stemmer.stem("jogging"))

jog


In [35]:
#Decide not to stem stopwords with ignore_stopwords
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)

#compare the two versions of the stemmer
print(stemmer.stem("having"))

print(stemmer2.stem("having"))

have
having


In [36]:
#The 'english' stemmer is better than the original 'porter' stemmer.
print(SnowballStemmer("english").stem("generously"))

print(SnowballStemmer("porter").stem("generously"))

generous
gener


# Tokenise + Stemming 
Lets create  method to stem each word / token contained in the piece of text. 
Note how the text is first tokenised before stemming.

In [20]:
def tokenizer_stemmer(text):
    return [stemmer.stem(word) for word in tokenizer(text)]#text.split()]

In [38]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [39]:
tokenizer_stemmer('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thus', 'they', 'run']

You can clearly see from the code above the effect of the stemmer on the tokens

In [23]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_stemmer('A runner likes running and runs a lot')[-8:]
if w.lower() not in stop]


['runner', 'like', 'run', 'run', 'lot']

# Training a model for sentiment classification

Strip HTML and punctuation to speed up the GridSearch later:

In [24]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

### smaller sample
X_train = df.loc[:2500, 'review'].values
y_train = df.loc[:2500, 'sentiment'].values

In [25]:
param_grid0 = [{'vect__ngram_range': [(1, 1)], #can also extract 2-grams of words in addition to the 1-grams (individual words)
               'vect__stop_words': [stop, None], # use the stop dictionary of stopwords or not
               'vect__tokenizer': [tokenizer_stemmer]}, # use a tokeniser and the stemmer 
               ]

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC


tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)


mnb_tfidf = Pipeline([('vect', tfidf),
                     ('clf',  SVC(gamma='autoy'))])


                   
gs_mnb_tfidf = GridSearchCV(mnb_tfidf, param_grid0,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=1) 

**Important Note about the running time**

Executing the following code cell **may take up to 30-60 min** depending on your machine, since based on the parameter grid we defined, there are 2*2*2*3*5 + 2*2*2*3*5 = 240 models to fit.

If you do not wish to wait so long, you could reduce the size of the dataset by decreasing the number of training samples, for example, as follows:

    X_train = df.loc[:2500, 'review'].values
    y_train = df.loc[:2500, 'sentiment'].values
    
However, note that decreasing the training set size to such a small number will likely result in poorly performing models. Alternatively, you can delete parameters from the grid above to reduce the number of models to fit -- for example, by using the following:

    param_grid = [{'vect__ngram_range': [(1, 1)],
                   'vect__stop_words': [stop, None],
                   'vect__tokenizer': [tokenizer],
                  ]

In [27]:
gs_mnb_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.1min finished
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [28]:
print('Best parameter set: %s ' % gs_mnb_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_mnb_tfidf.best_score_)

Best parameter set: {'vect__ngram_range': (1, 1), 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'som

In [29]:
clf = gs_mnb_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.499


####  GridSearchCV versus cross_val_score
    
Please note that `gs_mnb_tfidf.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:

# Exercise
In the previous code you saw that having the stopword remover (i.e. stop ) was better than None (not having stopword removal). You can now explore other alternatives. Change the param_grid0  values to explore the folloing:
- Does having Tokenizer_stemmer result in better accuracy compared to just the tokenizer (with no stemmer)?
- What happens if you consider bith unigrams and bigrams? Is there a benfit to analysing bigrams?
- How can you modify the classifier 'clf' to a different learner such as the MLPClassifier or anyone the learners you have tried in the previous lab.

# Unsupervised Sentiment Analysis
In the previous code we treated Sentimetn Analysis as a standard text classification problem. 
In this section we will adopt a different approach; whereby we will make use of a sentiment dictionary (SentWornNet)
to score every word in a piece of text as either positive or negative. 
Instead of fitting a machine learning classifier; we will instead aggregate these scores to establish the overall poloarity of the piece of text. 

In [47]:
# lets download the sentiwordnet lexicon / dictionary from NLTK
# we also need to the original wordnet lexicon 
nltk.download('sentiwordnet')
nltk.download('wordnet')

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/craigpirie/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/craigpirie/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

## Using the sentiwordnet Lexicon

In [56]:
from nltk.corpus import sentiwordnet as swn

awesome = list(swn.senti_synsets('awesome', 'a'))[0]
print('Positive Polarity Score:', awesome.pos_score())
print('Negative Polarity Score:', awesome.neg_score())
print('Objective Score:', awesome.obj_score())

Positive Polarity Score: 0.875
Negative Polarity Score: 0.125
Objective Score: 0.0


## Exercise
Using the cell above try SentiWordNet to explore the pos, neg scores for different words.
Note that 'a' refers to adjectives. If you are looking at anoun then this should be set to 'n' and similarly for verbs ('v') and adverbs ('r). 

# POS Tagging - Setup NLTK for pos tagging

In [64]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/craigpirie/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/craigpirie/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Unlike the previous tokenizer method we had created with the simple split function we now need to both tokenise and associate the relevant part-of-speech (POS) tag to each word. To do this we will make use the of nltk.pos_tag as shown in the example below, 

In [41]:
tagged_text =nltk.pos_tag(tokenizer('The cat sat on the mat.'))
tagged_textkjkijklkokokmju


[('The', 'DT'),
 ('cat', 'NN'),
 ('sat', 'VBD'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('mat.', 'NN')]

We can modify the code above such that it returns a list.
This data construct is more convinient as we see in the next cell using word_tokenise from the nltk library.

## Unsupervised Sentiment Analyser

In [65]:
from nltk import word_tokenize

def unsupervised_sentiment_analyzer(review, verbose=False):

    # tokenize and POS tag text tokens
    #tagged_text = [(token.text, token.tag_) for token in nltk.pos_tag tn.nlp(review)]
    
    tagged_text = [(pair) for pair in nltk.pos_tag(word_tokenize(review))]
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')): #noun
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')): #verb
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')): #adjective
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')): #adverb
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score, 
                                         norm_neg_score, norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                             ['Predicted Sentiment', 'Objectivity',
                                                              'Positive', 'Negative', 'Overall']], 
                                                             labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
    return 1 if (final_sentiment == 'positive') else 0
    #return final_sentiment

## Predict Sentiment for some example text

In [69]:
unsupervised_sentiment_analyzer("stupid movie, acting average or worse... screenplay - no sense at all... SKIP IT!", verbose=True)
unsupervised_sentiment_analyzer("great movie, loved it to bits, good action and cool actors", verbose=True)

     SENTIMENT STATS:                                      
  Predicted Sentiment Objectivity Positive Negative Overall
0            negative        0.79     0.04     0.17   -0.12
     SENTIMENT STATS:                                      
  Predicted Sentiment Objectivity Positive Negative Overall
0            positive        0.78     0.19     0.03    0.16




1

# Exercise
You can see that the prediction is 0 which in this case relates to non-positive sentiment class i.e. negative sentiment.
Modify the piece of text above so that the sentiment analyser predicts positive class. 

## Predict Sentiment for a selected movie review
You can use the code below to explore the prediction given a specific id of a review.
In the example we have used '5635'. Try out different reviews ids. 

In [71]:
print(df.loc[3635, 'review'])
pred = unsupervised_sentiment_analyzer(df.loc[5635, 'review'], verbose=True)
print('predicted:', pred)

not often it happens that a great director s last movie becomes such a moving brillantly performed and filmed masterpiece the cast is excellent as well as the camerawork what starts up as a merry coming together of a group of well educated citizens of an early 20th century dublin turns into a dark philosophic narration about all our fear from death and the sometimes dark shadows of the past thank you mister huston for this last piece of great cinema 
     SENTIMENT STATS:                                      
  Predicted Sentiment Objectivity Positive Negative Overall
0            positive        0.81     0.12     0.07    0.05
predicted: 1




# Evaluate Sentiment Prediction on the Movie Review Test Dataset¶


In [72]:
#Choose the test sample
reviews = X_test #df.loc[25000:25003,['review']].values
sentiments = y_test #df.loc[25000:25003,['sentiment']].values
#The original test dataset has 25000 
# This may take considerable time on a standard machine
# Ideally select a sample to test with
reviews = df.loc[25000:26000,'review'].values
sentiments = df.loc[25000:26000,'sentiment'].values

count = 0
correct = 0

msg = "predicting for %d reviews" % len(reviews)
print(msg)
print('this will take a moment ...')
for test_X, test_y in zip(reviews, sentiments):#zip(df['review'], df['sentiment']):
    count+=1
    #print('REVIEW:', review)
    #print('Actual Sentiment:', test_y)
    pred = unsupervised_sentiment_analyzer(test_X, verbose=False) 
    #print('predicted Sentiment:', pred)
    #print('-'*60)
    if (pred==test_y):
        correct+=1

accuracy = round(float(correct) / count, 3) * 100
print(accuracy)
print(correct, count)


predicting for 1001 reviews
this will take a moment ...
59.599999999999994
597 1001


# Exercise
- Which approach results in better accuracy; supervised or unsupervised sentiment analysis?
- What do you think might be the pros and cons of the 2 approaches to sentment analysis i.e. using a text classification or supervised approach versus using the -precompiled dictionary of sentment knowledge to analyse text in an unsupervised setting?