# Lab2.4 Sentiment Analysis with Scikit-Learn

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

The focus of this notebook is on performing sentiment analysis using the scikit-learn package. Material from [this notebook](http://www.pitt.edu/~naraehan/presentation/Movie+Reviews+sentiment+analysis+with+Scikit-Learn.html) was used.

**At the end of this notebook, you will be able to**:
* load the training data, i.e., the movie reviews
* inspect the training data, i.e., the movie reviews
* extracting features from the training data
* training and evaluating the *NaiveBayesClassifier*
* apply the classifier to fake movie reviews

**If you want to learn more, you might find the following link useful:**
* [documentation on dataset loading](http://scikit-learn.org/stable/datasets/)

For training a machine learning system we need a number of packages, the most important ones are *sklearn* and *numpy* to manipulate out data and call machine learning functions. Since we are dealing with texts, we also need some specific packages from *sklearn* to operate on texts to get words as features.

In [2]:
import pathlib
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## Loading the dataset
We are first going to load and inspect the **airlinetweets** dataset (which is part of the zip file you downloaded from Github). We are going to use the method **load_files** as part of sklearn.
Let's first inspect what the help message of the function **load_files** states.

In [3]:
help(load_files)

Help on function load_files in module sklearn.datasets.base:

load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
    Load text files with categories as subfolder names.
    
    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:
    
        container_folder/
            category_1_folder/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            category_2_folder/
                file_43.txt
                file_44.txt
                ...
    
    The folder names are used as supervised signal label names. The individual
    file names are not important.
    
    This function does not try to extract features into a numpy array or scipy
    sparse matrix. In addition, if load_content is false it does not try to
    load the files in memory.
    
    To use text files in a scikit

Ok, so the function requires the following structure in order for it to work:
* container_folder/
    * category_1_folder/ (e.g., 'pos')
        * file_1.txt
        * file_2.txt
        * ...
        file_42.txt
    * category_2_folder/ (e.g., 'neg')
        * file_43.txt
        * file_44.txt
        * ...
        
Remember from the previous labs that some of the NLTK data is precisely structured accordingly, e.g. nltk_data/corpora/movie_reviews.

Let's check whether our **airlinetweets** corpus has this structure.

In [4]:
cwd = pathlib.Path.cwd()
airline_tweets_folder = cwd.joinpath('airlinetweets')
print('path:', airline_tweets_folder)
print('this will print True if the folder exists:', 
      airline_tweets_folder.exists())

path: /Users/piek/Desktop/CBS2020/text-mining-ba-changed/lab_sessions/lab2/airlinetweets
this will print True if the folder exists: True


In [5]:
str(airline_tweets_folder)

'/Users/piek/Desktop/CBS2020/text-mining-ba-changed/lab_sessions/lab2/airlinetweets'

Inspect whether the corpus has the required structure.

....

....

Hopefully, it is! Let's now load it using the function.

In [6]:
# loading all files as training data.
airline_tweets_train = load_files(str(airline_tweets_folder))

## Inspecting dataset

How many files do we have?

In [7]:
len(airline_tweets_train.data)

4755

In [8]:
# target names ("classes") are automatically generated from subfolder names
airline_tweets_train.target_names

['negative', 'neutral', 'positive']

If you do not agree with these labels, you could change the names of the subdirectories. If you do not agree with the distinctions, you could add other folders with other category names and move files to these folders.

How many do we have for each category?

In [10]:
freqs = Counter(airline_tweets_train.target)
for category, frequency in freqs.items():
    print(airline_tweets_train.target_names[category], frequency)

neutral 1515
positive 1490
negative 1750


In [14]:
# Let's inspect the first file
airline_tweets_train.data[0]

b'@AmericanAir Why is your cover photo of TWA? Just wondering.'

In [15]:
# first file is in "neutral" folder
airline_tweets_train.filenames[0]

'/Users/piek/Desktop/CBS2020/text-mining-ba-changed/lab_sessions/lab2/airlinetweets/neutral/AL_570069345818161152.txt'

In [16]:
# first file is a neutral review and is mapped to index 1 in target_names
airline_tweets_train.target[0]

1

We can find out what the index means by inserting it into **target_names**

In [17]:
airline_tweets_train.target_names[1]

'neutral'

## Extracting features from training data (see notebook Lab2.3 Feature representation.ipynb for more information)
Note: you might get a warning when you run the following cell. You do NOT have to resolve the warning.

In [18]:
# initialize airline object, and then turn airline tweets train data into a vector 

airline_vec = CountVectorizer(min_df=2, # If a token appears fewer times than this, across all documents, it will be ignored
                             tokenizer=nltk.word_tokenize, # we use the nltk tokenizer
                             stop_words=stopwords.words('english')) # stopwords are removed

  'stop_words.' % sorted(inconsistent))


We have now created a vector representation *airline_vec* of the complete vocabulary of the full data set. Every position in this vector represents a unique word token.

In [24]:
# length of the total vector
print(len(airline_vec.vocabulary_))

2902


In [25]:
# 'plane' is found in the corpus, mapped to index 1948
airline_vec.vocabulary_.get('plane')

1948

In order to represent each document in terms of this vector, we use the *fit_transform* function to generate a matrix of documents (the rows) and the vectors with the scores for each words that occurs in each document.

In [26]:
airline_counts = airline_vec.fit_transform(airline_tweets_train.data)

We can now inspect the dimensions of our feature array by getting the spape: the rows (documents) and columns (the word vector length).

In [37]:
# large dimensions! 4,755 documents, 2902 unique terms. 
airline_counts.shape

(4755, 2902)

We can convert the matrix to an array and get the first element and look at the vector values for slots 100 till 200:

In [41]:
print(airline_counts.toarray()[0][100:200])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


We can see that most values are zero's and just a few have the value 1. This is what we call a sparse vector.

As we have seen in the previous notebook, we can also transform the counts into information value scores using the *TfidfTrasnformer* function.

In [42]:
# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
airline_tfidf = tfidf_transformer.fit_transform(airline_counts)

Obviously the shape remains the same but the values are not scores between zero and one.

In [43]:
# Same dimensions, now with tf-idf values instead of raw frequency counts
print(airline_tfidf.shape)
print(airline_tfidf.toarray()[0][100:200])

(4755, 2902)
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.19103974 0.07301937 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.       

## Training and testing a Naive Bayes classifier

We can now use the above data representation as training data to build a classifier. the Sklearn package already associated each row (a document) in our data represenation with a label by taking the name of the data subfolder.

In [54]:
airline_tweets_train.target_names[2]

'positive'

We are going to use a simple Naive Bayes classifier to train a model. Because we have multiple labels (negative, positive, neutral), we need a multinomial classifiers.

In [55]:
# Now ready to build a classifier. 
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB

It is easy for machine learning package to read the above vector representations and associated these with any type of label. However, we also want to test the data. For that purpose, we need to exclude part of the data from a trainign set.

To train the classifier, we will first split the data into train and test.

In [56]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split

We choose 80% training and 20% test. 

In [57]:
docs_train, docs_test, y_train, y_test = train_test_split(
    airline_tfidf, # the tf-idf model
    airline_tweets_train.target, # the labels for each tweet 
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 

One instance looks like this:

In [58]:
docs_train[55].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

it's label is:

In [59]:
y_train[55]

1

which we know is then:

In [60]:
airline_tweets_train.target_names[y_train[55]]

'neutral'

In [61]:
# Train a Multimoda Naive Bayes classifier
clf = MultinomialNB().fit(docs_train, y_train)

In [62]:
# Predicting the Test set results, find macro recall
y_pred = clf.predict(docs_test)

In [63]:
print('one tweet review:', airline_tweets_train.data[0])
print('gold label:', airline_tweets_train.target[0])
print('classifier predicted:', y_pred[0])

one tweet review: b'@AmericanAir Why is your cover photo of TWA? Just wondering.'
gold label: 1
classifier predicted: 1


See notebook **Lab2-Evaluation** for information on the evaluation. The micro recall of our classifier is:

In [64]:
sklearn.metrics.recall_score(y_true=y_test,
                             y_pred=y_pred,
                             average='micro')

0.8138801261829653

We can also inspect the least and most important features per category.

In [66]:
def important_features_per_class(vectorizer,classifier,n=10): #n is the number of top features
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_class3 = sorted(zip(classifier.feature_count_[2], feature_names),reverse=True)[:n]
    print("Important words in negative documents")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------")
    print("Important words in neutral documents")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 
    print("-----------------------------------------")
    print("Important words in positive documents")
    for coef, feat in topn_class3:
        print(class_labels[2], coef, feat) 

# example of how to call from notebook:
important_features_per_class(airline_vec, clf)

Important words in negative documents
0 158.5026183625101 united
0 114.63901478498562 .
0 99.34506196626158 ``
0 97.14276347218933 @
0 55.20252957511612 flight
0 50.797049679143626 ?
0 44.82498343550132 #
0 40.96792229453236 !
0 40.91549370524313 n't
0 28.969699883212485 ''
-----------------------------------------
Important words in neutral documents
1 107.17306547959436 @
1 85.63988700454907 ?
1 64.70572752146752 jetblue
1 60.073937553971604 .
1 59.57862047110026 southwestair
1 57.76483083081609 ``
1 52.9830504708107 :
1 52.683581162458346 americanair
1 41.466075480898525 usairways
1 38.066061269751444 http
-----------------------------------------
Important words in positive documents
2 166.3336107680331 !
2 104.97745601590876 @
2 88.59014116105949 .
2 84.07913775756035 thanks
2 79.69467663605144 thank
2 67.67368174634025 southwestair
2 65.37572678433506 jetblue
2 63.03115865898889 ``
2 52.91319925303695 #
2 49.67742952791253 americanair


## Applying classifier on our own data
Now we can apply our classifier to new data.
In the example below, these are movie reviews. In the exercise, you will choose tweets that you've selected.

In [67]:
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 
               'Absolute joy ride', 
               'Steven Seagal was terrible', 
               'Steven Seagal shined through.', 
               'This was certainly a movie', 
               'Two thumbs up', 
               'I fell asleep halfway through', 
               "We can't wait for the sequel!!", 
               'I cannot recommend this highly enough', 
               'instant classic.', 
               'Steven Seagal was amazing.']
len(reviews_new)

11

In [68]:
# We re-use airline_vec to transform it in the same way as the training data
new_counts = airline_vec.transform(reviews_new)
new_counts.shape

(11, 2902)

Note that words in our movie reviews that are NOT in the training data, will not be represented as there are no slots in the vectors from the training data.

In [69]:
# we compute tf idf values
reviews_new_tfidf = tfidf_transformer.transform(new_counts)

In [70]:
reviews_new_tfidf.shape

(11, 2902)

In [71]:
# have classifier make a prediction
pred = clf.predict(reviews_new_tfidf)

In [72]:
# print out results ()
for review, predicted_label in zip(reviews_new, pred):
    
    print('%s => %s' % (review, 
                        airline_tweets_train.target_names[predicted_label]))

This movie was excellent => positive
Absolute joy ride => positive
Steven Seagal was terrible => negative
Steven Seagal shined through. => negative
This was certainly a movie => negative
Two thumbs up => negative
I fell asleep halfway through => neutral
We can't wait for the sequel!! => negative
I cannot recommend this highly enough => negative
instant classic. => negative
Steven Seagal was amazing. => positive


## End of this notebook