# Text Classification Exercise: Movie Reviews

## Introduction

This exercise uses the data from Kaggle's [IMDB Movie reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition.

**Description of the data:**

- **`labeledTrainData.tsv.zip`** contains the dataset.
- Each observation in this dataset is a review of a movie by a user.
- The **sentiment** column is the sentiment of the review (1 -> positive and 0 -> negative).
- The **review** column is the text of the review.

# **Goal:** Predict the sentiment of the review using the review text.

## Task 1

Read **`labeledTrainData.tsv.zip`** into a pandas DataFrame and examine it. Please note that pandas can directly read tsv/csv files inside a zip file.

## Task 2

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review** as the feature and the **sentiment** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

## Task 3

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

## Task 4

Use multinomial Naive Bayes to **predict the sentiment** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

## Task 5

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

## Task 6

Use different **tuning parameters** e.g max_df, min_df, max_features etc to build models and check test accuracy.

Hint:

- You can write a function which accepts a vectorizer as a parameter and..
- Create DTMs for Training and Test data
- Trains a model (SVM)
- Calculate the testing accuracy and prints the same

Call the above function with Vectorizers object created using different tuning parameters. Use TF-IDF vectorizer for this task.

In [2]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [3]:
import pandas as pd
import numpy as np

In [4]:
# read file into pandas using a relative path. Please change the path as needed
sms_df = pd.read_table('/gdrive/My Drive/Statistical NLP AIML/labeledTrainData.tsv.zip')

In [5]:
sms_df.shape

(25000, 3)

In [6]:
sms_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [7]:
from sklearn.model_selection import train_test_split

In [8]:
# split X and y into training and testing sets
sms_train, sms_test, y_train, y_test = train_test_split(sms_df.review, sms_df.sentiment, random_state=2)

In [9]:
#Traing data
print(sms_train.shape)
print(y_train.shape)

(18750,)
(18750,)


In [10]:
#Test Data
print(sms_test.shape)
print(y_test.shape)

(6250,)
(6250,)


### 3. Tokenization & Vectorization

Using **CountVectorizer**, to get numeric features.

In [11]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer()

In [12]:
#Feed SMS data to CountVectorizer
cvect.fit(sms_train)

#Check the vocablury size
len(cvect.vocabulary_)

66672

In [13]:
#What is there in the vocabulary
cvect.vocabulary_

{'japanese': 31283,
 'tomo': 59974,
 'akiyama': 2138,
 'keko': 32409,
 'mask': 36734,
 '1993': 368,
 'is': 30933,
 'extremely': 20913,
 'enjoyable': 19581,
 'trash': 60543,
 'film': 21947,
 'and': 2876,
 'so': 54695,
 'fun': 23551,
 'to': 59841,
 'watch': 64404,
 'there': 59240,
 'are': 3638,
 'also': 2477,
 'some': 54907,
 'sequels': 52492,
 'but': 8741,
 'haven': 26904,
 'seen': 52232,
 'them': 59186,
 'since': 53778,
 'these': 59266,
 'films': 21985,
 'hyper': 28826,
 'rare': 47705,
 'kind': 32692,
 'of': 41567,
 're': 47883,
 'releases': 48714,
 'day': 14865,
 'would': 65697,
 'be': 5597,
 'nice': 40572,
 'think': 59321,
 'many': 36384,
 'lovers': 35270,
 'like': 34551,
 'the': 59155,
 'tongue': 59993,
 'in': 29482,
 'cheek': 10499,
 'story': 56536,
 'about': 1094,
 'one': 41793,
 'strict': 56717,
 'school': 51715,
 'which': 64860,
 'teachers': 58623,
 'that': 59144,
 'it': 31020,
 'okay': 41676,
 'torture': 60143,
 'students': 56849,
 'order': 42032,
 'attain': 4301,
 'discipline'

Build Document-term Matrix (DTM)

In [14]:
#Convert Training SMS messages into Count Vectors
X_train_ct = cvect.transform(sms_train)

In [15]:
#Size of Document Term Matrix
X_train_ct.shape

(18750, 66672)

In [16]:
#Let's check the first record
X_train_ct[0]

<1x66672 sparse matrix of type '<class 'numpy.int64'>'
	with 280 stored elements in Compressed Sparse Row format>

In [17]:
#What's there in sparse matrix
print(X_train_ct[0])

  (0, 33)	1
  (0, 368)	1
  (0, 1094)	4
  (0, 1165)	1
  (0, 1267)	1
  (0, 1431)	1
  (0, 1510)	1
  (0, 1856)	1
  (0, 2052)	1
  (0, 2138)	1
  (0, 2342)	3
  (0, 2392)	2
  (0, 2394)	1
  (0, 2477)	6
  (0, 2541)	1
  (0, 2727)	1
  (0, 2876)	19
  (0, 3270)	1
  (0, 3638)	12
  (0, 3937)	6
  (0, 4200)	3
  (0, 4301)	1
  (0, 4336)	1
  (0, 4829)	1
  (0, 5272)	1
  :	:
  (0, 60543)	5
  (0, 62227)	1
  (0, 62869)	1
  (0, 62968)	3
  (0, 63208)	2
  (0, 63422)	2
  (0, 63515)	1
  (0, 63580)	1
  (0, 64357)	1
  (0, 64404)	1
  (0, 64539)	2
  (0, 64678)	1
  (0, 64800)	2
  (0, 64805)	1
  (0, 64860)	5
  (0, 64870)	1
  (0, 64974)	4
  (0, 65031)	1
  (0, 65137)	1
  (0, 65361)	4
  (0, 65408)	1
  (0, 65697)	3
  (0, 65700)	1
  (0, 66124)	3
  (0, 66231)	4


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

Convert Test SMS also in numerical features

In [18]:
X_test_ct = cvect.transform(sms_test)

In [19]:
X_test_ct.shape

(6250, 66672)

### 4. Building an SMS Classifier

Let's first try K-Nearest Neigbour algorithm

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [20]:
# instantiate the model (with the default parameters)
model = SVC()
# fit the model with data (occurs in-place)
model.fit(X_train_ct, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Evaluation on Test Dataset

In [21]:
from sklearn import metrics

In [22]:
#Calculate accuracy on Test Dataset
metrics.accuracy_score(y_test, model.predict(X_test_ct))

0.8552

In [23]:
model.score(X_test_ct,y_test)

0.8552

In [24]:
from sklearn.naive_bayes import MultinomialNB

In [25]:
#Use a Sklearn Pipeline and perform CountVectoriser and SVC together
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score
pipe = Pipeline((
("cv",CountVectorizer(stop_words='english',ngram_range=(1, 2),max_df=0.5,min_df=100)),
("nb",MultinomialNB())
))
pipe.fit(sms_train,y_train)
print("Training Accuracy")
print(pipe.score(sms_train,y_train))
print("Testing Accuracy")
print(pipe.score(sms_test,y_test))
predicted = pipe.predict(sms_test)
print(confusion_matrix(y_test,predicted))
print(classification_report(y_test,predicted))
#scoresdt = cross_val_score(pipe,sms_train,y_train,cv=10)
#print(scoresdt)
#print("Average Cross Validation Accuracy")
#print(np.mean(scoresdt))

Training Accuracy
0.8554133333333334
Testing Accuracy
0.84368
[[2679  508]
 [ 469 2594]]
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      3187
           1       0.84      0.85      0.84      3063

    accuracy                           0.84      6250
   macro avg       0.84      0.84      0.84      6250
weighted avg       0.84      0.84      0.84      6250



### 7. Building a Deep Learning Model

In [21]:
import tensorflow as tf

We will use CountVectorizer features in this case. This can be replaced by TF-IDF features

In [22]:
#Start building a Keras Sequential Model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [24]:
#Add hidden layers
model.add(tf.keras.layers.Dense(100, activation='relu', input_shape=(len(cvect.vocabulary_),)))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))

#Add Output layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [29]:
#Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [30]:
#X_train_ct_array = X_train_ct.toarray()
#X_test_ct_array = X_test_ct.toarray()

In [26]:
#model.fit(X_train_ct.toarray(), y_train,
#           validation_data=(X_test_ct.toarray(), y_test), 
#           epochs=10, batch_size=12)