# Text Classification Project

## Perform imports and load the dataset

In [83]:
import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head(10)

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
5,neg,"to put it bluntly , ed wood would have been pr..."
6,neg,"synopsis : melissa , a mentally-disturbed woma..."
7,neg,tim robbins and martin lawernce team up in thi...
8,neg,"in "" gia "" , angelina jolie plays the titular ..."
9,neg,"in 1990 , the surprise success an unheralded l..."


In [2]:
# The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has 
# been preprocessed as a tab-delimited file.

In [3]:
len(df)

2000

In [13]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

#### Let's take a look at one of the reviews

In [4]:
from IPython.display import Markdown, display

display(Markdown('> '+df['review'][9])) # To set the alignment, use the character > for left-align.

> in 1990 , the surprise success an unheralded little movie called ghost instantly rescued the moribund careers of its trio of above-the-title stars , patrick swayze , demi moore , and whoopi goldberg . 
eight years later , moore and goldberg's careers aren't exactly thriving , but they have had their share of screen successes since ; the same can't be said of swayze , who has just added yet another turkey to his resume with the aptly named black dog . 
forget the mortal kombat movies--this trucksploitation flick is the closest the movies has come to video games . 
good truck driver jack crews ( swayze ) must drive a cargo of illegal firearms from atlanta to new jersey . 
along the way , jack and his crew of three run into a number of obstacles--such as a highway weigh station , evil truckers , and deadly uzi-firing motorcyclists . 
every so often , like at the end of a video game " level " or " stage , " the main baddie pops up : red ( meat loaf , fresh from the triumph of spice world ) , who wants to steal the cache of guns . 
just in case you forget his name or have trouble keeping track of who's driving what , all of red's vehicles , be it a pickup or a big rig , are painted--you guessed it--red . 
i could go into more of the plot specifics ( such as jack's dream of having a nice home with his family , the past trauma that sent him to prison and cost him his trucking license , the fbi/atf crew tracking the cargo ) , but they are of little importance . 
all that matters to director kevin hooks and writers william mickelberry and dan vining are the obstacles jack confronts in his drive from point a to point b . but they fail at even this modest goal , for none of the highway chaos , as credibly staged as it is , is terribly interesting , let alone exciting . 
once you've seen a couple of trucks bang against each other or a big rig explode the first time , you've seen it every time . 
as dreary as black dog is as an entertainment , the saddest part about the film has nothing to do with what shows up onscreen ; it's that swayze has to reduce himself to such work . 
while far from the best of actors , he is certainly not horrible , and he is a charismatic presence . 
i don't know if it's his judgment or the dearth of quality job offers that leads him to involve himself with bombs such as black dog . 
regardless , if he continues on this career track , could a tv series be far behind ? 


#### Let's check for missing values

In [5]:
df.isnull().sum()

label      0
review    35
dtype: int64

In [6]:
# We can see tht we are not missing any label but we are missing a few review. So some of them seem to be 
# completly empty. We can remove these NAN reviews 

In [7]:
df.dropna(inplace=True)

In [8]:
df.isnull().sum() # Now we don't have any missing value

label     0
review    0
dtype: int64

In [9]:
# we can also remove empty string or white space by using `isspace method`

In [10]:
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [11]:
# The .itertuples() pandas method provides access to every field. 
#For brevity we'll assign the names i, lb and rv to the index, label and review columns.

In [12]:
df.drop(blanks, inplace=True)

len(df)

1938

In [14]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

## Let's split the data into train & test sets

In [18]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Let's build pipelines to vectorize the data, then train and fit a model

In [20]:
# Now that we have sets to train and test, we'll develop a selection of pipelines, each with a different model

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

## Feed the training data through the first pipeline

#### For naïve Bayes

In [22]:
text_clf_nb.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

### Predictions

In [23]:
predictions = text_clf_nb.predict(X_test) # Form a prediction set

### Let's analyze the results

In [25]:
# Report the confusion matrix

In [26]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[287  21]
 [130 202]]


In [27]:
# Print a classification report

In [28]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.69      0.93      0.79       308
         pos       0.91      0.61      0.73       332

    accuracy                           0.76       640
   macro avg       0.80      0.77      0.76       640
weighted avg       0.80      0.76      0.76       640



In [30]:
# Print the overall accuracy

In [32]:
print(metrics.accuracy_score(y_test,predictions))

0.7640625


Naïve Bayes gave us better-than-average results at 76.4% for classifying reviews as positive or negative based on text alone

#### For SVC

In [34]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

### Predictions

In [35]:
predictions = text_clf_lsvc.predict(X_test) # Form a prediction set

### Let's analyze the results 

In [36]:
# Report the confusion matrix

In [37]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[259  49]
 [ 49 283]]


In [39]:
# Print a classification report

In [38]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.84      0.84      0.84       308
         pos       0.85      0.85      0.85       332

    accuracy                           0.85       640
   macro avg       0.85      0.85      0.85       640
weighted avg       0.85      0.85      0.85       640



In [40]:
# Print the overall accuracy

In [41]:
print(metrics.accuracy_score(y_test,predictions))

0.846875


Based on text alone we correctly classified reviews as positive or negative 84.7% of the time. 

By default, CountVectorizer and TfidfVectorizer do not filter stopwords. However, they offer some optional settings, including passing in your own stopword list.

we'll pass in our own stopword list, so that we know exactly what's being filtered. In order to try to improve the accuracy score

In [42]:
#Scikit-learn's built-in list contains 318 stopwords:

In [49]:
from sklearn.feature_extraction import text
print(text.ENGLISH_STOP_WORDS)

frozenset({'hundred', 'too', 'will', 'two', 'forty', 'us', 'whereby', 'other', 'out', 'so', 'only', 'being', 'sixty', 'who', 'me', 'name', 'three', 'part', 'these', 'up', 'ours', 'nine', 'can', 'or', 'are', 'namely', 'cannot', 'which', 'besides', 'if', 'thru', 'become', 'whether', 'see', 'itself', 'still', 'hereafter', 'there', 'my', 'ourselves', 'hasnt', 'something', 'further', 'whom', 'yourself', 'but', 'moreover', 'system', 'de', 'without', 'yourselves', 'call', 'fire', 'etc', 'hereupon', 'well', 'formerly', 'except', 'have', 'interest', 'elsewhere', 'many', 'few', 'sometimes', 'is', 'side', 'co', 'describe', 'all', 'among', 'bottom', 'during', 'nor', 'cant', 'anyhow', 'amongst', 'very', 'upon', 'around', 'whoever', 'whence', 'empty', 'into', 'here', 'thereby', 'its', 'eight', 'sincere', 'several', 'yours', 'an', 'take', 'full', 'done', 'their', 'same', 'latter', 'afterwards', 'been', 'before', 'always', 'move', 'nobody', 'would', 'those', 'hereby', 'last', 'somehow', 'meanwhile', '

### Let's trim the list to just 60 words 

In [50]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

In [51]:
len(stopwords)

60

### Add stopwords to the linear SVC pipeline

In [57]:
text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', LinearSVC()),
])
text_clf_lsvc2.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',...
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...],
                                 strip_accents=None, sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer

In [58]:
predictions = text_clf_lsvc2.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))

[[256  52]
 [ 48 284]]


In [59]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.84      0.83      0.84       308
         pos       0.85      0.86      0.85       332

    accuracy                           0.84       640
   macro avg       0.84      0.84      0.84       640
weighted avg       0.84      0.84      0.84       640



In [60]:
print(metrics.accuracy_score(y_test,predictions))

0.84375


#### Our score didn't change that much. We went from 84.7% without filtering stopwords to 84.4% after adding a stopword filter to our pipeline.

### Let's write our own review, and see how accurately our model assigns a "positive" or "negative" label to it.

In [76]:
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."
myreview1 = "They made moovies nowadays just to earn money"
myreview2 = "They moovies were really nice and exiting"

In [77]:
print(text_clf_lsvc2.predict([myreview]))

['neg']


In [78]:
print(text_clf_nb.predict([myreview]))

['neg']


In [79]:
print(text_clf_lsvc2.predict([myreview1]))

['neg']


In [80]:
print(text_clf_nb.predict([myreview]))

['neg']


In [81]:
print(text_clf_lsvc2.predict([myreview2]))

['pos']


In [82]:
print(text_clf_nb.predict([myreview2]))

['pos']
