___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [28]:
import pandas as pd
import numpy as np

df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')
len(df)

6000

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 2 columns):
label     6000 non-null object
review    5980 non-null object
dtypes: object(2)
memory usage: 93.8+ KB


In [23]:
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [24]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [25]:
# Check for whitespace strings (it's OK if there aren't any!)
blanks = []

for idx,lbl,rvw in df.itertuples():
    if type(rvw)==str:            # avoid NaN values
        if rvw.isspace():
            blanks.append(idx)

blanks

[]

### Task #3: Remove NaN values:

In [29]:
df.dropna(inplace=True)
len(df)

5980

### Task #4: Take a quick look at the `label` column:

In [30]:
df['label'].value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [33]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [42]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf_lsvc = Pipeline([('tfidf_vect', TfidfVectorizer()), ('clf', LinearSVC())])

from sklearn.naive_bayes import MultinomialNB

text_clf_mnb = Pipeline([('tfidf_vect', TfidfVectorizer()), ('clf', MultinomialNB())])

In [44]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf_vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [45]:
text_clf_mnb.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf_vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...rue,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

### Task #7: Run predictions and analyze the results

In [48]:
# Form a prediction set
predictions_lsvm = text_clf_lsvc.predict(X_test)
predictions_mnb = text_clf_mnb.predict(X_test)

In [51]:
# Report the confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print('Linear Support Vector Machine')
print(metrics.confusion_matrix(y_test, predictions_lsvm))
print('\n')
print('Multinomial Naive Bayes')
print(metrics.confusion_matrix(y_test, predictions_mnb))

Linear Support Vector Machine
[[900  91]
 [ 63 920]]


Multinomial Naive Bayes
[[940  51]
 [136 847]]


In [52]:
# Print a classification report
print('Linear Support Vector Machine')
print(metrics.classification_report(y_test, predictions_lsvm))
print('\n')
print('Multinomial Naive Bayes')
print(metrics.classification_report(y_test, predictions_mnb))

Linear Support Vector Machine
              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

   micro avg       0.92      0.92      0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



Multinomial Naive Bayes
              precision    recall  f1-score   support

         neg       0.87      0.95      0.91       991
         pos       0.94      0.86      0.90       983

   micro avg       0.91      0.91      0.91      1974
   macro avg       0.91      0.91      0.91      1974
weighted avg       0.91      0.91      0.91      1974



In [53]:
# Print the overall accuracy
print('Linear Support Vector Machine')
print(metrics.accuracy_score(y_test, predictions_lsvm))
print('\n')
print('Multinomial Naive Bayes')
print(metrics.accuracy_score(y_test, predictions_mnb))

Linear Support Vector Machine
0.9219858156028369


Multinomial Naive Bayes
0.9052684903748733


## Great job!