___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')

### Task #2: Check for missing values:

In [4]:
# Check size of df

df.shape

(6000, 2)

In [5]:
# Preview contents of df

df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [6]:
pd.concat(
    [df.label.value_counts(), df.label.value_counts(normalize=True)], axis=1
    ).set_axis(['label', 'percent'], axis=1, inplace=False)

Unnamed: 0,label,percent
neg,3000,0.5
pos,3000,0.5


In [7]:
# Check for NaN values:

df.isnull().sum()

label      0
review    20
dtype: int64

In [8]:
# Check for whitespace strings (it's OK if there aren't any!):

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

0 blanks:  []


### Task #3: Remove NaN values:

In [9]:
df.dropna(inplace=True)

len(df)

5980

### Task #4: Take a quick look at the `label` column:

In [10]:
pd.concat(
    [df.label.value_counts(), df.label.value_counts(normalize=True)], axis=1
    ).set_axis(['label', 'percent'], axis=1, inplace=False)

Unnamed: 0,label,percent
neg,2990,0.5
pos,2990,0.5


<font color=green>**We need our model to predict more than this baseline of 50%.**</font>


### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [11]:
X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [12]:
# Create Pipeline for Naive Bayes
text_clf_nb = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

# Naive Bayes fit

text_clf_nb.fit(X_train, y_train)

# Create Pipeline for Linear SVC

text_clf_lsvc = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

# SVC fit

text_clf_lsvc.fit(X_train, y_train)

# Create Pipeline for Decision Tree

text_clf_dt = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', DecisionTreeClassifier(max_depth=5)),
])

# DT fit

text_clf_dt.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])

### Task #7: Run predictions and analyze the results

In [13]:
# Form a prediction set

# Naive Bayes
predictions_nb = text_clf_nb.predict(X_test)

# Report the confusion matrix
print(confusion_matrix(y_test,predictions_nb))

# Print a classification report
print(classification_report(y_test,predictions_nb))

# Print the overall accuracy
print(f'Accuracy Score for Naive Bayes: {accuracy_score(y_test,predictions_nb)}.')

[[940  51]
 [136 847]]
              precision    recall  f1-score   support

         neg       0.87      0.95      0.91       991
         pos       0.94      0.86      0.90       983

   micro avg       0.91      0.91      0.91      1974
   macro avg       0.91      0.91      0.91      1974
weighted avg       0.91      0.91      0.91      1974

Accuracy Score for Naive Bayes: 0.9052684903748733.


In [37]:
# SVC
predictions_lsvc = text_clf_lsvc.predict(X_test)


# Report the confusion matrix
print(confusion_matrix(y_test,predictions_lsvc))

# Print a classification report
print(classification_report(y_test,predictions_lsvc))

# Print the overall accuracy
print(f'Accuracy Score for SVC: {accuracy_score(y_test,predictions_lsvc)}.')

[[900  91]
 [ 63 920]]
              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974

Accuracy Score for SVC: 0.9219858156028369.


In [41]:
# DT
predictions_dt = text_clf_dt.predict(X_test)

# Report the confusion matrix
print(confusion_matrix(y_test,predictions_dt))

# Print a classification report
print(classification_report(y_test,predictions_dt))

# Print the overall accuracy
print(f'Accuracy Score for Decision Tree: {accuracy_score(y_test,predictions_dt)}')

[[669 322]
 [184 799]]
              precision    recall  f1-score   support

         neg       0.78      0.68      0.73       991
         pos       0.71      0.81      0.76       983

    accuracy                           0.74      1974
   macro avg       0.75      0.74      0.74      1974
weighted avg       0.75      0.74      0.74      1974

Accuracy Score for Decision Tree: 0.7436676798378926


## Great job!