# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [None]:
import spacy as spacy
import numpy as np
import pandas as pd

data = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')

data.head()







### Task #2: Check for missing values:

In [None]:
# Check for NaN values:
data.isnull().sum()

In [None]:
# Check for whitespace strings (it's OK if there aren't any!):
white_spaces = []

for i, lb, rw in data.itertuples():
    if type(rw) == str:
        if rw.isspace():
            white_spaces.append(i)
            
len(white_spaces)








### Task #3: Remove NaN values:

In [None]:
data.drop(white_spaces, inplace=True)
data.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [None]:
data['label'].value_counts()

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [None]:
from sklearn.model_selection import train_test_split

X = data['review']
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

X_train.head()
print(X_train.shape, " ", y_train.shape)
print(X_test.shape, " ", y_test.shape)




### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())
])


text_clf.fit(X_train, y_train)


### Task #7: Run predictions and analyze the results

In [None]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [None]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predictions)


In [None]:
# Print a classification report
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

In [None]:
# Print the overall accuracy
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, predictions) * 100)
 