___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv("./moviereviews2.tsv",sep='\t')

### Task #2: Check for missing values:

In [3]:
# Check for NaN values:
print("# missing values: \n", df.isnull().sum)
df.dropna(inplace=True)

In [4]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []
for i,lb,rv in df.itertuples():
    if rv.isspace():
        blanks.append(i) 

### Task #3: Remove NaN values:

In [None]:
df.dropna(blanks,inplace=True)

### Task #4: Take a quick look at the `label` column:

In [5]:
df['label'].value_counts()

neg    2990
pos    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [6]:
#perform train-test data split
from sklearn.model_selection import train_test_split
X=df['review']
y=df['label']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [7]:
#create a data pipeline to train and fit the model
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC #Linear support vector classifier

#pipeline input is a list of tuples specifying components
text_clf = Pipeline([('tfidf',TfidfVectorizer()),
                    ('clf',LinearSVC())])
text_clf.fit(X_train,y_train)

  if LooseVersion(joblib_version) < '0.12':


Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

### Task #7: Run predictions and analyze the results

In [8]:
# Form a prediction set
predictions = text_clf.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [9]:
# Report the confusion matrix
print(confusion_matrix(y_test,predictions))

[[821  78]
 [ 58 837]]


In [10]:
# Print a classification report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       899
         pos       0.91      0.94      0.92       895

   micro avg       0.92      0.92      0.92      1794
   macro avg       0.92      0.92      0.92      1794
weighted avg       0.92      0.92      0.92      1794



In [11]:
# Print the overall accuracy
print(accuracy_score(y_test,predictions))

0.9241917502787068


## Great job!