# Text Classification

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. Labels are given as `pos` and `neg`.

We've 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Imports and load the dataset into a pandas DataFrame

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('moviereviews2.tsv',sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [3]:
# Check for whitespace strings:
df.dropna(inplace=True)
blanks=[i for i,lbl,rev in df.itertuples() if rev.isspace()]
blanks

[]

### Remove NaN values:

In [4]:
df.dropna(inplace=True)

### Quick look at the `label` column:

In [5]:
print(df['label'].value_counts())
print(df.shape)

pos    2990
neg    2990
Name: label, dtype: int64
(5980, 2)


### Split the data into train & test sets:

In [6]:
from sklearn.model_selection import train_test_split
X=df['review']
y=df['label']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

### Build a pipeline to vectorize the date, then train and fit a model using `LinearSVC`.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

clf=Pipeline([('tfidf',TfidfVectorizer(stop_words='english')),('clf',LinearSVC())])

clf.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')),
                ('clf', LinearSVC())])

### Run predictions and analyze the results

In [8]:
# Form a prediction set
predictions=clf.predict(X_test)

In [9]:
# Report the confusion matrix
from sklearn import metrics

metrics.confusion_matrix(y_test,predictions)


array([[895,  96],
       [ 45, 938]], dtype=int64)

In [10]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.95      0.90      0.93       991
         pos       0.91      0.95      0.93       983

    accuracy                           0.93      1974
   macro avg       0.93      0.93      0.93      1974
weighted avg       0.93      0.93      0.93      1974



In [11]:
# Print the overall accuracy
metrics.accuracy_score(y_test,predictions)

0.9285714285714286

### Now that our model is ready. We can use it to label movie reviews as POSITIVE or NEGATIVE

In [23]:
my_neg_review='The moview was such a waste of time.'
clf.predict([my_neg_review])

array(['neg'], dtype=object)

In [24]:
my_pos_review='The moview was amazing.'
clf.predict([my_pos_review])

array(['pos'], dtype=object)