___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv(r"C:\Users\HARDIK\NLP END TO END\NLP_COURSE_HELP\TextFiles\moviereviews2.tsv", sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [3]:
my_st = 'hello'

my_st.isspace()

False

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   6000 non-null   object
 1   review  5980 non-null   object
dtypes: object(2)
memory usage: 93.9+ KB


In [5]:
# Check for whitespace strings (it's OK if there aren't any!):

# Remove empty string

blanks = []

# (index, label, review text)
for i,lb,rv in df.itertuples():
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
        
blanks

[]

### Task #3: Remove NaN values:

In [6]:
df.dropna(inplace=True)

df.isnull().sum()

label     0
review    0
dtype: int64

### Task #4: Take a quick look at the `label` column:

In [7]:
df['label'].value_counts()

neg    2990
pos    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [8]:
X = df['review']
y = df['label']

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.22, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [9]:
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])


text_clf .fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [10]:
# Form a prediction set
pred = text_clf.predict(X_test)

In [11]:
# Report the confusion matrix

print(confusion_matrix(y_test,pred))

[[592  61]
 [ 40 623]]


In [12]:
# Print a classification report

print(classification_report(y_test,pred))

              precision    recall  f1-score   support

         neg       0.94      0.91      0.92       653
         pos       0.91      0.94      0.93       663

    accuracy                           0.92      1316
   macro avg       0.92      0.92      0.92      1316
weighted avg       0.92      0.92      0.92      1316



In [13]:
# Print the overall accuracy

print(accuracy_score(y_test,pred))

0.9232522796352584


## Great job!

<br>
_____________________________________________________________________________________________________________________________

In [15]:
import pickle
with open('sms_classification_nlp','rb') as f:
    msg = pickle.load(f)
    
msg.predict(["Hello, How are you jake?"])

array(['ham'], dtype=object)

In [16]:
msg.predict(["Chase Bank: Your account balance has dropped below $25. Sign in to view your accounts. \n https://xyz.ly/354dfd"])

array(['ham'], dtype=object)

In [17]:
msg.predict(["PCC Alert! Due to snow and ice across the metro area, all PCC capuses and centers are closed."])

array(['ham'], dtype=object)

In [18]:
msg.predict(["Dear Walmart shopper, your purchase last month won a %1000 Walmart Gift Card, go to www.xyz.com within 24 hours to claim. (NO2cancel)"])

array(['spam'], dtype=object)