<div class="alert alert-block alert-success">
<font color=blue>

## Text Classification Steps

#### 1. Data Collection
- Import and load the dataset

#### 2. Data Exploration & Visualization
- Check missing values
- Draw visualizations

#### 3. Pre-processing
- Detect and remove null records.
- Detect and remove empty strings.
- Stopword removal, Stemming, Lemmatization.
- Split data into train/test

#### 4. Feature Engineering
- Vectorization (TF-IDF)

#### 5. Model Building
- Import - Import the Model.
- Instantiate - Create instance of the Model.
- Fit - Fit the model with training data.
- Predict - Predict Model with test data.

#### 6. Model Evaluation
- Confusion Matrix
- Classification Report
- Accuracy Score

#### 7. Model Prediction with new Data
- Feed new data to the model and check prediction.

</font>
</div>

<div class="alert alert-block alert-success">
<font color=blue>

### 0. Objective

</font>
</div>

- Objective is to come up with a model that can predict the sentiments for any input text.
- The model will be trained using the "Movie Review" dataset.
- Moview Review dataset has the review text and labled sentiments (neg-negative,pos-positive).

<div class="alert alert-block alert-success">
<font color=blue>

### 1. Data Collection

</font>
</div>

##### 1.1 Import and load the dataset

In [117]:
import pandas as pd

df = pd.read_csv('~/data/moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


<div class="alert alert-block alert-success">
<font color=blue>

### 2. Data Exploration & Visualization

</font>
</div>

##### 2.1 Check missing values

In [118]:
df.shape

(2000, 2)

In [119]:
# Check for NaN values:
df.isnull().sum()

label      0
review    35
dtype: int64

In [120]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [121]:
# Check target variable
df['label'].value_counts()

pos    1000
neg    1000
Name: label, dtype: int64

##### 2.2 Visualization

< NA / TBD >

<div class="alert alert-block alert-success">
<font color=blue>

### 3. Pre-processing

</font>
</div>

##### 3.1 Detect and remove null records.

In [122]:
df.dropna(inplace=True)

len(df)

1965

##### 3.2 Detect and remove empty strings.

In [123]:
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [124]:
df.drop(blanks, inplace=True)

len(df)

1938

In [125]:
df['label'].value_counts()

pos    969
neg    969
Name: label, dtype: int64

##### 3.3 Stopword removal, Stemming, Lemmatization.

< NA / TBD >

##### 3.4 Split data into train/test

In [126]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

<div class="alert alert-block alert-success">
<font color=blue>

### 4. Feature Engineering

</font>
</div>

#### 4.1 Vectorization (TF-IDF)

In [129]:
# This step is just to show how TF-IDF works.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(1298, 33307)

So it generated a matrix of 1298 messages and 33307 features.

**Note -** This step is just to show how TF-IDF works. Since its already included in the Pipeline below (step 5.2), this step can be ignored.

<div class="alert alert-block alert-success">
<font color=blue>

### 5. Model Building

</font>
</div>

We will use TWO algorithms here 
- ***Naive Bayes (MultinomialNB)***
- ***Support Vector Machine (LinearSVC)***

##### 5.1 Import - Import the Model.

In [130]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

##### 5.2 Instantiate - Create instance of the Model.

Create Text Classifiers using scikit-learn pipeline.

In [107]:
# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                        ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                          ('clf', LinearSVC()),
])

##### 5.3 Fit - Fit the model with training data.

**Naïve Bayes:**

In [108]:
text_clf_nb.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

**LinearSVC:**

In [132]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

##### 5.4 Predict - Predict Model with test data.

**Naïve Bayes:**

In [133]:
# Form a prediction set
predictions_nb = text_clf_nb.predict(X_test)

**LinearSVC:**

In [137]:
# Form a prediction set
predictions_svc = text_clf_lsvc.predict(X_test)

<div class="alert alert-block alert-success">
<font color=blue>

### 6. Model Evaluation

</font>
</div>

##### 6.1 Confusion Matrix

**Naïve Bayes:**

In [138]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions_nb))

[[287  21]
 [130 202]]


**LinearSVC:**

In [139]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions_svc))

[[259  49]
 [ 49 283]]


##### 6.2 Classification Report

**Naïve Bayes:**

In [140]:
# Print a classification report
print(metrics.classification_report(y_test,predictions_nb))

              precision    recall  f1-score   support

         neg       0.69      0.93      0.79       308
         pos       0.91      0.61      0.73       332

    accuracy                           0.76       640
   macro avg       0.80      0.77      0.76       640
weighted avg       0.80      0.76      0.76       640



**LinearSVC:**

In [141]:
# Print a classification report
print(metrics.classification_report(y_test,predictions_svc))

              precision    recall  f1-score   support

         neg       0.84      0.84      0.84       308
         pos       0.85      0.85      0.85       332

    accuracy                           0.85       640
   macro avg       0.85      0.85      0.85       640
weighted avg       0.85      0.85      0.85       640



##### 6.3 Accuracy Score

**Naïve Bayes:**

In [142]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions_nb))

0.7640625


**LinearSVC:**

In [143]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions_svc))

0.846875


Looks like LinearSVC has better accuracy(84%) compared to Naive Bayes(76%).

<div class="alert alert-block alert-success">
<font color=blue>

### 7. Model Prediction with new Data

</font>
</div>

- Feed new data to the model and check prediction.

In [144]:
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."

In [145]:
print(text_clf_nb.predict([myreview]))

['neg']


In [146]:
print(text_clf_lsvc.predict([myreview]))

['neg']


In [None]:
myreview = "Last chance to save 20% Ends Tonight - West Keynote Speakers Announced."