# IMDB Film Reviews: Sentiment Analysis

## Vectorization

##### Now that our data has been processed, it is now ready to be vectorized.

__Vectorization__ is a technique used to convert text data into a numerical format that ML algorithms can understand. 


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split 

In [2]:
df = pd.read_csv('processed_IMDb_reviews.csv')
df.head()

Unnamed: 0,review,sentiment
0,one review mention watch oz episod youll hook ...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic there famili littl boy jake think there ...,0
4,petter mattei love time money visual stun film...,1


In [3]:
# Split into train test split
x = df['review']    # will be used for training and validating (hyperparameter tuning using k-fold)
y = df['sentiment'] # will be used for final evaluation
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=300)


In [4]:
print(f'X train shape: {x_train.shape}')
print(f'Y train shape: {y_train.shape}')
print(f'X test shape: {x_test.shape}')
print(f'Y test shape: {y_test.shape}')

X train shape: (39665,)
Y train shape: (39665,)
X test shape: (9917,)
Y test shape: (9917,)


In [5]:
vectorizer= TfidfVectorizer()
# train TFDIF model on training data
# transform training data
x_train_vect = vectorizer.fit_transform(x_train)
# transform testing data
x_test_vect = vectorizer.transform(x_test)

In [6]:
print(f'X train vectorized shape:   {x_train_vect.get_shape()}')
print(f'X test vectorized shape:    {x_test_vect.get_shape()}')

X train vectorized shape:   (39665, 150335)
X test vectorized shape:    (9917, 150335)


### Let's do some inital modeling.
These models are just to get a feel for a data (without hyperparameter tuning) and will be discarded therefore we will be use our previous training and testing split.

In [7]:
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [8]:
svc = LinearSVC(random_state=1)
svc.fit(x_train_vect, y_train)
y_pred_svc = svc.predict(x_test_vect)
print(classification_report(y_test, y_pred_svc))

              precision    recall  f1-score   support

           0       0.89      0.87      0.88      4904
           1       0.88      0.90      0.88      5013

    accuracy                           0.88      9917
   macro avg       0.88      0.88      0.88      9917
weighted avg       0.88      0.88      0.88      9917



Support Vector models typically require a their features to be standardized, however, after doing some research we found that it might not be necessary if those features have already been transformed using TF-IDF.

Anyhow, we decided to try to see if such standardization to see if it provided any benefit. 

In [9]:
scaler = StandardScaler(with_mean=False)
x_train_sc = scaler.fit(x_train_vect)
x_train_vect_SCALED = scaler.transform(x_train_vect)
x_test_vect_SCALED = scaler.transform(x_test_vect)

svc_scaled = LinearSVC(random_state=1)
svc_scaled.fit(x_train_vect_SCALED, y_train)
y_pred_svc = svc.predict(x_test_vect_SCALED)
print(classification_report(y_test, y_pred_svc))

              precision    recall  f1-score   support

           0       0.83      0.83      0.83      4904
           1       0.84      0.83      0.83      5013

    accuracy                           0.83      9917
   macro avg       0.83      0.83      0.83      9917
weighted avg       0.83      0.83      0.83      9917



Ultimately, applying a standard scaler transformation to the vectorized data yielded worse results than without. 

In [10]:
lr = LogisticRegression(max_iter=1000, solver='saga')
lr.fit(x_train_vect, y_train)
y_pred_lr = lr.predict(x_test_vect)
print(classification_report(y_test, y_pred_lr))

              precision    recall  f1-score   support

           0       0.89      0.88      0.88      4904
           1       0.88      0.89      0.89      5013

    accuracy                           0.89      9917
   macro avg       0.89      0.89      0.89      9917
weighted avg       0.89      0.89      0.89      9917

