# Import Dataset
- Let's import preprocessed movie review dataset.

In [1]:
import pandas as pd

DATA_PATH = "Preprocessed/"     # Directory path where data is located
TRAIN_CLEAN_DATA = "clean_train_df.csv"

train_data = pd.read_csv(DATA_PATH + TRAIN_CLEAN_DATA)
reviews = list(train_data.review)  # preprocessed reviews
sentiments = list(train_data.sentiment) # corresponding sentiment label

# Vectorization

- In order to be fitted to models, our text data have to be transformed into a vector. 
- We are going to use three different vectorization technics - **"TF-IDF Vectorization"**, **"Count Vertorization"** and **"word2vec Vectorization"**.
- Let's define each vectorizer and compare their performances.

## TF-IDF

- TF-IDF stands for Term Frequency - Inverse Document Frequency.
- Basically, TF-IDF value gets larger if a specific word appears frequently in a specific document only. Therefore, commonly used words like pronouns get small values.
- We will use scikitlearn's `TfidfVectorizer` module.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define vectorizer instance
vectorizer = TfidfVectorizer(min_df = 0, analyzer = "word", sublinear_tf = True,
                            ngram_range = (1,2), max_features = 500)

# Transform movie reviews
X_TFIDF = vectorizer.fit_transform(reviews)

Explanations for the arguments used above in `TfidfVectorizer`:

- `min_df`:   To ignore the terms that have a **document frequency** strictly lower than the given threshold
- `analyzer`:   Whether the feature should be made of word or character n-grams ("word" / "char" / "word_wb")
- `sublinear_tf`:   Apply sublinear tf (term frequency) scaling, i.e. replace tf with 1 + log(tf). It is used to deal with outliers in tf
- `ngram_range`: The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. 

---

For additional information, check out the [official website](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) or [this site](https://chan-lab.tistory.com/27) (in Korean).

## Count Vectorizer

- count vectorizer extracts features of text data by relative frequency of each words. 
- Although this method is very simple and easy to implement, it might be less practical because the frequently used but less meaningful words like pronouns can have large values.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word", max_features=500)
X_countvect = vectorizer.fit_transform(reviews)

hyperparameters are very similar to those of TF-IDF

- `analyzer`: unit of analysis
- `max_features`: maximum number of words to consider

## word2vec

- word2vec is a prediction-based word representation method. 
- It consists of two famous models which are **CBOW (Continuous Bag of Words)** and **Skip-Gram**, 
    - CBOW: speculates a specific word by using nearby words
    - Skip-Gram: speculates nearby words by a specific word
- word2vec models are known to be **able to catch complicated features** of human language including relationships among words.
- On average it is said that Skip-Gram has better performance over CBOW

---

- Unlike TF-IDF, word2vec vectorizer takes a **word-separated list** as an input.
- Each word is transformed into a n-dimensional vector designated by `size` parameter.
- Thus, if a review consists of m words, then the output is a (m x n) dimensional matrix.

---

https://wikidocs.net/22660

In [4]:
# Define input data
sentences = [review.split() for review in reviews]

In [5]:
# Hyperparameters of word2vec model
num_features = 500    # dimension of each embedded vectors
min_word_count = 35   # words with word count less than the set value are ignored
num_workers = 8       # number of processers
context = 10          # Set context window size (similar to n-gram)
downsampling = 1e-3   # downsampling rate for correct words to increase speed. 0.001 is used generally.
sg = 1                # 0 for CBOW, 1 for Skip-gram

# For Checking progression details
import logging
logging.basicConfig(format = "%(asctime)s : %(levelname)s : %(message)s", level = logging.INFO)

In [6]:
# Define and train word2vec Skip-gram model
from gensim.models import word2vec

print("Training Model...")
model = word2vec.Word2Vec(sentences,
                         workers = num_workers,
                         size = num_features,
                         min_count = min_word_count,
                         window = context,
                         sample = downsampling,
                         sg = sg)

2021-01-09 15:45:55,784 : INFO : collecting all words and their counts
2021-01-09 15:45:55,785 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Training Model...


2021-01-09 15:45:56,000 : INFO : PROGRESS: at sentence #10000, processed 1205223 words, keeping 51374 word types
2021-01-09 15:45:56,202 : INFO : PROGRESS: at sentence #20000, processed 2396605 words, keeping 67660 word types
2021-01-09 15:45:56,310 : INFO : collected 74065 word types from a corpus of 2988089 raw words and 25000 sentences
2021-01-09 15:45:56,311 : INFO : Loading a fresh vocabulary
2021-01-09 15:45:56,360 : INFO : effective_min_count=35 retains 8973 unique words (12% of original 74065, drops 65092)
2021-01-09 15:45:56,360 : INFO : effective_min_count=35 leaves 2657397 word corpus (88% of original 2988089, drops 330692)
2021-01-09 15:45:56,385 : INFO : deleting the raw counts dictionary of 74065 items
2021-01-09 15:45:56,388 : INFO : sample=0.001 downsamples 28 most-common words
2021-01-09 15:45:56,388 : INFO : downsampling leaves estimated 2526274 word corpus (95.1% of prior 2657397)
2021-01-09 15:45:56,404 : INFO : estimated required memory for 8973 words and 500 dimen

2021-01-09 15:46:41,334 : INFO : EPOCH 4 - PROGRESS: at 62.64% examples, 206763 words/s, in_qsize 15, out_qsize 0
2021-01-09 15:46:42,491 : INFO : EPOCH 4 - PROGRESS: at 71.62% examples, 205336 words/s, in_qsize 15, out_qsize 0
2021-01-09 15:46:43,520 : INFO : EPOCH 4 - PROGRESS: at 80.79% examples, 206853 words/s, in_qsize 15, out_qsize 0
2021-01-09 15:46:44,530 : INFO : EPOCH 4 - PROGRESS: at 88.98% examples, 206941 words/s, in_qsize 14, out_qsize 1
2021-01-09 15:46:45,526 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-01-09 15:46:45,545 : INFO : EPOCH 4 - PROGRESS: at 98.12% examples, 208209 words/s, in_qsize 6, out_qsize 1
2021-01-09 15:46:45,549 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-01-09 15:46:45,630 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-01-09 15:46:45,682 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-01-09 15:46:45,725 : INFO : worker thread finished; awaiting f

- Trained word2vec models can be saved and reused in the future.
- It is a good practice to include information about hyperparameters in the model's name.
- Once the model is saved, it can be reused by `Word2Vec.load()` method.

In [7]:
# Save the trained model
model_name = "500features_35minwords_10context_Skipgram"
model.save(model_name)

2021-01-09 15:46:59,117 : INFO : saving Word2Vec object under 500features_35minwords_10context_Skipgram, separately None
2021-01-09 15:46:59,119 : INFO : not storing attribute vectors_norm
2021-01-09 15:46:59,119 : INFO : not storing attribute cum_table
2021-01-09 15:46:59,469 : INFO : saved 500features_35minwords_10context_Skipgram


- Since the number of words inside each reviews are all different, we have to standardize them.
- A simple way to do that is to use average vector as a representative of a review (**feature vector**).
- We will make a function to calculate this feature vector.

In [8]:
# Define function to calculate feature vectors
import numpy as np

def get_features(words, model, num_features):
    """
    Function to calculate mean vector of all embedded vectors.
    Calculated mean vector will be used to represent a single review. (Feature Vector)
    
    words: a single review consisting of embedded vectors
    model: pretrained word2vec model
    num_features: dimension of embedded vectors (Same as num_features in word2vec model definition)
    """
    
    # Initialize output vector
    feature_vector = np.zeros((num_features), dtype=np.float32)
    num_words = 0   # total count of valid words inside a review
    
    # word dictionary
    index2word_set = set(model.wv.index2word)
    
    # Calculate mean vector
    for w in words:
        if w in index2word_set:
            num_words += 1
            feature_vector = np.add(feature_vector, model[w])
            
    feature_vector = np.divide(feature_vector, num_words)
    
    return feature_vector

- As a final step, let's define function to get averaged feature vectors for the entire dataset.

In [9]:
def get_dataset(reviews, model, num_features):
    """
    Function to get feature vectors for all the reviews.
    A list inside which all feature vectors are stacked is returned.
    
    reviews: entire dataset
    model: pretrained word2vec model
    num_features: dimension of embedded vectors (Same as num_features in word2vec model definition)
    """
    
    dataset = [get_features(review, model, num_features) for review in reviews]
    
    reviewsFeatureVecs = np.stack(dataset)
    
    return reviewsFeatureVecs

In [10]:
# Train input dataset to be used for fitting the model
X_word2vec = get_dataset(sentences, model, num_features)

  feature_vector = np.add(feature_vector, model[w])


# Define and train model

- We are going to try several models and compare their performances.
    - Logistic Regression
    - Random Forest
    - RNN
    - CNN
- We use n-fold cross validation to evaluate each model's performance.

# Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

# Define model
lgs = LogisticRegression(class_weight="balanced")

# Train labels
y = sentiments

- `class_weight`: by using "balanced" mode, each labels are trained fairly

## Using TF-IDF inputs

In [12]:
from sklearn.model_selection import cross_validate

# Evaluate model by 5 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
lgs_scores = cross_validate(lgs, X_TFIDF, y, scoring=scoring, cv=5, return_train_score=False)

print("Mean F1 Score: {:.3f}".format(np.mean(lgs_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(lgs_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(lgs_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(lgs_scores["test_accuracy"])))

Mean F1 Score: 0.838
Mean Precision Score: 0.828
Mean Recall Score: 0.849
Mean Accuracy Score: 0.836


## Using Count vectorizer inputs

In [13]:
# Evaluate model by 5 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
lgs_scores = cross_validate(lgs, X_countvect, y, scoring=scoring, cv=5, return_train_score=False)

print("Mean F1 Score: {:.3f}".format(np.mean(lgs_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(lgs_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(lgs_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(lgs_scores["test_accuracy"])))

Mean F1 Score: 0.837
Mean Precision Score: 0.825
Mean Recall Score: 0.849
Mean Accuracy Score: 0.835


## Using word2vec inputs

In [14]:
# Evaluate model by 5 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
lgs_scores = cross_validate(lgs, X_word2vec, y, scoring=scoring, cv=5, return_train_score=False)

print("Mean F1 Score: {:.3f}".format(np.mean(lgs_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(lgs_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(lgs_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(lgs_scores["test_accuracy"])))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean F1 Score: 0.880
Mean Precision Score: 0.876
Mean Recall Score: 0.884
Mean Accuracy Score: 0.880


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- It seems like by using logistic regression, word2vec vectorization fits the best by showing about 88% accuracy.

# Random Forest

- This time, we use random forest model which is a bit more advanced than logistic regression model.

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Random forest classifier with no hyperparameters tuned
rf = RandomForestClassifier()

## Using TF-IDF inputs

In [16]:
# Evaluate model by 3 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
rf_scores = cross_validate(rf, X_TFIDF, y, scoring=scoring, cv=3, return_train_score=False, 
                           verbose=1, n_jobs=-1)

print("Mean F1 Score: {:.3f}".format(np.mean(rf_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(rf_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(rf_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(rf_scores["test_accuracy"])))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Mean F1 Score: 0.814
Mean Precision Score: 0.807
Mean Recall Score: 0.821
Mean Accuracy Score: 0.812


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   28.3s finished


## Using Count Vectorizer inputs

In [17]:
# Evaluate model by 3 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
rf_scores = cross_validate(rf, X_countvect, y, scoring=scoring, cv=3, return_train_score=False,
                          verbose=1, n_jobs=-1)

print("Mean F1 Score: {:.3f}".format(np.mean(rf_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(rf_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(rf_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(rf_scores["test_accuracy"])))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Mean F1 Score: 0.808
Mean Precision Score: 0.803
Mean Recall Score: 0.812
Mean Accuracy Score: 0.806


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   28.4s finished


## Using word2vec inputs

In [18]:
# Evaluate model by 3 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
rf_scores = cross_validate(rf, X_word2vec, y, scoring=scoring, cv=3, return_train_score=False,
                          verbose=1, n_jobs=1)

print("Mean F1 Score: {:.3f}".format(np.mean(rf_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(rf_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(rf_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(rf_scores["test_accuracy"])))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Mean F1 Score: 0.851
Mean Precision Score: 0.831
Mean Recall Score: 0.873
Mean Accuracy Score: 0.848


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.3min finished


- For all input vectors random forest was beyond performance than logistic regression.
- This implies that using advanced model doesn't necessarily guarentee better performace.

# XGBoost

In [19]:
from xgboost import XGBClassifier

# XGBoost Classifier without hyperparameters tuned
XGB = XGBClassifier(random_state=0)

2021-01-09 15:49:42,741 : INFO : NumExpr defaulting to 8 threads.


## Using TF-IDF inputs

In [20]:
# Evaluate model by 3 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
xgb_scores = cross_validate(XGB, X_TFIDF, y, scoring=scoring, cv=3, return_train_score=False, 
                            verbose=1, n_jobs=-1)

print("Mean F1 Score: {:.3f}".format(np.mean(xgb_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(xgb_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(xgb_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(xgb_scores["test_accuracy"])))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Mean F1 Score: 0.818
Mean Precision Score: 0.803
Mean Recall Score: 0.833
Mean Accuracy Score: 0.814


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   16.3s finished


## USing Count Vectorizer inputs

In [21]:
# Evaluate model by 3 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
xgb_scores = cross_validate(XGB, X_countvect, y, scoring=scoring, cv=3, return_train_score=False, 
                            verbose=1, n_jobs=-1)

print("Mean F1 Score: {:.3f}".format(np.mean(xgb_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(xgb_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(xgb_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(xgb_scores["test_accuracy"])))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Mean F1 Score: 0.823
Mean Precision Score: 0.807
Mean Recall Score: 0.839
Mean Accuracy Score: 0.819


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    7.8s finished


## Using word2vec inputs

In [22]:
# Evaluate model by 3 fold cross validation
scoring = ['f1','precision','recall', "accuracy"]    # metrics to be evaluated
xgb_scores = cross_validate(XGB, X_word2vec, y, scoring=scoring, cv=3, return_train_score=False, 
                            verbose=1, n_jobs=-1)

print("Mean F1 Score: {:.3f}".format(np.mean(xgb_scores["test_f1"])))
print("Mean Precision Score: {:.3f}".format(np.mean(xgb_scores["test_precision"])))
print("Mean Recall Score: {:.3f}".format(np.mean(xgb_scores["test_recall"])))
print("Mean Accuracy Score: {:.3f}".format(np.mean(xgb_scores["test_accuracy"])))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Mean F1 Score: 0.868
Mean Precision Score: 0.862
Mean Recall Score: 0.874
Mean Accuracy Score: 0.867


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  1.6min finished


# Conclusion

- Considering the results of above models, we could see that validation accuracy didn't improve from 88%.
- This seems like the best we can do with our preprocessed data, so if we want to increase accuracy, we have to go back to the preprocessing steps and try to extract more features of words.
- However, out of expectation that deep learning models might be able to perform better than machine learning models, we will try to fit our data to RNN and CNN.