# Sentiment analysis on Reviews given by viewers on IMDB


## The Dataset and The Problem to Solve

In this notebook, we're exploring a dataset from Kaggle featuring 50,000 movie reviews, distributed across two columns: "review" for the text of the review, and "sentiment" indicating whether the review is positive or negative.

Objective: We aim to determine the most effective machine learning model for predicting the sentiment (positive or negative) of a movie review based on its text content.

In [None]:
import pandas as pd 


## 1. Data Collection

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset size: ~60MB (decompressed)

Make sure the dataset is named "IMDB_Dataset.csv" inside the "dataset" folder.


In [None]:
df_review = pd.read_csv("dataset/IMDB_Dataset.csv")
df_review

## 2. Data preprocessing

The preprocessing steps that we will use are:
- Use a fraction of the dataset for faster iterations in initial experimentations
- Split the data into train and test set
- Vectorize the data (natural language reviews -> integer vectors)

### Data fractioning


In [None]:
df_positive = df_review[df_review['sentiment']=='positive'][:5000]
df_negative = df_review[df_review['sentiment']=='negative'][:5000]

df_review_small = pd.concat([df_positive,df_negative ])


### Splitting into train and test set


In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_review_small, test_size=0.2, random_state=42)

In [None]:
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

In [None]:
train_y.value_counts()


### Vectorization (Bag of words)


To analyze our movie reviews, we must convert the text into numerical vectors, as machine learning models require numerical input. We'll employ the Bag of Words (BOW) method, focusing on word frequency while disregarding word order.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)
# also fit the test_x_vector
test_x_vector = tfidf.transform(test_x)

In [None]:
pd.DataFrame.sparse.from_spmatrix(train_x_vector,
                                  index=train_x.index,
                                  columns=tfidf.get_feature_names_out())

# 3-4. Model Selection & Training

Since in our dataset we have labels, this is a supervised learning problem. We will test 3 classification algorithms to see which one performs better on our dataset. 

The algorithms that will be tested are:

- Decision Tree
- Gaussian Naive Bayes
- Logistic Regression


## Decision Tree


In [None]:
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier(
    random_state=42,
    max_depth=3
)
dec_tree.fit(train_x_vector, train_y)


## Naive Bayes


In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)


## Logistic Regression


In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    random_state=42,
    max_iter=5
)
log_reg.fit(train_x_vector,train_y)


# 5. Model Evaluation


### Mean Accuracy

In [None]:
print("Model 1 (Decision tree) accuracy:", dec_tree.score(test_x_vector, test_y))
print("Model 2 (GaussianNB) accuracy:", gnb.score(test_x_vector.toarray(), test_y))
print("Model 3 (Logistic Regression) accuracy:", log_reg.score(test_x_vector, test_y))

# 6. Iteration
Having achieved respectable accuracy with Logistic Regression, our next steps involve iterating over the entire workflow (steps 1-5) to refine and potentially enhance our final model's performance. Here are several targeted strategies for each phase of the workflow:

### 1. **Data Collection:**
   - **Expand the Dataset:** Explore additional datasets or scrape movie review websites to diversify the training data.
   - **Augment the Data:** Implement techniques like synonym replacement or back-translation to artificially increase the dataset size.

### 2. **Data Preprocessing:**
   - **More Training Data:** Increase the training data size to improve model generalization and performance.
   - **Advanced Vectorization:** Experiment with Word2Vec or GloVe for word embeddings that capture more nuanced semantic relationships than TF-IDF.
   - **Lemmatization Over Stemming:** Apply lemmatization to reduce words to their base or dictionary form, preserving the semantic meaning of the text.

### 3. **Model Selection:**
   - **Advanced Algorithms:** Explore more sophisticated algorithms like Support Vector Machines (SVM) or XGBoost for better predictive performance.

### 4. **Model Training:**
   - **Hyperparameter Optimization:** Experiment with different hyperparameter values for each model (max_depth in Decision Tree, max_iter in Logistic Regression) to see how they affect performance.
   - **Cross-Validation:** Implement k-fold cross-validation to ensure the model's robustness and generalizability across different subsets of the dataset.

### 5. **Model Evaluation:**
   - **Confusion Matrix:** Beyond accuracy, inspect the confusion matrix to understand the model's performance across different classes (positive vs. negative reviews).
   - **ROC and AUC:** Evaluate the model using the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) for a comprehensive performance metric.
   - **Precision-Recall Trade-off:** Analyze the precision-recall curve, especially useful in the context of imbalanced datasets, to find an optimal balance for the classification threshold.



# Manual check

In [None]:
review = """
Last night I saw the movie Road House and I did not like it. 
The acting was fine, the fighting scenes were badass 
but overall a predictable and kinda boring plot.
"""


print(log_reg.predict(tfidf.transform([review])))