# Predicting IMDB Ratings Based on Movie Reviews

Authors: Yuanzhe Marco Ma, Arash Shamseddini, Kaicheng Tan, Zhenrui Yu

## Table of Contents

- [Project Goal](#1)
- [Data Retrieval](#2)
- [Exploratory Data Analysis (EDA)](#3)
- [Preprocessing](#4)
- [ML Model Fitting](#5)
- [Predicting](#6)
- [Criticism](#7)

## I. Project Goal <a class="anchor" id="1"></a>

In this project, we look at the relationship between movie reviews and their IMDB scores (ranging from 0 ~ 10). Positive reviews are often related to high IMDB scores, while negative reviews indicate the opposite. While it is easy for humans to understand a piece of review and guess the scores, we wonder if machines could understand it as well. Furthermore, we would like to automate this process, so that given a bunch of movie reviews, we are able to predict their corresponding IMDB scores easily. <br>

In this project, we use Support Vector Machines as our machine learning model. 

## II. Where Do We Get the Data? <a class="anchor" id="2"></a>

We obtained our data from an open-sourced github repository: <br> 
> https://github.com/nproellochs/SentimentDictionaries/blob/master/Dataset_IMDB.csv <br>

The repository was originally used for sentiment analysis related to movie reviews. Here we are using the `Dataset_IMDB.csv` as our main data source. 

#### Automation

To automate data retrieval, we have written a script to obtain the dataset with Python. The script can be accessed [here](../src/download_data.py). 

## III. What does the data look like? <a class="anchor" id="3"></a>

Let's look into the dataset by performing some **Exploratory Data Analysis (EDA)**. 

#### 1. Data Columns:
| Column Name | Column Type | Description                             |
|-------------|-------------|-----------------------------------------|
| Id          | Numeric     | Unique ID assigned to each observation. |
| Text        | Free Text   | Body of the review content.             |
| Author      | Categorical | Author's name of the review             |
| Rating      | Numeric     | Ratings given along with the review (normalized)   |

For this project, we look primarily into the `Text` and `Rating` columns.  



We realized that `Author` may have significant relation to ratings, but since we are making a generalized model for reviews from any audience, we have decided to discard it for this analysis. 

Therefore, we will **drop** both the `Author` and `Id` columns.

#### 2. The `Text` feature
The `Text` feature contains all the movie reviews. This will be our primarily input feature. <br>
Below are the top 10 most frequent words in the reviews:

| Word   | Frequency | Rank |
|--------|-----------|------|
|the     |   172557  |  1   |
|of      |   78038   |  2   |
|and     |   76392   |  3   |
|to      |   74239   |  4   |
|is      |   57547   |  5   |
|in      |   49646   |  6   |
|that    |   33476   |  7   |
|it      |   33061   |  8   |
|as      |   27536   |  9   |
|with    |   26852   |  10  |

As we can see, the most frequent words are often generic words such as prepositions and pronouns, which has little implication to our model learning. We might want to avoid overfitting to these words as we train our model. 

#### 3. The `Rating` Class
`Ratings` will be our target class. Let's look at a distribution of `Rating`. 

![ratings_histo](../results/histogram_rating_distribution.svg)

The ratings seem roughly normally distributed, with a little skewness to the left. Most of the ratings cluster around 0.5 ~ 0.8. 

#### 4. Correlation between `Text` length and `Rating`
We suspect that people more passionate about certain movies tend to write longer reviews to express feelings. This could also be true for very negative reviews. <br> <br>
A bar plot of `Text` length vs `Rating` is presented below. 

![textlength_vs_rating](../results/histogram_rating_vs_text_length.svg)

There doesn't seem to be a strong correlation between reviews length and rating. However, it is notable that for the most positive ratings (from 0.7 ~ 1.0), the reviews tend to be higher. 

## IV. Preparing our Data <a class="anchor" id="4"></a>

In addition to `Text` feature alone, we extracted two potentially useful columns that could enhance our machine learning model. 

#### 1. `n_words`
As mentioned above, we suspect some correlation between review lengths and ratings. Therefore we created an `n_words` feature, which counts the number of words in each review.  
#### 2. `sentiment`
We utilized the [**NLTK**](#c2) package to assist us in extracting the sentiment of each review. This `sentiment` feature will have four ordinal categories - ['neg', 'compound', 'neu', 'pos']. 

## V. Fitting the Model <a class="anchor" id="5"></a>

Now we have our training data ready, we will fit our regression model using a Support Vector Machine. <br>
Here we utilized [sklearn](#c3)'s SVR estimator.

We tuned our model with hyper-parameter optimization. Specifically, we tuned the `Gamma` hyperparameter of SVR using sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). The `Gamma` hyperparameter controls the complexity of our model. We want to pick the best `Gamma` so that our model does a decent job in predicting while avoiding over-fitting. 

Below is our tuned, best performing model based on cross-validation score.

| Model Name | Hyperparameter - Gamma | Mean Fit Time |  Mean Scoring Time | Mean CV Score |
|------------|------------------------|---------------|--------------------|---------------|
|    SVR     |  0.0007000000000000001 | 43.80s | 11.40s |0.4700 |

For a more detailed GridSearchCV result, see [this file](../results/hyper_param_search_result.csv).

## VI. Predicting with Our Model <a class="anchor" id="6"></a>

After finalizing our model, we tested it on our test set. 

#### 1. $R^2$, __RMSE__ Scores

$R^2$ and RMSE (Root-mean-squared-error) are two common metrics for evaluating a regression model's accuracy. <br> 

- The obtained $R^2$ score for the test set is **0.5037**, which was comparable to, and even better than our validation score.
- The RMSE score was 1.2861. This seems large because it means our predicted score can have a margin of error of over 1.2 points. 

#### 2. Prediction vs. True Ratings
We have also created a scatterplot to compare our predicted ratings vs. the true ratings. 

In [2]:
from IPython.display import HTML
HTML(filename='../results/true_vs_predict.html')

> There is an obvious difference between the predicted ratings and true ratings. In the true ratings, people tend to give whole number rating, i.e. 0.3 instead of 0.3247. Our model did not capture that. 

Despite that, most of the points are somewhat clustered around the identity line. This indicates that our model didn't seem to under-fit or over-fit. 

## VII. Criticism and Improvements <a class="anchor" id="7"></a>

Now that we are done with our prediction and analysis, we can examine the quality of our work. 
In fact, there are a few areas of improvement. 

- We discarded the `Author` and `id` columns at the beginning. In fact, these columns may be influential features, especially `Author`. One major flaw of our dataset is that all reviews come from four critics. This makes it difficult to generalize to the broader audience. It would better if we can obtain reviews from the general audiences. 

- The `sentiment` feature genereated with the NLTK package contained only `neu` and `compound` in our dataset. This is confusing and we have yet to understand this behavior. We included the feature regardless because it may still provide useful information. However, this is definitely a place we need to investigate further into. 

- As shown on the predicition vs true rating plot above, some of our predictions went beyond the rating limit of 10. We could have set some sort of boundary for our prediction.

- As mentioned above, our model did not capture the pattern where humans tend to give whole number scores. In the future, we can probably give more emphasis to predicting whole number scores, so that our model resembles more human behaviors. 

## References

1. Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. (http://archive.ics.uci.edu/ml) <a class="anchor" id="c1"></a>
2. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python.  O'Reilly Media Inc. (https://www.nltk.org/) <a class="anchor" id="c2"></a>
3. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. (https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html) <a class="anchor" id="c3"></a>