# Predicting IMDB Ratings Based on Movie Reviews

Authors: Yuanzhe Marco Ma, Arash Shamseddini, Kaicheng Tan, Zhenrui Yu

## Table of Contents

- [Project Goal](#0)
- [Data Retrieval](#1)
- [Exploratory Data Analysis (EDA)](#2)
- [Preprocessing](#3)
- [Maching Learning Model](#4)
- [Prediction with Model](#5)

## Project Goal

In this project, we look at the relationship between movie's reviews and their IMDB scores (ranging from 0 ~ 5 stars). Positive reviews are often related to high IMDB scores, while negative reviews indicate the opposite. While it is easy for humans to understand a piece of review and guess the scores, we wonder if machines could understand it as well. Furthermore, we would like to automate this process, so that given a bunch of movie reviews, we are able to predict their corresponding IMDB scores easily. <br>

In this project, we attempt to use Support Vector Machines as our machine learning model, aided by other statistical analysis tools such as ___ and ___. 

## Where Do We Get the Data?

We obtained our data from an open-sourced github repository: <br> 
> https://github.com/nproellochs/SentimentDictionaries/blob/master/Dataset_IMDB.csv <br>

The repository was originally used for sentiment analysis related to movie reviews. Here we are using the `Dataset_IMDB.csv` as our main data source. 

### Data Retrieval

To automate our analysis, we have written a script to retrieve the dataset with Python. The script can be accessed [here]("https://github.com/UBC-MDS/group_26/blob/main/src/download_data.py). 

## What does the data look like?

Let's look into the dataset by performing some Exploratory Data Analysis (EDA). 

#### 1. Data Columns:
| Column Name | Column Type | Description                             |
|-------------|-------------|-----------------------------------------|
| Id          | Numeric     | Unique ID assigned to each observation. |
| Text        | Free Text   | Body of the review content.             |
| Author      | Categorical | Author's name of the review             |
| Rating      | Numeric     | Ratings given along with the review     |

For this project, we look primarily into the `Text` and `Rating` columns. <br> Therefore, we will **drop** the `Author` and `Id` columns. 

#### 2. The `Text` feature
The `Text` feature will be our primarily input feature. Here is a distribution of the lengths of the reviews. 

In [1]:
# Text length distribution here

We also looked into the top 20 most frequent words in the reviews. They are shown below: 

| Word   | Frequency | Rank |
|--------|-----------|------|
|the     |   172557  |  1   |
|of      |   78038   |  2   |
|and     |   76392   |  3   |
|to      |   74239   |  4   |
|is      |   57547   |  5   |
|in      |   49646   |  6   |
|that    |   33476   |  7   |
|it      |   33061   |  8   |
|as      |   27536   |  9   |
|with    |   26852   |  10  |
|he      |   26320   |  11  |
|for     |   25704   |  12  |
|his     |   25454   |  13  |
|on      |   18487   |  14  |
|film    |   17220   |  15  |
|by      |   16890   |  16  |
|this    |   16778   |  17  |
|but     |   16759   |  18  |
|her     |   15690   |  19  |
|are     |   15084   |  20  |

As we can see, the most frequent words are often generic words such as preposition or pronouns, which has little implication to our learning. We might want to avoid overfitting to these words as we train our model. 

#### 3. The `Rating` Class
`Ratings` will be our target class. Let's look at a distribution of `Rating`. 

In [2]:
# Rating distribution plot here

#### 4. Correlation between `Text` length and `Rating`
We suspect that people more passionate about certain movies tend to write longer reviews to express feelings. This could also be true for very negative reviews. <br> <br>
A bar plot of `Text` length vs `Rating` is presented below. 

In [None]:
# review length vs rating plot

There doesn't seem to be a strong correlation between reviews length and rating. However, it is notable that for the most positive ratings (at 0.9-1.0), the reviews are the highest. 

## A Little bit of Pre-processing

## Fitting the Model

## Prediction Using our Model

## Criticism and Improvements