# Exploring the Relationship Between the Meta Scores of Films and the Reviews of the General Audiance


## Introduction:
IMDb is an online database for films, television series, podcasts, and other media. This project will be working with movie data from IMDb that contains data on the cast, production crew, plot summary, and scores.$^1$ This data set presents two kinds of quantitative ratings IMDb scores which are averages of user-generating ratings out of 10,$^2$ and Meta Scores which are weighted averages of critic-generated reviews out of 100.$^3$

Though each professional critic has their own set of goals and writing style, in general, there are many components that all critics consider when reviewing films. Critics tend to watch films with little to no information to avoid preconceived judgment and allow for an unbiased background. While watching a film, critics will often re-watch the film to ensure they catch all details of the film. When compiling a review, a critic will often focus on certain elements to create their review. These elements include Plot, Theme, Acting/Characterization, Direction, Music, Cinematography, Production Design, Special Effects, Editing, and Dialogue.$^4$

Unlike Meta Scores, IMDb scores are computed as unweighted averages of general audience reviews, with an average of 7. The "critics" behind the IMDb scores don't all review films based on a specific set list of elements. There are possible external factors that can influence these reviews such as the opinions of others and marketing of the film, these factors can produce a preconceived judgment before they can produce their own opinions. General audience members tend to watch films with some level of distraction and generally don't watch for the sake of critiquing the same factors as critics. So the reviews that affect the IMDb scores of a film can be biased and affected by the feelings of the individual reviewer and external factors.

The main difference between IMDb scores and Meta Scores is that IMDb scores are computed from potentially biased reviews of the general audience, whereas Meta Scores are computed from less biased critic reviews that evaluate films based on a standardized formula. Unbiased opinions are typically the best representation of quality, however they might not represent how entertaining they might be for a general audience because they focus on different aspects.$^5$ Though each individual review behind the IMDb scores is subject to the bias of the reviewer, for this project it is assumed that the pool of reviews used to formulate the IMDb score is a good representation of the general audience.

The goal of this project is to answer the following question: "How accurate are the Meta Scores of films in predicting the reception of the general audience?"

The data set is from Kaggle and includes data from the top 1000 movies based on the IMDb score. The data set includes the following columns:

* Series_Title: the name of the film
* Released_year: year it was released
* Certificate: certificate earned by the movie
* Runtime: total runtime in minutes
* Genre: list of genres the film falls into
* IMDB_Rating: the average IMDB rating given by IMDB user reviews out of 10
* Overview: plot summary
* Meta_score: meta score of the movie determined by movie critics out of 100
* Director: name of the director of the movie
* Star1, Star2, Star3, Star4: names of the stars of the movie in order of significance
* No_of_votes: number of reviews on IMDB
* Gross: how much money the movie earned

## Methods:
#### 1. Import Libraries
Tidyverse, dplyr, and tidymodels libraries were imported for data manipulation and visualization.

In [None]:
library(tidyverse)
library(dplyr)
library(tidymodels)
options(repr.matrix.max.rows = 8)

#### 2. Read data from repository
The data was downloaded from Kaggle$^1$ then added to this project's repository. The following cell reads the data from the repository as a CSV file and filters for films released after 1970, as these films will represent modern conditions better.

In [None]:
url <- "https://github.com/anh-dong/dsci-100-2023w1-group-33/blob/main/data/imdb_top_1000.csv?raw=true"
movies <- read_csv(url) |>
    filter(Released_Year > 1970) |>
    select(IMDB_Rating, Meta_score)

#### 3. Create Training Data
The raw data is then split into a training and testing set of 75% and 25% of the raw data respectively.

In [None]:
set.seed(1234)

movies_split <- initial_split(movies, prop = 0.75, strata = Meta_score)
movies_training <- training(movies_split)
movies_testing <- testing(movies_split)

#### 4. Summary Table
The training data was summarized by first normalizing and counting each normalized value for Meta_score and IMDB_Rating.

In [None]:
movies_summary <- movies_training |>
    mutate(Meta_score = round(scale(Meta_score), 0), IMDB_Rating = round(scale(IMDB_Rating), 0))

movies_table_Meta_scores <- movies_summary |>
    group_by(Meta_score) |>
    summarize(Meta_score_count = n()) |>
    rename(normalized_value = Meta_score)

movies_table_IMDB_Rating <- movies_summary |>
    group_by(IMDB_Rating) |>
    summarize(IMDB_Rating_count = n()) |>
    rename(normalized_value = IMDB_Rating)

joined <- left_join(movies_table_Meta_scores, movies_table_IMDB_Rating) |>
    mutate(IMDB_Rating_count = replace(IMDB_Rating_count, is.na(IMDB_Rating_count), 0))
joined

##### Table 1:
Counts of normalized values for Meta_score and IMDB_Rating for the training data set.

#### 5. Summary Visuals
NA values were first removed then the distribution of the normalized data was then visualized by plotting them on a histogram.

In [None]:
movies_summary <- na.omit(movies_summary)

Meta_score_hist <- movies_summary |>
    ggplot(aes(Meta_score)) +
        geom_histogram(binwidth = 1) +
        xlab("Normalized Meta Score") +
        ylab("Count") +
        ggtitle("Distribution of Normalized Meta Scores")
Meta_score_hist

##### Figure 1:
Histogram of normalized IMDB Ratings using geom_histogram with a binwidth of 1.

In [None]:


IMDB_Review_hist <- movies_summary |>
    ggplot(aes(IMDB_Rating)) +
        geom_histogram(binwidth = 1) +
        xlab("Normalized IMDB Rating") +
        ylab("Count") +
        ggtitle("Distribution of Normalized IMDB Ratings")
        
IMDB_Review_hist

##### Figure 2: 
Histogram of normalized IMDB Ratings with a binwidth of 1.

#### 6. Linear Regression
To further clean the data, all rows with na values in either column the IMDB_Rating or Meta_score were removed, then to best determine the relationship between audience perception and meta score of a film a linear regression was performed using Meta_score as the predictor and IMDB_Rating as the response variable. The training data was fit to this workflow to predict parameters for the y-intercept and slope of the regression line.

In [None]:
movies_training <- na.omit(movies_training)

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

# TODO: Swap variables
lm_recipe <- recipe(IMDB_Rating ~ Meta_score, data = movies_training)

lm_fit <- workflow() |>
  add_recipe(lm_recipe) |>
  add_model(lm_spec) |>
  fit(data = movies_training)
lm_fit

#### 7. Visualization of Training Data and Line of Best Fit
The linear regression line was then plotted against the training data to visualize its goodness of fit.

In [None]:
movies_preds <- lm_fit |>
   predict(movies_training) |>
   bind_cols(movies_training)

lm_predictions <- movies_preds |>
    ggplot(aes(x = Meta_score, y = IMDB_Rating)) +
        geom_point(alpha = 0.4) +
        geom_line(
            mapping = aes(x = Meta_score, y = .pred), 
            color = "blue") +
        ggtitle("Linear Regression of Meta Score vs. IMDB Rating Score")+
        xlab("Meta Score")+
        ylab("IMDB Rating")
lm_predictions

##### Figure 3: 
Scatter plot of IMDB Rating vs Meta Score for the training data and line of best fit. The IMDB Rating is a value out of 10 representing average user reviews, while the Meta Score is a value out of 100 representing a weighted average of critic scores. The line of best fit in blue was calculated using linear regression fit against the data.

#### 8. Root Mean Squared Error (RMSE)
RMSE was calculated by binding our predicted values to our training data set and calculating metrics with IMDB_Rating as the truth value.

In [None]:
lm_training_results <- lm_fit |>
         predict(movies_training) |>
         bind_cols(movies_training) |>
         metrics(truth = IMDB_Rating, estimate = .pred)

lm_rmse <- lm_training_results |>
          filter(.metric == "rmse") |>
          select(.estimate) |>
          pull()

lm_rmse

#### 9. Testing
The testing data set was then compared to their predictions based on our model and plotted against the line of best fit calculated above.

In [None]:
movies_testing <- na.omit(movies_testing)

test_preds <-  lm_fit |>
   predict(movies_testing) |>
   bind_cols(movies_testing)

lm_predictions_test <- test_preds |>
     ggplot(aes(x = Meta_score, y =IMDB_Rating )) +
         geom_point(alpha = 0.4) +
         geom_line(
             mapping = aes(x = Meta_score, y = .pred), 
             color = "blue") +
        xlab("Meta Score")+
        ylab("IMDB Rating")+ 
        ggtitle("Meta Score vs IMDB Rating for the Test set") +
         theme(text = element_text(size = 15))

lm_predictions_test

##### Figure 4: 
Scatter plot of IMDB Rating vs Meta Score for the testing data and line of best fit. The IMDB Rating is a value out of 10 representing average user reviews, while the Meta Score is a value out of 100 representing a weighted average of critic scores. The line of best fit in blue was calculated using linear regression fit against the training data.

#### 10. Root Mean Squared Prediction Error (RMSPE)
The RMSPE was calculated by using our model to predict IMDB_Rating for our testing data set. This metric is an estimate for our standard deviation of our testing data set against our line of best fit.

In [None]:
lm_test_results <- lm_fit |>
         predict(movies_testing) |>
         bind_cols(movies_testing) |>
         metrics(truth = IMDB_Rating, estimate = .pred)

lm_rmspe <- lm_test_results |>
          filter(.metric == "rmse") |>
          select(.estimate) |>
          pull()

lm_rmspe

## Expected Outcomes and Significance:
It was expected that the Meta Scores and IMDb Rating will have a weak positive relationship. Critics are trying to give audiences an accurate expectation on the film based on their ratings so both scoring/rating systems should correlate positively with each other. However, critics and general audiences often look for different things in their ratings, producing more variability between the two ratings, potentially causing a weaker relationship.$^5$ This relationship will allow audiences to understand how a meta score should factor into their decision to see a movie before any general audience ratings are available.

Our linear regression produced a standard deviation of about 0.3 against both the training data and testing data (estimate by RMSE and RMSPE respectively). The RMSE and RMSPE are both low considering IMDb ratings are a value out of 10 and IMDb scores average around 7, suggesting a 4% error on average. Considering the errors are similar for both the training and testing data it suggests that our model could be reasonable generalized to a broader set of movies, such as those not in the top 1000 movies on IMDb.

It was mentioned that IMDb is great for seeing what general audiences think of a movie. If you don't care what the critics say and want to see what people like yourself thought of a film, then you should use IMDb. We must, however, be aware that fans can skew the vote with 10-star ratings or 1-star ratings, which makes the score vary significantly. Next time you are trying to decide whether a film is worth watching, our linear model can be used to give a representation of potential audience perception given a meta score, but there are other qualitative factors that should also influence your decision to see a movie.

## Resources:

1. Shankhdhar, H. (2021, February 1). IMDB movies dataset. Kaggle. https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows 

2. Türker, B. (2021, August 17). Eda project - what are the factors affecting IMDB ratings?. Medium. https://medium.com/i%CC%87stanbuldatascienceacademy/eda-project-what-are-the-factors-affecting-imdb-ratings-e91f41396c89 

3. How do you compute metascores? – metacritic support. Metacritic Support. (2023, October). https://metacritichelp.zendesk.com/hc/en-us/articles/14478499933079-How-do-you-compute-METASCORES- 

4. Gruber, M. (2022, April 20). How critics produce a film analysis. Ready Steady Cut. https://readysteadycut.com/2022/04/20/how-critics-produce-a-film-analysis/ 

5. Collazo, M. (2014, April 30). How movie critics and moviegoers view films differently. The Artifice. https://the-artifice.com/movie-critics-and-moviegoers-view-films-differently/ 