# Exploring the Relationship Between the Meta Scores of Films and the Reviews of the General Audiance for Films After the 1970s?


## Introduction:

IMDB is an online database of information related to films, television series, podcasts, and other forms of entertainment content. For our project we will be working specifically with movie (film) data from IMDB. This database presents data regarding several aspects of the content - including cast, production crew, plot summary, and scores. IMDB presents two kinds of scores; IMDB scores which are user-generated, meaning they are an average of reviews submitted on the IMDB website by non-critic viewers, and Meta Scores which are critic-generated, meaning they are an average of reviews from critics, professionals who analyze films. 

For our project we will try to answer the following question:
"How accurate are the Meta Scores of films in representing the general interest of the viewers for films after 1970?"

The dataset presented is from Kaggle, and includes data from the top 1000 movies based on the IMDB score from the IMBD online database. The dataset includes data for movie name (Series_Title), year of release (Released_year), certificate earner by the movie (Certificate), total runtime (Runtime), genre (Genre), the IMDB score of the movie (IMDB_Rating), plot summary (Overview), the Meta score of the movie (Meta_score), the director (Director), stars of the movie (Star1, Star2, Star3, Star4), number of reviews on IMDB (No_of_votes), and the amount of money earned by the movie (Gross). In summary, this project will determine if meta scores are an accurate predictor of audience reception of a movie using the Meta_score and IMDB_Rating data respectively.


Dataset Origin: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

In [None]:
### Initialize. 
library(tidyverse)
library(repr)
library(rvest)
library(stringr)
library(dplyr)
options(repr.matrix.max.rows = 8)

In [None]:
### Reading from Github
url <- "https://github.com/anh-dong/dsci-100-2023w1-group-33/blob/main/data/imdb_top_1000.csv?raw=true"
movies_raw <- read_csv(url)

In [None]:
### Table Summary
movies <- movies_raw |>
    select(Meta_score, IMDB_Rating, Released_Year) |>
    filter(Released_Year > 1970) |>
    mutate(Meta_score = round(scale(Meta_score), 0), IMDB_Rating = round(scale(IMDB_Rating), 0))

movies_table_Meta_scores <- movies |>
    group_by(Meta_score) |>
    summarize(Meta_score_count = n()) |>
    rename(normalized_value = Meta_score)


movies_table_IMDB_Rating <- movies |>
    group_by(IMDB_Rating) |>
    summarize(IMDB_Rating_count = n()) |>
    rename(normalized_value = IMDB_Rating)

joined <- left_join(movies_table_Meta_scores, movies_table_IMDB_Rating) |>
    mutate(IMDB_Rating_count = replace(IMDB_Rating_count, is.na(IMDB_Rating_count), 0))
joined

In [None]:
### Summary Visual
movies <- movies_raw |>
    select(Meta_score, IMDB_Rating, Released_Year) |>
    filter(Released_Year > 1970) |>
    mutate(Meta_score = scale(Meta_score), IMDB_Rating = scale(IMDB_Rating))

Meta_score_hist <- movies |>
    ggplot(aes(Meta_score)) +
        geom_histogram(binwidth = 0.5) +
        xlab("Normalized Meta Score") +
        ylab("Count") +
        ggtitle("Distribution of Meta Scores")
Meta_score_hist

IMDB_Review_hist <- movies |>
    ggplot(aes(IMDB_Rating)) +
        geom_histogram(binwidth = 0.5) +
        xlab("Normalized IMDB Rating") +
        ylab("Count") +
        ggtitle("Distribution of IMDB Ratings")
IMDB_Review_hist

## Methods:

In the first step we used the select and filter functions to filter out the columns we needed from the data. We removed the Poster Link and Overview scores and filtered the movies to after 1970.

In the second step we extracted the Meta scores and IMDB ratings from the data using select and plotted them as a histogram to see the average of the ratings more visually.

For further observation, we extracted the columns for Gross, Runtime, and IMDB scores and then plotted each against the year as three scatter plots. In addition, we also plotted scatter plots based on Meta Score and these three variables.

These graphs can better help us to find the direct relationship of each variable and thus make hypothesis for further research.

## Expected Outcomes and Significance:

Our prediction is that that the Meta Scores will be weakly proportional to each other. This is expected because in general, critics are trying to give audiences an accurate expectation on the film based on their ratings so they should correlate positively with each other. However, the critics producing the meta scores would likely focus on more creative minute details such as creative decisions to determine a given rating, while general audiences reviews would be more skewed towards the overall content and how entertaining the film was, suggesting that there will not be a strong relationship between the two variables.