# Horror Movies Key EDAs

## Introduction

We want to explore the relationship between the ratings and the revenue for horror movies based on the dataset we collected. In specific, we want to answer the following question: do horror movies with a higher rating also have higher revenue than those with a lower rating. In this context, we will consider movies as highly rated if their ratings are above the median rating of all movies in the dataset, and movies as poorly rated if their ratings are below the median rating. 

## Dataset Introduction

Original purpose of creating this dataset (by the original author) is to explore a dataset about horror movies dating back to the 1950s. Data set was extracted from The Movie Datbase via the tmdb API using R httr. There are ~35K movie records in this dataset. There are 18 columns and we are going to focus on column `vote_average` and `revenue` columns for our study.

When pre-processing the data, we decided to drop columns including: 

- original_title
- overview
- tagline
- poster_path
- status
- backdrop_path
- collection
- collection_name

In addition, we decided to drop observations which have a very low number for vote_count. There are several movies in the dataset which have only 1 or 2 total votes and then have vote_average=10.0. Therefore, we decided to filter out observations which have vote_count < 10 to ensure meaningful outcome of the experiment.

## EDA Analysis Result

Here, we first want to see the distribution for `vote_average` and `revenue` respectively.

<img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/image/vote_avg_density.png"  width="300" height="100"> <img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/image/revenue_density.png"  width="300" height="100">


<img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/image/horror_scatter_vote.png"  width="300" height="100">

Then, we first want to explore the pairwise correlation between several important variables, including:

- budget
- runtime
- revenue
- vote_average

<img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/image/attribute_pairs.png"  width="600" height="100">

To target our research question, we also plot out the correlation between `vote_average` and `revenue`:

<img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/image/horror_scatter_vote.png" width="300" height="100">

## Hypothesis Testing Result

We set up a one-tailed hypothesis test to answer the following question:

"Do horror movies with a higher rating also have higher revenue than those with a lower rating?"

Mathematically, we have set up the following metrics:

- `test statistic`: $Median_{movies\_high} - Median_{movies\_low}$ 
- `Null Hypothesis`: $H_0$: $Median_{movies\_high} - Median_{movies\_low}$ = 0
- `Alternative Hypothesis`: $H_1$: $Median_{movies\_high} - Median_{movies\_low}$ > 0

For this test, we will use $\alpha = 0.05$ as our confidence level.

Some findings (analysis to be added):

<img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/results/revenue_null_distribution.png" width="400" height="100">


|observed_test_stat|p_value|null_dist_q95     |null_dist_mean   |null_dist_se      |
|------------------|-------|------------------|-----------------|------------------|
|9007166.5         |1.5e-4 |2429231.5999999987|8042.138475000001|1487583.7235765792|


From the result above, we observe that `p_value` is much smaller than the confidence level, therefore, we conclude that we have sufficient evidence to reject the null hypothesis. We also expect the median revenue of higher rated horror movies to be $8042.14 higher than the median revenue of lower rated horror movies.


Since we observed a large standard error (variation in revenue), we are interested to see the distribution of the revenue for these two groups.


<img src="https://raw.githubusercontent.com/UBC-MDS/horror_movies/main/results/revenue_violin_by_rating.png" width="400" height="100">

|rating_group|sample_median|sample_sd         |sample_size      |
|------------|-------------|------------------|-----------------|
|high        |12717468     |79135511.97789192 |634              |
|low         |3710301.5    |42174996.847960226|692              |

We observed that the sample sizes for the two groups are balanced, and both have a large variation within the group. The sample median revenue for group that is highly rated is much larger than the sample median revenue for poorly rated group.

## Critique

We still need to do more work in data analysis to perform the hypothesis testing. In addition, we excluded other factors that might impact on the `revenue` on purpose (for example, those included in the attribute pairwise plots). But we might want to consider analyzing their correlation with `vote_average` and `revenue` to make a more meaningful conclusion.

## Citation

- [Data Source for Horror Movies](https://github.com/rfordatascience/tidytuesday)
- [tidytuesday Introduction](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-11-01)
- [Sample Project (Breast Cancer Predictor)](https://github.com/ttimbers/breast_cancer_predictor)
- [EDA Analysis Reference from DSCI 531](https://pages.github.ubc.ca/mds-2022-23/DSCI_531_viz-1_students/lectures/4-eda.html)
- [ggplot Doc](https://ggplot2.tidyverse.org/)