# Investigating Fandango Movie Ratings

Back in October 2015, a data journalist from [FiveThirtyEight](https://fivethirtyeight.com) named Walt Hickey published [this analysis](https://fivethirtyeight.com/features/fandango-movies-ratings/) that suggests movie ratings on [Fandango](https://www.fandango.com) were dishonest and rated higher than they should be. He was able to find that the actual ratings in Fandango's HTML were nearly always displayed to the user rounded up to the closest half-star and sometimes rounded up a full star.

After Hickey's analysis was published, Fandango claimed that the biased rounding was caused by a bug in their system, but we can no longer tell for sure since the actual rating value no longer shows up in the page's HTML.

In this project, we'll analyze some more recent movie ratings to see if we can determine whether or not Fandango has changed their rating system. We can do this by analysing Walt Hickey's original data set on GitHub [here](https://github.com/fivethirtyeight/data/tree/master/fandango), and comparing it to movie ratings data from movies released in 2016 and 2017 [here](https://github.com/mircealex/Movie_ratings_2016_17).

### Taking a First Look at the Data


We'll begin by reading in the data sets and looking at their structure.

In [7]:
import pandas as pd

pd.options.display.max_columns = 100 # To be able to view all columns within this notebook

before = pd.read_csv('fandango_score_comparison.csv')
after = pd.read_csv('movie_ratings_16_17.csv')

before.head(3)

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,Metacritic_norm,Metacritic_user_nom,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,3.3,3.55,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,3.35,3.75,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
2,Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,3.2,4.05,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5


In [9]:
after.head(3)

Unnamed: 0,movie,year,metascore,imdb,tmeter,audience,fandango,n_metascore,n_imdb,n_tmeter,n_audience,nr_metascore,nr_imdb,nr_tmeter,nr_audience
0,10 Cloverfield Lane,2016,76,7.2,90,79,3.5,3.8,3.6,4.5,3.95,4.0,3.5,4.5,4.0
1,13 Hours,2016,48,7.3,50,83,4.5,2.4,3.65,2.5,4.15,2.5,3.5,2.5,4.0
2,A Cure for Wellness,2016,47,6.6,40,47,3.0,2.35,3.3,2.0,2.35,2.5,3.5,2.0,2.5


Next, we'll isolate the columns relevant to our analysis so the data is more easily accessible. We'll also set the data as copies so we can avoid any `SettingWithCopyWarning` issues later on.

In [16]:
fandango_before = before[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference']].copy()
fandango_after = after[['movie', 'year', 'fandango']].copy()

fandango_before.head(3)

Unnamed: 0,FILM,Fandango_Stars,Fandango_Ratingvalue,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),5.0,4.5,14846,0.5
1,Cinderella (2015),5.0,4.5,12640,0.5
2,Ant-Man (2015),5.0,4.5,12055,0.5


In [14]:
fandango_after.head(3)

Unnamed: 0,movie,year,fandango
0,10 Cloverfield Lane,2016,3.5
1,13 Hours,2016,4.5
2,A Cure for Wellness,2016,3.0


### Changing the Goal of Our Analysis

After reviewing the data sets on GitHub and reading Hickey's analysis we find that the sampling processes for the data are not random, and the samples we have access to are unlikely to be representative of the population we're interested in. We'll have to come up with a creative workaround that won't be perfect, but should be good enough. 

Instead of collecting new data, we'll just alter the goal of our analysis a bit. Our new goal will be to find out if there are any differences in Fandango's ratings for popular movies released in 2015 vs Fandango's ratings for popular movies released in 2016. We'll use Hickey's benchmark and consider "popular" movies those that have 30 fan ratings or more on Fandango's website. However, since sometime after 2018, Fandango completely changed the way the rated movies on their site, so we are going to do this analysis on information gathered before that  change took place.

### Isolating the Samples We Need

One of the sampling criteria in our 2016 data is `movie popularity`, but it doesn't provide info on the number of fan ratings on Fandango. We'll check the representativity of this data by randomly sampling 10 movies from the data set and then checking the fan ratings of those movies on Fandango's website manually to see if at least 80% of them have 30 fan ratings or more.

In [17]:
fandango_after.sample(10, random_state = 1)

Unnamed: 0,movie,year,fandango
108,Mechanic: Resurrection,2016,4.0
206,Warcraft,2016,4.0
106,Max Steel,2016,3.5
107,Me Before You,2016,4.5
51,Fantastic Beasts and Where to Find Them,2016,4.5
33,Cell,2016,3.0
59,Genius,2016,3.5
152,Sully,2016,4.5
4,A Hologram for the King,2016,3.0
31,Captain America: Civil War,2016,4.5


As of April 2018, these were the fan ratings:


| Movie | Fan Ratings |
| :--- | :---: |
| Mechanic: Resurrection |	2247 |
| Warcraft |	7271 |
|Max Steel |	493 |
| Me Before You |	5263 |
| Fantastic Beasts and Where to Find Them |	13400 |
| Cell |	17 |
| Genius |	127 |
| Sully |	11877 |
| A Hologram for the King |	500 |
| Captain America: Civil War |	35057 |

We can quickly see that 9/10 of the movies in our sample would be considered popular. This should give us enough confidence for us to move forward with this data.

We'll double-check the other data set to make sure that there are only movies with at least 30 fan ratings, as stated in the data's documentation.

In [18]:
sum(fandango_before['Fandango_votes'] < 30)

0

Now we'll want to isolate only the movies released in 2015 and 2016. We'll have to extract the release date from the `FILM` column.

In [20]:
fandango_before['Year'] = fandango_before['FILM'].str[-5:-1] # Extracting the release year

fandango_before.head(3)

Unnamed: 0,FILM,Fandango_Stars,Fandango_Ratingvalue,Fandango_votes,Fandango_Difference,Year
0,Avengers: Age of Ultron (2015),5.0,4.5,14846,0.5,2015
1,Cinderella (2015),5.0,4.5,12640,0.5,2015
2,Ant-Man (2015),5.0,4.5,12055,0.5,2015


Now we can examine the frequency distribution of our new `Year` column and isolate the movies from 2015.

In [22]:
fandango_before['Year'].value_counts()

2015    129
2014     17
Name: Year, dtype: int64

In [None]:
fandango_2015 