# Investigating Fandango Movie Ratings

Both distributions above are strongly left skewed, suggesting that movie ratings on Fandango are generally high or very high. We can see there's no rating under 2 stars in the sample Hickey analyzed. The distribution of displayed ratings is clearly shifted to the right compared to the actual rating distribution, suggesting strongly that Fandango inflates the ratings under the hood.

Fandango's officials replied that the biased rounding off was caused by a bug in their system rather than being intentional, and they promised to fix the bug as soon as possible. Presumably, this has already happened, although we can't tell for sure since the actual rating value doesn't seem to be displayed anymore in the pages' HTML.

In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

## Understanding the Data

One of the best ways to figure out whether there has been any change in Fandango's rating system after Hickey's analysis is to compare the system's characteristics previous and after the analysis.

- Walt Hickey made the data he analyzed publicly available on [GitHub](https://github.com/fivethirtyeight/data/tree/master/fandango). We'll use the data he collected to analyze the characteristics of Fandango's rating system previous to his analysis.
- One of Dataquest's team members collected movie ratings data for movies released in 2016 and 2017. The data is publicly available on [GitHub](https://github.com/mircealex/Movie_ratings_2016_17) and we'll use it to analyze the rating system's characteristics after Hickey's analysis.


In [1]:
import pandas as pd
fandango_score_comparison = pd.read_csv('fandango_score_comparison.csv')
movie_ratings_16_17 = pd.read_csv('movie_ratings_16_17.csv')

In [2]:
fandango_previous = fandango_score_comparison[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference']].copy()
fandango_after = movie_ratings_16_17[['movie', 'year', 'fandango']].copy()

In [3]:
merge = pd.merge(fandango_previous, fandango_after, left_on='FILM', right_on='movie')
merge

Unnamed: 0,FILM,Fandango_Stars,Fandango_Ratingvalue,Fandango_votes,Fandango_Difference,movie,year,fandango


We have shown that we can't find information about `fandango` movies in `movie_ratings` dataset.

This could be due to the fact that the dataset has specific sampling characteristics:
For Fandango sampling have:
- At least 30 fan reviews on Fandango
- Contains a rating or score in Rotten Tomatoes, RT User, Metacritic, Metacritic User and IMDb.
- The data from Fandango was pulled on Aug. 24, 2015

For Movie ratings have:
- Most popular movies (with a significant number of votes).
- Mostly for movies released in 2016 

It's unlikely to be representative of our population of interest.

## Changing the Goal of our Analysis

At this point, we can either collect new data or change our the goal of our analysis. We choose the latter and place some limitations on our initial goal.

Instead of trying to determine whether there has been any change in Fandango's rating system after Hickey's analysis, our new goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.

## 4. Isolating the Samples We Need
With the new goal, we now have two populations that we want to describe and compare with each other:

- All Fandango's ratings for popular movies released in 2015.
- All Fandango's ratings for popular movies released in 2016.

The term "popular" is vague and we need to define it with precision before continuing. We'll use Hickey's benchmark of 30 fan ratings and consider a movie as "popular" only if it has 30 fan ratings or more on Fandango's website.

Great, now let's isolate the movies in the other data set.

In [4]:
sum(fandango_previous['Fandango_votes'] < 30)

0

We can see that all the movies have more than 30 reviews.

We'll isolate the movies released in 2015.
For `fandango previous` we need to extract the year values from `FILM` column, the pattern are `Film name (year:int)`. 

In [5]:
fandango_previous['FILM'].value_counts()
pattern = r'\((.*)\)'
fandango_previous['Year'] = fandango_previous['FILM'].str.extract(pattern, expand=False).astype(int)
fandango_previous['Year'].value_counts()

2015    129
2014     17
Name: Year, dtype: int64

In [6]:
fandango_2015 = fandango_previous[fandango_previous['Year'] == 2015].copy()
fandango_2015['Year'].value_counts()

2015    129
Name: Year, dtype: int64

Now we'll isolate the movies released in 2016 in the other data set.

In [7]:
fandango_after['year'].value_counts()

2016    191
2017     23
Name: year, dtype: int64

In [8]:
fandango_2016 = fandango_after[fandango_after['year'] == 2016].copy()
fandango_2016['year'].value_counts()

2016    191
Name: year, dtype: int64