**Important note**! Before you turn in this lab notebook, make sure everything runs as expected:

- First, restart the kernel -- in the menubar, select Kernel → Restart.
- Then run all cells -- in the menubar, select Cell → Run All.

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE."

# Problem 3: Movie Revenue Analysis

In this problem you are required to get your hands dirty with a (fairly clean) dataset. It contains information for about 5000 Hollywood movies. We will try to find how the movie revenue are related with budgets, ratings and genres.

This dataset is sourced from https://www.kaggle.com/makray/tmdb-5000-movies/data.

The original source for the data is the movie database is https://www.themoviedb.org

Let's start by inspecting the dataset.

In [None]:
from cse6040utils import download_all
datasets = {'tmdb_5000_movies.csv': '64346a71897b5741d553d34b86088603'}
datapaths = download_all(datasets, local_suffix="tmdb/", url_suffix="tmdb/")

In [None]:
import pandas as pd
from IPython.display import display
import ast

# Import the dataset
data = pd.read_csv(datapaths["tmdb_5000_movies.csv"])

# Display the data
display(data.head())

Here are the available variables:

In [None]:
list(data.columns)

That's a lot of variables! How many have missing values?

**Exercise 0** (1 point). Write a function,

```python
    def find_missing_vals(df, colname):
        ...
```

which should return the number of missing values given a dataframe `df` and column name `colname`.

For example, observe that the row at offset 15 has a `NaN` in the `homepage` field:

In [None]:
data.iloc[15]

Therefore, a call to `find_missing_vals(data, 'homepage')` should include this row in its returned count.

In [None]:
def find_missing_vals(df, colname):
    ###
    ### YOUR CODE HERE
    ###


In [None]:
# Test Cell: Exercise 0

col_null = {'budget': 0,
 'genres': 0,
 'homepage': 3091,
 'id': 0,
 'keywords': 0,
 'original_language': 0,
 'original_title': 0,
 'overview': 3,
 'popularity': 0,
 'production_companies': 0,
 'production_countries': 0,
 'release_date': 1,
 'revenue': 0,
 'runtime': 2,
 'spoken_languages': 0,
 'status': 0,
 'tagline': 844,
 'title': 0,
 'vote_average': 0,
 'vote_count': 0}
for col in data.columns:
    assert find_missing_vals(data, col) == col_null[col], "Looks like you don't have the right count for at least one of the columns"
    
print("\n(Passed!)")

How many missing values do the columns have?

In [None]:
for col in data.columns:
    if find_missing_vals(data, col):
        print("{} has {} missing values out of {}".format(col,find_missing_vals(data,col),len(data)))

It looks like there are not any missing values except in these 5 columns. Let's plot a histogram of the budgets, revenues and vote counts.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

plt.hist(x = data['budget'],bins = 50)
plt.ylabel('Number of movies')
plt.xlabel('Budget');

In [None]:
plt.hist(x = data['revenue'],bins = 50)
plt.ylabel('Number of movies')
plt.xlabel('Revenue');

In [None]:
plt.hist(x = data['vote_count'],bins = 50)
plt.ylabel('Number of movies')
plt.xlabel('Number of votes ');

Observe the following:

* There is a huge spike in near zero. The budget and revenue values are likely zero. In the industry from which these data are gathered, budget and revenue values below $100,000 don't make much sense.
* We should also have a minimum vote count to consider the vote average an effective way to measure the quality of a movie. Let's filter the data to get more rows that have "good" budget, revenue and user ratings data.

**Exercise 1** (2 points): Write some code to create a new pandas dataframe **filtered_data** that implements the following

0. Keep only the columns of interest for our analysis i.e. id, budget, revenue, vote_average, vote_count, genres, original_title, popularity
1. keep rows with budget > 100,000
2. keep rows with revenue > 100,000
2. Keep movies with number of votes (vote_count) > 20


In [None]:
filtered_data = pd.DataFrame()

###
### YOUR CODE HERE
###

# Display the data and count the number of movies remaining
print("Rows remaining: {}".format(len(filtered_data)))
display(filtered_data.head())

In [None]:
# Test cell: Exercise 1

columns = ['id','original_title','genres','budget','revenue','vote_average','vote_count','popularity']
for col in columns:
    assert col in filtered_data.columns, "You're missing a column"

assert len(filtered_data) == 3065, "Hmm, your filtered data doesn't have the correct number of rows"

assert min(filtered_data.budget) > 100000, "Hmm, you have some budget values less the required"
assert min(filtered_data.revenue) > 100000, "Uh-oh, you have some revenue values less than the required"
assert min(filtered_data.vote_count) > 20, "some vote_counts are less than required"


print("\n(Passed!)")

Let's look at a paiwise plot for all the numerical variables to see if we see any obvious relationships.

In [None]:
import seaborn as sns
sns.pairplot(filtered_data[['revenue','budget','popularity', 'vote_average','vote_count']])

It appears that revenue is correlated with budget, popularity and vote count. Let's back this visual analysis with correlation coefficients.

**Exercise 2** (1 point). Write a function,

```python
    def corr_coeff(col1, col2):
        ...
```

which takes two **Pandas Series objects** (`col1` and `col2`) as an input and returns their [(Pearson) correlation coefficient](https://en.wikipedia.org/wiki/Correlation_coefficient).

In [None]:
def corr_coeff(col1,col2):
###
### YOUR CODE HERE
###


Let's check what are the correlation coefficients between the different variables we have

In [None]:
# Test Cell: Exercise 2
import numpy.testing as npt

npt.assert_almost_equal(corr_coeff(filtered_data.revenue, filtered_data.vote_count), 0.751209931882, decimal=5)
npt.assert_almost_equal(corr_coeff(filtered_data.revenue, filtered_data.budget), 0.699955328476, decimal=5)
npt.assert_almost_equal(corr_coeff(filtered_data.revenue, filtered_data.popularity), 0.593541205556, decimal=5)
npt.assert_almost_equal(corr_coeff(filtered_data.revenue, filtered_data.vote_average), 0.181083687401, decimal=5)

print("\n(Passed!)")

In [None]:
for col in ['vote_count','budget','popularity','vote_average']:
    print("correleation coefficient for revenue and {} = {}".format(col,
                                                                    corr_coeff(filtered_data['revenue'],
                                                                               filtered_data[col])))

This confirms our speculation that budget, popularity, vote_average are highly correlated with the revenue.

What about genre -- is it also a driver for movie revenues? And are some genres more popular than others? Let's look at the `genres` column for one specific movie:

In [None]:
filtered_data['genres'][0]

It looks like a movie has multiple genres: each entry of a genre is stored as a list of dictionaries, with each dictionary having a genre ID and name. In the example above, the corresponding movie has 4 genres, namely, _Action_, _Adventure_, _Fantasy_, and _Science Fiction_. Let's clean this up to find average revenue made by a movie in each genre.

**Instructions for Exercise 3 & 4** (6 points). You need to write some code to create a dataframe named **`avg_revenue_by_genre`** from `filtered_data`. The dataframe should have the following columns:

- `'genre'`: a unique identifier in the dataframe
- `'average_revenue'`: the average revenue for a genre (see below for instructions on how calculate this value)
- `'movie_count'`: the number of movies that list this genre as one of its genres

Here is an example of how to calculate the average revenue by genre.

- If a movie has multiple genres, split the revenue equally to each assigned genre.

For instance, consider the first entry in the row below for _Avatar_, which has 4 genres and a total revenue of $2,787,965,087. Since it is associated with 4 genres, each one will get a 1/4 share of the revenue, $2,787,965,087/4 = $696,991,271.75.

- So, consider this input:

|original_title|genres|revenue|
|--------------|------|-------|
|Avatar|[{"id": 28, "name": "Action"}, {"id": 12, "nam...|2787965087|
|Spectre|[{"id": 28, "name": "Action"}, {"id": 12, "nam...|880674609|

'Avatar'  = {'genre': ['Action', 'Adventure', 'Fantasy', 'Science Fiction'],  'revenue' : 2787965087 } and 
'Spectre' = {'genre': ['Action', 'Adventure', 'Crime'], 'revenue' : 880674609}

Therefore, here is a sample output that you should get.

|genre|average_revenue|movie_count|
|-----|---------------|-----------|
|Action|495274737.375|2|
|Adventure|495274737.375|2|
|Fantasy|696991271.75|1|
|Science Fiction|696991271.75|1|
|Crime|293558203|1|

The average_revenue for Action = ```mean(2787965087/4,880674609/3)```
The average_revenue for Adventure = ```mean(2787965087/4,880674609/3)```
The average_revenue for Fantasy = ```mean(2787965087/4)```
The average_revenue for Science fiction = ```mean(2787965087/4)```
The average_revenue for Crime = ```mean(880674609/3)```

*Hints*:
1. The type of entries in genres in filtered data is currently `'str'`. It will be easier to first convert the entries to a list of dictionaries. (Try searching for [`ast.literal_eval`](https://docs.python.org/3/library/ast.html).)
2. You can use default dictionaries to add the revenue contribution from each movie to each genre.
3. You can use default dictionaries to count the number of movies in each genre.
4. For each genre, using results from 2 and 3, $$average\_revenue = \frac{(total\_revenue)}{(movie\_count)}$$ 

To help you solve this problem, we've broken it up into two parts.

**Exercise 3** (3 points) Let's consider the first part of this problem. Create two dictionaries, **`revenue_by_genre`** and **`movie_count_by_genre`**, that contain a genre's total revenue and genre's movie count, respectively.

- **`revenue_by_genre`**: the key is a genre's name, value is the genre's total revenue.
- **`movie_count_by_genre`**: the key is the genre's name and value is the number of movies associated with the genre.

In [None]:
ast.literal_eval(filtered_data.loc[2,"genres"])

In [None]:
from collections import defaultdict # Hint
import ast # Hint

revenue_by_genre = defaultdict(float)
movie_count_by_genre = defaultdict(int)
###
### YOUR CODE HERE
###

print(revenue_by_genre)
print()
print(movie_count_by_genre)

In [None]:
## Test cell: Exercise 3
assert isinstance(revenue_by_genre, dict), "type of revenue_by_genre is not dict"
assert isinstance(movie_count_by_genre, dict), "type of movie_count_by_genre is not dict"

all_revs__ = sum(revenue_by_genre.values())
all_revs_true__ = 390406088153.0
rel_delta_all_revs__ = (all_revs__ - all_revs_true__) / all_revs_true__
assert abs(rel_delta_all_revs__) <= 1e-7, \
       "Your total sum of revenue: {} does not match the instructor's: {}".format(all_revs__, all_revs_true__)

all_movies__ = sum(movie_count_by_genre.values())
assert all_movies__ == 8188, "Your total sum of movie count, {}, does not match the instructor's sum, {}.".format(all_movies__, 8188)

assert len(revenue_by_genre) & len(movie_count_by_genre) == 18
genres = ['Mystery', 'Romance', 'History', 'Family', 'Science Fiction', 
          'Horror', 'Crime', 'Drama', 'Fantasy', 'Animation', 'Music', 'Adventure',
          'Action', 'Comedy', 'Documentary', 'War', 'Thriller', 'Western']

for gen in genres:
    assert gen in revenue_by_genre.keys(), "{} is not in your revenue_by_genre dictionary".format(gen)
    assert gen in movie_count_by_genre.keys(), "{} is not in your movie_count_by_genre dictionary".format(gen)
    
sample_genres = {'Documentary': [525228204.0, 27],
                 'Animation': [16092739561.0, 181],
                 'Western': [1448994102.0, 54],
                 'Mystery': [8111172141.0, 254]}

for gen in sample_genres:
    rev__ = revenue_by_genre[gen]
    rev_true__ = sample_genres[gen][0]
    rel_delta__ = (rev__ - rev_true__) / rev_true__
    assert abs(rel_delta__) <= 1e-7, "revenue for {} should be {} but you have {}".format(gen, rev_true__, rev__)
    assert movie_count_by_genre[gen] == sample_genres[gen][1], "movie count for {} should be {} but you have {}".format(gen, sample_genres[gen][1], movie_count_by_genre[gen])

print("\n(Passed!)")

**Exercise 4** (3 points): Write some code to create a dataframe **`avg_revenue_by_genre`** from `filtered_data`. The dataframe should include the following columns:

- `'genre'`: a unique identifier in the dataframe.
- `'average_revenue'` : the average revenue for a genre.
- `'movie_count'`: the number of movies that list this genre as one of its genres.

> *Hint: You can use the dictionaries created in Exercise 3 as a starting point!*

In [None]:
###
### YOUR CODE HERE
###

# print your solution
display(average_revenue_by_genre)

In [None]:
## Test cell : Exercise 4

assert isinstance(average_revenue_by_genre, pd.DataFrame)
assert len(average_revenue_by_genre) == len(revenue_by_genre)
cols = ['genre', 'average_revenue', 'movie_count']
for c in cols:
    assert c in average_revenue_by_genre.columns

test = average_revenue_by_genre.set_index('genre')
for sample in sample_genres:
    a__ = test.loc[sample, 'average_revenue']
    b__ = sample_genres[sample][0] / sample_genres[sample][1]
    assert (a__ - b__) / a__ <= 1e-7

assert sum(average_revenue_by_genre['movie_count']) == 8188, "Your total sum of movie count: {} does not match the instructor's sum of movie count: {}".format(sum(movie_count_by_genre.values()), 8188)
assert np.isclose(sum(average_revenue_by_genre['movie_count']*average_revenue_by_genre['average_revenue']),
                  390406088153.0), "Your total sum of revenue: {} does not match the instructor's sum of revenue: {}".format(sum(revenue_by_genre.values()), 390406088153.0)

print("\n(Passed!)")

Let's make one last observation, looking specifically at the `average_revenues` by genre.

In [None]:
import seaborn as sns
sns.barplot(x="average_revenue", y="genre", data=average_revenue_by_genre.sort_values(['average_revenue'],ascending=False))

Genre is indeed associated with revenue. While adventure and action movies have high revenues, documentaries and history movies have lower revenue. What other exploratory analysis can you think of using this dataset?


**Fin!** That's the end of this problem. Don't forget to restart and run this notebook from the beginning to verify that it works top-to-bottom before submitting. You can move on to the next problem