# Data Visualization

## Assignment 3: Visualizing Distributions

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links to 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

<div class="alert alert-info" style="color:black">
    
### Assignment Learning Goals:

By the end of the module, students are expected to:

- Select an appropriate distribution plot for the data.
- Create density plots to compare a few distributions.
- Create boxplots to compare many distributions.


This assignment covers [Module 3](https://viz-learn.mds.ubc.ca/en/module3) of the online course. You should complete this module before attempting this assignment.
 
</div>

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this assignment
from hashlib import sha1
import altair as alt
import pandas as pd
import numpy as np
from hashlib import sha1
import test_assignment3 as t

## 0. Motivation 

Welcome back to another episode of "Learning how to do data visualizations"! Tonight's topic of interest consists of whether or not we should start a new business venture. We've seen that the number of online streaming services has increased in availability over the years and since it's such a lucrative business, why not try to capitalize on it?

Before putting everything on the line financially, we need to understand the movie market better so that we can get a competitive edge over existing services. How can we outshine our competitors?

We need an aspiring data wrangler and future **VIZARD** to help understand the current market. 

This is where you come in!



<img src='img/vizard.png' width=40%>




<div>Icons made by <a href="https://www.freepik.com" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div>

Which companies produce what kind of movies,
and which receive good scores? What films are currently being offered in the streaming market? 
It's our job to attempt to answer these questions so that we what type of market we are entering and further only pick specific films for our platform (less scrolling - am I right?). How has no one has thought of this before!

We've even come up with a unique name for this service: Betterflix™.

<img src='img/betterflix.png' width=40%>

Ok, let's get started!

# 1. Opening Scene - Non-Visual Exploratory Data Analysis

When we begin EDA, it can be good to look at some textual summaries, just to get an idea of what to plot. We've done this in the previous course, so it's our instinct to want to do it here as well. 

**Question 1.1** 
    <br> {points: 1}

Read in the data `movies.json`. This data is stored in a `.json` file and not a `csv` which means you are going to need a different pandas tool than `pd.read_csv()`. Take a look at the documentation for [`.read_json()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) if you need any more help. 

*Assign your data to a variable named` movies_df`.*

In [None]:
movies_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movies_df

In [None]:
t.test_1_1(movies_df)


Display information about all columns, data types, and number of NaN values in the cell below and answer the following questions:


*Hint: How do you get **info**-rmation from a dataframe?*

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

**Question 1.2** 
    <br> {points: 1}
    
How many columns are there of Dtype `object`? 

*Save the answer in the cell below to an object of type `int` called `answer1_2`.*

In [None]:
answer1_2 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_2

In [None]:
t.test_1_2(answer1_2)

**Question 1.3** 
    <br> {points: 2}

How many columns have null values?


*Save the answer in the cell below to an object of type `int` called `answer1_3`.*

In [None]:
answer1_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_3

In [None]:

# check that the variable exists
assert 'answer1_3' in globals(
), "Please make sure that your solution is named 'answer1_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 1.4** 
<br> {points: 1}

Let's obtain some summary statistics to get a bit of an idea of our data. 

use the appropriate function and make sure to `include='all'` the column stats. 

*Save the dataframe in the cell below in an object named `movie_stats`.*

In [None]:
movie_stats = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movie_stats

In [None]:
t.test_1_4(movie_stats)

**Question 1.5** 
<br> {points: 1}

How many unique titles are in the database?

*Save the answer in the cell below to an object of type `int` named `answer1_5`.*

In [None]:
answer1_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_5

In [None]:
t.test_1_5(answer1_5)

**Question 1.6** 
<br> {points: 1}

What was the highest budget from the dataset?

*Save the answer in the cell below to an object of type `int` or `float` named `answer1_6`.*

In [None]:
answer1_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_6

In [None]:
t.test_1_6(answer1_6)

**Question 1.7** 
<br> {points: 1}

How many drama films are in the dataset?

*Save the answer in the cell below to an object of type `int` named `answer1_7`.*

In [None]:
answer1_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_7

In [None]:
t.test_1_7(answer1_7)

**Question 1.8** 
<br> {points: 1}

To 2 decimal places, what was the average runtime of the films in the database? 

*Save the answer in the cell below to an object of type `float` named `answer1_8`.*

In [None]:
answer1_8 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_8

In [None]:
t.test_1_8(answer1_8)

# 2.  Let's get Visual!

Alright, let's start making some plots that will help us answer some questions regarding our data. 

Betterflix™ wants to select films that are "better than the rest". Revenue is a good tell of how successful and well-received a film is. It would be interesting to see if films made by collaborating studios have higher producing revenue? If films that work in collaboration have higher revenue, that makes it easy for Betterflix™ to simply select films for their streaming service that have worked with multiple studios. 

**Question 2.1** <br> {points: 1}  

Before we dive into answering our question, let's examine the studios and studios that collaborate from the data. Create a bar chart that shows the number of films each studio produced. 

Think about how the plot should be rotated. It's a good idea to sort your plot as well in an appropriate order. Give the plot an appropriate title, axis labels and set the height and width to 800 and 600 respectively.

*Save the plot in an object named `studio_plot`.*

In [None]:
studio_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
studio_plot

In [None]:
t.test_2_1(studio_plot)

**Question 2.2** <br> {points: 1}  

Which 2 studios produce the most amount of films together? (take a look at the bar graph in 2.1)


*To answer the question, add the names of the studios as elements of type `str` to a list and assign it to a variable named `answer2_2`. For example, if you believe that Warner Bros. and Marvel Studios produce the greatest number of films together, then your answer would look like this:*

`answer2_2 = ["Pixar Animation Studios", "Walt Disney Pictures"]`

In [None]:
answer2_2 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_2

In [None]:
t.test_2_2(answer2_2)

**Question 2.3** <br> {points: 2}  

We want to examine if films that have been produced in collaboration have higher revenue so we need to distinguish between these films! 

Add a column to the dataframe `movies_df` named `collaboration` that returns `True` if the list in the `studios` column is greater than 1. 

*Save the new dataframe in an object named `collab_df`.*

In [None]:
collab_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
collab_df.head()

In [None]:
t.test_2_3(collab_df)

**Question 2.4** <br> {points: 2}  

How many films in our dataset have been collaborations? 

Plot the counts of the `collaboration` column that we made in **Question 2.3** from the `collab_df` dataframe. Give the plot a title and a width and height of 300.

*Save the plot in an object named `studio_dist`.*

In [None]:
studio_dist = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
studio_dist

In [None]:
t.test_2_4(studio_dist)

**Question 2.5** <br> {points: 2}

Which of the following statements is correct? 



A) In our dataset, films are more often made by multiple studios.

B) In our dataset, films are more often made by a single studio.

C) In our dataset, films are made equally as often in collaboration as they are singular. 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_5`.*

In [None]:
answer2_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_5

In [None]:

# check that the variable exists
assert 'answer2_5' in globals(
), "Please make sure that your solution is named 'answer2_5'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 2.6** <br> {points: 2}  

We want to find out if films made in collaboration produce more revenue, but first, it's a good idea to see how the revenue values are distributed in our data. 

What kind of films are included in our dataset? Do we only have big blockbuster films at our disposal or are there also smaller indie films in our dataset? 

Take a look at the `revenue` column from the `collab_df` dataframe.
Create a single histogram of the column, set the number of bins to `50`. Don't forget to give the plot a title and proper axis labels.

*Save the plot in an object named `revenue_plot`.*

In [None]:
revenue_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
revenue_plot

In [None]:
t.test_2_6(revenue_plot)

**Question 2.7** <br> {points: 1}

What can we tell from the graph above? 


A) Revenue is uniformly distributed among blockbuster films and lower revenue ones. 

B) Revenue is normally distributed (bell-shaped curved) 
 
C) There appears to be a few films generating much higher revenue.

D) There appears to be a few films generating much lower revenue.


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_7`.*

In [None]:
answer2_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_7

In [None]:
t.test_2_7(answer2_7)

**Question 2.8** <br> {points: 1}  

Alright, let's start answering our big picture question now - Do films made in collaboration generally generate more revenue? 

We can attempt to answer this question by taking the plot `revenue_plot` from **Question 2.6** and faceting it by the column `collaboration`. 

*Save the plot in an object named `revenue_plot_facet`.*

In [None]:
revenue_plot_facet = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
revenue_plot_facet

In [None]:
t.test_2_8(revenue_plot_facet)

**Question 2.9** <br> {points: 1}  

Ok, so that last plot didn't seem to help us that much. The two plot distributions are quite difficult to compare. We should fix that by giving each faceted plot it's own independent y-axis.

Take the plot `revenue_plot_facet` from **Question 2.8** and adjust the axis scaling so its done independently. 

*Save the plot in an object named `revenue_plot_facet2`.*

In [None]:
revenue_plot_facet2 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
revenue_plot_facet2

In [None]:
t.test_2_9(revenue_plot_facet2)

**Question 2.10** <br> {points: 1}  

Finally, we can now advise Betterflix™. What would you inform Betterflix™ regarding films that work in collaboration? <br>
Which of the following is ***most*** appropriate?

A) It's clear that films produced in collaboration have more success at the box office (higher revenue). 

B) It's clear that films produced in collaboration have less success at the box office (lower revenue). 

C) It appears there is no difference between the success at the box office between films that have collaboration or not.

D) It's not especially clear if collaboration has an impact on the success at the box office particularly since there are not that many movies produced by multiple studios. We would likely have to follow up with formal statistical testing to determine if there is a difference.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_10`.*

In [None]:
answer2_10 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_10

In [None]:
t.test_2_10(answer2_10)

# 3. Exploring Genre and Studio Categories

Are certain studios or movie genres producing more successful films at the box office? If Betterflix™ was hosting films from particularly popular genres and studios that might help their competitive edge! Well, Disney+ is doing it! 

This could be valuable information for Betterflix™ so let’s get to it!


Let's start with the movie genres.

If you look at the data frame, you can see that each film can have multiple genres in a list.

This is a little problematic and means we must decide if we are counting the film once per genre or once overall. Since we have no information on what the main genres are for each movie, it's a good idea to produce multiple rows per genre in the genres column.

In [None]:
movies_df.head()

**Question 3.1** 
<br> {points: 1}

Replicate each row once per genre in the genres column. There is a pandas method called [`.explode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.explode.html) for this that can help us produce multiples rows for each genre in the genres list. 

Exploding multiple columns in a dataframe at a time runs a risk of duplicated rows if not mapped correctly which could affect your analysis. Note that for these assignments we will using `.explode()` with a single column at a time to avoid this. 

Create a new dataframe that will create rows for every existing genre in the `genres` column. Look at the documentation provided if you need any guidance. 


*Save this new dataframe in an object named `movie_genres_df`.*

In [None]:
movie_genres_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movie_genres_df.head()

In [None]:
t.test_3_1(movie_genres_df)

**Question 3.2** 
<br> {points: 1}

What are all the unique movie genres?

How many different movie genres are there? 

*Save all the unique movie genres in a list named `movie_genres`. Save the total number of genres in an object named `genres_tot` of type `int`.*

In [None]:
movie_genres = None
genres_tot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
print(movie_genres)
print(genres_tot)

In [None]:
t.test_3_2(movie_genres)

**Question 3.3** 
<br> {points: 2}

Which movie genres have the highest **median** gross revenue? Sort the genres by ascending median gross revenue. 

*Hint: You will need to groupby each genre, find the median of the revenue, sort the values by the index and convert it to a list (`.sort_values().index.to_list()` may be helpful here)*

*Assign the sorted list to an object named `genres_by_revenue`.* 


In [None]:
genres_by_revenue = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
genres_by_revenue

In [None]:
t.test_3_3(genres_by_revenue)

**Question 3.4** 
<br> {points: 1}

Use the new dataframe `movie_genres_df` that was made in **Question 3.1** and create multiple boxplots inside a single figure. Map the genres on the y-axis and gross revenue on the x-axis.

Remember that Boxplots can only be sorted by passing an explicit list of the sort column, in this case, the genre's values. Sort the genres by using the list from `genres_by_revenue` from **Question 3.3**, so that the median lines of the box plot are nicely sorted. Don't forget to give the plot a title.


*Save the plot in an object named `rev_boxplot`.*

In [None]:
rev_boxplot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
rev_boxplot

In [None]:
t.test_3_4(rev_boxplot)

**Question 3.5**  
<br> {points: 2}

One of the great things about boxplots is they allows us to see some of the outliers that occur in each category. The high revenue outliers of each category are definitely films that Betterflix™ should host.

Let's see find out the revenue amount is for films in the 99.9th percentile of each genre. 

This means we need to look at what the 99.9th quantile is for `revenue` in each `genres`. 

You can solve this using a combination of `.groupby()` selecting and you'll need the [`.quantile()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html) method as well. 

*Save the revenue value for each genre in a Pandas Series assigned to an object named `quantile_999`.*

In [None]:
quantile_999 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
quantile_999

In [None]:
t.test_3_5(quantile_999)

**Question 3.6**  
<br> {points: 2}

Let's see which film the outlier in the animation box plot could be. We definitely want that one for Betterflix™! Does anyone have any guesses? 

There are several ways you can find outliers in a dataframe but for simplicity, you can filter the `movie_genres_df` for the animation films that have a revenue value greater than the value for the animation genres in the `quantile_999` object. 


*Save the title as a string in an object named `animation_outlier`.*

In [None]:
animation_outlier = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
animation_outlier

In [None]:
# check that the variable exists
assert 'animation_outlier' in globals(
), "Please make sure that your solution is named 'animation_outlier'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.7**  
<br> {points: 1}

What about the far-right outlier in the Science Fiction genre. Which film could that be?

Filter the `movie_genres_df` for the Science Fiction films that have a revenue value greater than the respective value in the `quantile_999` object. 


*Save the title as a string in an object named `sf_outlier`.*

In [None]:
sf_outlier = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
sf_outlier

In [None]:
t.test_3_7(sf_outlier)

Did you expect those films to be the outliers in their respective genres? 

**Question 3.8**  
<br> {points: 1}

Remember that boxplots hide how many observations there are in each group, so you would need to check this manually. Before coding this answer, glance at your plot above again and reflect for yourself: which do you think is the biggest and smallest category and how many observations does there appear to be in each? Can you tell at all?

Now let's see how close your guess is to the true answer. Make a barplot with the counts on the x-axis and the genres on the y-axis sorted by count and with the biggest bar closest to the x-axis. Don't forget to give the plot a title and appropriate axis labels.



*Save the plot in an object named `genre_bar`.*

In [None]:
genre_bar = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
genre_bar

In [None]:
t.test_3_8(genre_bar)

**Question 3.9**  
<br> {points: 1}

Now let's take a look and see if any particular studios are producing more blockbuster revenue-driving films. Then Betterflix™ can host more films coming from a particular studio. 

Similar to what you did in **Question 3.1**, create multiple rows for each of the studios from the `studios` column from the `movies_df` dataframe. 

*Save this new dataframe in an object named `movie_studios_df`.*

In [None]:
movie_studios_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movie_studios_df.head()

In [None]:
t.test_3_9(movie_studios_df)

**Question 3.10**
<br> {points: 1}

Since we can't tell from the boxplots, before we plot the revenue for each studio, let's see how many films each studio produced?

Make a barplot that maps the counts on the x-axis and the studio on the y-axis sorted by count, with the biggest bar closest to the x-axis. Don't forget to give the plot a title.


*Save the plot in an object named `studio_bar`.*

In [None]:
studio_bar = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
studio_bar

In [None]:
t.test_3_10(studio_bar)

**Question 3.11** 
<br> {points: 2}

Which studios have the highest **median** revenue? 
Sort the studios in ascending order. 

*Hint: You will need to groupby each studio, find the median of the revenue, sort the values by the index and convert it to a list (`.sort_values().index.to_list()` may be helpful here)*

*Assign the sorted list to an object named `studios_by_revenue`.* 


In [None]:
studios_by_revenue = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
studios_by_revenue

In [None]:
t.test_3_11(studios_by_revenue)

**Question 3.12** 
<br> {points: 1}

Using the new dataframe `movie_studios_df` that was made in **Question 3.9**, Map the studios on the y-axis and revenue on the x-axis and create a boxplot for each of the studios. 

Sort the studios using the list from `studios_by_revenue` from **Question 3.11**, so that the median lines of the box plot are nicely sorted. Don't forget to give the plot a title.


*Save the plot in an object named `rev_boxplot`.*

In [None]:
rev_boxplot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
rev_boxplot

In [None]:
t.test_3_12(rev_boxplot)

**Question 3.13** 
<br> {points: 1}

Which studio has the highest minimum revenue? These films are an excellent choice to add to Betterflix™ since films tend to "flop" less from this studio. 


*Save the studio name as a string in an object named `answer3_13`.*

In [None]:
answer3_13 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_13

In [None]:
t.test_3_13(answer3_13)

**Question 3.14** 
<br> {points: 1}

Which studio produces films that are the most unpredictable in revenue (has the greatest range in revenue)?

*Save the studio name as a string in an object named `answer3_14`.*

In [None]:
answer3_14 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_14

In [None]:
t.test_3_14(answer3_14)

**Question 3.15** 
<br> {points: 1}

Which studio has the highest median? 

*Save the studio name as a string in an object named `answer3_15`.*

In [None]:
answer3_15 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_15

In [None]:
t.test_3_15(answer3_15)

It's important to note that the data is not specifying the production year of each film. This means that some variables may not be completely telling since we are not accounting for inflation. 

One example of not having all the information is that New line Cinema no longer produces films and [was acquired in 2008](https://en.wikipedia.org/wiki/New_Line_Cinema) by  Warner Bros. 

This is one of the reasons why it's so important that you understand your data so that you can produce accurate and meaningful insights. 

# 4. Exploring Voting Average

Up until this point, we have been looking at the revenue generated from each film as a way to measure how well a film was received. In this data, we also have a variable named `vote_average` which measures how viewers rated the film collectively. Betterflix™ wants to know how people are receiving certain films, that's how we can make sure that we have a competitive edge and only hosting films people will enjoy!


The first thing we've done is categorized the top 10% revenue-producing filmed in the dataframe as `blockbusters` in a new `revenue_size` column and assigned it to a new dataframe named `blockbuster_df`. 

In this question, we will make a few different plots types for the same `vote_average` column and at the end, we are doing to observe which plot does the best job of answering our questions **"Do films generating the top 10% of revenue produce higher received films by voting average?"** and **"Are there some hidden gems that make less revenue, but are very well received?"**.  

Let's start by adding a new column in our dataframe that categorizes each film at a top 10% revenue producer or not. 

In [None]:
blockbuster_df = movies_df.assign(revenue_size=np.where(movies_df['revenue']>=  movies_df['revenue'].quantile(0.90), 'Blockbuster', 'Not top 10%'))
blockbuster_df.head(10)

[There is a bug that occurs when you plot histograms containing different groups](https://github.com/vega/vega-lite/issues/6991) which can affect the placement of each of the group's bars (whether a certain group has it's bars in front of another groups bars, is not consistent). 

For this reason, we are going to order this dataframe to avoid this error. 

In [None]:
blockbuster_df = blockbuster_df.sort_values('revenue_size', ascending=False)
blockbuster_df.head()

**Question 4.1** <br> {points: 2}  

First, let's look at all the observations and their `vote_average` from the data using a rug plot. We also want to map the colour channel to the film's `revenue_size`. It's also a good idea to set the opacity to 0.5. Make sure to assign a proper title and axis labels. 


*Save the plot in an object named `rug_plot`.*

In [None]:
rug_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

rug_plot

In [None]:
t.test_4_1(rug_plot)

**Question 4.2** <br> {points: 2}  

Following the rug plot, make a stacked histogram that bins the voting average values on the x-axis and counts the number of films on the y-axis. 
Make sure to assign a colour channel to `revenue_size` and set `maxbins=30`. Don't forget to give the plot a title and proper axis labels.


*Save the plot in an object named `histogram_plot`.*

In [None]:
histogram_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

histogram_plot

In [None]:
t.test_4_2(histogram_plot)

**Question 4.3** <br> {points: 1}  

Let's next take `histogram_plot` from **Question 4.2** and layer the bars instead of stacking them. Set the opacity to 0.5. 
Make sure that this plot has an appropriate title and proper axis labels.


*Save the plot in an object named `layered_histogram_plot`.*

In [None]:
layered_histogram_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

layered_histogram_plot

In [None]:
t.test_4_3(layered_histogram_plot)

**Question 4.4** <br> {points: 2} 


Let's next make a density plot visualizing the film voting average column just like the previous question (you knew this one was coming). Take the  `blockbuster_df` data and make a density plot using `vote_average`. Groupby the `revenue_size` and Make sure to give a name of `density_vals` to the KDE values. Give the plot an opacity of 0.5 and make sure to encode `density_vals:Q` on the y-axis and map `revenue_size` to the colour channel.  
Don't forget to give it an appropriate title and axis labels.

*Save the plot in an object named `density_plot`.*


In [None]:
density_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
density_plot

In [None]:
t.test_4_4(density_plot)

**Question 4.5** <br> {points: 2} 

Finally, for the last visualization, take the plot `density_plot` from **Question 4.4** and facet by `revenue_size` in a single column.

*Save the plot in an object named `faceted_density_plot`.*

In [None]:
faceted_density_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
faceted_density_plot

In [None]:
t.test_4_5(faceted_density_plot)

**Question 4.6** <br> {points: 2} 

Which plot best answers our original question **Do films generating revenue in the top 10%, produce higher received films by voting average?** 

A) `rug_plot` from **Question 4.1** 

B) `histogram_plot` from **Question 4.2** 
 
C) `layered_histogram_plot` from **Question 4.3** 

D) `density_plot` from **Question 4.4** 

E) `faceted_density_plot` from **Question 4.5** 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_6`.*

In [None]:
answer4_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_6

In [None]:
# check that the variable exists
assert 'answer4_6' in globals(
), "Please make sure that your solution is named 'answer4_6'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.7** <br> {points: 1} 

How would you answer this question after seeing all the plots above - **Are there some hidden gems that make less revenue, but are very well received?**
Choose the best most complete answer from the options below. 


A) Yes, we can tell that the films with the highest average rating in the dataset are from non-blockbuster films.  

B) Yes, there are some hidden gems but overal the highest-rated films by voting average tend to be blockbuster films. 

C) No, the only top-rated films are blockbuster films. 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_7`.*

In [None]:
answer4_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_7

In [None]:
t.test_4_7(answer4_7)

Excellent work!! We're proud of your efforts to get Betterflix™ off the ground. This first report will really help us with our pitch for acquiring funding!

Don't worry! We are not quite ready to launch and market to customers. In the next assignment we will continue our research.

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel, clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions

- MDS DSCI 531: Data Visualization I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_531_viz-1) 
