# Data Visualization

## Assignment 4: Visualizing Distributions and Exploratory Data Analysis

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links to 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

<div class="alert alert-info" style="color:black">
    
### Assignment Learning Goals:

By the end of the module, students are expected to:

- Create heatmaps to visualize 2D distributions
- Visualize correlations and counts of categorical variables.
- Use repeated plot grids to investigate multiple data frame columns in the same plot.

This assignment covers [Module 4](https://viz-learn.mds.ubc.ca/en/module4) of the online course. You should complete this module before attempting this assignment.
 
   
</div>

**Hint:** If you are encountering issues passing tests with repeat plots - some tests are looking for one method of making a repeat plot, and others another. Try alternating between alt.repeat() in your encodings or .repeat() at the end with and without arguments in .repeat(). Ultimately there are a lot of ways to make plots that look like repeat plots, the tests are just looking for specific ways. Also post to Piazza if you are getting stuck passing a test.     

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this assignment

from hashlib import sha1
import altair as alt
import pandas as pd
import test_assignment4 as t
# Handle large data sets without embedding them in the notebook
# alt.data_transformers.enable('data_server');

## 0. ACT II 

Welcome back dearest **VIZARD**! 


<img src='img/vizard.png' width=40%>


<div>Icon made by <a href="https://www.freepik.com" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div>


In the last assignment, we left you on a bit of a cliff hanger! We started some analysis so that we could begin our new venture into the land of online streaming. We wanted to get an insight into the different movies available for our streaming platform and get a better understanding of the market before we launched our service Betterflix™.

<img src='img/betterflix.png' width=40%>


Now that we've developed the tools in our toolbox, we are going to continue on our venture to really solidify our knowledge of what we are getting into. 

**Lights, camera, action!** 🔦🎥 🎬

## 1. Prologue

Let's dive in! 

Since we want to put "good" movies on our streaming platform, we should take a look at the distribution of how the viewers have rated the films! This will tell us the range of ratings that viewers give to movies, as well as what are the most common ratings - this will allow us to see what ratings are exceptionally high, so we can add these movies to our streaming platform. 

The `vote_average` column gives the average score that reviewers awarded the film. Voters assigned a rating of 1 as the lower score and 5 being the highest. 

**Question 1.1** 
    <br> {points: 1}

Read in the data `lab2-movies.json`. When thinking about what function to use to read this in, note that this is a `.json` file. 

*Name the dataframe object you create `movies_df`*

In [None]:
movies_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movies_df

In [None]:
t.test_1_1(movies_df)

**Question 1.2** <br> {points: 2}  

Now that we have the data in Python, let's examine the distribution of the `vote_average column`. We will do this by making a histogram of the `vote_average` column. Set the number of `maxbins` to 40 and don't forget to give it an appropriate title and axis labels.

*Save the plot in an object named `vote_histogram_plot`.*

In [None]:
vote_histogram_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
vote_histogram_plot

In [None]:
t.test_1_2(vote_histogram_plot)

In [None]:
t.test_titles(vote_histogram_plot)

**Question 1.3** <br> {points: 2}  

Remember that the `vote_average` column has the average rating by voter who can score movies from 1-5. We want to make sure we are adding "good" movies to our platform.

If we wanted to add movies to our streaming site that only had a rating greater than 4, how many movies would we be putting on our site? 

A) Less than one hundred

B) Hundreds

C) Thousands 

D) Hundreds of thousands

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_3`.*

In [None]:
answer1_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_3

In [None]:
# check that the variable exists
assert 'answer1_3' in globals(
), "Please make sure that your solution is named 'answer1_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 1.4** <br> {points: 1}  

Would you be able to easily answer the above question if the visualization was a density plot? 

A) Yes

B) No

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_4`.*

In [None]:
answer1_4 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_4

In [None]:
t.test_1_4(answer1_4)

**Question 1.5** <br> {points: 1}  

Having this one plot is great, but it might help our analysis to see the distribution of the number of votes each film got as well the distributions of the other numeric columns in our data.

First, you need to extract the column names for all numeric columns except `id` and save them to a variable as a list. 

*Hint: [`.select_dtypes()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) method can help here, or you can do it manually.*

*Save this in a list named `numeric_cols`.*


In [None]:
numeric_cols = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
numeric_cols

In [None]:
t.test_1_5(numeric_cols)

**Question 1.6** <br> {points: 2}  

Use the Altair approach to repeating charts, `.repeat()`, to create a histogram for each of these numeric columns in a grid with 2 columns and 3 rows. Gives these charts a height of 150, and a width of 250.

Make sure to set maxbins to 40.

It is in the `.repeat()` function where you should specify an overall title.

*Save the plot in an object named `numeric_hist_plots`.*

In [None]:
numeric_hist_plots = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
numeric_hist_plots

In [None]:
t.test_1_6(numeric_hist_plots)

In [None]:
t.test_main_title(numeric_hist_plots)

**Question 1.7** <br> {points: 1}  

Does the `vote_count` column has the same distribution shape as the `vote_average` column? 


*Answer as either "Yes" or "No" as a string in an object named `answer1_7`*. 

In [None]:
answer1_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_7

In [None]:
t.test_1_7(answer1_7)

Something to ponder: Would you expect the `vote_count` to have a bell shaped distribution? AKA, would it makes sense that there are less films with lower voting amounts than average voting amounts? 

**Question 1.8** <br> {points: 1}  

When a bell-shape distribution (normal distribution) has a long left tail we say it is "skewed to the left" or a **negatively-skewed distribution**. When we see this, it indicates that the data have a few very small values. One of the consequences this has (which can impact decisions we make in our future machine learning or statistical analysis) is that it drives the mean downward, but it does not greatly affect the median.

When a bell-shape distribution (normal distribution) has a long right tail we say the opposite, that it is "skewed to the right" or is a **positively-skewed distribution**.When we see this, it indicates that the data have a few very large values. One of the consequences this has is that it drives the mean upward, but it still does not greatly affect the median.

Which is the case for the `runtime` distribution?

A) Normally distributed (bell-shaped)  

B) Normally distributed with a negatively-skewed distribution

C) Normally distributed with  with a positively-skewed distribution

D) Uniformly distributed (the same count values for all x-values)

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_8`.*


In [None]:
answer1_8 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_8

In [None]:
t.test_1_8(answer1_8)

Seeing the distributions of each of the numeric columns on their own can help Betterflix get an idea of the movies that are available to be added to our platform but it still leaves us with some questions unanswered, like "what if a movie has a lot of revenue but not good ratings? 

Combing information could allow us to find movies that are high in both!

# 2. Pairwise Numerical Columns

We've only been exploring a single numeric column at a time up until now. This time, we are going to visualize 2 numeric columns in a single plot!

**Question 2.1** <br> {points: 2}  

Let's create a scatterplot matrix (SPLOM) for all numerical columns except id. This is going to be a time-intensive plot to create so we are going to create this in separate steps to help you get the idea. 

First, let's make a scatter plot that visualizes just the `vote_average` column on the y-axis with all the numeric columns from the list `numeric_cols` that you made in **Question 1.5** on the x-axis. 

You should display these plots as a single row and display them with a height and width of 120 so they can be seen without scrolling. 

*Save the plot in an object named `num_row_plot`.*

In [None]:
num_row_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
num_row_plot

In [None]:
t.test_2_1(num_row_plot)

**Question 2.2** <br> {points: 2}  

Ok so we have `vote_average` being compared with every other possible numeric column but it will be useful to see all numeric columns compared to each other and hence make that scatterplot matrix (SPLOM) we discussed. 

We can do that now by instead of mapping `vote_average` to the y-axis, we repeat it as we did for the x-axis in **Question 2.1.**. 

You should display these plots with a height and width of 80 so they can be seen without scrolling. Each row and column should be represented on each row and column of a 5 x 5 grid of graphs.


*Save the plot in an object named `num_matrix_plot`.*

In [None]:
num_matrix_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
num_matrix_plot

In [None]:
t.test_2_2(num_matrix_plot)

**Question 2.3** <br> {points: 2}  

This plot seems to do the job of showing us a scatter plot for all the possible pairings of the numeric columns but, it could be a lot clearer. Let's remake and clean up this plot by making a new plot from scatch and doing the following: 

- Setting opacity to 0.5
- Setting the size of the points to 10
- Removing the necessity of displaying the zero value for both the x and y axes unless the data actually starts at zero. This will help get a better idea of the shape that's being made. we can use `scale=alt.Scale(zero=False)` for this.
- Reduce each subplots height and width to 120. 
- Finally, configure the axes ([`configure_axis`](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html#altair.Chart.configure_axis))  to remove the axes labels (this can be found using the argument `labels` in the `configure_axis` method. Plots don't have to look perfect during EDA, the point here is to pick up trends, not necessarily to read all labels.

*Save the plot in an object named `clean_matrix_plot`.*

In [None]:
clean_matrix_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
clean_matrix_plot

In [None]:
str(clean_matrix_plot.spec.encoding.x.axis)

In [None]:
assert not str(clean_matrix_plot.spec.encoding.x.axis) == 'Axis({\n  labels: False\n})', "Instead of removing the axes labels in the x channel, use the method 'configure_axis()' as specified in the question."
assert not str(clean_matrix_plot.spec.encoding.y.axis) == 'Axis({\n  labels: False\n})', "Instead of removing the axes labels in the y channel, use the method 'configure_axis()' as specified in the question."

In [None]:
t.test_2_3(clean_matrix_plot)

Do you notice how the diagonal plots have a perfection positive relationship? That should happen since it's plotting the same variable on each axis!

**Question 2.4** <br> {points: 2}  

Are there any columns that appear to have a positive relationship with the `vote_average` column? 


To answer the question, assign the column names as strings in a list assigned to a variable named `answer2_4`. If there are no columns that are displaying a slightly positive relationship, then submit an empty list (`[]`). 

Example: 

`answer2_4 = ['runtime', 'budget', 'revenue']` 

Hint: List the best supported column(s) that correlate with vote_average.

In [None]:
answer2_4 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_4

In [None]:
# check that the variable exists
assert 'answer2_4' in globals(
), "Please make sure that your solution is named 'answer1_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.


Another way of viewing the pairwise relationships between numerical columns is to assess their ***correlation coefficients***.

correlation coefficients: Measures the strength of a relationship between two different variables. they can take on values from -1 to 1. A positive value means that when one variable increases, the other variable will also increase. A negative relationship means that when one variable increases, the other variable tends to decrease. If the correlation
coefficient is zero, then there is no relationship between the two variables. 

We've calculated the correlation coefficients for you and saved them in the dataframe `corr_df`. 



In [None]:
corr_df = movies_df[numeric_cols].corr('pearson').stack().reset_index(name='correlation')
corr_df.head()

**Question 2.5** <br> {points: 1}  

Use this dataframe above and the scaffolding  code below to plot the calculated correlation coefficients between each of the numeric variables.

Fill in the blanks to create the plot.

In [None]:
# cc_plot = alt.Chart(...).mark_circle()....(
#     alt.X(...),
#     alt.Y('...),
#     alt.Size('correlation'),
#     alt.Color('correlation')
# ).....(...='Correlation coefficient between numeric variables')

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

cc_plot

In [None]:
t.test_2_5(cc_plot)

**Question 2.6** <br> {points: 1}  

Which two **differing** columns appear to have the highest positive linear correlation?

To answer the question, assign the column names as strings in a list assigned to a variable named `answer2_6`. 

Example: 

`answer2_6 = ['runtime', 'budget']` 

In [None]:
answer2_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_6

In [None]:
t.test_2_6(answer2_6)

**Question 2.7** <br> {points: 1}  

2D histograms can also be used as a good alternative when your scatter plots become oversaturated. In a similar way that you created the scatterplot matrix (SPLOM) from **Question 2.3**, you use similar code to make a “heatmap-like” 2D histogram instead. 

Make sure that you

- Use the `movies_df` dataframe. 
- Setting the `maxbins` parameter to 25 for both axes.
- Remove the necessity of displaying the zero value for both the x and y axes unless the data actually starts at zero. This will help get a better idea of the shape that's being made. we can use `scale=alt.Scale(zero=False)` for this.
- Reduce each subplots height and width to 120. 
- Finally, configure the axes ([`configure_axis`](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html#altair.Chart.configure_axis)) to remove the axes labels 


*Save the plot in an object named `hist_matrix_plot`.*

In [None]:
hist_matrix_plot = None
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
hist_matrix_plot

In [None]:
hist_matrix_plot.spec.encoding.color.aggregate

In [None]:
t.test_2_7(hist_matrix_plot)

In the last assignment, we used revenue as a way to identify potential movies that we would want to add to our platform. After looking at the above plots we've identified a small correlation between `revenue` and `vote_average` so it made sense to do so. We've also learned from looking at the `budget` and `vote_average` relationship, that that just because a movie has a large budget, doesn't mean it will be a high rated movie! For Betterflix, that means we can't plan to add a film before it's been released. We need to wait for the reviews or revenue before deciding on if a film should be added to our roster.

# 3. EDA of Numerical Columns Conditioned on a Categorical Column

In the last lab, we looked at how gross revenue was distributed among the genre of movies and how budget was distributed among different production studios. However, it may be useful for our analysis to look at how the movie genre affects the distribution of all the numeric columns. We then will independently analyze how the studio is affected by the numeric columns. 


**Question 3.1** <br> {points: 1}  

Let's start with the movie genres.

We observed last time in the dataframe that each film has multiple genre categories in a list in the `genres` column. 

We used the [`.explode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.explode.html) pandas method for this which then created additional rows, 1 for each of the genres in the list. 

Recap: Exploding multiple columns in a dataframe at a time runs a risk of duplicated rows if not mapped correctly which could affect your analysis. Note that for these assignments we will using `.explode()` with a single column at a time to avoid this. 

Just like last time, create a new dataframe that will create rows for every existing genre in the `genres` column using the `.explode()` method.


*Save this new dataframe in an object named `movie_genres_df`.*

In [None]:
movie_genres_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movie_genres_df.head()

In [None]:
t.test_3_1(movie_genres_df)

**Question 3.2** <br> {points: 2}  

Now we can make our multiple plots. 

Use the same strategy of repeating plots that we used earlier in this assignment to create a boxplot for all numerical values from the `movie_genres_df` dataframe. Map the genre on the y-axis and each of the numeric columns on the x-axis. use the numeric columns from the previous exercises. Give the y-axis and appropriate column label and remove the necessity of displaying the zero value for the x-axes. Display the plots in 2 columns and with a height and width of 200 and 360 respectively so that you do not need to scroll to view all the plots in their entirety. 

*Save the plot in an object named `genre_boxplots`.*

In [None]:
genre_boxplots = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
genre_boxplots

In [None]:
t.test_3_2(genre_boxplots)

Now that you have this visualization, analyze it and reflect over what you see by answering the following questions. 

**Question 3.3** <br> {points: 1}  

Which movie genres has the highest `vote_average` median?


*Save your movie genre as a string in an object named `answer3_3`*. 

In [None]:
answer3_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_3

In [None]:
t.test_3_3(answer3_3)

**Question 3.4** <br> {points: 2}  

Which movie genre shows the least amount of revenue outliers in its boxplot?


*Save your movie genre as a string in an object named `answer3_4`*. 

In [None]:
answer3_4 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_4

In [None]:
# check that the variable exists
assert 'answer3_4' in globals(
), "Please make sure that your solution is named 'answer1_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.


**Question 3.5** <br> {points: 1}  

Which movie genre has the highest budget median in it's boxplot?


*Save your movie genre as a string in an object named `answer3_5`*. 

In [None]:
answer3_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_5

In [None]:
t.test_3_5(answer3_5)

**Question 3.6** <br> {points: 1}  

Which column contains the most outliers for all movie genres?


*Save your column name as a string in an object named `answer3_6`*. 

In [None]:
answer3_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_6

In [None]:
t.test_3_6(answer3_6)

**Question 3.7** <br> {points: 1}  

Let's now address the `studios` column. 

In the previous assignment, We visualized the `studios` column with the  `budget` column only. I think we can agree that observing all the numeric columns with the studio, could be insightful. 

Just like in **Question 3.1**, create a new dataframe that will create rows for every existing genre in the `studios` column using the `.explode()` method.


*Save this new dataframe in an object named `movie_studios_df`.*

In [None]:
movie_studios_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
movie_studios_df.head()

In [None]:
t.test_3_7(movie_studios_df)

**Question 3.8** <br> {points: 1}  


Now see how the numerical columns vary with different production studios. We did this in the previous question but only looked at the `budget` column. 

In a similar plot to that of **Question 3.2** create a boxplot for all numerical values from the `movie_studios_df` dataframe. Map the studio on the y-axis and each of the numeric columns on the x-axis. Give the y-axis and appropriate column label and remove the necessity of displaying the zero value for the x-axes. Display the plots in 2 columns and with a height and width of 200 and 320 respectively so that you do not need to scroll to view all the plots in their entirety. 

*Save the plot in an object named `rev_boxplot`.*

In [None]:
rev_boxplot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
rev_boxplot

In [None]:
t.test_3_8(rev_boxplot)

**Question 3.9** <br> {points: 1}  

Which studio on average, seems to produce the lowest-rated films?


*Save your studio as a string in an object named `answer3_9`*. 

In [None]:
answer3_9 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_9

In [None]:
t.test_3_9(answer3_9)

**Question 3.10** <br> {points: 1}  

Which studio has the largest range of gross revenue?


*Save your studio as a string in an object named `answer3_10`*. 

In [None]:
answer3_10 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_10

In [None]:
t.test_3_10(answer3_10)

**Question 3.11** <br> {points: 1}  

Which studio has the most consistent runtime with the films it produces?


*Save your studio as a string in an object named `answer3_11`*. 

In [None]:
answer3_11 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_11

In [None]:
t.test_3_11(answer3_11)

# 4. EDA of Categorical Columns

So far we, examine the numeric columns in our data, but thats not all the data we have! We don't know that much about the relationship between categorical columns and categorical and numeric columns. We would generally want to examine all areas of our data during exploratory data analysis since finding a small insight such as "Animation movies tend to be rated higher on average" can be very valuable for Betterflix! We could anticipate adding more of those films even before they have been released! 

Since we are a new starting business, our cashflow and funding is limited. We can only afford to purchase from a few studios since we get to purchase the rights to films as a package deal (this helps us spend less!). 

We want to select films from studios that carry an assortment from all genres to help please everyone.  Asking a question such as *Which studios would you want to set up a deal with due to them offering the largest variety of film genres on a count basis?* would help us with this goal. 

As the final step of our EDA, and before you hand your analysis over to the investors, let’s explore which studios have the greatest assortment of movie genres.

**Question 4.1** <br> {points: 1}  

We now are going to produce a dataframe that has produced a row for each studio and each genre. We need both since we are counting the combinations. This can use done using the `.explode()` method again.

Explode the `genres` column from the `movie_studios_df` object.

*Save this new dataframe in an object named `genres_studios_df`.*

In [None]:
genres_studios_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
genres_studios_df.shape

In [None]:
t.test_4_1(genres_studios_df)

**Question 4.2** <br> {points: 2}  


Using `mark_rect()` to make a sort of "heat map", map the x and y axis to the genres and studios, respectively. Map the count to the color channel. Again, give it an appropriate title and axis labels.

*Save the plot in an object named `genres_studio_heatmap`.*

In [None]:
genres_studio_heatmap = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
genres_studio_heatmap

In [None]:
t.test_4_2(genres_studio_heatmap)

In [None]:
t.test_titles(genres_studio_heatmap)

**Question 4.3** <br> {points: 2}  

Make a similar plot to the one above, mapping the x and y-axis with the genres and studios, respectively. This time, use a square mark and map the count to color and size.  Again, give it an appropriate title and axis labels.  

*Save the plot in an object named `genres_studio_plot`.*

In [None]:
genres_studio_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
genres_studio_plot

In [None]:
t.test_4_3(genres_studio_plot)

In [None]:
t.test_titles(genres_studio_heatmap)

**Question 4.4** <br> {points: 2}  

Which of the 2 plots made in **Question 4.2** and **Question 4.3** you think helps best answer our question *"Which studios have the greatest assortment of movie genres?"*?


A) `genres_studio_heatmap` from **Question 4.2** 

B) `genres_studio_plot` from  **Question 4.3**

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_4`.*

In [None]:
answer4_4 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_4

In [None]:
# check that the variable exists
assert 'answer4_4' in globals(
), "Please make sure that your solution is named 'answer1_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.


**Question 4.5** <br> {points: 1}  


Which genres does Marvel Studios produce? 

Add all the genres as string elements in a list named `answer4_5`. For example, if Marvel Studio produces 2 genres, 'Action' and 'Animation', your solution will look like this:

answer4_5 = ["Action", "Animation"]

In [None]:
answer4_5 = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_5

In [None]:
t.test_4_5(answer4_5)

**Question 4.6** <br> {points: 1}  

How many studios produce war films?

*Answer in the cell below to an object of type `int` called `answer4_6`.*

In [None]:
answer4_6 = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_6

In [None]:
t.test_4_6(answer4_6)

**Question 4.7** <br> {points: 1}  

What genre does the Canal+ studio produce the most?

*Answer in the cell below as a string in an object called `answer4_7`.*

In [None]:
answer4_7 = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_7

In [None]:
t.test_4_7(answer4_7)

**Question 4.8** <br> {points: 1}  

Who produces the largest count of action films?

*Answer in the cell below as a string in an object called `answer4_8`.*

In [None]:
answer4_8 = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_8

In [None]:
t.test_4_8(answer4_8)

**Question 4.9** <br> {points: 1}  

Let's return back to our question - *Which 2 studios would you want to set up a deal with due to them offering the largest variety of film genres on a count basis?*

*Answer the studio(s) as string(s) in a list and name the object `answer4_9`.*

In [None]:
answer4_9 = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_9

In [None]:
t.test_4_9(answer4_9)

The plots above are great for comparing absolute counts,
but that means that studios with smaller production volume get drowned out. Let's instead visualize the proportion within each studio for each genre.

Below we have kindly calculated the proportion for each studio  - genre pair, so that for each studio, the genre proportions add up to 1.(you're welcome!) 

In [None]:
normalized_df = genres_studios_df.groupby('studios')['genres'].value_counts(normalize=True).reset_index(name='proportion')
normalized_df

**Question 4.10** <br> {points: 2}  

Take the `normalized_df` dataframe that we made above and visualize the proportions now using `mark_circle` (for a change). Make a plot mapping `genres` on the x-axis and `studios` on the y-axis and this time assigning `proportion` to both the color and size channel. Make sure you continue to give the plot an appropriate title and axis labels.

*Save the plot in an object named `normalized_plot`.*

In [None]:
normalized_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
normalized_plot

In [None]:
t.test_4_10(normalized_plot)

In [None]:
t.test_titles(genres_studio_heatmap)

**Question 4.11** <br> {points: 1}  

After looking at your answer to **Question 4.10**, is there an  additional studios to the ones from **Question 4.9** that may need to be considered when making business deals for Betterflix? We are looking for a studio that can contribute the most variety of genres. 

*Answer the ADDITIONAL studio as a string in a list and name the object `answer4_11`.*

In [None]:
answer4_11 = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_11

In [None]:
t.test_4_11(answer4_11)

Another useful insight we can see from the normalized plot above, is that now we can see Marvel Studios produces mostly Action and Science Fiction films and very few comedy ones. Before, using absolute numbers, it appeared that Marvel Studios produced approximately equal numbers of Action, Comedy and Science Fiction films. 

Dearest Vizard, you've done it! We're proud of your efforts to get Betterflix off the ground. 

We've now got a good idea of what studios we should approach with offers for Betterflix as well as not to base the success of a film on the budget! This information is crucial for the success of the company.

This is excellent work.
Now that we know what you are capable of we will be sure to contact you when we launch our next project: Bestify.

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel, clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- Gapminder dataset processed and uploaded by Joel Ostblom - [UofTCoders/workshops-dc-py](https://github.com/UofTCoders/workshops-dc-py)

- Original Gapminder data - [The Gapminder Foundation](https://www.gapminder.org/)


- MDS DSCI 531: Data Visualization I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_531_viz-1) 
