# Homework 2: DataFrames, Data Visualization, and Functions

## Due Tuesday, April 25 at 11:59PM

Welcome to Homework 2! This week, we will cover DataFrame manipulations, making visualizations, and defining functions. You can find additional help on these topics in  [BPD 6, 9-12](https://notes.dsc10.com/01-getting_started/functions-defining.html) in the `babypandas` notes and [CIT 7-7.3](https://inferentialthinking.com/chapters/07/Visualization.html) in the textbook.

### Instructions

Remember to start early and submit often. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or Ed. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

**Please do not use for-loops for any questions in this homework.** If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

In [None]:
# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)

import otter
grader = otter.Notebook()

import warnings
warnings.simplefilter('ignore')

## 1. Are You Scared Yet? Analyzing Horror Movies 🎃😱

<center><img src="./images/hocus_pocus.jpg" width = 400/></center>

Spooky season may have passed a few months ago, but it's never too late to watch horror movies! We've provided a file, `data/horror_movies.csv`, that contains information about 464 movies. For each movie, we have 10 pieces of information (see the data description below) that we'll use to generate some insights about the state of horror movies in recent years. 

| Column      | Description |
| ----------- | ----------- |
| `'Title'`      | Title of the movie       |
| `'Country'`   | Country the movie was originally released in        |
| `'Maturity Rating'` | A rating given to the movie by the Motion Picture Association |
| `'Review Rating'` | The IMDb rating of the film, representing how good it was | 
| `'Language'` | The language the movie is in | 
| `'Filming Locations'` | The location in which the movie was filmed |
| `'Budget'` | The total amount spent on the movie |
| `'Release Month'` | The month the movie was released |
| `'Release Day'` | The day the movie was released |
| `'Run Time'` | The length of the film in minutes |

Run the cell below to read the file containing all of the horror movies into a DataFrame called `horror`.

In [None]:
horror = bpd.read_csv('data/horror_movies.csv') 
horror

**Question 1.1.** Examine the columns available in `horror` and consider which would be the best choice of index for this DataFrame. Change the `horror` DataFrame so that it's indexed by the values in this column instead of the default index.

In [None]:
horror = ...
horror

In [None]:
grader.check("q1_1")

_Note:_ If you were to run the cell where you set the index of `horror` again, you'd see an error message. Stop and think about _why_ you'd run into an error. Once you've thought about it, click the thinking emoji below to see the reason for the error.

<br>

<details>
    <summary>Why would there be an error? 🤔</summary>
    There would be an error since you'd be trying to set the index of <code>horror</code> to a column that no longer exists in <code>horror</code> – the column wouldn't exist because it was converted to the index the first time you ran the cell (and the index is not a column)!
</details>

If you acually ran the cell twice and got an error message, don't worry. To get rid of it, re-run the cell in 1.1 where you defined the `horror` DataFrame, then run the cell in 1.2 just once, and you'll be good to go.

When you submit your work for autograding, the entire notebook will be run from start to finish. Each cell will run only once, so it's no problem if your code errors on the second run. In this case, it means you're doing something right!

**Question 1.2.** *Dream Nightmare*, released in 2016, is one of the three lowest-budget movies in our dataset. What is the budget of `'Dream Nightmare (2016)'`, and what is its `'Review Rating'`? Assign your answers to variables `DN_budget` and `DN_rating`, respectively.

In [None]:
DN_budget = ...
DN_rating = ...
print("The budget for Dream Nightmare was", DN_budget, "and the rating was", DN_rating)

In [None]:
grader.check("q1_2")

**Question 1.3.** Assign `lowest_rated_movie` to the name of the movie with the lowest `'Review Rating'` (including the year in parentheses), and set the `'Review Rating'` of that movie to `lowest_rating`.

In [None]:
lowest_rated_movie = ...
lowest_rating = ...
print("The lowest-rated movie is", lowest_rated_movie, "with a rating of", lowest_rating)

In [None]:
grader.check("q1_3")

**Question 1.4.** That's a very low rating, but how does that compare to the other movies included in the dataset? First, plot a density histogram that shows the distribution of `'Review Rating'`. Then compute the difference between the lowest rating and the **median** movie rating, and assign the result to the variable `below_med`.

When plotting your histogram, remember to set `density=True` and `ec='w'`. You don't have to set the `bins` argument.

In [None]:
# Create your histogram here.

# Then calculate below_med.
below_med = ...
below_med

In [None]:
grader.check("q1_4")

**Question 1.5.** How many movies in our dataset were released in April and have been given a `'Maturity Rating'`? Note that movies without a `'Maturity Rating'` appear as either `'NOT RATED'` or `'UNRATED'`.  Set the number of such movies equal to the variable `apr_rated_count`.

In [None]:
apr_rated_count = ...
apr_rated_count

In [None]:
grader.check("q1_5")

**Question 1.6.** Which movie titles contain the word `'night'`, with any capitalization? Create an *array* called `night_movies` containing the titles of all such movies, capitalized exactly as they appear in the DataFrame. 

*Hints:*
- To convert a Series into an array, call the function `np.array` on the Series.
- The movie names are all strings, so they may have inconsistencies in how they're capitalized. We want to count movie titles with the words `'Night'`, `'night'`, and even `'NiGHt'`. If we want to account for variations in capitalization, what operation should we call on the movie names **first**? (You may end up using `.str` twice! [This video](https://www.youtube.com/watch?v=TCcEhVA6Euw&list=PLDNbnocpJUhbczUw2Rw6bqreEECMvZ8gN&index=2) and [this section of Lecture 6](https://dsc10.com/resources/lectures/lec06/lec06.html#How-do-we-include-songs-with-other-artists,-as-well?) may help.)

In [None]:
night_movies = ...
night_movies

In [None]:
grader.check("q1_6")

**Question 1.7.** What proportion of movies in our dataset were originally released in each country? Create a DataFrame indexed by `'Country'` with one column called `'Proportion'` containing the proportion of movies in the dataset that were released in that country. Order the rows in descending order of `'Proportion'` and assign this DataFrame to `country_proportions`.

*Hints:*
- Proportions can be easily calculated from counts.
- If you pass in a **list** of columns names to `.get()`, the result will be a DataFrame containing only the columns specified in the list. 

In [None]:
country_proportions = ...
country_proportions

In [None]:
grader.check("q1_7")

**Question 1.8**  Create a horizontal bar chart that displays the median `'Review Rating'` for each country. Sort the bars so the country that with the lowest median appears at the very top, and the country with the highest median appears at the bottom.

_*Hint*_: To get the bar chart to display nicely, try adding the keyword argument `figsize=(10, 10)`.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_8
manual: True
-->

In [None]:
# Create your bar chart here.
...

<!-- END QUESTION -->



## 2. Shine Bright Like a Diamond 💎

In this section, we're going to be working alongside Jack the Jeweler to learn more about the diamond business! The data in `data/diamonds.csv` contains details about 10,000 diamonds, including the quality, dimensions, and price. The information about these diamonds can help Jack get a better understanding about what is valued most in the diamond business. The columns are described below:

| Column      | Description |
| ----------- | ----------- |
| `'carat'`      | The number of carats of the diamond       |
| `'cut'`   | The cut quality of the diamond   |
| `'color'` | The color of the diamond |
| `'clarity'` | The clarity of the diamond | 
| `'price'` | The diamond's price |
| `'x'` | The length of the diamond, in mm |
| `'y'` | The width of the diamond, in mm |
| `'z'` | The depth of the diamond, in mm |

Run the cell below to read in the data.

In [None]:
diamonds = bpd.read_csv('data/diamonds.csv')
diamonds

**Question 2.1.** One of the first things that Jack learned about when entering the diamond business was the *depth percentage* measurement. The depth percentage of a diamond is ratio of the depth to the mean of the width and length, times 100:

$$\text{depth percentage} =\dfrac{\text{depth (in mm)}}{\text{mean of width and length (in mm)}} \cdot 100$$
    
Assign to the variable `depth_percentage` a Series with the depth percentage of each diamond in `diamonds`. Then, add a column named `'depth_percentage'` containing this Series to the `diamonds` DataFrame.

In [None]:
depth_percentage = ...
diamonds = ...
diamonds

In [None]:
grader.check("q2_1")

Depth percentage is important to jewelers because it determines how light refracts through the stone, which in turn affects the visual appearance of the diamond. Diamonds that are too shallow have grey rings (called "fish eyes" 🐟 ), and diamonds that are too deep have dark spots in the middle (called "nail heads" 🔨). The ideal depth percentage for a diamond is between 54 and 66 percent, inclusive. These diamonds really sparkle! ✨

<center><img src=images/depth_percentage.jpg width=500>
(<a href="https://www.ori-diamonds.com/blog/diamond-depth">source</a>)</center>

**Question 2.2.** Jack is curious as to how common ideally proportioned diamonds actually are. Calculate the proportion of diamonds that have an ideal depth percentage (between 54 to 66 percent, inclusive) and set the result to the variable `ideal_prop`.

In [None]:
ideal_prop = ...
ideal_prop

In [None]:
grader.check("q2_2")

**Question 2.3.** Jack has been taught that the depth percentage has an impact on the visual impact of a diamond, which he suspects also affects the price. Create a scatter plot showing how the price of a diamond varies with its depth percentage.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_3
manual: True
-->

In [None]:
# Create your scatter plot here.
...

<!-- END QUESTION -->



Hmm... perhaps there's a bit more to diamond pricing than just depth percentage! Jack explains that there are four common measures of the quality of a diamond, sometimes called the 4 C's: `'carat'`, `'cut'`, `'clarity'`, and `'color'`.

1) The `'carat'` simply measures the weight of the diamond. 

2) The `'cut'` is related to the depth proportion, and is considered either Ideal, Premium, Very Good, Good, or Fair (in descending order of quality).

3) The `'clarity'` describes whether the diamond has any flaws. In descending order of quality, the values of `'clarity'` in our dataset are `'IF'` (which stands for "internally flawless"), `'VVS1'`, `'VVS2'`, `'VS1'`, `'VS2'`, `'SI1'`, `'SI2'`, and `'I1'`.

<br>

<center><img src=images/clarity.jpg width=400>
(<a href="https://www.petragems.com/education/diamond-clarity/">source</a>)</center>

4) The `'color'` of each diamond in our dataset is described by a letter between `'D'` and `'J'`, where `'D'` represents a diamond with no color, and `'J'` represents a diamond with some color to it. Diamonds with less color are considered higher quality.

<center><img src=images/color.jpg width=200>
(<a href="https://bashertjewelry.com/pages/diamonds-color-grading">source</a>)</center>

The `'carat'` column of `diamonds` contains numerical data, but the columns for the other 3 C's contain ordered categorical data. Since the data has an order to it, we can convert the values in those columns into numerical values, to make for easier comparisons. For example, if we assign all the values of `'J'` in the `'color'` column to 1, all the values of `'I'` in the `'color'` column to 2, etc., we'll more easily be able to search for diamonds where the color is better than an `'F'` (we could search for color values greater than 5).

For each of `'cut'`, `'clarity'`, and `'color'`, let's translate the data from categorical values to numerical values. For all three of these quality measures, we'll use the number 1 to represent the lowest quality category, and we'll count up from there by one for each category. For example, the numbers for `'clarity'` will range from 1 (for `'I1'`-rated diamonds) to 8 (for internally flawless, or `'IF'`-rated diamonds).


One way to do this conversion is to use a Python [dictionary](https://www.tutorialspoint.com/python/python_dictionary.htm).  A dictionary is a simple way to map a unique key to a value.  For example, the dictionary below maps course codes to course names.

In [None]:
dsc_courses = {
    # key: value
    'DSC 10': 'Principles of Data Science',
    'DSC 20': 'Programming and Basic Data Structures for Data Science',
    'DSC 30': 'Data Structures and Algorithms for Data Science',
    'DSC 40A': 'Theoretical Foundations of Data Science I',
    'DSC 40B': 'Theoretical Foundations of Data Science II',
    'DSC 80': 'The Practice and Application of Data Science'
}

We can access the value corresponding to each key using bracket notation.

In [None]:
dsc30_name = dsc_courses['DSC 30']
dsc30_name

Here, `'DSC 30'` is the key and `'Data Structures and Algorithms for Data Science'` is the value.

Use dictionaries to help in categorical to numerical value conversions. For example, below is a dictionary containing each category in `'clarity'` as keys and numbers 1-8 as values.

In [None]:
clarity_nums = {
    'IF': 8,
    'VVS1': 7,
    'VVS2': 6,
    'VS1': 5,
    'VS2': 4,
    'SI1': 3,
    'SI2': 2,
    'I1': 1
}

**Question 2.4.** Create three functions, called `cut_numerical`, `clarity_numerical` and `color_numerical`, where each function takes in a string value describing the categorical quality for the `'cut'`, `'clarity'`, or `'color'`, respectively, and outputs the corresponding numerical value, as described above.

*Hint*: When implementing `clarity_numerical`, you can use the dictionary `clarity_nums` defined above; if you do so, your implementation of `clarity_numerical` should only take one line of code. When implementing the other two functions, you may want to define your own dictionaries. There is a way to implement these functions that doesn't involve dictionaries, but you'll find that the dictionary approach is much more concise.

In [None]:
def cut_numerical(cut):
    ...
    
def clarity_numerical(clarity):
    ...

def color_numerical(color):
    ...

In [None]:
grader.check("q2_4")

**Question 2.5.** Now, replace the categorical values in the `'cut'`, `'clarity'`, and `'color'` columns of `diamonds` with their numerical equivalents.

_Hint_: You can use the `.assign` method to replace values in a column, without having to create additional columns.

In [None]:
diamonds = ...
diamonds

In [None]:
grader.check("q2_5")

**Question 2.6.** One of Jack's customers comes into Jack's store asking for a diamond whose `'cut'` is `'Premium'` or better, and whose `'color'` is `'F'` or better. The customer only has $500 to spend on a diamond. Assign `customer_choices` to a DataFrame of all the diamonds in `diamonds` that fit the customer's criteria and budget.

In [None]:
customer_choices = ...
customer_choices

In [None]:
grader.check("q2_6")

**Question 2.7.** Jack wants you to find out which of the 4 C's is most closely connected to the price of a diamond. Assign an integer from 1 to 4 representing your answer to Jack's question to the variable `best_price_indicator`.

1. `'carat'`
2. `'cut'`
3. `'clarity'`
4. `'color'`

*Hint*: Use scatter plots to see the relationship of each variable with `'price'`.

In [None]:
best_price_indicator = ...

In [None]:
grader.check("q2_7")

**Question 2.8.** Jack asks you to show him the median price of a diamond as the length of the diamond (in mm) increases. Since Jack is more of a visuals type of person, he wants you to show him this trend in a graph. Create a plot that shows the trend of the median price of a diamond as the length of the diamond increases.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_8
manual: True
-->

In [None]:
# Create your plot here.
...

<!-- END QUESTION -->



## 3. Game On! 🎮

Here, we'll be working with a dataset taken from [Kaggle](https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings) that contains sales information for thousands of video games, including some released this year. In the cell below, we load the dataset in as a DataFrame named `video_games`. Take some time to understand what each column of `video_games` represents, as we haven't provided you with a description of each column.

In [None]:
# Run this cell to load the dataset.
video_games = bpd.read_csv('data/video_game_ratings.csv')
video_games

**Question 3.1.** If you look at the `'votes'` column in the DataFrame, you'll notice that there are commas in some of the numbers. For example, in the second row of the DataFrame, the value in the `'votes'` column is `'36,441'`. These commas indicate that the `'votes'` column contains strings, not integers, since Python never displays integers with commas.

Write a function `convert_votes_to_int` that takes in a string `v` as input and outputs `v` as an integer, after removing any commas. 

Then, use your function to update the `'votes'` column in the `video_games` DataFrame so that it contains integers rather than strings. Make sure to "save" your changes in the `video_games` DataFrame!

In [None]:
def convert_votes_to_int(v):
    ...

In [None]:
video_games = ...
video_games

In [None]:
grader.check("q3_1")

**Question 3.2.** You are curious as to whether there is a relationship between the number of votes a game receives and the rating of the game. Create an appropriate plot that shows the relationship between these two variables.

In [None]:
# Create your plot here.
...

Now, use the plot you made to answer the following question: 

> What type of ratings do video games with a higher number of votes tend to have?

Assign an integer from 1 to 3 representing your answer to the variable `q3_2`.

1. Video games with a higher number of votes tend to have higher ratings.
2. Video games with a higher number of votes tend to have lower ratings.
3. There is no association between number of votes and rating.

In [None]:
q3_2 = ...
q3_2

In [None]:
grader.check("q3_2")

**Question 3.3.** Assign `most_common_genres` to a DataFrame that contains the ten most common genres of video games, in descending order. The DataFrame should be indexed by `'genre'` and have only one column, `'count'`, which is the number of video games in that genre.

*Note:* For this question, we will treat each video game as having only one genre. For example, `'Action, Adventure, Drama'` is considered to be its own genre.

In [None]:
most_common_genres = ...
most_common_genres

In [None]:
grader.check("q3_3")

**Question 3.4.** Using the `most_common_genres` DataFrame you created in Question 3.3, create a horizontal bar chart that shows the distribution of video games into these ten genres. Make sure your plot has the most common genre as the top-most bar in the bar chart.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_4
manual: true
-->

In [None]:
# Create your plot here.
...

<!-- END QUESTION -->



**Question 3.5.** Assign the variable `third_lowest` to the genre of video games with the third lowest average rating (among all genres, not just the ones you looked at in Questions 3.3 and 3.4).

Do not manually type out your answer. Use `babypandas` methods to produce the answer.

*Note:* Again, we will consider a video game with multiple genres to have only one genre. For example, `'Action, Adventure, Drama'` is considered to be its own genre.

In [None]:
third_lowest = ...
third_lowest

In [None]:
grader.check("q3_5")

**Question 3.6.** Create a histogram showing the distribution of video game ratings in the `video_games` DataFrame.

Remember to set `density=True` since we always use density histograms and `ec='w'` to make the separation of the bars more clear. You don't have to set the `bins` argument.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_6
manual: true
-->

In [None]:
# Create your plot here.
...

<!-- END QUESTION -->

**Question 3.7.** There is one director who has directed exactly 27 video games **alone** (with no-codirectors).

Below, assign `director_of_27` to the name of this director. Do not manually type out the director's name. Instead, use `babypandas` methods to extract the name.

<!--
BEGIN QUESTION
name: q3_7
-->

In [None]:
director_of_27 = ...
director_of_27

In [None]:
grader.check("q3_7")

## 4. Let's TED Talk 💡🎤

TED Talks (Technology, Entertainment, and Design) are short, powerful presentations that cover a wide range of topics, delivered by experts in the field. In 2019, a few UCSD Alumni and students were even featured speakers at TEDxUCSD. Over the years, TED Talks have become an important platform for the sharing and culmination of ideas, and are viewed by millions daily. There will even be another TEDxUCSD event this May 2023!

<img src="./images/TED-UCSD.PNG" width=350/>

We have a dataset of TED Talks on YouTube from [Kaggle](https://www.kaggle.com/datasets/purnasaikirank/ted-talks-youtube?resource=download). First, we'll read in the data from a CSV. There is no good index, so we will leave it unset.

In [None]:
ted_data = bpd.read_csv('data/ted_main.csv')
ted_data

**Question 4.1.** You'll notice that the values in the `'published_date'` column are really large integers. It turns out they're stored as Unix timestamps, which measure the number of seconds that have elapsed since midnight on January 1st, 1970. So, January 1st, 1970, at 00:00:00 UTC is time 0. (If you're curious, [this site](https://www.unixtimestamp.com) shows a live "Unix timestamp clock.")

Define a function `timestamp_to_year` that takes a Unix timestamp as input, like the values listed in the DataFrame above, and returns the corresponding year as an integer.

*Note*: Don't worry about leap years or leap seconds here.

In [None]:
def timestamp_to_year(timestamp):
    ...

In [None]:
grader.check("q4_1")

**Question 4.2.** Use your `timestamp_to_year` function and the `.apply` method to convert all of the timestamps in the `'published_date'` column of `ted_data` into their correct year. Do this without creating an additional column or reordering the existing columns. Assign the resulting DataFrame to the variable name `ted`.

In [None]:
ted = ...
ted

In [None]:
grader.check("q4_2")

🚨 **Important**: For the rest of the questions in this section, use the DataFrame `ted` instead of `ted_data`.

**Question 4.3.** 
Define a function named `clean_title` that takes as input a string from the `'title'` column of `ted` and returns the title of the TED talk, without the speaker's name included. Example behavior is shown below.

```py
>>> clean_title('Ken Robinson: Do schools kill creativity?')
'Do schools kill creativity?'
```

Once you have created the function, use the `.apply` method to apply the function on all elements of the `'title'` column in `ted`. Do not create a new column or a new DataFrame.


_*Hint*_: The string method [`.split`](https://docs.python.org/3/library/stdtypes.html#str.split) will be helpful.

In [None]:
def clean_title(title):
    ...
    
ted = ...
    
# Test cases for your own reference. Feel free to test out more!
print(clean_title('Ken Robinson: Do schools kill creativity?'))  # Should print 'Do schools kill creativity?'
print(clean_title("Hans Rosling: The best stats you've ever seen")) # Should print 'The best stats you've ever seen'

In [None]:
grader.check("q4_3")

**Question 4.4.** We'll say a talk's title is a question if the character `'?'` appears anywhere in the title. Add a column to `ted` named `'is_question'` that contains the value `True` for talks whose titles are questions and `False` for talks whose titles aren't questions. Save the resulting DataFrame as `ted_with_question`; **don't** modify the current `ted` DataFrame, otherwise you may start to fail some test cases you're currently passing.

*Hint*: If you try and check whether a title contains `'?'` using the same method you used in Question 1.6, you'll run into an error. Instead of using just `'?'`, you'll need to use `'\?'`. When using the Series method from Question 1.6, the `'?'` is interpreted as a special character; by using `'\?'` as the input to that Series method, we're telling Python to find all titles that contain a literal question mark. (If you're curious, the technical term for this is "escaping" the `'?'` character.)

In [None]:
...

In [None]:
grader.check("q4_4")

**Question 4.5.** Using the `'is_question'` column that you created in Question 4.4, calculate the mean number of views for titles that are questions and for titles that are not questions. Store the result for titles with a question in a variable called `mean_views_with_question`, and the result for titles without a question in a variable called `mean_views_without_question`.

In [None]:
mean_views_with_question = ...
mean_views_without_question = ...

print('Average views of talks with questions in the title:', mean_views_with_question)
print('Average views of talks without questions in the title:', mean_views_without_question)

In [None]:
grader.check("q4_5")

**Question 4.6.** Create a horizontal bar chart that displays the mean views for each of the **top 20 TED Talk events** in the dataset. Sort the bars so that the event with the highest mean views appears at the very top, and the event with the lowest mean views appears at the very bottom.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4_6
manual: true

-->

In [None]:
# Create your plot here.
...

<!-- END QUESTION -->



So far, we haven't used the `'ratings'` column in `ted_with_question` at all for our analysis. The values in the `'ratings'` column appear to be formatted strangely:

In [None]:
first_rating_string = ted_with_question.get('ratings').iloc[0]
first_rating_string

In [None]:
type(first_rating_string)

If we look closely, we see that each value in the `'ratings'` column looks like a list of dictionaries! While these values _look_ like lists, they are actually strings.

Conveniently, it turns out there's a function built into Python called `eval` that takes in a string that contains a Python expression and evaluates that expression. We use it below.

In [None]:
eval("np.array([1, 2, 3]) + np.array([4, 5, 6])")

In [None]:
eval("ted.shape[0]")

In the two examples above, `eval` seemed to make things more complicated, not less complicated. However, `eval` can help turn the values in the `'ratings'` column, which are strings (of lists, of dictionaries), to actual lists.

For example:

In [None]:
first_rating_list = eval(first_rating_string)
first_rating_list

In [None]:
type(first_rating_list)

Now it's a bit more clear as to how these lists are formatted. Each individual dictionary corresponds to a different tag that a video received, e.g. `'Funny'` or `'Persuasive'`. The associated `'count'` values represent the number of votes, or ratings, that video received for the corresponding tag. For instance, the first talk in the dataset received 10704 votes for the `'Persuasive'` tag.

Below, we've defined a function that takes in a single value from the `'ratings'` column and returns a single dictionary (not a list of dictionaries) corresponding to the most common tag for that video. You don't need to understand how the function works.

In [None]:
def most_common_tag_dict(rating_str):
    rating_list = eval(rating_str)
    rating_list_sorted = sorted(rating_list, key=lambda x: x['count'])
    return rating_list_sorted[-1]

For example, in the string below, the tag with the most votes is `'Inspiring'`:

In [None]:
first_rating_string

And so:

In [None]:
most_common_tag_dict(first_rating_string)

**Question 4.7.** Complete the implementation of the function `most_common_tag_name`, which takes in a value from the `'ratings'` column of `ted_with_question` and returns the name of the most common tag as a string. For instance, `most_common_tag_name(first_rating_string)` should return `'Inspiring'`.

Then, assign `ted_final` to a DataFrame with all of the same columns as `ted_with_question`, with an additional column named `'most_common_tag'` containing the most common tag name for each talk.

_*Hint*_: Most of the work has already been done for you – you should use `most_common_tag_dict` in your implementation of `most_common_tag_name`.

In [None]:
def most_common_tag_name(rating_string):
    ...
    
ted_final = ...
ted_final

In [None]:
grader.check("q4_7")

**Question 4.8.** Finally, create a plot that depicts the distribution of the `'most_common_tag'` column.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4_8
manual: true
-->

<!-- END QUESTION -->



Are you inspired? What would you give a talk about at TEDxUCSD if given a chance?

If you're curious, in the cell below, create a plot that depicts the average number of views each `'most_common_tag'` received. Which `'most_common_tag'` is the most popular, on average? 😱

<div class="alert alert-block alert-danger">
    In this question, we used the Python <code>eval</code> function out of necessity. In general, it's a good idea to <b>avoid</b> the <code>eval</code> function. This is because it's possible to call it on an input that looks safe, but contains malicious code. If you're really curious, watch starting around 9 minutes in <a href="https://podcast.ucsd.edu/watch/wi23/dsc80_a00/15">this video</a> – you can see an example from another data science course where we call the <code>eval</code> function and lose all of our files!
</div>

## 5. Final Stretch 🧘‍♀️

Suppose we have a DataFrame called `data` with two numerical columns, `'x'` and `'y'`. Consider the following scatter plot, which was generated by calling `data.plot(kind='scatter', x='x', y='y')`:

<img src="images/q4_scatter_plot.png" width=400/>

Now consider these two histograms:

<center>
    <table><tr>
        <td><center><b>Histogram A</b><br> <img src="images/q4_histogram_one.png" width=400></center> </td>
        <td><center><b>Histogram B</b><br> <img src="images/q4_histogram_two.png" width=400></center> </td>
    </tr></table>
</center>

**Question 5.1.** Which of the following lines of code generated **Histogram B**? Assign either `1`, `2`, `3`, or `4` to `which_code`.
 1. `data.plot(kind='hist', density=False, y='x')`
 2. `data.plot(kind='hist', density=False, y='y')` 
 3. `data.plot(kind='hist', density=True, y='x')`
 4. `data.plot(kind='hist', density=True, y='y')`

In [None]:
which_code = ...

In [None]:
grader.check("q5_1")

**Question 5.2.** Suppose we run this block of code:

```py
new_data = bpd.DataFrame().assign(
    x = data.get('x') / 3,
    y = data.get('y')
)
```
    
We then run 

`new_data.plot(kind='hist', density=True, y='x')`.

How will this histogram look compared to the histogram created by 

`data.plot(kind='hist', density=True, y='x')`, 

assuming both histograms are drawn on the same axes? Assign `histogram_difference` to either 1, 2, 3, or 4, corresponding to your choice.

1. The `new_data` histogram will be narrower and taller than the `data` histogram.
2. The `new_data` histogram will be narrower and shorter than the `data` histogram.
3. The `new_data` histogram will be wider and taller than the `data` histogram.
4. The `new_data` histogram will be wider and shorter than the `data` histogram.

_*Hint*_: Look at the end of [Lecture 7](https://dsc10.com/resources/lectures/lec07/lec07.html#Plotting-overlaid-histograms) for an example of two histograms drawn on the same axes.

In [None]:
histogram_difference = ...

In [None]:
grader.check("q5_2")

**Question 5.3.** Below, we show Histogram A again.

<img src="./images/q4_histogram_one.png" width=400/>

What **percent** of values in Histogram A are between -2 (inclusive) and 0 (exclusive)? While we cannot answer this question exactly since we do not know where the bins start and end, we can still approximate the answer. Assign the variable `percent_between` to a number 1 through 5, corresponding to the closest answer.

1. 22% 
2. 27% 
3. 34%
4. 41%
5. 48%

In [None]:
percent_between = ...

In [None]:
grader.check("q5_3")

## Finish Line 

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission. 

With homeworks, unlike with labs, the grade you see on Gradescope is **not your final score**. We will run correctness tests after the assignment's due date has passed.

In [None]:
grader.check_all()