In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

<img src="data6.png" style="width: 15%; float: right; padding: 1%; margin-right: 2%;"/>

# Lab 5 – Advanced Table Methods

## Data 6, Fall 2024

Today we will be exploring some more complex table methods we can use! The `apply` and `join` methods all allow us to perform different queries on our familiar tables. Understanding not only *how* each method works, but also *why* and *when* to use them are the key takeaways from this lab; by the end of it, we will be able to query tables in some pretty cool ways!

These new methods allow us to do different operations than before. As such, it is becoming more and more important to remember how each method sits on our Data Science toolbelt. We should think of each new method as a **tool that serves a specific purpose**. Your job as a data scientist is not only to understand what each tool does, but when each tool is applicable in new situations!

In [None]:
from datascience import *
import numpy as np

import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline

import warnings
warnings.simplefilter('ignore')

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Part 1: Data Context Exploration

When working with data, it is always important to consider not only the impacts that the representation of the data can have in the real world</strong>, but also <strong>the effects that the conclusions from your data analysis and visualizations can have!</strong> 

### A Look Into the `Gender` Column

The [source of the original data](https://ir.ucr.edu/stats/enrollment/demographic) today from UC Riverside includes a dataset that displays admissions data, broken down by college. An interesting question you may want to consider is why UC Riverside would make this type of breakdown available alongside the original dataset of the whole university. 

Let's load the dataset into a Table object below called `ucr_college`.

In [None]:
ucr_college = Table.read_table("riverside_by_college.csv")
ucr_college.show(5)

As we can see from the `"Year"` column, this dataset also contains slightly more updated information, including data from 2023. Let's take a look at the way the data is represented in the `"Gender"` column for this dataset:

In [None]:
np.unique(ucr_college.column("Gender"))

<!-- BEGIN QUESTION -->

### Question 1.1 (Discussion)
It looks like this dataset contains the additional values of `"Nonbinary"` and `"Unknown"` for `"Gender"`. Take a look at the "Details about the data" section [at the bottom of the website.](https://ir.ucr.edu/stats/enrollment/demographic) How was this particular variable measured? How might this impact our comparison between the conclusions we come to with this data vs. the data we previously worked with?

By either looking at the details of the data or the ["Data Dictionary and Reporting Methodology"](https://ir.ucr.edu/sites/default/files/2019-07/IR_Data_Dictionary_1.pdf), find one other interesting way that UCR has chosen to measure their variables OR an interesting way UCR has chosen to capture concepts using certain variables and list it below.

*Hint:* Think about what reducing the number of categories would mean.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Doing this type of extra basic background research on the source of our data can be incredibly helpful in our understanding of our data!

---
### Exploration of Aggregation by College

> "STEM Majors include those in the Bourns College of Engineering (BCOE) and the College of Natural and Agricultural Sciences (CNAS)."

Using this information, we can use the `where` method to create two different tables: one for STEM majors, and one for non-STEM majors. In the cell below, we generate the list of distinct college names at UC Riverside. As we see in the statement above, we can classify all non-STEM colleges as those in the list below, aside from the College of Engineering and the College of Natural and Agricultural Sciences.

In [None]:
np.unique(ucr_college.column("College"))

### Question 1.2
In the two cells below, use the `where` method to create two different tables: one for STEM majors (`ucr_stem`), and one for non-STEM majors (`ucr_nonstem`), based on which college they belong to.

In [None]:
ucr_stem = ...
ucr_stem.show(3)

In [None]:
ucr_nonstem = ...
ucr_nonstem.show(3)

In [None]:
grader.check("q1_2")

Now, if we wanted to do a rough tabular comparison of the STEM vs. non-STEM majors, we can group by `"Gender"` and compare the numbers.

For now, don't worry too much about understanding the syntax for the group method: we'll cover it in the next lab! The code in the two cells below simply aggregates the data by calculating the sums of the `"Fall Headcount"` column for each gender category.

In [None]:
ucr_stem_grouped = ucr_stem.group("Gender", np.sum).select("Gender", "Fall Headcount sum").sort("Fall Headcount sum", descending=True)
ucr_stem_grouped 

In [None]:
ucr_nonstem_grouped = ucr_nonstem.group("Gender", np.sum).select("Gender", "Fall Headcount sum").sort("Fall Headcount sum", descending=True)
ucr_nonstem_grouped 

<!-- BEGIN QUESTION -->

### Question 1.3 (Discussion)
From the above two tables, we can see an interesting discrepancy that we see quite often with this type of education-related data. What do you see when we compare these two tables?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.4
Below, create a joined version of the headcounts split by gender for with one column containing the fall headcounts for STEM colleges, and another containing the headcounts for non-STEM colleges.

To differentiate between STEM and non-STEM majors, label the STEM headcount "Fall STEM Headcount" and non-STEM headcount "Non-STEM Fall Headcount". You may find the `relabeled` table method useful here.

In [None]:
ucr_joined_gender = ...
ucr_joined_gender

<!-- END QUESTION -->

We looked into some aspects of the students' race/ethnicities in Part 2, but now we can dig a little deeper by performing the same process of grouping our data and visualizing the distribution between the STEM and non-STEM colleges. We start by creating the tables split by STEM and non-STEM, as well as the joined version with the headcounts of both categories.

In [None]:
ucr_stem_grouped_re = ucr_stem.group("IPEDS Race/Ethnicity", np.sum).select(["IPEDS Race/Ethnicity", "Fall Headcount sum"]).sort("Fall Headcount sum", descending=True)
ucr_stem_grouped_re 

In [None]:
ucr_nonstem_grouped_re = ucr_nonstem.group("IPEDS Race/Ethnicity", np.sum).select("IPEDS Race/Ethnicity", "Fall Headcount sum").sort("Fall Headcount sum", descending=True)
ucr_nonstem_grouped_re 

<!-- BEGIN QUESTION -->

### Question 1.5
Below, create a joined version of the headcounts split by IPEDS Race/Ethnicity for with one column containing the fall headcounts for STEM colleges, and another containing the headcounts for non-STEM colleges.

In [None]:
ucr_joined_re = ...
ucr_joined_re.show(5)

<!-- END QUESTION -->

Going back to an idea we posed when starting to look into the `"Gender"` column: while this type of exploration is very important and can also be informative, it is also important to remember that it is not always enough to simply look at a trend like this and state it. Ultimately, it is typically very hard to encode data about humans into numbers and categories, **because in doing so, we lose information and context about the individual we are looking at.** In your analysis in the future, try to strike a balance of looking at interesting trends in the data and considering the original context of the data you're working with.

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Part 2: Table Methods with Movies

In this section, we'll be working with a `movies` data set that contains information about various American films over time. It contains the following columns:
1. `"Film"`: The name of the movie
2. `"Genre"`: The genre of the movie
3. `"Year"`: The year the movie was released
4. `"Lead Studio"`: The primary movie studio responsible for producing the movie
5. `"Audience score %"`: The score, out of 100%, given to the movie by viewers
6. `"Rotten Tomatoes %"`: The score, out of 100%, given to the movies by the website [Rotten Tomatoes](https://www.rottentomatoes.com/)
7. `"Worldwide Gross (Millions)"`: The total gross revenue, in millions of dollars, that the movie made
8. `"Quality"`: Descriptive ranking of the movie based on audience score

Let's load the dataset into a new table, `movies`, to get practice with the more advanced table methods.

In [None]:
movies = Table.read_table("movies.csv")
movies.show(5)

## The [apply](http://data8.org/datascience/_autosummary/datascience.tables.Table.apply.html#datascience.tables.Table.apply) method

The `apply` method allows us to map a function's behavior onto an entire column of a table. We can use built-in Python functions (like `max`) or we can define our own functions and then *apply* those functions to the columns of a table.

The `apply` method takes at least 2 arguments. The first is a function, and the rest are as many column labels you need to run that function. The number of columns you need to specify is dependent on the number of arguments the function has. For example, if the function you provide needs two inputs, you need to list two columns for it to work on.

`apply` returns a NumPy array of the transformed values. We can ask questions like "How can I categorize the items in this column?" (like converting grade percentages into letter grades from lecture). We can also make modifications to a table, like rounding all the values in a column to a certain accuracy.

Let's use `apply` to take the average of the two score percentages, `"Audience score %"` and `"Rotten Tomatoes %"`.

<!-- BEGIN QUESTION -->

### Question 2.1

In [None]:
def average_score(audience_score, rt_score):
    """Computes the average score between the audience score and Rotten Tomatoes score
    Inputs: 
    audience_score: an integer representing the audience score
    rt_score: an integer representing the Rotten Tomatoes score

    Returns:
    An integer representing the average of the two scores
    """
    return ...

In [None]:
average_scores = ...
average_scores

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.2

Now, let's add a new column called `"Average score %"` and populate it with the information we just assigned to `average_scores`. We'll re-assign this new table to `movies`.

In [None]:
movies = ...
movies.show(5)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.3

Explain what the error is with the code below:

In [None]:
movies.apply(average_score, "Genre", "Lead Studio")

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.4
Fill in the function `convert_to_dollars` which converts a dollar amount from *millions of dollars* to *dollars*. Then, use the `apply` method to convert all values from the `"Worldwide Gross (Millions)"` column into dollars. Finally, create a new column in the `movies` table called `"Worldwide Gross"` using the array resulting from your call to `apply`.

*Note*: The code for this question requires several steps. Feel free to create new cells to experiment!


In [None]:
def convert_to_dollars(dollar_millions):
    """Converts a dollar amount from millions of dollars to dollars
    Inputs:
    dollar_millions: an integer representing the number of dollars in millions

    Returns:
    an integer representing the number of dollars
    """
    ...

dollars = ...
movies = ...
movies

<!-- END QUESTION -->

## The [join](http://data8.org/datascience/_autosummary/datascience.tables.Table.join.html#datascience.tables.Table.join) method

The last method we will talk about in this lab is the `join` method. This method allows us to combine two different tables together!

The `join` method takes in 2 mandatory arguments and 1 optional argument:

| **Column Name** | **Description** |
| --- | --- |
| `column_label` | a column to use to join |
| `other` | another table |
| *Optional:* `other_label` | `other`'s label to join on (if not the same as `column_label`) |

If `other` has a label in common with the table you are joining with and this common label is the one you want to join on, then you do not need to use the optional argument. If you want to join on another column label or if neither table has a column label in common, then you can use the optional `other_label`.

The way join works takes some getting used to, so let's look at some examples of `join` at work!

We have the `dogs` and `owners` tables below, take a look at them a bit before we move on so you understand what data they contain:

In [None]:
dogs = Table().with_columns(
    "Name", np.array(["Shefran", "Hero", "Fluffy", "Doge"]),
    "Breed", np.array(["Bichon Frise", "Shih-poo", "Corgi", "Coin"]),
    "Owner", np.array(["Su Min", "Elizabeth", "Isabella", "Elizabeth"]),
)
dogs

In [None]:
owners = Table().with_columns(
    "Owner", np.array(["Su Min", "Kenneth", "Isabella", "Elizabeth"]),
    "Owner Age", np.array([6, 17, 3, 18])
)
owners

As you can see, we have a column in common: `Owner`. Let's join these two tables together so that we can have all the doggy data in one place!

In [None]:
doggy_data = dogs.join("Owner", owners)
doggy_data

This table now has all of our information in one place, which makes using it easier!

Now let's take a look at a more common example, where the column labels being named differently can cause a problem. We will use the exact same `dogs` and `owners` tables, but we will change a column label on `owners`:

In [None]:
owners_new_label = owners.relabeled("Owner", "Name")

display(dogs, owners_new_label)

Now if we try to use the `join` method like we did last time, we run into an issue...

In [None]:
doggy_data = dogs.join("Owner", owners_new_label)
doggy_data

Because the `owners` table does not have a column label called `Owner`, we may try to use the one column label they do have in common: `Name`...

In [None]:
doggy_data = dogs.join("Name", owners_new_label)
doggy_data

...but this doesn't appear to work either. This is not an error, there is simply no table outputted by this join call!

This `join` call does not do what we want, because the `dogs` `Name` corresponds to the **dog's** name, but the `owners` `Name` corresponds to the **Owner's** name! No dog and owner have any of the same names, so there is no data to output in this `join` call!

Instead, we have to make sure we join on the `Owner` column from `dogs` and the `Name` column from `owners`! We can do this using the third *optional* argument in the `join` method:

In [None]:
doggy_data = dogs.join("Owner", owners_new_label, "Name")
doggy_data

In this table, the `Name` column now refers to the name of the dog, and the `Owner` column corresponds to the name of the owner!

### Question 2.5 (A Slightly Different `Join`): 
The `join` method can also change the number of rows in its output. If there are multiple rows in one table that match with one row in the other, the `join` method will include rows for each of these matches in the output. Also, if there is a row in a table with no match in the other, there will be no row in the output that represents this row. Let's implement both these situations in practice and see how they work:

In [None]:
# Just run this cell
# This new dogs table has a new extra dogs
more_dogs = dogs.with_rows([["Clifford", "Big Red", "Sandra"], ["Doug", "Golden Retriever", "Russell"]])
more_dogs

**Task**: Before we execute the join between these tables, we should be able to calculate how many rows should there be in the output. Assign the variable `more_dog_owner_rows` to the number of rows that should result from joining `more_dogs` with `owners_new_label`. Run the cell below to see them again for clarity:

In [None]:
display(more_dogs, owners_new_label)

In [None]:
more_dog_owner_rows = ...
more_dog_owner_rows

In [None]:
grader.check("q2_5")

### Confirm your answer

Remember, each row in the `more_dogs` table only gets a row in the output if it matches a row in `owners_new_label`

In [None]:
# If an owner in the table has 2 dogs, both dogs should appear in the output
# If a dog has no owner in the owners table, the dog does not appear in the output
complex_doggy_data = more_dogs.join("Owner", owners_new_label, "Name")
complex_doggy_data

In [None]:
# Total of 4 rows
complex_doggy_data.num_rows

**Finally, for reference, here is the link to the Data 6 Python Reference (our Python cheat-sheet) so you can review some of the methods we've used for tables in this lab!**

[Python Reference](http://data6.org/fa24/reference)

## Done! 😇

Good luck with the homework!

## Pets of Data 6
Kaalu is living it up! Good luck on HW2!

<img src="kaalu.jpg" width="40%" alt="Black dog chilling on the floor"/>

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)