In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

In [1]:
# Run this cell to load all dependencies 
exec(open("./utils.py").read())

# Lab 4 – Tables

## Data 6, Summer 2022

In this lab, we will be talking all about *Tables*. We use tables to store all sorts of data form sports statistics to population information. If there's data you have ever been curious about, it is very likely that the Internet has a table somewhere with that data!

Tables are integral to the foundation of Data Science, and we will go over how to **query** a table. **Querying** a table is basically asking information about the table. Some examples of common queries (in English, not code):

- How many data points are there?
- Which data points have a specific characteristic?
- What is the attribute of a specific data point?
- And many more!

There are so many ways we can use tables to get information we need, and there are several existing libraries in Python that we can use to do this! In this course, we will be using the `datascience` library, and if you take Data Science classes beyond this one, you may learn many more!


### Loading a Table

Recall in Lab 2, we introduced the `Table.read_table` method, which takes a *file path* and constructs a `Table` with the information from that file. Let's see how this works using the file `"data/football.csv"`, which contains information about the Cal football team:

*Note*: If you want to check where the `football.csv` file is, you can look in your DataHub directory by clicking `File` > `Open..` in the top left. 

In [2]:
cal = Table.read_table("data/football.csv")
cal.show(5)

### Excluding columns: `drop`

We now have information about Cal Football's seasons since statistics were kept. Because this file was pulled from the internet, it may have some data in it that we are not interested in, like the rows with a bunch of `nan` values (`nan` means "Not a number", and it is commonly used to indicate there is no value there).

**Caution**: It is not a good idea to blindly drop all columns with several NaN values from a table. Think back to what you saw with the missing values in Lab 2. What information would have been lost if we just dropped all missing values?

However, for the sake of this exercise, we'll do so. We can use the `drop` method to remove columns like this from the table. Let's drop the `Notes` column:

In [3]:
cal_no_notes = cal.drop("Notes")
cal_no_notes.show(5)

Let's also drop the `AP Pre`, `AP High`, `AP Post`, `SRS` and `SOS` columns from the table. These are statistics specific to college football, and they are not important for what we're doing. `drop` can take in as many columns as you need, and it will drop them all from the table.

In [4]:
cal_improved_columns = cal_no_notes.drop("AP Pre", "AP High", "AP Post", "SRS", "SOS")
cal_improved_columns.show(5)

**Question 1.1 (Number of Years):** Since each row within the `cal_improved_columns` table corresponds to a season, we can see how many years this table includes by determining the number of rows it has. Assign the variable `cal_rows` to the number of rows in `cal_improved_columns`. 

You should not write an integer, but instead use one of the table attributes we have talked about so far to **calculate** the number of rows.

Hint: Stuck? Remember, you can reference all of the Table tools in `datascience` by looking at the Data 6 Python Reference sheet [here](http://data6.org/su22/reference). 

<!--
BEGIN QUESTION
name: q1_1
points: 0
-->

In [5]:
cal_rows = ...
cal_rows

In [None]:
grader.check("q1_1")

Using this value, we can calculate the first year in `cal_improved_columns` without looking at it! The `cal` table covers up until 2020, so subtracting 2020 by the number of rows gives us the first year *not* in the table. Thus, we add one to the result to get the first year *in* the table.

Run the following cell to see this in action:

In [8]:
first_year = 2020 - cal_rows + 1
print(f"The first year in this table is: {first_year}")

### Querying

Let's try querying our new table using the `column` method to determine which conferences Cal has played in during its history. This information is contained within the `"Conf"` column of the `cal_improved_columns` table.

In [9]:
conference_list = cal_improved_columns.column("Conf")
conference_list

As you can see, this list looks long and repetitive, but we can use the `np.unique` method to tell us all the conferences only once as they appear:

In [10]:
np.unique(conference_list)

### Picking columns: `select`

It appears that there are also several other columns that we are not very interested in. Instead of dropping several columns, we can use the `select` method to grab only the columns we want. 

**Question 1.2:** In this case, we only want to keep the `"Year"`, `"W"`, `"L"`, `"T"`, and `"Pct"`,  columns. Fill in the following code so that the `football` table has only the relevant columns.

<!--
BEGIN QUESTION
name: q1_2
points: 0
-->

In [11]:
football = ...
football

In [None]:
grader.check("q1_2")

In [14]:
# Note that our cal_improved_columns table is still in tact after this:
cal_improved_columns

### Changing column labels: `relabeled`

We can rename column labels using the `relabeled` method. With this function, you are able to:
1. Relabel a *single column*
2. Relabel *several columns* at once

To change the names of multiple columns, we pass in an array of the old names and an array of the new names as the 2 inputs to `relabeled`.

*Note*: You may see another method called `relabel` in the `datascience` documenation. Please avoid using this,as it can change your data when you may not want to.*

**Question 1.3:** Some of the columns in the `football` table have labels that may not be best for what they store. Let's change the column labels to the following:

- `"W"` should be changed to `"Wins"`
- `"L"` should be changed to `"Losses"`
- `"T"` should be changed to `"Ties"`
- `"Pct"` should be changed to `"Winning Percentage"`

*Hint*: We've provided skeleton code for you to use.

<!--
BEGIN QUESTION
name: q1_3
points: 0
-->

In [15]:
old_names = ...
new_names = ...

football_relabeled = football.relabeled(..., ...)

football_relabeled.show(5)

In [None]:
grader.check("q1_3")

### Asking Questions

Now that we have the table we want, let's try to write some code that tells us some information about Cal Football's wins. Let's write three queries that can help us answer these three questions. The code to answer the first question has been provided, but it is your job to write code that will answer questions two and three.

1. What is the most wins Cal has ever had in one season?
2. How many total games has Cal lost?
3. What is the average number of games Cal each every year?

*Remember, you do not need to calculate the answers to these questions by hand, you should be writing queries to have Python do all the calculation for you.*

**Question 2.1**: What is the most wins Cal has ever had in one season?

<!--
BEGIN QUESTION
name: q2_1
points: 0
-->

In [18]:
most_wins_ever = np.max(football_relabeled.column("Wins"))
most_wins_ever

Let's break down this line of code and see what it does. First, we ask for the `Wins` column of `football_relabeled`, which gives us access to the win total from every season. 

In [19]:
football_relabeled.column("Wins")

We then use the `np.max` method to find the maximum value in this array, which ultimately tells us the most wins Cal Football has even had in any one season.

Let's use similar queries to answer the other 2 questions:

**Question 2.2 (Losses)** For the following question, use a `NumPy` function, the `football_relabeled` table, and some table method to answer the following question:

>How many total games has Cal lost?

Assign the value to the variable `games_lost_alltime`.

<!--
BEGIN QUESTION
name: q2_2
points: 0
-->

In [20]:
games_lost_alltime = ...
games_lost_alltime

In [None]:
grader.check("q2_2")

**Question 2.3 (Wins)**: Similar to above, let's answer the third question using a combination of a function, table, and table method:

>What is the average numnber of games Cal wins each year?

Assign your answer to the variable `average_wins`.

<!--
BEGIN QUESTION
name: q2_3
points: 0
-->

In [23]:
average_wins = ...
average_wins

In [None]:
grader.check("q2_3")

### Interpreting Our Data

What does winning 5.52 games even mean?! Well, this means you can (roughly) expect Cal to win 5-6 games a year. 

While this is not a perfect statistic (some seasons are longer than others, football is a completely different game than it was a long time ago, etc.), in a 12-13 game season, do you think this a good amount of wins? The answer to this question is not concrete, and even with data to back up either side, neither answer seems more right than the other.

**Important**: Data science is not only being able to *compute* the answers to questions, but also forming thoughtful questions in response to your findings.

### Sorting a column: `sort`

We will now introduce a new table method: `sort`. The `sort` table method allows us to see a table's column values sorted by its values in either **decreasing** (`descending=True`) or **increasing** (`descending=False`) order.

Let's say we want to ask the question: **What is Cal's best season ever?**. There are many ways to answer the question, but you may argue that a season with the most wins or the fewest losses could be considered the best:

In [26]:
# We can sort in descending order
football_relabeled.sort("Wins", descending=True)

In [27]:
# Or we can sort in ascending order
football_relabeled.sort("Losses", descending=False)

As you can see, queries about the most wins and the fewest losses can both answer the question **What is Cal's best season ever?** in different ways. Note that the same seasons do not necessarily show up in the top of each queried table.

**Question 2.4**: Yet another way to answer this question about Cal's best seasons ever is to sort by winning percentage. Assign the variable `best_win_pct_year` to the year corresponding to the season with the **highest winning percentage**.

To do so, we want to assign `seasons_sorted` to the result of a table query sorting the `football_relabeled` table by winning percentage in **descending** order. 

*Note*: We want descending order because we want the best seasons **at the top of the table**.

<!--
BEGIN QUESTION
name: q2_4
points: 0
-->

In [28]:
seasons_sorted = ...
best_win_pct_year = ...
best_win_pct_year

In [None]:
grader.check("q2_4")

As you can see, many of Cal Football's best seasons are quite far in the past, only a few modern seasons even show up in any of these queries 😢

## Row selection: `where` and the `are` Predicates

The last table method we will talk about is the `where` method. The `where` method keeps all rows that satisfiy a particular boolean condition. It takes in a column label and an `are` statement, which can be crafted using the `are` library. These are the most important `are` library methods, but there are many more if you would like to investigate: [Explore the 'are' library here.](http://data8.org/datascience/predicates.html)

| Method | Input Type | Method Description |
| --- | --- | --- |
| `are.equal_to(n)` | number | Is the value from the column equal to `n`? |
| `are.above(n)` | number | Is the value from the column above `n`? |
| `are.above_or_equal_to(n)` | number | Is the value from the column above or equal to `n`? |
| `are.below(n)` | number | Is the value from the column below `n`? |
| `are.below_or_equal_to(n)` | number | Is the value from the column below or equal `n`? |
| `are.containing(s)` | string | Is `s` contained in the string value from the given column? |
| `are.containined_in(s)` | string | Is the string value from the given column contained in `s`? |

Adding a `not_` in front of all of these methods makes each method do the opposite of what it does (ex: `are.not_equal_to(n)`).

*Note*: As we've seen in lecture, we can achieve an **exact match** by not explicitly using an `are` predicate. That is, `where("col", are.equal_to("something")` is identical to `where("col", "something")`; the latter is shorthand for the former.

For example, if we only wanted to see the Cal Football seasons where Cal had a tie, we could use the `where()` method combined with an `are` method:

In [33]:
football_relabeled.where("Ties", are.above(0))

For the 2021 season, Cal will play 12 games. If we wanted to see Cal's worst seasons where they lost more than 6 games, we can use a similar query:

In [34]:
football_relabeled.where("Losses", are.above(6))

Again you can see that Cal Football (especially recently) has had some rough seasons 😢

**Question 2.5 (Bowl Eligibility)**: In college football, a team advances to the post-season (to play "bowl games") if they have a winning/non-losing record. In other words, you must having a winning percentage of at least 0.500 to become eligible to play in a bowl game.

Assign the variable `bowl_eligible` to a float that describes the proportion of times in which Cal was eligible to play in college bowls throughout its history, based on their winning percentage.   

*Hint:* If you're stuck, feel free to add additional variables *before* you assign the float to `bowl_eligible`. It's often easier to break down these problems into multiple steps to make sure you're properly calculating each step and performing them in the right order. 


<!--
BEGIN QUESTION
name: q2_5
points: 0
-->

In [35]:
bowl_eligible = ...
bowl_eligible


In [None]:
grader.check("q2_5")

For reference, here is the to the Data 6 Python Reference (our Python cheat-sheet) so you can review some of the methods we've used for tables!

[Python Reference](http://data6.org/su22/reference)

## Done! 😇

That's it! There's nowhere for you to submit this, as labs are not assignments. However, please ask any questions you have with this notebook in lab or on Ed.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)