In [None]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd

import matplotlib.pyplot as plt
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
import pandas as pd
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import display, IFrame, YouTubeVideo

def show_grouping_animation():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTgVlFngQcLMYHP-z1vq5lVXjsBgcHebc-3TX7SW6L_gjX6TD1gsflvVDQUpWiDdeEPqJASenUIfBVd/embed?start=false&loop=false&delayms=60000&rm=minimal"
    width = 960
    height = 509
    display(IFrame(src, width, height))

# Lecture 5 –  Querying and Grouping

## DSC 10, Fall 2023

### Announcements

- Lab 1 is due on **Thursday at 11:59PM**.
- Homework 1 is due on **Saturday at 11:59PM**.
    - Do Lab 1 before Homework 1.
    - [Avoid submission errors](https://dsc10.com/syllabus/#submission-errors). 
- Quiz 1 will be this **Wednesday in discussion section**.
   - It'll be a 20 minute paper-based quiz administered in the second half of discussion. We'll cover practice problems and ask questions in the first half.
   - It covers Lectures 1 through 4, or [BPD 1-9](https://notes.dsc10.com/front.html) in the `babypandas` notes. Review both of these materials to study.
   - It will consist of short answer and multiple choice questions.
   - No aids are allowed (no notes, no calculators, no computers).

### Agenda

- Querying.
- Querying with multiple conditions.
- Grouping.
- After class: challenge problems.

#### Don't forget about these resources!

- [DSC 10 Reference Sheet 📌](https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view). 
- [`babypandas` notes](https://notes.dsc10.com).
- [`babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html).
- [The Resources tab of the course website](https://dsc10.com/resources/).

### You belong here! 🤝

- We're moving _very_ quickly in this class.
- This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
- We promise, no matter what your prior experience is, **the answer is yes, you belong!**
    - Watch: [🎥 Developing a Growth Mindset with Carol Dweck](https://www.youtube.com/watch?v=hiiEeMN7vbQ).
- Please come to office hours (see the schedule [here](https://dsc10.com/calendar)) and post on Ed for help – we're here to make sure you succeed in this course.

### The data: US states  🗽

We'll continue working with the same data from last time.

In [None]:
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states

## Example 4: What is the population density of Pennsylvania?

**Key concept**: Accessing using row labels.

### Population density of Pennsylvania

We know how to get the `'Density'` of all states. How do we find the one that corresponds to Pennsylvania?

In [None]:
states

In [None]:
# Which one is Pennsylvania?
states.get('Density')

### Utilizing the index

- When we load in a DataFrame from a CSV, columns have meaningful names, but rows do not.

In [None]:
bpd.read_csv('data/states.csv')

- The row labels (or the *index*) are how we refer to specific rows. Instead of using numbers, let's refer to these rows by the names of the states they correspond to.

- This way, we can easily identify, for example, which row corresponds to Pennsylvania.

### Setting the index

- To change the index, use `.set_index(column_name)`.
- Row labels should be unique identifiers.
    - Each row should has a different, descriptive name that corresponds to the contents of that row's data.

In [None]:
states

In [None]:
states.set_index('State')

- Now there is one fewer column. When you set the index, a column becomes the index, and the old index disappears.

- 🚨 Like most DataFrame methods, `.set_index` returns a new DataFrame; it does not modify the original DataFrame.

In [None]:
states

In [None]:
states = states.set_index('State')
states

In [None]:
# Which one is Pennsylvania? The one whose row label is "Pennsylvania"!
states.get('Density')

### Accessing using the row label

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:
1. Use `.get(column_name)` to extract the entire column as a Series.
2. Use `.loc[]` to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [None]:
states.get('Density')

In [None]:
states.get('Density').loc['Pennsylvania']

### Summary: Accessing elements of a DataFrame

- First, `.get` the appropriate column as a Series.
- Then, use one of two ways to access an element of a Series:
    - `.iloc[]` uses the integer position.
    - `.loc[]` uses the row label.
    - Each is best for different scenarios.

In [None]:
states.get('Density')

In [None]:
states.get('Density').iloc[2]

In [None]:
states.get('Density').loc['Arizona']

### Note

- Sometimes the integer position and row label are the same.
- This happens by default with `bpd.read_csv`.

In [None]:
bpd.read_csv('data/states.csv')

In [None]:
bpd.read_csv('data/states.csv').get('Capital City').loc[35]

In [None]:
bpd.read_csv('data/states.csv').get('Capital City').iloc[35]

## Example 5: Which states are in the West?

**Key concept**: Querying.

### The problem

We want to create a DataFrame consisting of only the states whose `'Region'` is `'West'`. How do we do that?

### The solution

In [None]:
# This DataFrame only contains rows where the 'Region' is 'West'!
only_west = states[states.get('Region') == 'West']
only_west

🤯 What just happened?

### Aside: Booleans

- When we compare two values, the result is either `True` or `False`.
    - Notice, these words are **not** in quotes.
- `bool` is a data type in Python, just like `int`, `float`, and `str`. 
    - It stands for "Boolean", named after George Boole, an early mathematician.
- There are only two possible Boolean values: `True` or `False`.
    - Yes or no.
    - On or off.
    - 1 or 0.

In [None]:
5 == 6

In [None]:
type(5 == 6)

In [None]:
9 + 10 < 21

### Comparison operators

There are several types of comparisons we can make.

|symbol|meaning|
|--------|--------|
|`==` |equal to |
|`!=` |not equal to |
|`<`|less than|
|`<=`|less than or equal to|
|`>`|greater than|
|`>=`|greater than or equal to|

When comparing an entire Series to a single value, the result is a Series of `bool`s (via broadcasting).

In [None]:
states

In [None]:
states.get('Region') == 'West'

### What is a query? 🤔

- A *query* is code that extracts rows from a DataFrame for which certain condition(s) are true.
- We use queries to *filter* DataFrames to contain only the rows that satisfy given conditions.

### How do we query a DataFrame?

To select only certain rows of `states`:

1. Make a sequence (list/array/Series) of `True`s (keep) and `False`s (toss), usually by making a comparison.
2. Then pass it into `states[sequence_goes_here]`.

In [None]:
states[states.get('Region') == 'West']

### What if the condition isn't satisfied?

In [None]:
states[states.get('Region') == 'Pacific Northwest']

## Example 6: What proportion of US states are Republican?

**Key concept**: Shape of a DataFrame. 

##### Strategy
1. Query to extract a DataFrame of just the states where the `'Party'` is `'Republican'`.
2. Count the number of such states.
3. Divide by the total number of states.

In [None]:
only_rep = states[states.get('Party') == 'Republican']
only_rep

### Shape of a DataFrame

- `.shape` returns the number of rows and columns in a given DataFrame.
    - `.shape` is not a method, so we **don't use parentheses**.
    - `.shape` is an *attribute*, as it describes the DataFrame.
- Access each with `[]`: 
    - `.shape[0]` for rows.
    - `.shape[1]` for columns.

In [None]:
only_rep.shape

In [None]:
# Number of rows.
only_rep.shape[0]

In [None]:
# Number of columns.
only_rep.shape[1]

In [None]:
# What proportion of US states are Republican?
only_rep.shape[0] / states.shape[0]

## Example 7: Which Midwestern state has the most land area?

**Key concepts**: Working with the index. Combining multiple steps.

##### Strategy
1. Query to extract a DataFrame of just the states in the `'Midwest'`.
2. Sort by `'Land Area'` in descending order.
3. Extract the first element from the index.

In [None]:
midwest = states[states.get('Region') == 'Midwest']
midwest

In [None]:
midwest_sorted = midwest.sort_values(by='Land Area', ascending=False)
midwest_sorted

- The answer is Kansas, but how do we get it in code?

In [None]:
midwest_sorted.get('State').iloc[0]

### Working with the index

- We can't use `.get` because `.get` is only for columns, and there is no column called `'State'`. 
    - Instead, `'State'` is the index of the DataFrame. 
- To extract the index of a DataFrame, use `.index`.
    - Like `.shape`, this is an attribute of the DataFrame, not a method. Don't use parentheses.  
- Access particular elements in the index with `[]`.

In [None]:
midwest_sorted.index

In [None]:
midwest_sorted.index[0]

### Combining multiple steps

- It is not necessary to define the intermediate variables `midwest` and `midwest_sorted`. We can do everything in one line of code.

- When solving a multi-step problem, develop your solution incrementally. Write one piece of code at a time and run it.

In [None]:
# Final solution, which you should build up one step at a time.
states[states.get('Region') == 'Midwest'].sort_values(by='Land Area', ascending=False).index[0]

- If a line of code gets too long, enclose it in parentheses to split it over multiple lines.

In [None]:
# You can space your code out like this if needed.
(
    states[states.get('Region') == 'Midwest']
    .sort_values(by='Land Area', ascending=False)
    .index[0]
)

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

Which expression below evaluates to **the total population of the `'West'`**?

A. `states[states.get('Region') == 'West'].get('Population').sum()`

B. `states.get('Population').sum()[states.get('Region') == 'West']`

C. `states['West'].get('Population').sum()`
   
D. More than one of the above.

In [None]:
...

## Example 8: What are the top three most-populated Republican states in the South?

**Key concepts**: Queries with multiple conditions. Selecting rows by position.

### Multiple conditions

- To write a query with multiple conditions, use `&` for "and" and `|` for "or".
    - `&`: All conditions must be true.
    - `|`: At least one condition must be true.
- **You must use `(`parentheses`)` around each condition!**
- 🚨 Don't use the Python keywords `and` and `or` here! They do not behave as you'd want.
    - See [BPD 10.3](https://notes.dsc10.com/02-data_sets/querying.html#multiple-conditions) for an explanation.

In [None]:
states[(states.get('Party') == 'Republican') & (states.get('Region') == 'South')]

In [None]:
# You can also add line breaks within brackets.
states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]

### The `&` and `|` operators work element-wise!

In [None]:
(states.get('Party') == 'Republican')

In [None]:
(states.get('Region') == 'South')

In [None]:
(states.get('Party') == 'Republican') & (states.get('Region') == 'South')

### Original Question: What are the top three most-populated Republican states in the South?

In [None]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
    .sort_values(by='Population', ascending=False)
)

How do we extract the first three rows of this DataFrame?

### Using `.take` to select rows by position

- Querying allows us to select rows that satisfy a certain _condition_.
- We can also select rows in specific _positions_ with `.take(seqence_of_integer_positions)`. This keeps only the rows whose positions are in the specified sequence (list/array).
    - This is analogous to using `.iloc[]` on a Series.
    - It's rare to need to select rows by integer position. Querying is **far** more useful.

In [None]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region')=='South')]
    .sort_values(by='Population', ascending=False)
    .take([0, 1, 2])
)

- `.take(np.arange(3))` could equivalently be used in place of `.take([0, 1, 2])`.

### Extra Practice

Write code to answer each question below. 

1. What is the capital city of the state in the `'West'` with the largest land area?
1. How many states in the `'Northeast'` have more land area than an average US state?
1. What is the total population of the `'Midwest'`, `'South'`, and `'Northeast`?

<details>
    <summary>✅ Click <b>here</b> to see the answers <b>after</b> you've attempted the problems on your own.</summary>

1. What is the capital city of the state in the West with the largest land area?

<pre>
states[states.get('Region') == 'West'].sort_values(by='Land Area', ascending=False).get('Capital City').iloc[0]
</pre>

2. How many states in the Northeast have more land area than an average US state?

<pre>
states[(states.get('Region') == 'Northeast') & 
       (states.get('Land Area') > states.get('Land Area').mean())].shape[0]
</pre>
     
3. What is the total population of the Midwest, South, and Northeast?

<pre>
states[(states.get('Region') == 'Midwest') | 
       (states.get('Region') == 'South') | 
       (states.get('Region') == 'Northeast')].get('Population').sum()
</pre>
&nbsp;&nbsp;&nbsp;&nbsp; Alternate solution to 3:

<pre>
states.get('Population').sum() - states[states.get('Region') == 'West'].get('Population').sum()
</pre>
        
</details>

In [None]:
...

## Example 9: Which region is most populated?

**Key concept**: Grouping by one column.

### Organizing states by region

We can find the total population of any one region using the tools we already have.

In [None]:
states[states.get('Region') == 'West'].get('Population').sum()

In [None]:
states[states.get('Region') == 'Midwest'].get('Population').sum()

But can we find the total population of **every** region all at the same time, without writing very similar code multiple times? Yes, there is a better way!

### A new method: `.groupby`

Observe what happens when we use the `.groupby` method on `states` with the argument `'Region'`.

In [None]:
states.groupby('Region').sum()

These populations (for the `'West'` and `'Midwest'`) match the ones we found on the previous slide, except now we get the populations for all regions at the same time. What just happened? 🤯

### An illustrative example: Pets 🐱 🐶🐹

Consider the DataFrame `pets`, shown below.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Species</th>
      <th>Color</th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>dog</td>
      <td>black</td>
      <td>40</td>
      <td>5.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>cat</td>
      <td>golden</td>
      <td>15</td>
      <td>8.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>cat</td>
      <td>black</td>
      <td>20</td>
      <td>9.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>dog</td>
      <td>white</td>
      <td>80</td>
      <td>2.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>dog</td>
      <td>golden</td>
      <td>25</td>
      <td>0.5</td>
    </tr>
    <tr>
      <th>5</th>
      <td>hamster</td>
      <td>golden</td>
      <td>1</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>

Let's see what happens under the hood when we run `pets.groupby('Species').mean()`.


In [None]:
show_grouping_animation()

### Let's try it out!

In [None]:
pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets

In [None]:
pets.groupby('Species').mean()

It takes several steps to go from the original `pets` DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.

### Back to states: which region is most populated?

In [None]:
states

In [None]:
states.groupby('Region').sum()

In [None]:
# Note the use of .index – remember, the index isn't a column!
(
    states
    .groupby('Region')
    .sum()
    .sort_values(by='Population', ascending=False)
    .index[0]
)

### Using `.groupby` in general

In short, `.groupby` aggregates (collects) all rows with the same value in a specified column (e.g. `'Region'`) into a single row in the resulting DataFrame, using an aggregation method (e.g. `.sum()`) to combine values from different rows with the same value in the specified column.

To use `.groupby`:

1. **Choose a column to group by**.
    - `.groupby(column_name)` will gather rows which have the same value in the specified column (`column_name`).
    - In the resulting DataFrame, there will be one row for every unique value in that column.

2. **Choose an aggregation method**.
    - The aggregation method will be applied **within** each group.
    - The aggregation method is applied individually to each column.
        - If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output.
    - Common aggregation methods include `.count()`, `.sum()`, `.mean()`, `.median()`, `.max()`, and `.min()`.

### Observations on grouping

1. After grouping, the index changes. The new row labels are the *group labels* (i.e., the unique values in the column that we grouped on), sorted in ascending order.

In [None]:
states

In [None]:
states.groupby('Region').sum()

2. The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, the column will disappear. 🐇🎩  


3. Since the aggregation method is applied to each column **separately**, the rows of the resulting DataFrame need to be interpreted with care.

In [None]:
states.groupby('Region').max()

In [None]:
12812508 / 81759 == 288.77

4. The column names don't make sense after grouping with the `.count()` aggregation method.

In [None]:
states.groupby('Region').count()

### Dropping, renaming, and reordering columns

Consider dropping unneeded columns and renaming columns as follows:
1. Use `.assign` to create a new column containing the same values as the old column(s).
2. Use `.drop(columns=list_of_column_labels)` to drop the old column(s). 
    - Alternatively, use `.get(list_of_column_labels)` to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!

In [None]:
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
                    States=states_by_region.get('Capital City')
                    ).get(['States'])
states_by_region

## Challenge problems: IMDb dataset 🎞️

<center>
<img width=40% src="images/imdb.png"/>
</center>

### Extra practice

We won't cover this section in class. Instead, it's here for you to practice with some harder examples.

The video below walks through the solutions (it's also linked [here](https://youtu.be/xg7rnjWnZ48)). You can also see the solutions by clicking the "✅ Click <b>here</b> to see the answer." button below each question.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')

Before watching the video or looking at the solutions, **make sure to try these problems on your own** – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Ed.

In [None]:
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb

### Question: How many movies appear from each decade?

In [None]:
imdb.groupby('Decade').count()

In [None]:
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');

### Question: What was the highest rated movie of the 1990s?

Let's try to do this two different ways.

#### Without grouping

In [None]:
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]

_Note:_ The command to extract the index of a DataFrame is `.index` - no parentheses! This is different than the way we extract columns, with `.get()`, because the index is not a column.

#### With grouping

In [None]:
imdb.reset_index().groupby('Decade').max()

- It turns out that this method **does not** yield the correct answer. 
- When we use an aggregation method (e.g. `.max()`), aggregation is done to each column individually. 
- While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is **not** Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
- Taking the `max` is not helpful here.

### Question: How many years have more than 3 movies rated above 8.5?

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count()
good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]    
</pre>
    
As mentioned below, you can also use:
    
<pre>
(good_movies_per_year.get('Votes') > 3).sum() 
</pre>
    
</details>

#### Aside: Using `.sum()` on a boolean array

- Summing a boolean array gives a count of the number of `True` elements because Python treats `True` as 1 and `False` as 0. 
- Can you use that fact here?

### Question: Out of the years with more than 3 movies, which had the highest average rating?

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3
imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1]
 
</pre>
    
</details>

### Question: Which year had the longest movie titles, on average?

**Hint:** Use `.str.len()` on the column or index that contains the names of the movies.

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
(
    imdb.assign(title_length=imdb.index.str.len())
    .groupby('Year').mean()
    .sort_values(by='title_length')
    .index[-1]
)
</pre>
    
The year is 1964 – take a look at the movies from 1964 by querying!
    
</details>

### Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The sum of the ratings of movies from years that had at least 3 movies on the list
total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum()

# The total number of movies from years that had at least 3 movies on the list
count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum()

# The correct answer
average_rating = total_rating / count

# Close, but incorrect: 
# Doesn't account for the fact that different years have different numbers of movies on the list
close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()
</pre>
        
</details>

## Summary, next time

### Summary

- We can write queries that involve multiple conditions, as long as we:
    - Put parentheses around all conditions.
    - Separate conditions using `&` if you require all to be true, or `|` if you require at least one to be true.
- The method call `df.groupby(column_name).agg_method()` **aggregates** all rows with the same value for `column_name` into a single row in the resulting DataFrame, using `agg_method()` to combine values.
    - Common aggregation methods include `.count()`, `.sum()`, `.mean()`, `.median()`, `.max()`, and `.min()`.

### Next time

 A picture is worth a 1000 words – it's time to visualize!