In [None]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd

import matplotlib.pyplot as plt
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
import pandas as pd
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Lecture 5 –  More Querying and GroupBy

## DSC 10, Spring 2023

### Announcements

- Lab 1 is due on **Saturday at 11:59PM**.
- Homework 1 is due on **Tuesday at 11:59PM**.
    - Do Lab 1 before Homework 1.
    - [Avoid submission errors](https://dsc10.com/syllabus/#submission-errors). 
- Discussion 2 is today, and we'll be covering old exam problems on this week's material. 
   - You must be present when attendance is taken in discussion to get credit, even if you have a conflicting class.

### Agenda

- Recap: Queries.
- Queries with multiple conditions.
- GroupBy.
- Extra practice, including challenge problems.

#### Don't forget about these resources!

- [DSC 10 Reference Sheet 📌](https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view). 
- [`babypandas` notes](https://notes.dsc10.com).
- [`babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html).
- [The Resources tab of the course website](https://dsc10.com/resources/).

### You belong here! 🫂

- We're moving _very_ quickly in this class.
- This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
- We promise, no matter what your prior experience is, **the answer is yes, you belong!**
    - Watch: [🎥 Developing a Growth Mindset with Carol Dweck](https://www.youtube.com/watch?v=hiiEeMN7vbQ).
- Please come to office hours (see the schedule [here](https://dsc10.com/calendar)) and post on Ed for help – we're here to make sure you succeed in this course.

### About the Data: Get It Done service requests 👷
<center>
<img height=75% src="images/get-it-done.jpg"/ width=500>
</center>

Recall, the `requests` DataFrame contains a summary of all service requests so far this year, broken down by neighborhood and service.

In [None]:
requests = bpd.read_csv('data/get-it-done-requests.csv')
requests = requests.assign(total=requests.get('closed') + requests.get('open'))
requests

## Recap: Queries

### What is a query? 🤔

- A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
- We often use queries to _filter_ DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

### Comparison operators

There are several types of comparisons we can make.

|symbol|meaning|
|--------|--------|
|`==` |equal to |
|`!=` |not equal to |
|`<`|less than|
|`<=`|less than or equal to|
|`>`|greater than|
|`>=`|greater than or equal to|

In [None]:
5 == 6

In [None]:
type(5 == 6)

In [None]:
9 + 10 < 21

In [None]:
'zebra' == 'zeb' + 'ra'

### How do we query a DataFrame?

To select only certain rows of `requests`:

1. Make a sequence (list/array/Series) of `True`s (keep) and `False`s (toss), usually by making a comparison.
2. Then pass it into `requests[sequence_goes_here]`.

In [None]:
requests

In [None]:
# A Boolean Series.
requests.get('closed') > 5

In [None]:
# A query.
requests[requests.get('closed') > 5]

## Example 5: Which neighborhood has the most `'Pothole'` requests? 🕳

**Key concept**: Querying.

### Strategy

1. Query to extract a DataFrame of just the `'Pothole'` requests.
2. Sort by `'total'` in descending order.
3. Extract the first element from the `'neighborhood'` column.

In [None]:
# This DataFrame only contains rows where the 'service' is 'Pothole'!
only_potholes = requests[requests.get('service') == 'Pothole']
only_potholes

In [None]:
# You can space your code out like this if needed.
(
    only_potholes
    .sort_values('total', ascending=False)
    .get('neighborhood')
    .iloc[0]
)

### What if the condition isn't satisfied?

In [None]:
requests[requests.get('service') == 'Car Maintenance']

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

Which expression below evaluates to **the total number of service requests in the `'Downtown'` neighborhood**?

A. `requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()`

B. `requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']`

C. `requests['Downtown'].get('total').sum()`
   
D. More than one of the above.

In [None]:
...

### Activity 🚘

**Question**: What is the most commonly requested service in the `'University'` neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

In [None]:
...

## Example 6: How many service requests were for `'Pothole'` or `'Dead Animal'`?

**Key concept**: Queries with multiple conditions.

### Multiple conditions

- To write a query with multiple conditions, use `&` for "and" and `|` for "or".
    - `&`: All conditions must be true.
    - `|`: At least one condition must be true.
- **You must use `(`parentheses`)` around each condition!**
- 🚨 Don't use the Python keywords `and` and `or` here! They do not behave as you'd want.
    - See [BPD 10.3](https://notes.dsc10.com/02-data_sets/querying.html#multiple-conditions) for an explanation.

In [None]:
requests[(requests.get('service') == 'Pothole') | (requests.get('service') == 'Dead Animal')]

In [None]:
# You can add line breaks within brackets or parentheses.
requests[(requests.get('service') == 'Pothole') | 
         (requests.get('service') == 'Dead Animal')]

### The `&` and `|` operators work element-wise!

In [None]:
(requests.get('service') == 'Pothole')

In [None]:
(requests.get('service') == 'Dead Animal')

In [None]:
(requests.get('service') == 'Pothole') | (requests.get('service') == 'Dead Animal')

### Original Question: How many service requests were for `'Pothole'` or `'Dead Animal'`?

In [None]:
requests[(requests.get('service') == 'Pothole') | 
         (requests.get('service') == 'Dead Animal')].get('total').sum()

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

Each of the following questions can be answered by querying the `requests` DataFrame.

1. Which neighborhood had the most `'Street Flooded'` requests?
2. In the `'Kearny Mesa'` neighborhood, how many different types of services have open requests?
3. How many requests have been closed in the `'La Jolla'` neighborhood?

How many of the questions above **require** the query to have **multiple conditions**?

A. 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
B. 1 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
C. 2 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   D. 3

**Bonus**: Try to write code to answer each question.

In [None]:
...

### Aside: Using `.take` to select rows by position

- Querying allows us to select rows that satisfy a certain _condition_.
- We can also select rows in specific _positions_ with `.take([list_of_integer_positions])`. This keeps only the rows whose positions are in the specified list.
    - This is analogous to using `.iloc[]` on a Series.
    - It's rare to need to select rows by integer position. Querying is **far** more useful.

In [None]:
requests

In [None]:
requests.take([1, 3, 5])

In [None]:
requests.get('service').iloc[[1, 3, 5]]

In [None]:
requests.take(np.arange(5))

## Example 7: Which neighborhood had the most requests?

**Key concept**: Grouping by one column.

### Organizing requests by neighborhood

We can find the total number of Get It Done requests in any one neighborhood using the tools we already have.

In [None]:
requests[requests.get('neighborhood') == 'Black Mountain Ranch'].get('total').sum()

In [None]:
requests[requests.get('neighborhood') == 'Uptown'].get('total').sum()

If we wanted to find the total number of requests in **every** neighborhood, this would be quite inconvenient... there has to be a better way!

### A new method: `.groupby`

Observe what happens when we use the `.groupby` method on `requests` with the argument `'neighborhood'`.

In [None]:
requests.groupby('neighborhood').sum()

Note that the `'total'` counts for Black Mountain Ranch and Uptown are the same as we saw on the previous slide. What just happened? 🤯

### An illustrative example: Pets 🐱 🐶🐹

Consider the DataFrame `pets`, shown below.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Species</th>
      <th>Color</th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>dog</td>
      <td>black</td>
      <td>40</td>
      <td>5.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>cat</td>
      <td>golden</td>
      <td>15</td>
      <td>8.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>cat</td>
      <td>black</td>
      <td>20</td>
      <td>9.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>dog</td>
      <td>white</td>
      <td>80</td>
      <td>2.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>dog</td>
      <td>golden</td>
      <td>25</td>
      <td>0.5</td>
    </tr>
    <tr>
      <th>5</th>
      <td>hamster</td>
      <td>golden</td>
      <td>1</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>

When we run `pets.groupby('Species').mean()`, `babypandas` does three things under the hood.


#### Step 1: Split

First, it **splits** the rows of `pets` into "groups" according to their values in the `'Species'` column.

<center>🐶</center>
            <table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Species</th>
      <th>Color</th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>dog</td>
      <td>black</td>
      <td>40</td>
      <td>5.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>dog</td>
      <td>white</td>
      <td>80</td>
      <td>2.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>dog</td>
      <td>golden</td>
      <td>25</td>
      <td>0.5</td>
    </tr>
  </tbody>
</table>

<br>

<center>🐱</center>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Species</th>
      <th>Color</th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>cat</td>
      <td>golden</td>
      <td>15</td>
      <td>8.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>cat</td>
      <td>black</td>
      <td>20</td>
      <td>9.0</td>
    </tr>
  </tbody>
</table>

<br>

<center>🐹</center>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Species</th>
      <th>Color</th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>5</th>
      <td>hamster</td>
      <td>golden</td>
      <td>1</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>

#### Step 2: Aggregate

Then, it **aggregates** the rows with the same value of `'Species'` by taking the `mean` of all numerical columns.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>48.33</td>
      <td>2.5</td>
    </tr>
  </tbody>
</table>

<br>

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>cat</th>
      <td>17.5</td>
      <td>8.5</td>
    </tr>
  </tbody>
</table>

<br>

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>hamster</th>
      <td>1.0</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>
        
</div>

#### Step 3: Combine

Finally, it **combines** these means into a new DataFrame that is indexed by `'Species'` and sorted by `'Species'` in ascending order.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Weight</th>
      <th>Age</th>
    </tr>
    <tr>
      <th>Species</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>cat</th>
      <td>17.50</td>
      <td>8.5</td>
    </tr>
    <tr>
      <th>dog</th>
      <td>48.33</td>
      <td>2.5</td>
    </tr>
    <tr>
      <th>hamster</th>
      <td>1.00</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>

### Let's try it out!

In [None]:
pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets

In [None]:
pets.groupby('Species').mean()

### Back to Get It Done service requests 👷

In [None]:
requests

In [None]:
requests.groupby('neighborhood').sum()

Our original goal was to find the neighborhood with the most total requests, so after grouping, we need to sort:

In [None]:
# Note the use of .index – remember, the index isn't a column!
(
    requests
    .groupby('neighborhood')
    .sum()
    .sort_values(by='total', ascending=False)
    .index[0]
)

### Using `.groupby` in general

In short, `.groupby` aggregates all rows with the same value in a specified column (e.g. `'neighborhood'`) into a single row in the resulting DataFrame, using an aggregation method (e.g. `.sum()`) to combine values.

1. **Choose a column to group by**.
    - `.groupby(column_name)` will gather rows which have the same value in the specified column (`column_name`).
    - On the previous slide, we grouped by `'neighborhood'`.
    - In the resulting DataFrame, there was one row for every unique value of `'neighborhood'`.

2. **Choose an aggregation method**.
    - The aggregation method will be applied **within** each group.
    - On the previous slide, we applied the `.sum()` method to every `'neighborhood'`.
    - The aggregation method is applied individually to each column (e.g. the sums were computed separately for `'closed'`, `'open'`, and `'total'`). 
        - If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output – we'll look at this in more detail shortly.
    - Common aggregation methods include `.count()`, `.sum()`, `.mean()`, `.median()`, `.max()`, and `.min()`.

### Observation #1

- The index has changed to neighborhood names.
- In general, the new row labels are the *group labels* (i.e., the unique values in the column that we grouped on), sorted in ascending order.

In [None]:
requests

In [None]:
requests.groupby('neighborhood').sum()

### Observation #2

The `'service'` column has disappeared. Why?

In [None]:
requests

In [None]:
requests.groupby('neighborhood').sum()

### Disappearing columns ✨🐇🎩  

- The aggregation method – `.sum()`, in this case – is applied to each column.
- If it doesn't make sense to apply it to a particular column, that column will disappear.
- For instance, we _can't_ sum strings, like in the `'service'` column.
- However, we _can_ compute the max of several strings. How?

In [None]:
# Can you guess how the max is determined?
requests.groupby('neighborhood').max() 

### Observation #3

- The aggregation method is applied to each column **separately**.
- The rows of the resulting DataFrame need to be interpreted with care.

In [None]:
requests.groupby('neighborhood').max()

Why isn't the `'total'` column equal to the sum of the `'closed'` and `'open'` columns, as it originally was?

In [None]:
# Why don't these numbers match those in the grouped DataFrame?
requests[(requests.get('neighborhood') == 'Balboa Park') & (requests.get('service') == 'Weed Cleanup')]

### Example: Number of different services

How do we find the number of different services requested in each neighborhood?

As always when using `groupby`, there are two steps:

1. Choose a column to group by.
    - Here, `'neighborhood'` seems like a good choice.

2. Choose an aggregation method.
   - Common aggregation methods include `.count()`, `.sum()`, `.mean()`, `.median()`, `.max()`, and `.min()`.

In [None]:
# How many different requests are there for the neighborhood 'University'?
requests[requests.get('neighborhood') == 'University']

In [None]:
# How do we find this result for every neighborhood?

### Observation #4

The column names of the output of `.groupby` don't make sense when using the `.count()` aggregation method.

In [None]:
num_diff_services = requests.groupby('neighborhood').count()
num_diff_services

Consider dropping unneeded columns and renaming columns as follows:
1. Use `.assign` to create a new column containing the same values as the old column(s).
2. Use `.drop(columns=list_of_column_labels)` to drop the old column(s). Alternatively, use `.get(list_of_column_labels)` to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!

In [None]:
num_diff_services = num_diff_services.assign(
                    count_of_services=num_diff_services.get('open')
                    ).drop(columns=['service', 'closed', 'open', 'total'])
num_diff_services

## More practice: IMDb dataset 🎞️

<center>
<img width=40% src="images/imdb.png"/>
</center>

### Challenge problems!

We won't cover this section in class. Instead, it's here for you to practice with some harder examples.

The video below walks through the solutions (it's also linked [here](https://youtu.be/xg7rnjWnZ48)). You can also see the solutions by clicking the "✅ Click <b>here</b> to see the answer." button below each question.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')

Before watching the video or looking at the solutions, **make sure to try these problems on your own** – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Ed.

In [None]:
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb

### Question: How many movies appear from each decade?

In [None]:
imdb.groupby('Decade').count()

In [None]:
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');

### Question: What was the highest rated movie of the 1990s?

Let's try to do this two different ways.

#### Without grouping

In [None]:
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]

_Note:_ The command to extract the index of a DataFrame is `.index` - no parentheses! This is different than the way we extract columns, with `.get()`, because the index is not a column.

#### With grouping

In [None]:
imdb.reset_index().groupby('Decade').max()

- It turns out that this method **does not** yield the correct answer. 
- When we use an aggregation method (e.g. `.max()`), aggregation is done to each column individually. 
- While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is **not** Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
- Taking the `max` is not helpful here.

### Question: How many years have more than 3 movies rated above 8.5?

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count()
good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]    
</pre>
    
As mentioned below, you can also use:
    
<pre>
(good_movies_per_year.get('Votes') > 3).sum() 
</pre>
    
</details>

#### Aside: Using `.sum()` on a boolean array

- Summing a boolean array gives a count of the number of `True` elements because Python treats `True` as 1 and `False` as 0. 
- Can you use that fact here?

### Question: Out of the years with more than 3 movies, which had the highest average rating?

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3
imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1]
 
</pre>
    
</details>

### Question: Which year had the longest movie titles, on average?

**Hint:** Use `.str.len()` on the column or index that contains the names of the movies.

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
(
    imdb.assign(title_length=imdb.index.str.len())
    .groupby('Year').mean()
    .sort_values(by='title_length')
    .index[-1]
)
</pre>
    
The year is 1964 – take a look at the movies from 1964 by querying!
    
</details>

### Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?

<details>
    <summary>✅ Click <b>here</b> to see the answer.</summary>

<pre>
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The sum of the ratings of movies from years that had at least 3 movies on the list
total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum()

# The total number of movies from years that had at least 3 movies on the list
count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum()

# The correct answer
average_rating = total_rating / count

# Close, but incorrect: 
# Doesn't account for the fact that different years have different numbers of movies on the list
close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()
</pre>
        
</details>

## Summary, next time

### Summary

- We can write queries that involve multiple conditions, as long as we:
    - Put parentheses around all conditions.
    - Separate conditions using `&` if you require all to be true, or `|` if you require at least one to be true.
- The method call `df.groupby(column_name).agg_method()` **aggregates** all rows with the same value for `column_name` into a single row in the resulting DataFrame, using `agg_method()` to combine values.
    - Common aggregation methods include `.count()`, `.sum()`, `.mean()`, `.median()`, `.max()`, and `.min()`.

### Next time

 A picture is worth a 1000 words – it's time to visualize!