In [None]:
# Run this cell if you're following along – it just helps make the lectures appear prettier.
import pandas as pd
import numpy as np

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Lecture 4 –  DataFrames: Accessing, Sorting, and Querying

## DSC 10, Spring 2023

### Announcements

- Lab 0 is due **tomorrow at 11:59PM**.
    - Submit early and avoid [Submission Errors](https://dsc10.com/syllabus/#submission-errors).
    - You can turn it in up to two days late using slip days – see the [Syllabus](https://dsc10.com/syllabus/#deadlines-and-slip-days) for more information. You never need to ask to use a slip day.
    - Please fill out the [Welcome Survey](https://docs.google.com/forms/d/e/1FAIpQLSfP_7dzEgsXgKcrV6zcafpJgepABS_WLXch_9iXHzTtJevTqw/viewform) as well!
- Lab 1 is the next assignment, and it's due on **Saturday at 11:59PM**.
- Come to office hours (see the schedule [here](https://dsc10.com/calendar)) and post on Ed for help!
- Watch [this video](https://www.youtube.com/watch?v=w_witptT6Ts), which walks through the activity from the end of Friday's lecture.
- You must be present when attendance is taken in discussion to get credit, even if you have a conflicting class.

### Agenda

Today, we'll use a real dataset and lots of motivating questions to illustrate key DataFrame manipulation techniques.

#### Note:

- Remember to check the [Resources tab of the course website](https://dsc10.com/resources/) for programming resources.
- Some key links moving forward:
    - [DSC 10 Reference Sheet](https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view).
    - [`babypandas` notes](https://notes.dsc10.com).
    - [`babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html).

## DataFrames

### `pandas`

- `pandas` is a Python package that allows us to work with **tabular** data – that is, data in the form of a table that we might otherwise work with as a spreadsheet (in Excel or Google Sheets).
- `pandas` is **the** tool for doing data science in Python.

<center>
<img src='data/pandas.png' width=500>
</center>

### But `pandas` is not so cute...

<center>
<img height=100% src="data/angrypanda.jpg"/>
</center>

### Enter `babypandas`!

- We at UCSD have created a smaller, nicer version of `pandas` called `babypandas`.
- It keeps the important stuff and has much better error messages.
- It's easier to learn, but is still valid `pandas` code. **You are learning `pandas`!**

<center>
<img height=75% src="data/babypanda.jpg"/ width=500>
</center>

### DataFrames in `babypandas` 🐼

- Tables in `babypandas` (and `pandas`) are called "DataFrames."
- To use DataFrames, we'll need to import `babypandas`. (We'll need `numpy` as well.)

In [None]:
import babypandas as bpd
import numpy as np

### About the Data: Get It Done 👷

- We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
- The file `data/get-it-done-apr-08.csv` contains service requests made on April 8th, 2023 through the [Get It Done](https://www.sandiego.gov/get-it-done) program. 
- Get It Done allows the general public to report non-emergency problems to the City of San Diego through a mobile app, website, or phone call.

<center>
<img height=75% src="data/get-it-done.jpg"/ width=500>
</center>

### Reading data from a file 📖

We can read in a CSV using `bpd.read_csv(...)`. Give it the path to a file relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file.

In [None]:
apr_08 = bpd.read_csv('data/get-it-done-apr-08.csv')
apr_08

### Structure of a DataFrame

- DataFrames have *columns* and *rows*.
    - Think of each column as an array. Columns contain data of the same type.
- Each column has a label, e.g. `'neighborhood'` and `'status'`.
    - A column's label is its name.
    - Column labels are stored as strings.
- Each row has a label too.
    - Together, the row labels are called the _index_. The index is **not** a column!
    

In [None]:
# This DataFrame has 1089 rows and 7 columns.
apr_08

### Setting a new index

- We can set a better index using `.set_index(column_name)`.
- Row labels should be unique identifiers.
    - Row labels are row names; ideally, each row has a different, descriptive name.

In [None]:
apr_08.set_index('service_request_id')

🚨 Like most DataFrame methods, `.set_index` returns a new DataFrame; it does not modify the original DataFrame.

In [None]:
apr_08

In [None]:
apr_08 = apr_08.set_index('service_request_id')
apr_08

### Shape of a DataFrame

- `.shape` returns the number of rows and columns in a given DataFrame.
- Access each with `[]`: 
    - `.shape[0]` for rows.
    - `.shape[1]` for columns.

In [None]:
# There were 7 columns before, but one of them became the index, and the index is not a column!
apr_08.shape

In [None]:
# Number of rows.
apr_08.shape[0]

In [None]:
# Number of columns.
apr_08.shape[1]

### Annual summary of Get It Done requests

- The file `data/get-it-done-requests.csv` contains a summary of all Get It Done requests submitted so far in 2023.
- This dataset describes the types of problems being reported in each neighborhood and the number of service requests that are resolved (`'closed'`) versus unresolved (`'open'`).

In [None]:
requests = bpd.read_csv('data/get-it-done-requests.csv')
requests

## Example 1: Total requests

**Key concepts**: Accessing columns, performing operations with them, and adding new columns.

### Finding total requests

**Question**: How many **total** service requests of each type in each neighborhood have been made this year?

In [None]:
requests

- We have, separately, the number of closed service requests and open service requests of each type in each neighborhood.

- Steps:
    - Get the column of closed requests.
    - Get the column of open requests.
    - Add these columns element-wise.
    - Add a new column to the DataFrame with these totals.

#### Step 1 – Getting the coluimn of closed requests

- We can get a column from a DataFrame using `.get(column_name)`.
- 🚨 Column names are case sensitive!
- Column names are strings, so we need to use quotes.
- The result looks like a 1-column DataFrame, but is actually a *Series*.

In [None]:
requests

In [None]:
requests.get('closed')

### Digression: Series

- A *Series* is like an array, but with an index.
- In particular, Series support arithmetic, just like arrays.

In [None]:
requests.get('closed')

In [None]:
type(requests.get('closed'))

#### Steps 2 and 3 – Getting the column of open requests and calculating the total

In [None]:
requests.get('open')

- Just like with arrays, we can perform arithmetic operations with two Series, as long as they have the same length and same index. 
- Operations happen element-wise, and the result is also a Series.

In [None]:
requests.get('closed') + requests.get('open')

#### Step 4 – Adding the totals to the DataFrame as a new column

- Use `.assign(name_of_column=data_in_series)` to assign a Series (or array, or list) to a DataFrame.
- 🚨 Don't put quotes around `name_of_column`.
- This creates a new DataFrame, which we must save to a variable if we want to keep using it.

In [None]:
requests.assign(
    total=requests.get('closed') + requests.get('open')
)

In [None]:
requests

In [None]:
requests = requests.assign(
    total=requests.get('closed') + requests.get('open')
)
requests

## Example 2: Analyzing requests
**Key concept**: Computing statistics of columns using Series methods.

### Questions

- What is the largest number of service requests for any one service in any one neighborhood? 
- What is a typical number of service requests for any one service in any one neighborhood?

Series, like arrays, have helpful methods, including `.min()`, `.max()`, and `.mean()`.

In [None]:
requests.get('total').max()

What is it that people are reporting so frequently, and where? We'll see how to find out shortly!

Other statistics:

In [None]:
requests.get('total').mean()

In [None]:
requests.get('total').median()

In [None]:
requests.get('open').mean()

In [None]:
requests.get('open').median()

In [None]:
# Lots of information at once!
requests.get('total').describe()

## Example 3: *What and where* is the most frequently requested service?

**Key concepts**: Sorting. Accessing using integer positions.

#### Step 1  – Sorting the DataFrame

- Use the `.sort_values(by=column_name)` method to sort.
    - The `by=` can be omitted, but helps with readability.
- Like most DataFrame methods, this returns a new DataFrame.

In [None]:
requests.sort_values(by='total')

This sorts, but in ascending order (small to large). The opposite would be nice!

#### Step 1 – Sorting the DataFrame in *descending* order

- Use `.sort_values(by=column_name, ascending=False)` to sort in *descending* order.
- `ascending` is an optional argument. If omitted, it will be set to `True` by default.
    - This is an example of a *keyword argument*, or a *named argument*.
    - If we want to specify the sorting order, we **must** use the keyword `ascending=`.

In [None]:
ordered_requests = requests.sort_values(by='total', ascending=False)
ordered_requests

In [None]:
# We must specify the role of False by using ascending=, 
# otherwise Python does not know how to interpret this.
requests.sort_values(by='total', False)

#### Step 2 – Extracting the neighborhood and service

- We saw that the most reported issue is `'Encampment'` in `'Downtown'`, but how do we extract that information using code?
- First, grab an entire column as a Series.
- Navigate to a particular entry of the Series using `.iloc[integer_position]`.
    - `iloc` stands for "integer location."

In [None]:
ordered_requests

In [None]:
ordered_requests.get('neighborhood')

In [None]:
ordered_requests.get('neighborhood').iloc[0]

In [None]:
ordered_requests.get('service').iloc[0]

## Example 4: Status of a request

**Key concept**: Accessing using row labels.

### Status of a request

- On April 8th, you submitted service request **4183848**. Has the issue been resolved? 

- This cannot be answered from the annual summary data, but must be answered from the detailed data about April 8th.

In [None]:
apr_08

Your service request is buried in the middle of the DataFrame. Only the first few rows and last few rows are shown, so you can't tell just by looking at the DataFrame.

### Accessing using the row label

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:
1. Use `.get(column_name)` to extract the entire column as a Series.
2. Use `.loc[]` to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [None]:
apr_08.get('status')

In [None]:
apr_08.get('status').loc[4183848]

### Activity 🚔

Oh no, your service request **4183848** has still not been resolved! What was the problem again?

Write one line of code that evaluates to the full description of the problem, as you described it in your service request.

In [None]:
...

### Summary: Accessing elements in a Series

- There are two ways to get an element of a Series:
    - `.loc[]` uses the row label.
    - `.iloc[]` uses the integer position.
- Usually `.loc[]` is more convenient, but each is best for different scenarios.

### Note

- Sometimes the integer position and row label are the same.
- This happens by default with `bpd.read_csv`.

In [None]:
bpd.read_csv('data/get-it-done-apr-08.csv')

In [None]:
bpd.read_csv('data/get-it-done-apr-08.csv').get('public_description').loc[561]

In [None]:
bpd.read_csv('data/get-it-done-apr-08.csv').get('public_description').iloc[561]

## Reflection

### Questions we can answer right now...

- What is the largest number of open requests of one type in one neighborhood?
    - `requests.get('open').max()`.

- How many requests were made on April 8th?
    - `apr_08.shape[0]`.

- What is the description of the latest request made on April 8th?
    - `apr_08.sort_values(by='date_requested', ascending=False).get('public_description').iloc[0]`.

Moving forward, let's just focus on the `requests` DataFrame. As a reminder, here's what it looks like:

In [None]:
requests

### Questions we can't yet answer...
- Which neighborhood has the most `'Pothole'` requests?
- What is the most commonly requested service in the `'University'` neighborhood (near UCSD)?
- In the `'Downtown'` neighborhood, how many open service requests are there?

The common thread between these questions is that they all involve only a **subset** of the rows in our DataFrame.

## Example 5: Which neighborhood has the most `'Pothole'` requests? 🕳

**Key concept**: Querying.

### Selecting rows

- We could determine the neighborhood with the most `'Pothole'` requests if we had a DataFrame consisting of only these type of requests.
    - We would sort by the `'total'` column in descending order, then extract the neighborhood name in the first row.
- How do we get that DataFrame?

### The solution

In [None]:
# This DataFrame only contains rows where the 'service' is 'Pothole'!
only_potholes = requests[requests.get('service') == 'Pothole']
only_potholes

Now that we have a DataFrame with only `'Pothole'` requests, we can sort by `'total'` in descending order and extract the `'neighborhood'` in the first row, just like we planned:

In [None]:
only_potholes.sort_values('total', ascending=False).get('neighborhood').iloc[0]

🤯 What just happened?

### Aside: Booleans

- When we compare two values, the result is either `True` or `False`.
    - Notice, these words are **not** in quotes.
- `bool` is a data type in Python, just like `int`, `float`, and `str`. 
    - It stands for "Boolean", named after George Boole, an early mathematician.
- There are only two possible Boolean values: `True` or `False`.
    - Yes or no.
    - On or off.
    - 1 or 0.

In [None]:
5 == 6

In [None]:
type(5 == 6)

In [None]:
9 + 10 < 21

In [None]:
'zebra' == 'zeb' + 'ra'

### Comparison operators

There are several types of comparisons we can make.

|symbol|meaning|
|--------|--------|
|`==` |equal to |
|`!=` |not equal to |
|`<`|less than|
|`<=`|less than or equal to|
|`>`|greater than|
|`>=`|greater than or equal to|

When comparing an entire Series to a value, the result is a Series of `bool`s.

In [None]:
requests

In [None]:
requests.get('service') == 'Pothole'

### What is a query? 🤔

- A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
- We often use queries to _filter_ DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

### How do we query a DataFrame?

To select only certain rows of `requests`:

1. Make a sequence (list/array/Series) of `True`s (keep) and `False`s (toss), usually by making a comparison.
2. Then pass it into `requests[sequence_goes_here]`.

In [None]:
requests[requests.get('service') == 'Pothole']

### Another query

This time, we'll find the neighborhoods and services with over 100 open requests.

In [None]:
requests

In [None]:
requests.get('open') > 100

In [None]:
requests[requests.get('open') > 100]

## Summary

### Summary

- We learned many DataFrame methods and techniques. **Don't feel the need to memorize them all right away.**
- Instead, refer to this lecture, [the DSC 10 reference sheet](https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view), [the `babypandas` notes](https://notes.dsc10.com/front.html), and [the `babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html) when working on assignments.
- Over time, these techniques will become more and more familiar. Lab 1 will walk you through many of them.
- **Practice!** Frame your own questions using this dataset and try to answer them.

### Next time

- We'll start by reviewing queries, and talk about how to write queries with multiple conditions.
- We'll answer more complicated questions, which will lead us to a new core DataFrame method, `.groupby`, for organizing rows of a DataFrame with the same value in a particular column. 
- For example, we might want to organize the data by neighborhood, collecting all the different service requests for each neighborhood.