In [None]:
# Run this cell to set up packages for lecture.
from lec04_imports import *

# Lecture 4 –  DataFrames

## DSC 10, Winter 2025

### Agenda

- DataFrames.

#### Note:

- Remember to check the [resources tab of the course website](https://dsc10.com/resources/) for programming resources.
- Some key links moving forward:
    - [DSC 10 reference sheet](https://dsc-courses.github.io/bpd-reference/docs/documentation/intro/).
    - [`babypandas` notes](https://notes.dsc10.com).

## DataFrames

### `pandas`

- `pandas` is a Python package that allows us to work with **tabular** data – that is, data in the form of a table that we might otherwise work with as a spreadsheet (in Excel or Google Sheets).
- `pandas` is **the** tool for doing data science in Python.

<center>
<img src='images/pandas.png' width=400>
</center>

### But `pandas` is not so cute...

<center>
<img height=100% src="images/angrypanda.jpg"/>
</center>

### Enter `babypandas`!

- We at UCSD have created a smaller, nicer version of `pandas` called `babypandas`.
- It keeps the important stuff and has much better error messages.
- It's easier to learn, but is still valid `pandas` code. **You are learning `pandas`!**
    - Think of it like learning how to build LEGOs with many, but not all, of the possible Lego blocks. You're still learning how to build LEGOs, and you can still build cool things!

<center>
<img height=75% src="images/babypanda.jpg"/ width=400>
</center>

### DataFrames in `babypandas` 🐼

- Tables in `babypandas` (and `pandas`) are called "DataFrames."
- To use DataFrames, we'll need to import `babypandas`. 

In [None]:
import babypandas as bpd

### Reading data from a file 📖

- We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."

- We can read in a CSV using `bpd.read_csv(...)`. Replace the `...` with a path to the CSV file relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file.

In [None]:
# Our CSV file is stored not in the same folder as our notebook, 
# but within a folder called data.
states = bpd.read_csv('data/states.csv')
states

### About the data 🗽

Most of the data is self-explanatory, but there are a few things to note:

- `'Population'` figures come from the 2020 census.

- `'Land Area'` is measured in square miles.

- The `'Region'` column places each state in one of four regions, as determined by the US Census Bureau.

<center>
<img src='images/regions.png' width=600>
</center>

- The `'Party'` column classifies each state as `'Democratic'` or `'Republican'` based on a political science measurement called the Cook Partisan Voter Index. 


<center>
<img src='images/party.png' width=600>
(<a href="https://www.cookpolitical.com/cook-pvi/2022-partisan-voting-index/state-map-and-list">source</a>)
</center>

### Structure of a DataFrame

- DataFrames have *columns* and *rows*.
    - Think of each column as an array. Columns contain data of the same type.
- Each column has a label, e.g. `'Capital City'` and `'Land Area'`.
    - Column labels are stored as strings.
- Each row has a label too – these are shown in bold at the start of the row.
    - Right now, the row labels are 0, 1, 2, and so on.
    - Together, the row labels are called the _index_. The index is **not** a column!
    

In [None]:
# This DataFrame has 50 rows and 6 columns.
states

## Example 1: Population density

**Key concepts**: Accessing columns, performing calculations with them, and adding new columns.

### Finding population density

**Question**: What is the population density of each state, in people per square mile?

In [None]:
states

- We have, separately, the population and land area of each state.

- Steps:
    - Get the `'Population'` column.
    - Get the `'Land Area'` column.
    - Divide these columns element-wise.
    - Add a new column to the DataFrame with these results.

#### Step 1 – Getting the `'Population'` column

- We can get a column from a DataFrame using `.get(column_name)`.
- 🚨 Column names are case sensitive!
- Column names are strings, so we need to use quotes.
- The result looks like a 1-column DataFrame, but is actually a *Series*.

In [None]:
states

In [None]:
states.get('Population')

### Digression: Series

- A *Series* is like an array, but with an index.
- In particular, Series support arithmetic, just like arrays.

In [None]:
states.get('Population')

In [None]:
type(states.get('Population'))

#### Steps 2 and 3 – Getting the `'Land Area'` column and dividing element-wise

In [None]:
states.get('Land Area')

- Just like with arrays, we can perform arithmetic operations with two Series, as long as they have the same length and same index. 
- Operations happen element-wise (by matching up corresponding index values), and the result is also a Series.

In [None]:
states.get('Population') / states.get('Land Area')

#### Step 4 – Adding the densities to the DataFrame as a new column

- Use `.assign(name_of_column=data_in_series)` to assign a Series (or array, or list) to a DataFrame.
- 🚨 Don't put quotes around `name_of_column`.
- This creates a new DataFrame, which we must save to a variable if we want to keep using it.

In [None]:
states.assign(
    Density=states.get('Population') / states.get('Land Area')
)

In [None]:
states

In [None]:
states = states.assign(
    Density=states.get('Population') / states.get('Land Area')
)
states

## Example 2: Exploring population density
**Key concept**: Computing statistics of columns using Series methods.

### Questions

- What is the highest population density of any one state? 
- What is the average population density across all states?

Series, like arrays, have helpful methods, including `.min()`, `.max()`, and `.mean()`.

In [None]:
states.get('Density').max()

What state does this correspond to? We'll see how to find out shortly!

Other statistics:

In [None]:
states.get('Density').min()

In [None]:
states.get('Density').mean()

In [None]:
states.get('Density').median()

In [None]:
# Lots of information at once!
states.get('Density').describe()

## Example 3: *Which* state has the highest population density?

**Key concepts**: Sorting. Accessing using integer positions.

#### Step 1  – Sorting the DataFrame

- Use the `.sort_values(by=column_name)` method to sort.
    - The `by=` can be omitted, but helps with readability.
- Like most DataFrame methods, this returns a new DataFrame.

In [None]:
states.sort_values(by='Density')

This sorts, but in ascending order (small to large). The opposite would be nice!

#### Step 1 – Sorting the DataFrame in *descending* order

- Use `.sort_values(by=column_name, ascending=False)` to sort in *descending* order.
- `ascending` is an optional argument. If omitted, it will be set to `True` by default.
    - This is an example of a *keyword argument*, or a *named argument*.
    - If we want to specify the sorting order, we **must** use the keyword `ascending=`.

In [None]:
ordered_states = states.sort_values(by='Density', ascending=False)
ordered_states

In [None]:
# We must specify the role of False by using ascending=, 
# otherwise Python does not know how to interpret this.
states.sort_values(by='Density', False)

#### Step 2 – Extracting the state name

- We saw that the most densely populated state is New Jersey, but how do we extract that information using code?
- First, grab an entire column as a Series.
- Navigate to a particular entry of the Series using `.iloc[integer_position]`.
    - `iloc` stands for "integer location" and is used to count the rows, starting at 0.

In [None]:
ordered_states

In [None]:
ordered_states.get('State')

In [None]:
# We want the first entry of the Series, which is at "integer location" 0.
ordered_states.get('State').iloc[0]

- The row label that goes with New Jersey is 29, because our original data was alphabetized by state and New Jersey is the 30th state alphabetically. But we **don't use the row label** when accessing with `iloc`; we use the integer position counting from the top.

- If we try to use the row label (29) with `iloc`, we get the state with the 30th highest population density, which is **not** New Jersey.

In [None]:
ordered_states.get('State').iloc[29]

## Example 4: What is the population density of Pennsylvania?

**Key concepts**: Setting the index. Accessing using row labels.

### Population density of Pennsylvania

We know how to get the `'Density'` of all states. How do we find the one that corresponds to Pennsylvania?

In [None]:
states

In [None]:
# Which one is Pennsylvania?
states.get('Density')

### Utilizing the index

- When we load in a DataFrame from a CSV, columns have meaningful names, but rows do not.

In [None]:
bpd.read_csv('data/states.csv')

- The row labels (or the *index*) are how we refer to specific rows. Instead of using numbers, let's refer to these rows by the names of the states they correspond to.

- This way, we can easily identify, for example, which row corresponds to Pennsylvania.

### Setting the index

- To change the index, use `.set_index(column_name)`.
- Row labels should be unique identifiers.
    - Each row should have a different, descriptive name that corresponds to the contents of that row's data.

In [None]:
states

In [None]:
states.set_index('State')

- Now there is one fewer column. When you set the index, a column becomes the index, and the old index disappears.

- 🚨 Like most DataFrame methods, `.set_index` returns a new DataFrame; it does not modify the original DataFrame.

In [None]:
states

In [None]:
states = states.set_index('State')
states

In [None]:
# Which one is Pennsylvania? The one whose row label is "Pennsylvania"!
states.get('Density')

### Accessing using the row label

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:
1. Use `.get(column_name)` to extract the entire column as a Series.
2. Use `.loc[]` to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [None]:
states.get('Density')

In [None]:
states.get('Density').loc['Pennsylvania']

### Summary: Accessing elements of a DataFrame

- First, `.get` the appropriate column as a Series.
- Then, use one of two ways to access an element of a Series:
    - `.iloc[]` uses the integer position.
    - `.loc[]` uses the row label.
    - Each is best for different scenarios.

In [None]:
states.get('Density')

In [None]:
states.get('Density').iloc[4]

In [None]:
states.get('Density').loc['California']

### Note

- Sometimes the integer position and row label are the same.
- This happens by default with `bpd.read_csv`.

In [None]:
bpd.read_csv('data/states.csv')

In [None]:
bpd.read_csv('data/states.csv').get('Capital City').loc[35]

In [None]:
bpd.read_csv('data/states.csv').get('Capital City').iloc[35]

## Summary, next time

### Summary

- We learned many DataFrame methods and techniques. **Don't feel the need to memorize them all right away.**
- Instead, refer to this lecture, [the `babypandas` notes](https://notes.dsc10.com/front.html), and [the DSC 10 reference sheet](https://dsc-courses.github.io/bpd-reference/docs/documentation/intro/) when working on assignments.
- Over time, these techniques will become more and more familiar. Lab 1 will walk you through many of them.
- **Practice!** Frame your own questions using this dataset and try to answer them. 

### Next time

We'll frame more questions and learn more DataFrame manipulation techniques to answer them. In particular, we'll learn about querying and grouping. 