In [None]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_rows', 9)

# Lecture 2 – DataFrame Fundamentals

## DSC 80, Spring 2023

### Announcements 📣

- Lab 1 is released, and is due on **next Monday, April 10th at 11:59PM!**
    - See the [Tech Support](https://dsc80.com/tech_support/#replicating-the-gradescope-environment) page for instructions and watch [this video 🎥](https://www.youtube.com/watch?v=PPKXJqu2XmY) for tips on how to set up your environment and work on assignments.
- You may use a slip day, in which case the due date will be April 11th.
- Discussion starts **today**.
- Lecture recordings are available [here](https://podcast.ucsd.edu/watch/sp23/dsc80_a00).
- Make sure to fill out the [Welcome Survey](https://docs.google.com/forms/d/e/1FAIpQLSfVuXEIVVs6AygLYkpS5KaQOTv3C7AObGdi2FD_LlmlJUOvdg/viewform).

### Agenda

- Introduction to `pandas`.
- Selecting columns.
    - `get` vs. `[]`.
    - Useful Series methods.
- Selecting rows (and columns).
    - `loc` and `iloc`.
    - Querying.
    
Remember, we are not going to cover every single detail! The [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) will be your friend.

## Introduction to `pandas` 🐼

### Baby pandas

- a subset of pandas that is beginner friendly.
<center><img src='imgs/babypanda.jpg' width=45%></center>

### pandas

- everything that you learned in babypandas will carry over.
<center><img src='imgs/angrypanda.jpg' width=60%></center>

### `pandas`

<center><img src='imgs/pandas.png' width=200></center>

- `pandas` is **the** Python library for tabular data manipulation.
- Before `pandas` was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
- Wes McKinney, the original developer of `pandas`, wanted a library which would allow everything to be done in Python.
    - Python is faster to develop in than Java, and is more general-purpose than R.

### `pandas` data structures

There are three key data structures at the core of `pandas`:
- DataFrame: 2 dimensional tables.
- Series: 1 dimensional array-like object, typically representing a column or row.
- Index: sequence of column or row labels.

<center><img src='imgs/example-df.png' width=600></center>

### Importing `pandas` and related libraries

We've already run this at the top of the notebook, so we won't repeat it here. But `pandas` is almost always imported in conjunction with `numpy`:

<br>

```py
import pandas as pd
import numpy as np
```

### Example: Universities in California 📚

To refresh our memory on the basics of `pandas`, let's work with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.

- We use `pd.read_csv` to load a DataFrame from file.
- Aside: `os.path.join('x', 'y.csv')` evaluates to `'x/y.csv'` on Unix machines and `'x\y.csv'` on Windows.

In [None]:
schools_path = os.path.join('data', 'california_universities.csv')
schools = pd.read_csv(schools_path)
schools

### Exploring a new DataFrame

To extract the first or last few rows of a DataFrame, use the `head` or `tail` methods.

In [None]:
schools.head()

In [None]:
schools.tail(2)

The `shape` attribute returns the DataFrame's number of rows and columns.

In [None]:
schools.shape

### The anatomy of a DataFrame

Each row and column of a DataFrame is a Series. 
- Note that Series' look like arrays, but contain an index.
- The column labels and row labels are each stored as `Index` types.
- You can think of a DataFrame as a dictionary that maps column names to Series.

In [None]:
schools

In [None]:
# Index is 0, 1, 2, ..., 31
schools['City']

In [None]:
# The default index of a DataFrame is 0, 1, 2, 3, ...
schools.index

In [None]:
# Index is 'Name', 'City', 'County', ...
schools.iloc[-5]

In [None]:
schools.columns

### Sorting

The order of the rows in `schools` does not seem to be meaningful right now. To sort by a column, use the `sort_values` method. Like most DataFrame and Series methods, `sort_values` returns a new DataFrame, and doesn't modify the original.

In [None]:
schools

In [None]:
schools.sort_values('Founded')

In [None]:
schools.sort_values('Name', ascending=False)

In [None]:
# Why isn't this sorting correctly?
schools.sort_values('Enrollment')

### Setting the index

Think of each row's index as its **unique identifier** or **name**. Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the `set_index` method.

In [None]:
# By reassigning schools, our changes will persist.
schools = schools.set_index('Name')
schools

In [None]:
# Only 4 columns now!
schools.shape

## Selecting columns

### Selecting columns in `babypandas` 👶🐼

- In `babypandas`, you selected columns using the `.get` method.
- `.get` also works in `pandas`, but it is not **idiomatic** – people don't usually use it.

In [None]:
schools

In [None]:
schools.get('County')

In [None]:
# This doesn't error, but sometimes we'd like it to.
schools.get('State')

### Selecting columns with `[]`

- The standard way to select a column in `pandas` is by using the `[]` operator.
    - Think of a DataFrame as a dictionary of arrays!
* Specifying a column name returns the column as a Series.
* Specifying a list of column names returns a DataFrame.

In [None]:
schools

In [None]:
# Returns a Series.
schools['City']

In [None]:
# Returns a DataFrame.
schools[['Founded', 'County']]

In [None]:
# 🤔
schools[['Founded']]

In [None]:
# Names are stored in the index, which is not a column!
schools['Name']

In [None]:
schools.index

### Useful Series methods

There are a variety of useful methods that work on Series. You can see the entire list [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). As we'll see next lecture, many of these methods work on DataFrames directly, too – how?

In [None]:
schools

In [None]:
# What's the most common county?
schools['County'].mode()

In [None]:
# How many unique counties are represented?
schools['County'].nunique()

In [None]:
# What's the distribution of counties?
schools['County'].value_counts()

In [None]:
# What's the mean of the 'Founded' column?
schools['Founded'].mean()

In [None]:
# Tell me more about the 'Founded' column.
schools['Founded'].describe()

In [None]:
# Sort the 'Founded' column. Note that here we're using sort_values on a Series, not a DataFrame!
schools['Founded'].sort_values()

## Selecting rows (and columns)

### Using `loc` to select rows using row labels

If `df` is a DataFrame, then:
* `df.loc[idx]` returns the Series whose index is `idx`.
* `df.loc[idx_list]` returns a DataFrame containing the rows whose indexes are in `idx_list`.

In [None]:
schools

In [None]:
schools.loc['University of California, San Diego']

In [None]:
schools.loc[['University of California, San Diego', 'San Diego State University']]

### Boolean indexing

* The `loc` operator also supports Boolean sequences (lists, arrays, Series) as input. 
* The length of the sequence must be the same as the number of rows in the DataFrame. 
* The result is a filtered DataFrame, containing only the rows in which the sequence contained `True`.

In [None]:
schools

In [None]:
random_bools = np.random.choice([True, False], 32)
random_bools

In [None]:
schools.loc[random_bools]

### Querying

- Querying is the act of selecting rows in a DataFrame that satisfy certain condition(s).
- Comparisons with arrays (Series) result in Boolean arrays (Series).
- We can use comparisons along with the `loc` operator to **query** a DataFrame.


In [None]:
schools

In [None]:
schools['Founded'] > 1998

In [None]:
schools.loc[schools['Founded'] > 1998]

In [None]:
schools.loc[schools.index.str.contains('University of California')]

In [None]:
# Using loc is not strictly necessary when indexing with Boolean sequences.
schools[schools.index.str.contains('University of California')]

Note that because we set the index to `'Name'` earlier, we can select rows based on school names without having to query.

In [None]:
schools

In [None]:
# Series!
schools.loc['University of California, San Diego']

If `'Name'` was instead a column, then we'd need to query to access information about a particular school.

In [None]:
schools_reset = schools.reset_index()
schools_reset

In [None]:
# DataFrame!
schools_reset[schools_reset['Name'] == 'University of California, San Diego']

### Discussion Question

Write an expression that evaluates to the number of UC schools founded after 1950.

In [None]:
schools.loc[(schools.index.str.contains('University of California')) & (schools['Founded']>1950)].shape[0]

### Selecting columns and rows simultaneously

So far, we used `[]` to select columns and `loc` to select rows.

For instance, to find the cities for all schools in San Diego county:

In [None]:
schools

In [None]:
schools.loc[schools['County'] == 'San Diego']['City']

### Selecting columns and rows simultaneously

`loc` can also be used to select both rows and columns. The general pattern is:

```py
df.loc[<row selector>, <column selector>]
```

Examples:
- `df.loc[idx_list, col_list]` returns a DataFrame containing the rows in `idx_list` and columns in `col_list`.
- `df.loc[bool_arr, col_list]` returns a DataFrame contaning the rows for which `bool_arr` is `True` and columns in `col_list`.
- Both the row and column selectors can be **slices**, which use `:` syntax (e.g. `'City': 'Enrollment'`).

There are many, many more – see the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more.

In [None]:
schools

In [None]:
# Find the city and enrollment for all schools in San Diego county.
schools.loc[schools['County'] == 'San Diego', ['City', 'Enrollment']]

In [None]:
# Find the county, enrollment, and year founded for all schools founded after 1950.
schools.loc[schools['Founded'] > 1950, 'County':]

### Don't forget `iloc`!

- `iloc` stands for "integer location".
- `iloc` is like `loc`, but it selects rows and columns based off of integer positions only.

In [None]:
schools

In [None]:
schools.iloc[3:7, :-1]

`iloc` is often most useful when we sort first. For instance, to find the enrollment of the youngest school in the dataset:

In [None]:
schools.sort_values('Founded', ascending=False)['Enrollment'].iloc[0]

In [None]:
# Finding the name involves sorting, but not iloc.
schools.sort_values('Founded', ascending=False).index[0]

### More Practice

Consider the DataFrame below.

In [None]:
jack = pd.DataFrame({1: ['fee', 'fi'], 
                     '1': ['fo', 'fum']})
jack

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We may not be able to cover these all in class; if so, make sure to try them on your own.

In [None]:
# jack[1]

In [None]:
# jack[[1]]

In [None]:
# jack['1']

In [None]:
# jack[[1, 1]]

In [None]:
# jack.loc[1]

In [None]:
# jack.loc[jack[1] == 'fo']

In [None]:
# jack[1, ['1', 1]]

In [None]:
# jack.loc[1,1]

## Summary, next time

### Summary

- `pandas` is **the** library for tabular data manipulation in Python.
- There are three key data structures in `pandas`: DataFrame, Series, and Index.
- Refer to the lecture notebook and the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for tips.

### Next time

- How `pandas` and `numpy` work together (and when they disagree).
- Performance implications.