## Dataframes: a Collection of Series

Here's the data frame we've been working with. Note that it is of type `DataFrame`.

In [2]:
import pandas as pd
from IPython.display import display

df = pd.read_csv("state_data.csv")
display(df)
display(type(df))

Unnamed: 0,State,Year,Total Population,Median Household Income
0,Alabama,2005,4442558,36879
1,Alabama,2006,4599030,38783
2,Alabama,2007,4627851,40554
3,Alabama,2008,4661900,42666
4,Alabama,2009,4708708,40489
...,...,...,...,...
931,Wyoming,2018,577737,61584
932,Wyoming,2019,578759,65003
933,Wyoming,2021,578803,65204
934,Wyoming,2022,581381,70042


pandas.core.frame.DataFrame

It is common to think of a DataFrame as "a rectangle of data." A more accurate description is "A collection of columns, each of the same length." Note that each column has a name.

Working with a single column of data is very common. You do it like this: `df[<column name>]`.

The data in each column is always of the same type (text, number, boolean). The formal type of a column is `Series`.


In [3]:
display(df["State"])
display(type(df["State"]))

0      Alabama
1      Alabama
2      Alabama
3      Alabama
4      Alabama
        ...   
931    Wyoming
932    Wyoming
933    Wyoming
934    Wyoming
935    Wyoming
Name: State, Length: 936, dtype: object

pandas.core.series.Series

The `Series` class as a lot of methods. For example, the `unique()` method will return the unique values in the series:

In [4]:
df["State"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

# Exercise 2.1: Series Methods

1. Calculate the unique values in the `Year` column. Does anything surprise you?
2. Use the `min()` and `max()` methods to calculate the min and max population in the dataset. Does anything surprise you?
3. Ask an LLM for a suggestion of another method to experiment with, and then try it. Try a prompt like "I just started working with Pandas Series. I've used the unique, min and max methods. Please suggest another method for me to try."

In [5]:
# Put your solution here

# Filtering a Dataframe (Boolean Indexing)

The most common operation we perform on a dataframe is: "Show me the rows where some condition is True". 

An example is "Show me the rows where the values of the 'Total Population' column are less than 1 million". 

The technique for doing this in Pandas is called "Boolean Indexing". It is a 2-step process: 
1. First we create a Series that contains Boolean (True/False) values. These values correspond to the rows of the dataframe. 
2. Then we feed that those values to the original dataframe using bracket `[]` notation. 

If the first element in the Series is True, then the first row is returned. And so on. 

In [6]:
df

Unnamed: 0,State,Year,Total Population,Median Household Income
0,Alabama,2005,4442558,36879
1,Alabama,2006,4599030,38783
2,Alabama,2007,4627851,40554
3,Alabama,2008,4661900,42666
4,Alabama,2009,4708708,40489
...,...,...,...,...
931,Wyoming,2018,577737,61584
932,Wyoming,2019,578759,65003
933,Wyoming,2021,578803,65204
934,Wyoming,2022,581381,70042


# Creating the Boolean Series (Vectorization)

Python makes it easy to test whether an individual value is less than 1 million:

```py
0 < 1000000
```

In Pandas, we use the same syntax to test whether each value in a Series is less than 1 million:

In [16]:
df["Total Population"] < 1000000

0      False
1      False
2      False
3      False
4      False
       ...  
931     True
932     True
933     True
934     True
935     True
Name: Total Population, Length: 936, dtype: bool

In [14]:
# Display the column / series
display(df["Total Population"])

# The test is "vectorized" over the Series
df["Total Population"] < 1000000

0      4442558
1      4599030
2      4627851
3      4661900
4      4708708
        ...   
931     577737
932     578759
933     578803
934     581381
935     584057
Name: Total Population, Length: 936, dtype: int64

0      False
1      False
2      False
3      False
4      False
       ...  
931     True
932     True
933     True
934     True
935     True
Name: Total Population, Length: 936, dtype: bool

We say that the logical test (`< 1000000`) is *vectorized* over the Series. Note that the comparison returns a boolean Series the same length as the original DataFrame.

# Boolean Indexing / Filtering

If a Boolean Series is put inside brackets after a dataframe, Pandas will return rows which have the value `True`. This is called *Boolean Indexing*. 

For example, this code will return the rows which have a population less than 1 million:

In [11]:
df[df["Total Population"] < 1000000]

Unnamed: 0,State,Year,Total Population,Median Household Income
18,Alaska,2005,641724,56234
19,Alaska,2006,670053,59393
20,Alaska,2007,683478,64333
21,Alaska,2008,686293,68460
22,Alaska,2009,698473,66953
...,...,...,...,...
931,Wyoming,2018,577737,61584
932,Wyoming,2019,578759,65003
933,Wyoming,2021,578803,65204
934,Wyoming,2022,581381,70042


# Syntax

Boolean Indexing in Pandas can sometimes be confusing to read:

```py
df[df['Total Population'] < 1000000]
```

The reason it's confusing is because the bracket operator `[]` is being used twice on one line. And it means something different each time it's used: 
  * The inner bracket is being used to extract a column as a Series
  * The outer bracket is being used as Boolean Indexing for the dataframe.

To make things easier to read, the above code is sometimes broken into two lines:

In [12]:
mask = df["Total Population"] <= 1000000
df[mask]

Unnamed: 0,State,Year,Total Population,Median Household Income
18,Alaska,2005,641724,56234
19,Alaska,2006,670053,59393
20,Alaska,2007,683478,64333
21,Alaska,2008,686293,68460
22,Alaska,2009,698473,66953
...,...,...,...,...
931,Wyoming,2018,577737,61584
932,Wyoming,2019,578759,65003
933,Wyoming,2021,578803,65204
934,Wyoming,2022,581381,70042


When feeding a variable to a dataframe for Boolean Indexing, it is common to name the variable `mask`.

# Exercise 2.2: Boolean Indexing

1. Create a variable called `mask` that says which rows in `df` are in California. Use it to subset the dataframe.
2. Update your app to filter the dataframe to show rows in the state the user selected.