## Dataframes: a Collection of Series

Here's the data frame we've been working with. Note that it is of type `DataFrame`.

In [1]:
import pandas as pd
from IPython.display import display

df = pd.read_csv('state_data.csv')
display(df)
display(type(df))

Unnamed: 0,State,Year,Total Population,Median Household Income
0,Alabama,2005,4442558,36879
1,Alabama,2006,4599030,38783
2,Alabama,2007,4627851,40554
3,Alabama,2008,4661900,42666
4,Alabama,2009,4708708,40489
...,...,...,...,...
931,Wyoming,2018,577737,61584
932,Wyoming,2019,578759,65003
933,Wyoming,2021,578803,65204
934,Wyoming,2022,581381,70042


pandas.core.frame.DataFrame

It is common to think of a DataFrame as "a rectangle of data." A more accurate description is "A collection of columns, each of the same length." 

Working with a single column of data is very common. You do it like this: `df[<column name>]`.

The data in each column is always of the same type (text, number, boolean). The formal type of a column is `Series`.


In [2]:
display(df['State'])
display(type(df['State']))

0      Alabama
1      Alabama
2      Alabama
3      Alabama
4      Alabama
        ...   
931    Wyoming
932    Wyoming
933    Wyoming
934    Wyoming
935    Wyoming
Name: State, Length: 936, dtype: object

pandas.core.series.Series

The `Series` class as a lot of methods. For example, the `unique()` method will return the unique values in the series:

In [3]:
df['State'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

# Exercise 2.1: Series Methods

1. Calculate the unique values in the `Year` column. Does anything surprise you?
2. Use the `min()` and `max()` methods to calculate the min and max population in the dataset. Does anything surprise you?
3. Ask an LLM for a suggestion of another method to experiment with, and then try it. Try a prompt like "I just started working with Pandas Series. I've used the unique, min and max methods. Please suggest another method for me to try."

In [8]:
# Put your solution here

In [6]:
df['Total Population'].max()

np.int64(39557045)

# Filtering a Dataframe

The most common operation we perform on a dataframe is: "Show me the rows where <condition> is true". 

An example is "Show me the rows where the value of the `State` column is `Alabama`". 

The syntax for doing this with Pandas is tricky. This workbook walks you though it, in the hopes that it makes the syntax easier to remember.

# Testing for Equality

Our goal is to get Pandas to show us the rows in `df` for just Alabama.

Our first step is to apply `==` to the Series that contains the state:

In [10]:
df['State'] == 'Alabama'

0       True
1       True
2       True
3       True
4       True
       ...  
931    False
932    False
933    False
934    False
935    False
Name: State, Length: 936, dtype: bool

Note that the `dtype` above is `bool`. This means that the values are True or False.

# Boolean Indexing / Filtering

If you put a Boolean Series inside the `[]`, Python will return just the values where the mask is `True`.

In [19]:
condition = df['State'] == 'Alabama'
df[condition]

Unnamed: 0,State,Year,Total Population,Median Household Income
0,Alabama,2005,4442558,36879
1,Alabama,2006,4599030,38783
2,Alabama,2007,4627851,40554
3,Alabama,2008,4661900,42666
4,Alabama,2009,4708708,40489
5,Alabama,2010,4785298,40474
6,Alabama,2011,4802740,41415
7,Alabama,2012,4822023,41574
8,Alabama,2013,4833722,42849
9,Alabama,2014,4849377,42830


# Exercises: Filtering dataframe

1. Show just the rows where the state is California
2. Show just the rows where the population is less than 1 million

In [20]:
df[df['Total Population'] < 1000000]

Unnamed: 0,State,Year,Total Population,Median Household Income
18,Alaska,2005,641724,56234
19,Alaska,2006,670053,59393
20,Alaska,2007,683478,64333
21,Alaska,2008,686293,68460
22,Alaska,2009,698473,66953
...,...,...,...,...
931,Wyoming,2018,577737,61584
932,Wyoming,2019,578759,65003
933,Wyoming,2021,578803,65204
934,Wyoming,2022,581381,70042


# Exercises: Update your app

Update your app to show just the rows where the state is california


# Exercises for Series

.unique() for states
.min(), .max() for years, income, population