# Week 3 Lab — Importing, Exploring, and Subsetting DataFrames

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/02_wk2_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Estimated time:** 60–75 minutes  
**Mode:** Group-first, then Q&A, then group challenges

In this lab, you will:
- Import tabular data into a Pandas DataFrame
- Explore the structure and attributes of your dataset
- Apply subsetting techniques to filter rows and columns
- Build flexible filtering logic to answer different questions

This session builds on what we covered in Tuesday’s class and the reading lessons you should've read prior to today's Lab. You’ll work with a small group (2–4 students) and complete the Lab together.

We will use the **[New York Times College COVID-19](https://github.com/nytimes/covid-19-data/tree/master/colleges)** dataset for our analysis.  And since we will rely on the Pandas library, make sure you have it installed in your Python environment. If you're using Google Colab, Pandas is typically pre-installed.


In [1]:
import pandas as pd

## ⏱️ Plan
- **0–30 min (Group Work):** Import + explore + basic subsetting  
- **30–40 min (Class Q&A):** Surface blockers + clarify concepts  
- **40–70 min (Group Challenges):** Three analysis challenges (10 min each)  
- **70–75 min (Wrap):** Shareouts + next steps


## Part A — Group Work (first 30 minutes)

### A1. Import the data
Dataset URL:  
```
https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv
```
**Task:** Load into a DataFrame named `college_df` and preview it.


In [None]:
import pandas as pd

data_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv"
college_df = pd.read_csv(data_url)
college_df.head()

Unnamed: 0,date,state,county,city,ipeds_id,college,cases,cases_2021,notes
0,2021-05-26,Alabama,Madison,Huntsville,100654,Alabama A&M University,41,,
1,2021-05-26,Alabama,Montgomery,Montgomery,100724,Alabama State University,2,,
2,2021-05-26,Alabama,Limestone,Athens,100812,Athens State University,45,10.0,
3,2021-05-26,Alabama,Lee,Auburn,100858,Auburn University,2742,567.0,
4,2021-05-26,Alabama,Montgomery,Montgomery,100830,Auburn University at Montgomery,220,80.0,


### A2. Get your bearings
**Tasks:**
1. List the **columns**
2. Show the **first 20 rows**
3. Report the **shape** (rows × columns)
4. Show the **dtypes**
5. Count **missing values** per column


In [4]:
# 1) Columns
college_df.columns

Index(['date', 'state', 'county', 'city', 'ipeds_id', 'college', 'cases',
       'cases_2021', 'notes'],
      dtype='object')

In [5]:
# 2) First 20 rows
college_df.head(20)

Unnamed: 0,date,state,county,city,ipeds_id,college,cases,cases_2021,notes
0,2021-05-26,Alabama,Madison,Huntsville,100654,Alabama A&M University,41,,
1,2021-05-26,Alabama,Montgomery,Montgomery,100724,Alabama State University,2,,
2,2021-05-26,Alabama,Limestone,Athens,100812,Athens State University,45,10.0,
3,2021-05-26,Alabama,Lee,Auburn,100858,Auburn University,2742,567.0,
4,2021-05-26,Alabama,Montgomery,Montgomery,100830,Auburn University at Montgomery,220,80.0,
5,2021-05-26,Alabama,Walker,Jasper,102429,Bevill State Community College,4,,
6,2021-05-26,Alabama,Jefferson,Birmingham,100937,Birmingham-Southern College,263,49.0,
7,2021-05-26,Alabama,Limestone,Tanner,101514,Calhoun Community College,137,53.0,
8,2021-05-26,Alabama,Tallapoosa,Alexander City,100760,Central Alabama Community College,49,10.0,
9,2021-05-26,Alabama,Coffee,Enterprise,101143,Enterprise State Community College,76,35.0,


In [None]:
# 3) Last 20 rows
college_df.tail(20)

Unnamed: 0,date,state,county,city,ipeds_id,college,cases,cases_2021,notes
1928,2021-05-26,Wisconsin,Brown,De Pere,239716,Saint Norbert College,295,59.0,
1929,2021-05-26,Wisconsin,Eau Claire,Eau Claire,240268,University of Wisconsin-Eau Claire,994,78.0,
1930,2021-05-26,Wisconsin,Brown,Green Bay,240277,University of Wisconsin-Green Bay,536,172.0,
1931,2021-05-26,Wisconsin,La Crosse,La Crosse,240329,University of Wisconsin-La Crosse,562,48.0,
1932,2021-05-26,Wisconsin,Dane,Madison,240444,University of Wisconsin-Madison,7708,2298.0,
1933,2021-05-26,Wisconsin,Milwaukee,Milwaukee,240453,University of Wisconsin-Milwaukee,1632,231.0,
1934,2021-05-26,Wisconsin,Winnebago,Oshkosh,240365,University of Wisconsin-Oshkosh,1457,181.0,
1935,2021-05-26,Wisconsin,Kenosha,Somers,240374,University of Wisconsin-Parkside,342,75.0,
1936,2021-05-26,Wisconsin,Grant,Platteville,240462,University of Wisconsin-Platteville,691,144.0,
1937,2021-05-26,Wisconsin,Pierce,River Falls,240471,University of Wisconsin-River Falls,520,85.0,


In [7]:
# 4) Dimensions (rows, columns)
college_df.shape

(1948, 9)

In [None]:
# 5) Data types & missing values (Null vs. Non-Null) for each column
college_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1948 entries, 0 to 1947
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1948 non-null   object 
 1   state       1948 non-null   object 
 2   county      1946 non-null   object 
 3   city        1948 non-null   object 
 4   ipeds_id    1948 non-null   object 
 5   college     1948 non-null   object 
 6   cases       1948 non-null   int64  
 7   cases_2021  1611 non-null   float64
 8   notes       36 non-null     object 
dtypes: float64(1), int64(1), object(7)
memory usage: 137.1+ KB


### A3. Focus the frame

This dataset includes many columns, but for this exercise, we're going to focus only on:

* `state`
* `city`
* `college`
* `cases`

Let's select just these columns and preview the first 10 rows.


In [10]:
college_cases_df = college_df[["state", "city", "college", "cases"]]
college_cases_df.head(10)

Unnamed: 0,state,city,college,cases
0,Alabama,Huntsville,Alabama A&M University,41
1,Alabama,Montgomery,Alabama State University,2
2,Alabama,Athens,Athens State University,45
3,Alabama,Auburn,Auburn University,2742
4,Alabama,Montgomery,Auburn University at Montgomery,220
5,Alabama,Jasper,Bevill State Community College,4
6,Alabama,Birmingham,Birmingham-Southern College,263
7,Alabama,Tanner,Calhoun Community College,137
8,Alabama,Alexander City,Central Alabama Community College,49
9,Alabama,Enterprise,Enterprise State Community College,76


### A4. Quick subsetting reps


- Filter to **Ohio** rows with **cases > 100**
- Count how many such rows

Let's say we wanted to filter the DataFrame to only include rows where the `state` is "Ohio".  Recall we can use the following code snippet:


In [12]:
ohio_colleges = college_cases_df["state"] == "Ohio"
college_cases_df.loc[ohio_colleges]

Unnamed: 0,state,city,college,cases
1320,Ohio,Dayton,Air Force Institute of Technology-Graduate Sch...,0
1321,Ohio,Ashland,Ashland University,354
1322,Ohio,Berea,Baldwin Wallace University,362
1323,Ohio,Bluffton,Bluffton University,4
1324,Ohio,Bowling Green,Bowling Green State University,1477
...,...,...,...,...
1386,Ohio,Springfield,Wittenberg University,287
1387,Ohio,Dayton,Wright State University,246
1388,Ohio,Celina,Wright State University-Lake Campus,16
1389,Ohio,Cincinnati,Xavier University,46


Let's see how many cases we had across all Ohio colleges:

In [18]:
ohio_college_cases = college_cases_df.loc[ohio_colleges]
print(f"Total Ohio cases: {ohio_college_cases['cases'].sum()}")

Total Ohio cases: 31877


Now let's say we wanted to filter the DataFrame to only include rows where the `state` is "Ohio" and the `cases` are greater than 100. We can combine these conditions using the `&` operator:

In [19]:
over_100 = college_cases_df["cases"] > 100
ohio_over_100 = college_cases_df.loc[ohio_colleges & over_100]
ohio_over_100

Unnamed: 0,state,city,college,cases
1321,Ohio,Ashland,Ashland University,354
1322,Ohio,Berea,Baldwin Wallace University,362
1324,Ohio,Bowling Green,Bowling Green State University,1477
1327,Ohio,Cleveland,Case Western Reserve University,484
1331,Ohio,Cleveland,Cleveland State University,240
1333,Ohio,Cleveland,Cuyahoga Community College District,144
1335,Ohio,Granville,Denison University,121
1336,Ohio,Steubenville,Franciscan University of Steubenville,411
1337,Ohio,Tiffin,Heidelberg University,183
1339,Ohio,University Heights,John Carroll University,386


In [20]:
# How many rows meet the criteria?
ohio_over_100.shape[0]

38

### A5. A specific school

Now let's see how we can filter the DataFrame to focus on a specific college, such as the University of Cincinnati (UC).

- Filter rows where `college == "University of Cincinnati"`
- How many cases were reported at UC?


In [21]:
uc_df = college_cases_df[college_cases_df["college"] == "University of Cincinnati"]
uc_df

Unnamed: 0,state,city,college,cases
1379,Ohio,Cincinnati,University of Cincinnati,3288


If you get done before the 30 minutes are up, go ahead and adjust the above code to:

* Filter for a different college, such as "Ohio State University" or "Miami University".  
* How many cases were reported at that college?  
* What about the University of Dayton?  How many cases were reported there?
* How many cases were reported in Ohio compared to Kentucky?

## Class Q&A (10 minutes)

Prompts:
- Any concerns with reading the CSV?
- `.head()`, `.info()`, `.describe()` — when do you use each?
- Series vs DataFrame — what clicked? what’s fuzzy?
- `.loc` vs `[]` — when prefer which?


## Part B — Group Challenges (10 min each)

For the next several challenges...

* You will not be given starter code to work with; rather, you need to start from a blank cell.
* **DO NOT USE AI** to generate code for you.  This is a group exercise, and you should be writing the code together.
* Work with your group to write the code.
* Feel free to ask questions or seek help from the instructor.
* We'll stop and walk through each challenge together after the 10 minute time block is up.



### Challenge 1 — Tri-State Totals

**Question:** What were the **total number of COVID cases** across all **Ohio, Indiana, and Kentucky** colleges?


In [None]:
# Your turn: write code here to compute the total cases for OH, IN, KY



### Challenge 2 — Community Colleges vs Others
**Question:** What is the **average number of cases** across all **community colleges**?  
Use name-based detection (e.g., `college.str.contains("Community", case=False, na=False)`).  
**Then compare**: Is this average **greater or lesser** than the average for **non‑community** schools?


In [None]:
# Your turn: write code here to compute the average cases for community colleges

# Then compute the average for non-community colleges and compare

# Then use a comparison operator to see if the average cases for community colleges is greater than non-community colleges


### Challenge 3 — Community Colleges in OH/IN/KY
**Question:** Among **community colleges** in **OH, IN, KY**, what is the **average number of cases**?  
How does this compare to **all community colleges nationwide**?


In [None]:
# Your turn: write code here to compare the average cases for community colleges in OH/IN/KY
# vs the national community college average



### Challenge 4 — Top 5 Colleges

**Question:** What are the **top 5 colleges** with the **highest number of cases**?  Hint: there are two methods you can use to get this answer.

If you figure that out, now see if you can find the **top 5 colleges** with the **lowest number of cases**. Is this a fair assessment of the situation?  Why or why not?  How many colleges had zero cases?

In [28]:
# Your turn: write code here to compute the top 5 colleges with the highest number of cases
## Then compute the top 5 colleges with the lowest number of cases


## (Optional) Brainstorm — What else is interesting?
Write **3 questions** you’d like to explore with this dataset in future weeks.

Examples to spark ideas (don’t answer now):
- Top 5 states by average cases per college?
- Trends by institution type (university vs community college)?
- Largest city-level clusters?


---
**Save your work** and be ready to share your group’s approach and findings.
