# Week 3 Lab — Importing, Exploring, and Subsetting DataFrames

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/03_wk3_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Estimated time:** 60–75 minutes  
**Mode:** Group-first, then Q&A, then group challenges

In this lab, you will:
- Import tabular data into a Pandas DataFrame
- Explore the structure and attributes of your dataset
- Apply subsetting techniques to filter rows and columns
- Build flexible filtering logic to answer different questions

This session builds on what we covered in Tuesday’s class and the reading lessons you should've read prior to today's Lab. You’ll work with a small group (2–4 students) and complete the Lab together.

We will use the **[New York Times College COVID-19](https://github.com/nytimes/covid-19-data/tree/master/colleges)** dataset for our analysis.  And since we will rely on the Pandas library, make sure you have it installed in your Python environment. If you're using Google Colab, Pandas is typically pre-installed.


In [None]:
import pandas as pd

## ⏱️ Plan
- **0–30 min (Group Work):** Import + explore + basic subsetting  
- **30–40 min (Class Q&A):** Surface blockers + clarify concepts  
- **40–70 min (Group Challenges):** Three analysis challenges (10 min each)  
- **70–75 min (Wrap):** Shareouts + next steps


## Part A — Group Work (first 30 minutes)

### A1. Import the data
Dataset URL:  
```
https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv
```
**Task:** Load into a DataFrame named `college_df` and preview it.


In [None]:
import pandas as pd

data_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv"
college_df = pd.read_csv(data_url)
college_df.head()

### A2. Get your bearings
**Tasks:**
1. List the **columns**
2. Show the **first 20 rows**
3. Report the **shape** (rows × columns)
4. Show the **dtypes**
5. Count **missing values** per column


In [None]:
# 1) Columns
college_df.columns

In [None]:
# 2) First 20 rows
college_df.head(20)

In [None]:
# 3) Last 20 rows
college_df.tail(20)

In [None]:
# 4) Dimensions (rows, columns)
college_df.shape

In [None]:
# 5) Data types & missing values (Null vs. Non-Null) for each column
college_df.info()

### A3. Focus the frame

This dataset includes many columns, but for this exercise, we're going to focus only on:

* `state`
* `city`
* `college`
* `cases`

Let's select just these columns and preview the first 10 rows.


In [None]:
college_cases_df = college_df[["state", "city", "college", "cases"]]
college_cases_df.head(10)

### A4. Quick subsetting reps


- Filter to **Ohio** rows with **cases > 100**
- Count how many such rows

Let's say we wanted to filter the DataFrame to only include rows where the `state` is "Ohio".  Recall we can use the following code snippet:


In [None]:
ohio_colleges = college_cases_df["state"] == "Ohio"
college_cases_df.loc[ohio_colleges]

Let's see how many cases we had across all Ohio colleges:

In [None]:
ohio_college_cases = college_cases_df.loc[ohio_colleges]
print(f"Total Ohio cases: {ohio_college_cases['cases'].sum()}")

Now let's say we wanted to filter the DataFrame to only include rows where the `state` is "Ohio" and the `cases` are greater than 100. We can combine these conditions using the `&` operator:

In [None]:
over_100 = college_cases_df["cases"] > 100
ohio_over_100 = college_cases_df.loc[ohio_colleges & over_100]
ohio_over_100

In [None]:
# How many rows meet the criteria?
ohio_over_100.shape[0]

### A5. A specific school

Now let's see how we can filter the DataFrame to focus on a specific college, such as the University of Cincinnati (UC).

- Filter rows where `college == "University of Cincinnati"`
- How many cases were reported at UC?


In [None]:
uc_df = college_cases_df[college_cases_df["college"] == "University of Cincinnati"]
uc_df

If you get done before the 30 minutes are up, go ahead and adjust the above code to:

* Filter for a different college, such as "Ohio State University" or "Miami University".  
* How many cases were reported at that college?  
* What about the University of Dayton?  How many cases were reported there?
* How many cases were reported in Ohio compared to Kentucky?

## Class Q&A (10 minutes)

Prompts:
- Any concerns with reading the CSV?
- `.head()`, `.info()`, `.describe()` — when do you use each?
- Series vs DataFrame — what clicked? what’s fuzzy?
- `.loc` vs `[]` — when prefer which?


## Part B — Group Challenges (10 min each)

For the next several challenges...

* You will not be given starter code to work with; rather, you need to start from a blank cell.
* **DO NOT USE AI** to generate code for you.  This is a group exercise, and you should be writing the code together.
* Work with your group to write the code.
* Feel free to ask questions or seek help from the instructor.
* We'll stop and walk through each challenge together after the 10 minute time block is up.



### Challenge 1 — Tri-State Totals

**Question:** What were the **total number of COVID cases** across all **Ohio, Indiana, and Kentucky** colleges?


In [None]:
# Your turn: write code here to compute the total cases for OH, IN, KY



### Challenge 2 — Community Colleges vs Others
**Question:** What is the **average number of cases** across all **community colleges**?  
Use name-based detection (e.g., `college.str.contains("Community", case=False, na=False)`).  
**Then compare**: Is this average **greater or lesser** than the average for **non‑community** schools?


In [None]:
# Your turn: write code here to compute the average cases for community colleges

# Then compute the average for non-community colleges and compare

# Then use a comparison operator to see if the average cases for community colleges is greater than non-community colleges


### Challenge 3 — Community Colleges in OH/IN/KY
**Question:** Among **community colleges** in **OH, IN, KY**, what is the **average number of cases**?  
How does this compare to **all community colleges nationwide**?


In [None]:
# Your turn: write code here to compare the average cases for community colleges in OH/IN/KY
# vs the national community college average



### Challenge 4 — Top 5 Colleges

**Question:** What are the **top 5 colleges** with the **highest number of cases**?  Hint: there are two methods you can use to get this answer.

If you figure that out, now see if you can find the **top 5 colleges** with the **lowest number of cases**. Is this a fair assessment of the situation?  Why or why not?  How many colleges had zero cases?

In [None]:
# Your turn: write code here to compute the top 5 colleges with the highest number of cases
## Then compute the top 5 colleges with the lowest number of cases


### Challenge 5 — Top 5 Ohio Colleges

**Question:** What are the **top 5 Ohio colleges** with the **highest number of cases**?  Now what about the **top 5 Ohio colleges** with the **lowest number of cases**. How many Ohio colleges had zero cases?

In [None]:
# Your turn: write code here to compute the top 5 Ohio colleges with the highest and lowest number of cases



## (Optional) Brainstorm — What else is interesting?
Write **3 questions** you’d like to explore with this dataset in future weeks.

Examples to spark ideas (don’t answer now):
- Top 5 states by average cases per college?
- Trends by institution type (university vs community college)?
- Largest city-level clusters?


---
**Save your work** and be ready to share your group’s approach and findings.
