# Week 3 Lab — Become a Data Detective 🔍

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/03_wk3_lab_update.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🎯 Learning Objectives
By the end of this lab, you will be able to:
- Import tabular data into a Pandas DataFrame using multiple methods
- Systematically explore dataset structure using data detective techniques
- Distinguish between DataFrames and Series and know when to use each
- Apply professional subsetting techniques to filter and select data

## 📚 This Lab Reinforces
- **Chapter 7**: Importing Data (file paths, data types, inspection methods)
- **Chapter 8**: DataFrames (Series, indexing, attributes vs methods)
- **Chapter 9**: Subsetting (filtering, selection, `.loc[]` best practices)

## 🕐 Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Group-first, then Q&A, then group challenges

- **[0–40 min]** Part A: Guided Data Detective Training
- **[40–50 min]** Class Q&A: Surface blockers + clarify concepts
- **[50–75 min]** Part B: Independent Group Challenges

You are encouraged to work in small groups of **2–4 students** and complete the lab together.

## 💡 Why This Matters
Every data scientist starts a new project the same way: importing unknown data and systematically exploring it to understand what stories it might tell. Today you'll develop the professional habits that data detectives use to quickly assess, understand, and subset any dataset—skills that will serve you in every analysis you do.

We will use the **[New York Times College COVID-19](https://github.com/nytimes/covid-19-data/tree/master/colleges)** dataset for our analysis.

## Setup
Since we will rely on the Pandas library, make sure you have it installed in your Python environment. If you're using Google Colab, Pandas is typically pre-installed.

In [None]:
import pandas as pd

## Part A — Guided Data Detective Training (40 minutes)

### A1. Import the Dataset (5 minutes)

**The Data Detective's First Step:** Every investigation starts with getting the data into your workspace.

Dataset URL:  
```
https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv
```

**📋 Your Task:** Load into a DataFrame named `college_df` and take your first look.

In [None]:
# Import the dataset
data_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/refs/heads/master/colleges/colleges.csv"
college_df = pd.read_csv(data_url)
college_df.head()

### A2. Systematic Data Investigation (10 minutes)

**The Data Detective's Toolkit:** Before analyzing any dataset, professional data scientists ask the same systematic questions. Let's learn the essential detective methods.

**📋 Detective Questions to Answer:**
1. **"How big is this dataset?"** → Use `.shape`
2. **"What variables do I have?"** → Use `.columns`
3. **"What does the data look like?"** → Use `.head()` and `.tail()`
4. **"What types of data am I working with?"** → Use `.dtypes` and `.info()`
5. **"Any missing data issues?"** → Check `.info()` output

In [None]:
# Detective Question 1: Dataset dimensions
print(f"Dataset shape: {college_df.shape}")
print(f"This dataset has {college_df.shape[0]} rows and {college_df.shape[1]} columns")

In [None]:
# Detective Question 2: What variables are available?
college_df.columns

In [None]:
# Detective Question 3: First look at the data
college_df.head()

In [None]:
# Detective Question 3 continued: Last few rows
college_df.tail()

In [None]:
# Detective Questions 4 & 5: Data types and missing values
college_df.info()

### A2.5. DataFrame vs Series Mastery (5 minutes)

**Essential Concept:** Understanding the difference between DataFrames and Series is crucial for effective data analysis.

**🧪 Experiment:** Run each code block below and observe the differences.

In [None]:
# Single brackets = Series
cases_series = college_df['cases']
print(f"Type: {type(cases_series)}")
print(f"Shape: {cases_series.shape}")
print("\nFirst few values:")
cases_series.head()

In [None]:
# Double brackets = DataFrame
cases_dataframe = college_df[['cases']]
print(f"Type: {type(cases_dataframe)}")
print(f"Shape: {cases_dataframe.shape}")
print("\nFirst few values:")
cases_dataframe.head()

In [None]:
# Multiple columns = DataFrame
multiple_cols = college_df[['state', 'college', 'cases']]
print(f"Type: {type(multiple_cols)}")
print(f"Shape: {multiple_cols.shape}")
multiple_cols.head()

**🔑 Key Insight:** Use `[[]]` when you want to keep working with DataFrame methods!

### A2.6. Attributes vs Methods (3 minutes)

**Detective Tool Classification:** Understanding when to use parentheses makes you look professional.

**📋 Quick Classification Exercise:**

In [None]:
# Attributes (no parentheses needed) - these are "ID card" information
print("ATTRIBUTES (basic properties):")
print(f"Shape: {college_df.shape}")
print(f"Columns: {list(college_df.columns)}")
print(f"Data types: {college_df.dtypes.to_dict()}")

In [None]:
# Methods (parentheses required) - these DO something
print("METHODS (actions that return results):")
print("\nFirst 3 rows:")
print(college_df.head(3))
print("\nSummary info:")
college_df.info()

**💡 Memory Trick:** Methods = Actions = Parentheses!

### A3. Focus the Investigation (5 minutes)

**Data Detective Principle:** Real investigations focus on relevant variables. This dataset includes many columns, but for this exercise, we're going to focus only on:

* `state`
* `city`
* `college`
* `cases`

**📋 Your Task:** Select just these columns and preview the focused dataset.

In [None]:
# Create a focused dataset with key variables
college_cases_df = college_df[['state', 'city', 'college', 'cases']]
print(f"Focused dataset shape: {college_cases_df.shape}")
college_cases_df.head(10)

### A3.5. Index Investigation (5 minutes)

**Professional Practice:** Understanding and setting meaningful indexes makes data access faster and code more readable.

**🔍 Index Detective Work:**

In [None]:
# What's our current index?
print("Current index:")
print(college_cases_df.index)
print("\nIndex type:", type(college_cases_df.index))

In [None]:
# Would a different column make a better index?
# Let's check if college names are unique (good for indexing)
total_colleges = len(college_cases_df)
unique_colleges = college_cases_df['college'].nunique()
print(f"Total rows: {total_colleges}")
print(f"Unique colleges: {unique_colleges}")
print(f"Are college names unique? {total_colleges == unique_colleges}")

In [None]:
# Since college names aren't unique, let's try a different approach
# Create a copy with college names as index for demonstration
college_indexed = college_cases_df.set_index('college')
print("Sample of college-indexed data:")
college_indexed.head()

# Try looking up a specific college
# college_indexed.loc['University of Cincinnati']

### A4. Professional Subsetting Practice (7 minutes)

**Data Detective Skills:** Learning to filter and subset data efficiently using `.loc[]` - the professional standard.

**📋 Step-by-Step Filtering Training:**

In [None]:
# Step 1: Create a condition (True/False for each row)
ohio_condition = college_cases_df['state'] == 'Ohio'
print("First 10 values of our condition:")
print(ohio_condition.head(10))
print(f"\nNumber of Ohio colleges: {ohio_condition.sum()}")

In [None]:
# Step 2: Apply the filter using .loc[]
ohio_colleges = college_cases_df.loc[ohio_condition]
print(f"Ohio colleges dataset shape: {ohio_colleges.shape}")
ohio_colleges.head()

In [None]:
# Step 3: Analyze the filtered data
total_ohio_cases = ohio_colleges['cases'].sum()
print(f"Total COVID cases across all Ohio colleges: {total_ohio_cases:,}")

In [None]:
# Advanced filtering: Combine multiple conditions
high_case_condition = college_cases_df['cases'] > 100
ohio_high_cases = college_cases_df.loc[ohio_condition & high_case_condition]

print(f"Ohio colleges with >100 cases: {ohio_high_cases.shape[0]}")
ohio_high_cases.head()

In [None]:
# Professional technique: Look up a specific school
uc_condition = college_cases_df['college'] == 'University of Cincinnati'
uc_data = college_cases_df.loc[uc_condition]
print("University of Cincinnati COVID data:")
uc_data

## Class Q&A (10 minutes)

**Discussion prompts:**
- Any concerns with importing data from URLs vs. file paths?
- DataFrame vs Series distinction - what clicked? What's still confusing?
- Attributes vs Methods - when do you use parentheses?
- `.loc[]` vs `[]` for filtering - which feels more professional and why?
- Index concepts - when would you set a custom index?

**Common blockers and clarifications:**
- Forgetting parentheses on methods
- Confusing `[]` vs `[[]]` for column selection
- Using `and`/`or` instead of `&`/`|` in pandas conditions

## Part B — Independent Group Challenges (25 minutes)

For the next several challenges...

* You will not be given starter code to work with; rather, you need to start from a blank cell.
* **DO NOT USE AI** to generate code for you. This is a group exercise, and you should be writing the code together.
* Work with your group to write the code.
* Feel free to ask questions or seek help from the instructor.
* We'll stop and walk through each challenge together after each 5-minute time block.

**Professional Practice:** Use `.loc[]` for all filtering operations in these challenges.

### Challenge 1 — DataFrame vs Series Mastery (5 minutes)

**Question:** Extract the `cases` column from `college_cases_df` in **two different ways**:
1. As a **Series** and calculate the total cases using `.sum()`
2. As a **DataFrame** and calculate the total cases using `.sum()`

**Follow-up:** What's the difference in the output? When might you prefer each approach?

In [None]:
# Your turn: extract cases as Series and DataFrame, calculate totals


### Challenge 2 — Professional Index Usage (5 minutes)

**Question:** Create a version of `college_cases_df` with `state` as the index. Then use `.loc[]` to:
1. Look up all colleges in **Ohio**
2. Look up all colleges in **California** 
3. Compare the total cases between these two states

**Hint:** When state is the index, you can use `.loc['Ohio']` for lookups.

In [None]:
# Your turn: set state as index and practice lookups


### Challenge 3 — Complex Filtering with .loc[] (5 minutes)

**Question:** Using `.loc[]`, find colleges that meet **ALL** of these criteria:
1. Located in **Ohio, Indiana, or Kentucky**
2. Have **more than 50 cases**
3. Return only the `college` and `cases` columns

**Hint:** Use the `.isin()` method for multiple state matching and `&` for combining conditions.

In [None]:
# Your turn: complex multi-condition filtering with .loc[]


### Challenge 4 — Advanced Data Detective Work (5 minutes)

**Question:** Find **community colleges** using name-based detection and compare them to **all other colleges**:

1. Use `.str.contains("Community", case=False, na=False)` to identify community colleges
2. Calculate the **average cases** for community colleges
3. Calculate the **average cases** for non-community colleges
4. Which type has higher average cases?

**Hint:** Use `~` (tilde) to negate a condition for "not community colleges".

In [None]:
# Your turn: community college analysis


### Challenge 5 — Professional Ranking Analysis (5 minutes)

**Question:** Use professional pandas methods to find:

1. **Top 5 colleges** with the highest cases (use `.nlargest()`)
2. **Top 5 Ohio colleges** with the highest cases
3. **Bottom 5 colleges** with the lowest cases (use `.nsmallest()`)

**Discussion:** Are the lowest-case results meaningful? What might explain very low case counts?

In [None]:
# Your turn: ranking analysis with .nlargest() and .nsmallest()


## 🎯 (Optional) Extension Activities

If you finish early or want additional practice:

### Extension 1: File Path Practice
If you have the dataset downloaded locally, practice importing it using relative paths instead of the URL.

### Extension 2: Index Optimization
Experiment with different columns as indexes. Which ones make sense for the types of lookups you're doing?

### Extension 3: Brainstorm — What else is interesting?
Write **3 questions** you'd like to explore with this dataset in future weeks.

Examples to spark ideas:
- Which states have the highest average cases per college?
- How do case numbers vary by institution type?
- Are there geographic patterns in the data?

## 🎓 Lab Wrap-Up & Reflection

### ✅ What You Accomplished
In this lab, you practiced:
- **Systematic data investigation** using the data detective methodology
- **DataFrame vs Series distinction** and when to use each
- **Professional subsetting** with `.loc[]` for filtering and selection
- **Index manipulation** for more efficient data access
- **Attributes vs methods** for clean, professional-looking code

### 🤔 Reflection Questions
Take 2-3 minutes to consider:
- Which data detective technique will you use first on every new dataset?
- When would you choose a Series vs DataFrame for your analysis?
- How does using `.loc[]` make your code more professional?

### 🔗 Connection to Course Goals
These foundational skills in systematic data exploration and professional subsetting techniques will be essential for every analysis you do in this course and your career. You now have the tools to confidently approach any dataset and extract the insights that matter.

### 📋 Next Steps
- **Before next Tuesday:** Read Chapters 7-9 to reinforce today's concepts
- **Next Tuesday:** Data manipulation and aggregation techniques
- **Additional Practice:** Try applying these detective techniques to other datasets

---
**💾 Save your work** and be ready to share your approach and findings. Your systematic approach to data exploration will serve you well in every future analysis!

## 🚨 Troubleshooting & Common Issues

**Issue 1:** `KeyError` when accessing columns
- **Solution:** Check exact column names using `df.columns` - they're case-sensitive!

**Issue 2:** Forgetting parentheses on methods
- **Solution:** Remember: Methods = Actions = Parentheses. `df.head()` not `df.head`

**Issue 3:** Using `and`/`or` instead of `&`/`|` in pandas conditions
- **Solution:** Always use `&` (and) and `|` (or) with pandas, and wrap conditions in parentheses

**Issue 4:** Confusing Series vs DataFrame outputs
- **Solution:** Use `type()` to check what you're working with. Use `[[]]` to keep DataFrames.

**General Debugging Tips:**
- Use `.shape` to verify your filtering worked as expected
- Use `.head()` to preview results before running complex operations
- Break complex conditions into separate variables for easier debugging