<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/wk3_data_detective.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3: Become a Data Detective 🔍

## Today's Mission:
- Learn to import real-world datasets
- Master DataFrame investigation techniques  
- Discover the power of data subsetting

**Follow along with the slides and complete the challenges below!**

## Getting Data Into Python

Let's start by importing the pandas library and loading our first dataset!

In [None]:
# Step 1: Import pandas and load the Ames housing data
import pandas as pd

# Load the real estate data
ames = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_raw.csv")

# What type of object did we create?
print(f"Type of object: {type(ames)}")

In [None]:
# Take a quick look at our DataFrame
ames

## Getting to Know Your Data 🧐

Think of these as your **"data detective questions"**:

In [None]:
# 🔍 "How big is this dataset?"
ames.shape

In [None]:
# 👀 "What does the data look like?"
ames.head()

In [None]:
# 📊 "What types of data do I have?"
ames.info()

In [None]:
# 📋 "What are my column names?"
ames.columns

In [None]:
# 🎯 "What data types am I working with?"
ames.dtypes

In [None]:
# 📈 "What are the basic statistics?"
ames.describe()

## Understanding Series vs DataFrames

Let's explore the difference between extracting a single column (Series) vs multiple columns (DataFrame):

In [None]:
# Extract one column - this returns a Series
prices = ames['SalePrice']
print(f"Type: {type(prices)}")
prices.head()

In [None]:
# Single brackets = Series
print(f"Single brackets: {type(ames['SalePrice'])}")

# Double brackets = DataFrame  
print(f"Double brackets: {type(ames[['SalePrice']])}")

In [None]:
# Multiple columns = DataFrame
ames[['SalePrice', 'Year Built']].head(3)

## Data Mystery #1 🕵️

**Challenge:** Using our Ames dataset, can you find:

1. How many houses are in our dataset?
2. What's the highest sale price?
3. What's the average year built?

**Detective Tools:** `.shape`, `.max()`, `.mean()`

In [None]:
# Mystery 1: How many houses are in our dataset?
# Your code here:



In [None]:
# Mystery 2: What's the highest sale price?
# Your code here:



In [None]:
# Mystery 3: What's the average year built?
# Hint: Remember the column name has a space!
# Your code here:



## Selecting Columns: Pick Your Variables

Let's practice selecting specific columns from our dataset:

In [None]:
# Method 1: Select multiple columns
house_basics = ames[['SalePrice', 'Year Built']]
house_basics.head(3)

In [None]:
# Method 2: For just one column (as a Series)
just_prices = ames['SalePrice']
print(f"Type: {type(just_prices)}")
print(f"First few values: {just_prices.head(3).tolist()}")

## Filtering Rows: The Magic of Conditions

The process: Ask a yes/no question about each row!

In [None]:
# Step 1: Create a condition (True/False for each row)
# Which houses sold for more than $200,000?
expensive_houses = ames['SalePrice'] > 200000
expensive_houses

In [None]:
# Step 2: Use the condition to filter
# Keep only the True rows
filtered_ames = ames[expensive_houses]
print(f"Original dataset: {ames.shape[0]} houses")
print(f"Expensive houses: {filtered_ames.shape[0]} houses")
filtered_ames.head()

## Building More Complex Filters

Let's combine multiple conditions using `&` (AND) and `|` (OR):

In [None]:
# Step 1: Define our conditions
expensive = ames['SalePrice'] > 200000
recent = ames['Year Built'] > 2000

# Step 2: Combine with & (AND)
expensive_and_recent = expensive & recent

# Step 3: Filter the data
result = ames[expensive_and_recent]
print(f"Expensive AND recent houses: {result.shape[0]}")

## The Powerful .loc Accessor

Professional tip: Use `.loc[]` for clean, readable filtering

In [None]:
# Basic filtering with .loc
recent_houses = ames.loc[ames['Year Built'] > 2000]
print(f"Shape: {recent_houses.shape}")

In [None]:
# Select columns AND filter rows with .loc
rows = ames['Year Built'] > 2000
cols = ['SalePrice', 'Year Built']

recent_prices = ames.loc[rows, cols]
recent_prices.head()

## Data Mystery #2 🕵️‍♀️

**Your Challenge:** Be the real estate detective!

Using the Ames dataset, find:

1. **How many houses** were built in 2005 or later?
2. **What's the average price** of houses with more than 2000 sq ft living area? (use `Gr Liv Area` column)
3. **Challenge:** Find houses that are both expensive (>$300k) AND large (>2500 sq ft)

**Hint:** Use `.loc[]` for clean solutions

In [None]:
# Challenge 1: How many houses were built in 2005 or later?
# Your code here:



In [None]:
# Challenge 2: What's the average price of houses with more than 2000 sq ft living area?
# Your code here:



In [None]:
# Challenge 3: Find houses that are both expensive (>$300k) AND large (>2500 sq ft)
# Your code here:



## Real-World Detective Work Challenge!

Now let's work with the planes dataset and solve the same challenge you tried in the spreadsheet!

**Your Task:** Find the **average number of seats** on aircraft manufactured by **Embraer** in **2004 or later**

In [None]:
# Load the planes dataset
planes = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/planes.csv")

# Take a quick look at the data structure
planes.head()

In [None]:
# Now it's your turn! Write Python code to:
# 1. Filter for Embraer aircraft built in 2004 or later
# 2. Calculate the average number of seats

# Your code here:



## 🎉 Congratulations, Data Detective!

You've successfully learned how to:
- Import datasets with pandas
- Inspect data with detective questions
- Distinguish between DataFrames and Series
- Select specific columns
- Filter rows with conditions
- Combine filtering and selection with `.loc[]`

## 🧾 Detective's Quick Reference

| Detective Task | Pandas Code | Returns |
|----------------|-------------|---------|
| **Inspect data size** | `df.shape` | Tuple (rows, cols) |
| **Preview data** | `df.head()` | First 5 rows |
| **Get column info** | `df.info()` | Data types & missing values |
| **Select one column** | `df['col']` | Series |
| **Select multiple columns** | `df[['col1', 'col2']]` | DataFrame |
| **Filter rows** | `df[df['col'] > value]` | DataFrame |
| **Filter + Select** | `df.loc[condition, ['col1', 'col2']]` | DataFrame |

### Golden Rule: 
Always start with `.head()`, `.info()`, and `.shape` to understand your data!

### Remember:
Every dataset tells a story. Now you have the tools to read it! 🔍