# Session 18: Practice - Data Cleaning and Merging

In this practice session, you'll work with realistic datasets that have common data quality issues. You'll apply your skills in combining DataFrames, handling missing values, fixing data types, and dealing with outliers.

## Instructions

- Complete each exercise in the provided code cells
- Run your code to verify it works
- Some exercises build on previous ones, so complete them in order

In [None]:
# Setup: Run this cell first!
import pandas as pd
import numpy as np

---
## Part 1: Concatenation Exercises

### Exercise 1: Stack Monthly Data

You have sales data from two months. Concatenate them into a single DataFrame with a new sequential index.

In [None]:
# Data provided
jan_data = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'units': [100, 150, 75],
    'revenue': [5000, 6000, 3000]
})

feb_data = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'units': [120, 140, 90],
    'revenue': [6000, 5600, 3600]
})

# Your code: Concatenate with ignore_index=True


---
### Exercise 2: Concatenate with Month Labels

Using the same data, concatenate with keys to identify which month each row belongs to.

In [None]:
# Your code: Concatenate with keys=['January', 'February']


---
### Exercise 3: Handle Mismatched Columns

Concatenate these two DataFrames. First show the result keeping all columns, then show keeping only common columns.

In [None]:
# Data provided
store_a = pd.DataFrame({
    'product': ['Laptop', 'Phone'],
    'price': [1000, 600],
    'stock': [50, 100]
})

store_b = pd.DataFrame({
    'product': ['Laptop', 'Tablet'],
    'price': [1100, 400],
    'discount': [100, 50]
})

# Your code: Concatenate keeping all columns


In [None]:
# Your code: Concatenate keeping only common columns


---
## Part 2: Merge Exercises

### Exercise 4: Basic Merge

Merge the orders and products tables to get the product name for each order.

In [None]:
# Data provided
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'product_id': [101, 102, 101, 103, 102],
    'quantity': [2, 1, 3, 1, 2],
    'customer': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
})

products = pd.DataFrame({
    'product_id': [101, 102, 103, 104],
    'name': ['Laptop', 'Phone', 'Tablet', 'Watch'],
    'price': [1000, 600, 400, 200]
})

# Your code: Merge on product_id


---
### Exercise 5: Left Join

Merge employees with their department information. Use a LEFT join to keep all employees, even those without a department assigned.

In [None]:
# Data provided
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_id': [10, 20, 10, None, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20],
    'dept_name': ['Engineering', 'Marketing']
})

# Your code: Left join


# Check the data types
inventory.dtypes, inventory

In [None]:
# Data provided
sales = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'revenue': [100000, 120000, 110000, 130000]
})

targets = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'target': [95000, 115000, 105000, 125000]
})

# Your code: Merge on year AND quarter, then add a column showing if target was met


---
### Exercise 7: Different Column Names

Merge these DataFrames where the key column has different names.

In [None]:
# Data provided
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

purchases = pd.DataFrame({
    'purchase_id': [101, 102, 103, 104],
    'cust_id': [1, 2, 1, 3],
    'amount': [150, 200, 75, 300]
})

# Your code: Merge using left_on and right_on


---
## Part 3: Data Type Exercises

### Exercise 8: Fix Data Types

Fix the data types in this DataFrame. The `id` should be int, `price` should be float, and `in_stock` should be bool.

In [None]:
# Data provided
inventory = pd.DataFrame({
    'id': ['1', '2', '3', '4'],
    'product': ['A', 'B', 'C', 'D'],
    'price': ['10.99', '20.50', '15.00', '8.75'],
    'in_stock': ['yes', 'no', 'yes', 'yes']
})

print("Original types:")
print(inventory.dtypes)

# Your code: Convert to correct types


# Original data with time series
stock_prices = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'price': [100, None, None, 105, 107, None, 110, 108, None, 112]
})

stock_prices

In [None]:
# Data provided
np.random.seed(42)
large_df = pd.DataFrame({
    'order_id': range(10000),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 10000)
})

# Your code: Print memory usage before, convert to category, print memory usage after


---
## Part 4: Missing Values Exercises

### Exercise 10: Detect Missing Values

Find out which columns have missing values and how many.

In [None]:
# Data provided
customer_data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve', 'Frank', None, 'Henry'],
    'age': [25, None, 35, 28, None, 42, 31, None],
    'city': ['Madrid', 'Barcelona', 'Madrid', None, 'Valencia', None, 'Seville', 'Madrid'],
    'purchases': [5, 3, None, 8, 2, 6, None, 4]
})

# Your code: Show missing values count per column and percentage


---
### Exercise 11: Drop Missing Values

1. Drop rows where `name` is missing
2. From the result, drop rows where ALL of `age`, `city`, and `purchases` are missing

In [None]:
# Your code: Step 1 - Drop rows where name is missing


In [None]:
# Your code: Step 2 - Drop rows where ALL of age, city, purchases are missing


---
### Exercise 12: Fill Missing Values

Fill missing values appropriately:
- `age`: Fill with median
- `city`: Fill with 'Unknown'
- `purchases`: Fill with 0

In [None]:
# Use the original customer_data
filled_data = customer_data.copy()

# Your code: Fill missing values


---
### Exercise 13: Forward Fill Time Series

Use forward fill to handle missing values in this time series data.

In [None]:
# Data provided
stock_prices = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'price': [100, 102, None, None, 105, 103, None, 108, None, 110]
})

print("Original:")
print(stock_prices)

# Your code: Forward fill the price column


---
## Part 5: Outlier Exercises

### Exercise 14: Detect Outliers with IQR

Find outliers in the salary data using the IQR method.

In [None]:
# Data provided
np.random.seed(42)
salaries = pd.DataFrame({
    'employee_id': range(1, 51),
    'salary': np.concatenate([
        np.random.normal(60000, 10000, 45),
        [150000, 20000, 180000, 15000, 200000]  # Outliers
    ])
})

# Your code: Calculate IQR, define bounds, identify outliers


---
### Exercise 15: Cap Outliers

Instead of removing outliers, cap them at the bounds using `clip()`.

In [None]:
# Your code: Cap outliers at the IQR bounds


---
## Part 6: Comprehensive Exercises

### Exercise 16: Complete Data Pipeline

You have customer data from two regions that needs to be combined and cleaned.

Steps:
1. Concatenate the two regional datasets
2. Fix data types (age should be numeric)
3. Handle missing values in spending (fill with median)
4. Detect and cap outliers in spending

In [None]:
# Data provided
region_north = pd.DataFrame({
    'customer_id': ['N001', 'N002', 'N003', 'N004'],
    'age': ['25', '30', 'unknown', '45'],
    'spending': [500, None, 750, 15000]  # 15000 is outlier
})

region_south = pd.DataFrame({
    'customer_id': ['S001', 'S002', 'S003'],
    'age': ['35', '28', '40'],
    'spending': [600, 800, None]
})

# Your code: Complete the data pipeline
# Step 1: Concatenate


In [None]:
# Step 2: Fix age data type


In [None]:
# Step 3: Fill missing spending with median


In [None]:
# Step 4: Cap outliers in spending


---
### Exercise 17: Merge and Analyze

Merge the orders with customer and product information, then:
1. Calculate the total amount for each order
2. Find total spending per customer
3. Find the most popular product category

In [None]:
# Data provided
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'customer_id': [101, 102, 101, 103, 102, 101, 104, 103],
    'product_id': [1, 2, 3, 1, 2, 4, 3, 4],
    'quantity': [2, 1, 3, 1, 2, 1, 2, 1]
})

customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

products = pd.DataFrame({
    'product_id': [1, 2, 3, 4],
    'product_name': ['Laptop', 'Phone', 'Tablet', 'Watch'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories'],
    'price': [1000, 600, 400, 200]
})

# Your code: Merge all tables


In [None]:
# Your code: Calculate total amount per order and add as a column


In [None]:
# Your code: Total spending per customer


In [None]:
# Your code: Most popular category (by number of orders)


---
### Bonus Exercise 1: Handle Complex Missing Data

This dataset has various representations of missing values. Clean them all.

In [None]:
# Data provided
messy_survey = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6],
    'age': [25, -1, 30, 999, 28, None],  # -1 and 999 are invalid
    'income': [50000, 'N/A', 65000, 'refused', None, 55000],
    'satisfaction': [4, 3, '', 5, 'na', 4]  # '' and 'na' are missing
})

# Your code: Replace all missing value representations with NaN


---
### Bonus Exercise 2: Multi-Table Analysis

Combine data from three monthly files and analyze total revenue by product.

In [None]:
# Data provided
jan = pd.DataFrame({'product': ['A', 'B', 'C'], 'revenue': [1000, 2000, 1500]})
feb = pd.DataFrame({'product': ['A', 'B', 'C'], 'revenue': [1200, 1800, 1600]})
mar = pd.DataFrame({'product': ['A', 'B', 'D'], 'revenue': [1100, 2200, 800]})  # Note: D instead of C

# Your code: Concatenate with month labels, then find total revenue by product across all months


---
### Bonus Exercise 3: Data Quality Report

Create a data quality report for a dataset that shows:
- Missing value count and percentage per column
- Data type of each column
- Number of unique values per column

In [None]:
# Data provided
dataset = pd.DataFrame({
    'id': range(1, 101),
    'name': ['Customer_' + str(i) if i % 10 != 0 else None for i in range(1, 101)],
    'category': np.random.choice(['A', 'B', 'C', None], 100),
    'value': np.random.choice([100, 200, 300, None, None], 100),
    'date': pd.date_range('2024-01-01', periods=100)
})

# Your code: Create a quality report DataFrame


---
## Summary

Excellent work! In this practice session, you applied:

- **Concatenation**: `pd.concat()` for vertical and horizontal stacking
- **Merging**: `pd.merge()` with different join types (inner, left, right, outer)
- **Data Types**: `astype()` and type conversion strategies
- **Missing Values**: Detection with `isnull()`, removal with `dropna()`, filling with `fillna()`
- **Outliers**: IQR method detection and capping with `clip()`

### Key Takeaways

1. Always check data quality before analysis
2. Document your cleaning decisions (what you removed, filled, or transformed)
3. Choose the right merge type based on your analysis needs
4. Consider the impact of missing value handling on your results
5. Outlier treatment depends on context - sometimes they're valid data points

### Congratulations!

You've completed the Pandas fundamentals block. You now have the skills to:
- Load and save data from various formats
- Explore and understand datasets
- Filter, select, and aggregate data
- Combine data from multiple sources
- Clean and prepare data for analysis

These are the foundational skills for any data analysis project!