# Day 4: Data Wrangling and Advanced Pandas

**Prepared By:** Dr. Kenechi Omeke  
**Date:** November 2024  

---

## Aim
Teach advanced data manipulation techniques for complex datasets.

## Intended Learning Outcomes
- Merge, join, and concatenate DataFrames.
- Perform group operations and aggregations.
- Handle time-series data effectively.

## Topics Covered
- Merging and Joining DataFrames (with code examples)
- GroupBy Operations and Aggregations (with code examples)
- Time-Series Data Analysis (with code examples)
- Mini-Project: Advanced Data Wrangling

---

# 1. Merging and Joining DataFrames
Combining datasets is essential for real-world analysis. Let's practice different types of joins.

In [None]:
import pandas as pd
# Example DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Score': [85, 90, 95]})
# Inner join
inner = pd.merge(df1, df2, on='ID', how='inner')
print('Inner join:')
print(inner)
# Outer join
outer = pd.merge(df1, df2, on='ID', how='outer')
print('\nOuter join:')
print(outer)

In [None]:
# Left and right joins
left = pd.merge(df1, df2, on='ID', how='left')
right = pd.merge(df1, df2, on='ID', how='right')
print('Left join:')
print(left)
print('\nRight join:')
print(right)

## Exercise: Merging Practice
1. Create two DataFrames with a common column and merge them using all four join types.
2. What happens if there are duplicate keys?

# 2. GroupBy Operations and Aggregations
GroupBy lets you split data into groups, apply functions, and combine results.

In [None]:
# Example: Grouping and Aggregating
data = pd.DataFrame({'Category': ['A', 'A', 'B', 'B'], 'Values': [10, 20, 30, 40]})
grouped = data.groupby('Category').mean()
print(grouped)
# Custom aggregation
gg = data.groupby('Category').agg({'Values': ['sum', 'mean']})
print(gg)

## Exercise: GroupBy Practice
1. Use the Titanic dataset (`sns.load_dataset('titanic')`).
2. Group by `pclass` and calculate the average age and fare.
3. Find the total number of survivors by gender.

# 3. Time-Series Data Analysis
Time-series data is ordered by time. Let's see how to analyze it.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Create example time-series data
dates = pd.date_range(start='2023-01-01', periods=24, freq='M')
values = np.random.randint(100, 200, size=24)
ts = pd.Series(values, index=dates)
ts.plot(title='Monthly Sales')
plt.show()
# Decompose time series
result = seasonal_decompose(ts, model='additive', period=12)
result.plot()
plt.show()

## Exercise: Time-Series Practice
1. Create a time series of daily temperatures for one month (random data).
2. Plot the series and decompose it using `seasonal_decompose`.

# 4. Mini-Project: Advanced Data Wrangling
1. Download or load two related datasets (e.g., sales and customers).
2. Merge them and perform groupby analysis (e.g., total sales by region).
3. If you have a time column, plot sales over time and decompose the trend.
4. Summarize your findings in markdown.

## Reflection & Next Steps
- Which merging or groupby operation did you find most useful?
- Try wrangling a dataset from your own field.
- Explore more in the Pandas and statsmodels documentation.

---

## References
- [Pandas Documentation](https://pandas.pydata.org/)
- [Statsmodels Documentation](https://www.statsmodels.org/)