<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-03/3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 3-1: Create Data Visualizations with Pandas

**CAP3321C - Data Wrangling**

---

## Overview

In this exercise, you'll create various data visualizations using the mortality dataset. You'll practice creating line plots, area plots, bar plots, histograms, density plots, pie charts, and subplots using Pandas plotting methods and method chaining.

**Instructions:**
1. Run the setup cells to load the data
2. Complete each task by writing code in the provided cells
3. Some tasks are pre-filled - just run them and observe
4. Tasks marked with **YOUR CODE** require you to write the code
5. Use **method chaining** instead of creating new DataFrames

**Group Members:**
- Name 1:
- Name 2:
- Name 3:
- Name 4:

---

## Setup: Load the Data and Import Libraries

Run these cells to load the data. Do not modify this section.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Download the data files from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/mortality_prepped.pkl
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/mortality_wide.pkl
print("Data files downloaded successfully!")

In [None]:
# Load the long format data
mortality_data = pd.read_pickle('mortality_prepped.pkl')
print("Long format data shape:", mortality_data.shape)
mortality_data.head()

In [None]:
# Load the wide format data
mortality_wide = pd.read_pickle('mortality_wide.pkl')
print("Wide format data shape:", mortality_wide.shape)
mortality_wide.head()

---

## Part 1: Line Plots and Area Plots (Wide Data)

Use the `mortality_wide` DataFrame for Tasks 3-4.

### Task 3: Create a Line Plot for One Age Group (YOUR CODE)

Use the wide data to create a line plot for just the data in the **15-19 age group**. Include an appropriate title on the plot and remove the legend.

**Hint:** Use method chaining with `.set_index()` and `.plot()`

**Example syntax:**
```python
df.set_index('Year')['column_name'].plot(
    title='Your Title',
    legend=False
)
```

**Expected output:** A line plot showing death rates for 15-19 year olds from 1900-2000, with a title and no legend

In [None]:
# YOUR CODE HERE - create line plot for 15-19 age group


### Task 4: Create an Area Plot for All Age Groups (YOUR CODE)

Use the wide data to create an area plot for **all age groups**, and reverse the order of the items in the legend to see how that looks.

**Hint:**
- Use `.plot.area()` for area plots
- To reverse the legend, after plotting run:
```python
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(reversed(handles), reversed(labels))
```

**Example syntax:**
```python
df.set_index('Year').plot.area(title='Your Title')
```

**Expected output:** A stacked area plot showing all age groups with reversed legend order

In [None]:
# YOUR CODE HERE - create area plot for all age groups with reversed legend


---

## Part 2: Bar Plots (Wide and Long Data)

Tasks 5-7 focus on creating and customizing bar plots.

### Task 5: Create a Bar Plot for Year 1900 (YOUR CODE)

Use the wide data to create a bar plot for all age groups that shows the mortality rates for just the **year 1900**. Note the values on the y-axis.

**Hint:**
- First filter for Year == 1900 using `.query()`
- Set the index and select the age group columns
- Use `.plot.bar()`

**Example syntax:**
```python
df.query('Year == 1900').set_index('Year').plot.bar()
```

**Expected output:** A bar plot showing mortality rates for all age groups in 1900

In [None]:
# YOUR CODE HERE - create bar plot for year 1900


### Task 6: Create a Bar Plot for Year 2000 (YOUR CODE)

Change the bar plot you created in Task 5 to show the mortality rates for the **year 2000**. Note how the values on the y-axis change. Then, add an appropriate title to the plot and remove the label for the x-axis.

**Hint:**
- Use `xlabel=''` to remove the x-axis label
- Compare the y-axis scale to Task 5

**Example syntax:**
```python
df.query('Year == 2000').set_index('Year').plot.bar(
    title='Your Title',
    xlabel=''
)
```

**Expected output:** A bar plot for year 2000 with title and no x-axis label

In [None]:
# YOUR CODE HERE - create bar plot for year 2000 with title, no x-label


### Task 7: Create a Bar Plot Using Long Data with Pivot (YOUR CODE)

Use the **long data** (`mortality_data`) to create a bar plot like the one in Task 6. To do that, you'll need to chain the `pivot()` method to the `query()` method. Compare the two bar charts, and then make improvements so the plot that uses the long data is easier to read.

**Hint:**
- Use `.pivot(index='col1', columns='col2', values='col3')` to reshape
- Chain methods: query → pivot → plot
- Consider using `rot=0` to make x-axis labels horizontal
- Consider using `figsize=(10, 6)` for better readability

**Example syntax:**
```python
df.query('Year == 2000').pivot(
    index='Year',
    columns='AgeGroup',
    values='DeathRate'
).plot.bar(title='Title', rot=0, figsize=(10, 6))
```

**Expected output:** A bar plot created from long data that matches Task 6 but with improved readability

In [None]:
# YOUR CODE HERE - create bar plot using long data with pivot


---

## Part 3: Histograms and Density Plots (Long Data)

Use the `mortality_data` DataFrame for Tasks 8-9.

### Task 8: Create a Histogram (YOUR CODE)

Use the long data to create a histogram that shows the frequency of the death rates in the **default number of bins**. Then, change the number of bins to **15** to see how this changes the histogram. Does this make it easier to determine the frequency at various datapoints?

**Hint:**
- Use `.plot.hist()` for histograms
- Use the `bins` parameter to change the number of bins

**Example syntax:**
```python
df['DeathRate'].plot.hist(bins=15, title='Title')
```

**Expected output:** Two histograms - one with default bins, one with 15 bins

In [None]:
# YOUR CODE HERE - create histogram with default bins


In [None]:
# YOUR CODE HERE - create histogram with 15 bins


### Task 9: Create a Density Plot (YOUR CODE)

Use the long data to create a density plot that shows the distribution of death rates in the **year 1900**. Include a title and grids in the plot to make the data easier to read.

**Hint:**
- First filter for Year == 1900 using `.query()`
- Use `.plot.kde()` or `.plot.density()` for density plots
- Use `grid=True` to add grid lines

**Example syntax:**
```python
df.query('Year == 1900')['DeathRate'].plot.kde(
    title='Title',
    grid=True
)
```

**Expected output:** A density plot for 1900 death rates with title and grid lines

In [None]:
# YOUR CODE HERE - create density plot for year 1900


---

## Part 4: Pie Charts (Long Data)

Use the `mortality_data` DataFrame for Task 10.

### Task 10: Create a Pie Plot (YOUR CODE)

Use the long data to create a pie plot that shows the sum of the death rates for the years **1900, 1925, 1950, 1975, and 2000**.

**Hint:**
- Filter for specific years using `.query()` with `in` operator
- Group by Year and sum the death rates
- Use `.plot.pie()` for pie charts
- Consider using `autopct='%1.1f%%'` to show percentages
- Consider using `ylabel=''` to remove the y-axis label

**Example syntax:**
```python
df.query('Year in [1900, 1925, 1950, 1975, 2000]').groupby('Year')['DeathRate'].sum().plot.pie(
    title='Title',
    autopct='%1.1f%%',
    ylabel=''
)
```

**Expected output:** A pie chart showing death rate proportions for the 5 selected years

In [None]:
# YOUR CODE HERE - create pie plot for selected years


---

## Part 5: Subplots (Long Data)

Use the `mortality_data` DataFrame for Task 11.

### Task 11: Create Subplots (PRE-FILLED)

Create a plot with four subplots in two rows and two columns. The subplots should be horizontal bar charts that show the child mortality rates for each age group for the years **1900, 1925, 1975, and 2000**. Format the subplots so they're easy to read.

This task is more complex, so it's completed for you. **Study the code carefully** to understand how subplots work - you may see this on assessments!

In [None]:
# PRE-FILLED: Create 2x2 subplots with horizontal bar charts

# Define the years we want to visualize
years = [1900, 1925, 1975, 2000]

# Create a 2x2 subplot figure
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Flatten axes array for easier iteration
axes = axes.flatten()

# Create a horizontal bar chart for each year
for ax, year in zip(axes, years):
    # Filter data for this year and create the plot
    mortality_data.query(f'Year == {year}').set_index('AgeGroup')['DeathRate'].plot.barh(
        ax=ax,
        title=f'Child Mortality Rates - {year}',
        xlabel='Deaths per 100K',
        color='steelblue'
    )
    ax.set_ylabel('')  # Remove y-axis label for cleaner look

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

---

## Bonus Challenge (Optional)

If your group finishes early, try this challenge!

### Bonus: Compare Two Decades

Create a side-by-side comparison showing how mortality rates changed between 1900 and 2000 for all age groups. Use any visualization type you think best tells the story.

Consider:
- What changed the most?
- What visualization makes this comparison clearest?
- Can you calculate the percent decrease for each age group?

In [None]:
# BONUS: Your creative visualization here


---

## Summary

In this exercise, you practiced creating data visualizations:

**Tasks you completed:**
- Task 3: Line plot for single column with `.plot()` and formatting
- Task 4: Area plot for multiple columns with `.plot.area()`
- Task 5: Bar plot from wide data with `.plot.bar()`
- Task 6: Customizing bar plots with titles and axis labels
- Task 7: Bar plot using `.pivot()` with long data
- Task 8: Histograms with different bin sizes using `.plot.hist()`
- Task 9: Density plots with `.plot.kde()`
- Task 10: Pie charts with `.plot.pie()`

**Tasks that were pre-filled:**
- Task 11: Subplots with `plt.subplots()` - study this for future reference!

**Key Takeaways:**
- Use method chaining instead of creating new DataFrames
- Different plot types are suited for different data stories
- Formatting (titles, labels, grids) makes visualizations more readable
- Wide vs long data formats affect how you structure your plotting code

---

**Submission:** Save this notebook and submit to Canvas before the end of class.