<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-02/exercise_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 2-2: Write Your Own Code for the Mortality Notebook

**CAP3321C - Data Wrangling**

---

## Overview

In this exercise, you'll write and test your own code for a Mortality dataset that's slightly different from the one in Exercise 2-1. You can use the book or the demo notebook as a guide.

**Instructions:**
1. Run the setup cells to load the data
2. Complete each task by writing code in the provided cells
3. Some tasks are pre-filled - just run them and observe
4. Tasks marked with **YOUR CODE** require you to write the code

**Group Members:**
- Name 1:
- Name 2:
- Name 3:
- Name 4:

---

## Setup: Get the Long and Wide DataFrames

Run these cells to load the data. Do not modify this section.

In [None]:
import pandas as pd

In [None]:
# Load the long format data
mortality_data = pd.read_pickle('mortality_data.pkl')
mortality_data.head()

In [None]:
# Load the wide format data
mortality_wide = pd.read_pickle('mortality_wide.pkl')
mortality_wide.head()

---

## Part 1: Work with the Long DataFrame

Use the `mortality_data` DataFrame for Tasks 5-12.

### Task 5: Display the First Five Rows (YOUR CODE)

Display the first five rows of the DataFrame.

**Hint:** Use the `.head()` method

**Expected output:** A table showing 5 rows with columns: Year, AgeGroup, DeathRate

In [None]:
# YOUR CODE HERE


### Task 6: Rename the DeathRate Column (YOUR CODE)

Change the name of the "DeathRate" column to "Deaths/100K" since that's a more accurate description of the data in that column.

**Hint:** Use the `.rename()` method with the `columns` parameter

**Example syntax:**
```python
df.rename(columns={'old_name': 'new_name'}, inplace=True)
```

**Expected output:** When you display the DataFrame, the column should now be named "Deaths/100K"

In [None]:
# YOUR CODE HERE - rename the column


In [None]:
# Verify your rename worked
mortality_data.head()

### Task 7: Access Specific Columns (YOUR CODE)

Access and display the first five rows of ONLY the Year and MeanCentered columns.

**Hint:** Use double brackets to select multiple columns: `df[['col1', 'col2']]`

**Note:** If you don't have a MeanCentered column, first create it by running:
```python
mortality_data['MeanCentered'] = mortality_data['Deaths/100K'] - mortality_data['Deaths/100K'].mean()
```

**Expected output:** A table showing 5 rows with only Year and MeanCentered columns

In [None]:
# First, create the MeanCentered column if it doesn't exist
mortality_data['MeanCentered'] = mortality_data['Deaths/100K'] - mortality_data['Deaths/100K'].mean()

In [None]:
# YOUR CODE HERE - select Year and MeanCentered columns, show first 5 rows


### Task 8: Access Rows by Year Range (PRE-FILLED)

Access and display the last six rows of data from 1915 through 1920.

This task is completed for you. Run the cell and observe how query() and tail() work together.

In [None]:
# PRE-FILLED: Access rows from 1915-1920 and show last 6
mortality_data.query('Year >= 1915 and Year <= 1920').tail(6)

### Task 9: Filter Rows and Select Columns (YOUR CODE)

Access and display the Year and Deaths/100K columns for the age group "01-04 Years".

**Hint:** Combine `.query()` to filter rows with column selection `[['col1', 'col2']]`

**Example syntax:**
```python
df.query('column == "value"')[['col1', 'col2']]
```

**Expected output:** A table showing Year and Deaths/100K for only the 01-04 Years age group

In [None]:
# YOUR CODE HERE - filter for age group "01-04 Years" and select Year and Deaths/100K columns


### Task 10: Sort the Data (YOUR CODE)

Sort the DataFrame by the Deaths/100K column in descending sequence and display the results. Then, modify the cell so it displays the first and last three rows of the results.

**Hint:**
- Use `.sort_values('column', ascending=False)` for descending sort
- To show first and last 3 rows, you can use `.head(3)` and `.tail(3)` separately, or use `.iloc[]` with a list of indices

**Expected output:** First show all sorted data, then modify to show only first 3 and last 3 rows

In [None]:
# YOUR CODE HERE - sort by Deaths/100K descending


In [None]:
# YOUR CODE HERE - show first 3 rows of sorted data


In [None]:
# YOUR CODE HERE - show last 3 rows of sorted data


### Task 11: Calculate the Median (PRE-FILLED)

Calculate the median of all of the values in the Deaths/100K column.

This task is completed for you. Run the cell and observe.

In [None]:
# PRE-FILLED: Calculate median of Deaths/100K
mortality_data['Deaths/100K'].median()

### Task 12: Group and Aggregate (YOUR CODE)

Group the data by year, and calculate the sum of the Deaths/100K column.

**Hint:** Use `.groupby('column')['column_to_sum'].sum()`

**Example syntax:**
```python
df.groupby('category_column')['numeric_column'].sum()
```

**Expected output:** A Series showing the sum of Deaths/100K for each year

In [None]:
# YOUR CODE HERE - group by Year and sum Deaths/100K


---

## Part 2: Work with the Wide DataFrame

Use the `mortality_wide` DataFrame for Tasks 13-21.

### Task 13: Display the First Five Rows (PRE-FILLED)

Display the first five rows of the DataFrame.

In [None]:
# PRE-FILLED: Display first 5 rows of wide DataFrame
mortality_wide.head()

### Task 14: Display Index Information (PRE-FILLED)

Display the index information for the DataFrame.

In [None]:
# PRE-FILLED: Display index information
mortality_wide.index

### Task 15: Use describe() with and without .T (YOUR CODE)

Use the describe() method to display statistical information for the numeric columns in the DataFrame. Start by coding this statement without the T property to see how the display changes.

**Hint:**
- First try: `df.describe()`
- Then try: `df.describe().T`

**Expected output:** Two different views of the same statistics - notice how .T transposes the output

In [None]:
# YOUR CODE HERE - describe without .T


In [None]:
# YOUR CODE HERE - describe with .T


### Task 16: Access Specific Columns (PRE-FILLED)

Access and display just the Year and 01-04 Years columns.

In [None]:
# PRE-FILLED: Access Year and 01-04 Years columns
mortality_wide[['Year', '01-04 Years']].head()

### Task 17: Access Rows by Year (PRE-FILLED)

Access and display just the rows for the years from 1915 through 1920.

In [None]:
# PRE-FILLED: Access rows for 1915-1920
mortality_wide.query('Year >= 1915 and Year <= 1920')

### Task 18: Combine Row Filter and Column Selection (YOUR CODE)

Combine steps 16 and 17 into a single cell that accesses and displays the Year and 01-04 Years columns for the years from 1915 through 1920.

**Hint:** Chain the query() method with column selection

**Example syntax:**
```python
df.query('condition')[['col1', 'col2']]
```

**Expected output:** A table showing only Year and 01-04 Years columns for years 1915-1920

In [None]:
# YOUR CODE HERE - combine filter and column selection


### Task 19: Multi-Aggregation by Year (YOUR CODE)

Aggregate the data for all numeric columns in each year, and display the mean, median, and sum for those columns.

**Hint:** Use `.groupby()` with `.agg()` and pass a list of aggregation functions

**Example syntax:**
```python
df.groupby('column').agg(['func1', 'func2', 'func3'])
```

**Expected output:** A table showing mean, median, and sum for each numeric column, grouped by Year

In [None]:
# YOUR CODE HERE - group by Year and calculate mean, median, sum


### Task 20: Add a Calculated Column (YOUR CODE)

Add a new column to the DataFrame named TotalDeaths. The value of this column should be the sum of the values in each of the year range columns (01-04 Years, 05-09 Years, 10-14 Years, 15-19 Years). Display the DataFrame with the new column.

**Hint:** You can add columns together directly:
```python
df['NewColumn'] = df['col1'] + df['col2'] + df['col3'] + df['col4']
```

**Expected output:** The mortality_wide DataFrame with a new TotalDeaths column

In [None]:
# YOUR CODE HERE - create TotalDeaths column


In [None]:
# Display the DataFrame with the new column
mortality_wide.head()

### Task 21: Create a Line Plot (PRE-FILLED)

Create a line plot that shows the total death rates by year.

This task is completed for you. Run the cell to see the visualization.

In [None]:
# PRE-FILLED: Create line plot of TotalDeaths by Year
mortality_wide.set_index('Year')['TotalDeaths'].plot(
    title='Total Death Rates by Year',
    ylabel='Deaths per 100K',
    xlabel='Year'
)

---

## Summary

In this exercise, you practiced:

**Tasks you wrote yourself:**
- Task 5: Displaying data with `.head()`
- Task 6: Renaming columns with `.rename()`
- Task 7: Selecting multiple columns
- Task 9: Filtering rows and selecting columns
- Task 10: Sorting data with `.sort_values()`
- Task 12: Grouping and aggregating with `.groupby().sum()`
- Task 15: Exploring data with `.describe()` and `.T`
- Task 18: Combining filters and column selection
- Task 19: Multi-aggregation with `.agg()`
- Task 20: Creating calculated columns

**Tasks that were pre-filled:**
- Task 8: Filtering by year range
- Task 11: Calculating median
- Task 13: Displaying rows
- Task 14: Displaying index
- Task 16: Accessing columns
- Task 17: Filtering rows
- Task 21: Creating plots

---

**Submission:** Save this notebook and submit to Canvas before the deadline.