<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-02/exercise_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 2-1: Review the Mortality Notebook

**CAP3321C - Data Wrangling**

---

## Overview

In this exercise, you'll run the cells in the Mortality Notebook. This Notebook includes all the examples that are in Chapter 2, plus a few variations. As you run each cell, be sure that you understand what it does. To help you understand it, you may want to change some of the parameters to see how that changes the results.

**Instructions:**
1. Run each cell in order
2. Observe the output
3. Answer the reflection questions in the markdown cells provided

**Group Members:**
- Name 1:
- Name 2:
- Name 3:
- Name 4:

---

## Part 1: Get the Data

**Steps 1-4:** Import Pandas, load data from the CDC website, and save/restore the DataFrame.

In [None]:
import pandas as pd

### Read a CSV file from a website into a DataFrame

In [None]:
mortality_url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
mortality_data = pd.read_csv(mortality_url)

### Save and restore the DataFrame

In [None]:
mortality_data.to_pickle('mortality_data.pkl')

In [None]:
mortality_data = pd.read_pickle('mortality_data.pkl')

### ðŸ¤” Reflection: Getting Data

**Q1:** Why might you want to save a DataFrame to a pickle file instead of reading from the URL every time?

*Your answer:*


---

## Part 2: Examine and Clean the Data

**Steps 5-10:** Display the data using various techniques, examine attributes, and clean column names.

### Display the data

In [None]:
mortality_data

In [None]:
mortality_data.head()

In [None]:
mortality_data.tail(3)

In [None]:
display(mortality_data)

In [None]:
with pd.option_context(
    'display.max_rows', 5,
    'display.max_columns', None):
    display(mortality_data)

### Display the DataFrame attributes

In [None]:
mortality_data.values

In [None]:
print("Index:  ", mortality_data.index)
print("Columns:", mortality_data.columns)
print("Size:   ", mortality_data.size)
print("Shape:  ", mortality_data.shape)

### ðŸ¤” Reflection: DataFrame Attributes

**Q2:** Looking at the output above, what is the difference between `size` and `shape`?

*Your answer:*


### Use the columns attribute to change the column names

In [None]:
mortality_data.columns = mortality_data.columns.str.replace(" ", "")

In [None]:
print(mortality_data.columns)

In [None]:
mortality_data.head()

### ðŸ¤” Reflection: Column Names

**Q3:** Why did we remove the spaces from the column names? What problems might spaces cause?

*Your answer:*


### Use the info(), nunique(), and describe() methods

In [None]:
mortality_data.info()

In [None]:
mortality_data.info(memory_usage='deep')

In [None]:
mortality_data.nunique()

In [None]:
mortality_data.describe()

In [None]:
mortality_data.describe().T

### ðŸ¤” Reflection: Data Exploration Methods

**Q4:** What does the `.T` attribute do to the describe() output? When might this be useful?

*Your answer:*

**Q5:** Looking at the `nunique()` output, how many different age groups are in the dataset?

*Your answer:*


### Save and restore the cleaned DataFrame

In [None]:
mortality_data.to_pickle('mortality_cleaned.pkl')

In [None]:
mortality_data = pd.read_pickle('mortality_cleaned.pkl')
mortality_data.head()

---

## Part 3: Access the Data

**Steps 11-14:** Access columns, rows, and subsets using different methods.

### How to access columns

In [None]:
mortality_data.DeathRate.head(2)

In [None]:
type(mortality_data.DeathRate)

In [None]:
mortality_data['DeathRate'].head(2)

In [None]:
type(mortality_data['DeathRate'])

In [None]:
mortality_data[['Year','DeathRate']].head(2)

In [None]:
type(mortality_data[['Year','DeathRate']])

### ðŸ¤” Reflection: Accessing Columns

**Q6:** What is the difference in the output when you use single brackets `['DeathRate']` versus double brackets `[['DeathRate']]`? (Hint: Look at the `type()` output)

*Your answer:*


### How to access rows

In [None]:
mortality_data.query('Year==1900')

In [None]:
mortality_data.query('Year == 2000 and AgeGroup != "1-4 Years"')

In [None]:
mortality_data.query('Year == 1900 or Year == 2000').head()

In [None]:
# use backticks if a column name contains spaces
# mortality_data.query('Year == 2000 and `Age Group` != "1-4 Years"')

### How to access a subset of rows and columns

In [None]:
mortality_data.query('Year == 1900').DeathRate.head()

In [None]:
mortality_data.query('Year == 1900')['DeathRate'].head()

In [None]:
mortality_data.query('Year == 1900')[['DeathRate']].head()

In [None]:
mortality_data.query('Year == 1900')[['AgeGroup','DeathRate']].head()

### How to access rows and columns with the loc[] accessor

In [None]:
mortality_data.loc[0]

In [None]:
mortality_data.loc[[0]]

In [None]:
mortality_data.loc[0:2]

In [None]:
mortality_data.loc[[0,2,4]]

In [None]:
mortality_data.loc[:,'Year']

In [None]:
mortality_data.loc[:,['Year','DeathRate']]

In [None]:
mortality_data.loc[:,'Year':'DeathRate']

In [None]:
mortality_data.loc[0:2, 'Year':'DeathRate']

In [None]:
mortality_data.loc[[0,2,4], ['Year','DeathRate']]

### How to access rows and columns with the iloc[] accessor

In [None]:
mortality_data.iloc[0]

In [None]:
mortality_data.iloc[[0]]

In [None]:
mortality_data.iloc[0:2]

In [None]:
mortality_data.iloc[[0,2,4]]

In [None]:
mortality_data.iloc[:,0]

In [None]:
mortality_data.iloc[:,[0,2]]

In [None]:
mortality_data.iloc[:,0:2]

In [None]:
mortality_data.iloc[0:2, 0:2]

In [None]:
mortality_data.iloc[[0,2,4], [0,2]]

### ðŸ¤” Reflection: loc[] vs iloc[]

**Q7:** What is the key difference between `loc[]` and `iloc[]`? When would you use each?

*Your answer:*

**Q8:** Notice that `loc[0:2]` returns 3 rows, but `iloc[0:2]` returns 2 rows. Why is this?

*Your answer:*


---

## Part 4: Prepare the Data

**Steps 15-18:** Sort data, apply statistics, perform calculations, and modify string data.

### Sort the data

In [None]:
mortality_data.sort_values('DeathRate').head()

In [None]:
mortality_data.sort_values('DeathRate', ascending=False).head()

In [None]:
mortality_data.sort_values(['AgeGroup','Year']).head(10)

### ðŸ¤” Reflection: Sorting

**Q9:** Looking at the first `sort_values()` output, which age group and year had the lowest death rate?

*Your answer:*


### Apply statistical methods

In [None]:
mortality_data.DeathRate.mean()

In [None]:
mortality_data['DeathRate'].mean()

In [None]:
mortality_data[['Year','DeathRate']].mean()

In [None]:
mortality_data[['Year','DeathRate']].median()

In [None]:
mortality_data[['Year','DeathRate']].mode()

### Use Python for column arithmetic

In [None]:
mortality_data['MeanCentered'] = mortality_data.DeathRate - mortality_data.DeathRate.mean()

In [None]:
mortality_data.head()

In [None]:
mortality_data['DeathRate'] = mortality_data.DeathRate / 100000

In [None]:
mortality_data.head()

### ðŸ¤” Reflection: Column Arithmetic

**Q10:** What does "mean centered" mean? Why might this be useful in data analysis?

*Your answer:*


### Modify the string data in a column

In [None]:
mortality_data.AgeGroup.replace(
    {'1-4 Years':'01-04 Years','5-9 Years':'05-09 Years'},
    inplace = True)

In [None]:
mortality_data

### Save and restore the prepared DataFrame

In [None]:
mortality_data.to_pickle('mortality_prepped.pkl')

In [None]:
mortality_data = pd.read_pickle('mortality_prepped.pkl')
mortality_data.head()

---

## Part 5: Shape the Data

**Steps 19-23:** Set indexes, pivot data, and melt data.

### Set and use an index

In [None]:
mortality_data.head(2)

In [None]:
mortality_data = mortality_data.set_index('Year')
mortality_data.head(2)

In [None]:
mortality_data.reset_index(inplace=True)

In [None]:
mortality_data = mortality_data.set_index(
    ['Year','AgeGroup'],verify_integrity=True)
mortality_data.head(2)

In [None]:
mortality_data.reset_index(inplace=True)

In [None]:
mortality_data.head(2)

### ðŸ¤” Reflection: Indexes

**Q11:** Why must the index be reset before a new one can be set?

*Your answer:*


### Pivot the data

In [None]:
mortality_wide = mortality_data.pivot(
    index="Year",columns="AgeGroup")
mortality_wide.head(3)

In [None]:
mortality_wide = mortality_data.pivot(
    index="Year",columns="AgeGroup",values="DeathRate")
mortality_wide.head(3)

### ðŸ¤” Reflection: Pivoting

**Q12:** What is the difference between the two pivot outputs above? What does the `values` parameter do?

*Your answer:*


### Melt the data

In [None]:
mortality_wide.to_excel('mortality_wide.xlsx')

In [None]:
mortality_wide = pd.read_excel('mortality_wide.xlsx')
mortality_wide.head(4)

In [None]:
mortality_long = mortality_wide.melt(
    id_vars = 'Year',
    value_vars=['01-04 Years','05-09 Years'],
    var_name ='AgeGroup',
    value_name='DeathRate')
mortality_long.head(4)
with pd.option_context('display.max_rows', 4):
    display(mortality_long)

### ðŸ¤” Reflection: Pivot vs Melt

**Q13:** Pivot and melt are opposites. In your own words, what does pivot do? What does melt do?

*Your answer:*


### Save and restore the wide DataFrame

In [None]:
mortality_wide.to_pickle('mortality_wide.pkl')

In [None]:
mortality_wide = pd.read_pickle('mortality_wide.pkl')
mortality_wide.head()

---

## Part 6: Analyze the Data

**Steps 24-25:** Group and aggregate the data.

### Group the data

In [None]:
mortality_data.head()

In [None]:
mortality_data.groupby('AgeGroup').mean()

In [None]:
mortality_data.groupby('Year')['DeathRate'].median().head(4)

In [None]:
mortality_data.groupby(['Year','AgeGroup']).count().head()

In [None]:
mortality_data.groupby('AgeGroup')['DeathRate'].describe()

### Aggregate the data

In [None]:
mortality_data.groupby('AgeGroup').agg(['mean','median'])

In [None]:
mortality_data.groupby('AgeGroup')['DeathRate'] \
    .agg(['mean','median','std','nunique'])

In [None]:
mortality_data.groupby('Year')['DeathRate'] \
    .agg(['mean','median','std','min','max','var','nunique'])

### ðŸ¤” Reflection: Grouping and Aggregation

**Q14:** What is the difference between using `.mean()` directly on a groupby versus using `.agg(['mean'])`?

*Your answer:*

**Q15:** Looking at the grouped data by AgeGroup, which age group has the highest mean death rate?

*Your answer:*


---

## Part 7: Visualize the Data

**Step 25:** Create basic visualizations from the grouped data.

In [None]:
mortality_data.pivot(index='Year',columns='AgeGroup')['DeathRate'].plot()

In [None]:
mortality_data.groupby('AgeGroup')['DeathRate'] \
    .agg(['mean','median','std']).plot.barh()

### ðŸ¤” Reflection: Visualization

**Q16:** Looking at the line plot, what overall trend do you see in death rates over time? Why do you think this is?

*Your answer:*


---

## Summary

In this exercise, you explored the key Pandas operations for data analysis:

1. **Getting data** - Reading from URLs and files, saving/restoring with pickle
2. **Examining data** - head(), tail(), info(), describe(), nunique()
3. **Accessing data** - Column selection, row filtering, loc[], iloc[]
4. **Preparing data** - Sorting, statistics, column arithmetic, string manipulation
5. **Shaping data** - Setting indexes, pivoting, melting
6. **Analyzing data** - Grouping and aggregation
7. **Visualizing data** - Basic plots from DataFrames

---

### ðŸ¤” Final Reflection

**Q17:** Which Pandas operation or concept from this exercise do you think will be most useful in your future data work? Why?

*Your answer:*

**Q18:** What concept from this exercise would you like to learn more about?

*Your answer:*
