# 📊 **SI 618 - Homework #3: Data Visualization**

**Version:** 2024.09.19.3.CT

---

Welcome to **Homework #3**! In this assignment, you will be exploring data visualization techniques using `Pandas`, `Seaborn`, and `Matplotlib`. Let’s dive into the world of data and transform insights into impactful visuals! 🎨📈

---

### 📋 **Homework Assignment Overview:**

In this homework assignment, we will be exploring 5-year estimates for **Occupation for the Civilian Employed Population (16 Years and Older)** from the **American Community Survey (ACS)** for the State of Michigan, covering the period from **2010 to 2016**.

The teaching team has partially cleaned the data, but you'll need to perform some additional data cleaning and wrangling to prepare it for 📊 **visualization**.

---

### 🧠 **Key Learning Concepts:**

As you work on this assignment, you will gain knowledge and skills in the following areas:

- **"Visualizations with `Seaborn` and `Matplotlib`"**  
  You will use the `Pandas`, `Seaborn`, and `Matplotlib` libraries to create 📊 **visualizations** of the data.

- **"Learning about the `American Community Survey`"**  
  You will learn about the **American Community Survey (ACS)** and its role in collecting comprehensive data about the U.S. population.

- **"Understanding U.S. Census Data Standardization"**  
  You will explore how the **U.S. Census Bureau** standardizes its data collection processes to ensure consistency and reliability across datasets.

---

### 🧹 **What You Need to Do:**

1. Perform any necessary data cleaning and wrangling to prepare the dataset for analysis.
2. Create meaningful visualizations using `Seaborn` and `Matplotlib`.
3. Gain **insights** into the **American Community Survey** and **U.S. Census Data Standardization**.

---

Good luck, and don't forget to leverage the power of data visualization to tell a compelling story! 🎯

### 🏛️ **American Community Survey (ACS):**

The **American Community Survey (ACS)** is an annual survey conducted by the **U.S. Census Bureau**. It collects detailed data on the social, economic, housing, and demographic characteristics of the U.S. population. The survey provides estimates for various characteristics across different geographic levels, including:

- 🌎 **National level**
- 🗺️ **State level**
- 🏘️ **Local level**

The ACS plays a crucial role in helping policymakers, researchers, and public officials understand the needs of communities, allocate resources, and make informed decisions based on comprehensive data.

---

For more detailed information about the ACS, visit the [official ACS page](https://www.census.gov/programs-surveys/acs).

🔍 **Census Data Standardization:**

- The U.S. Census Bureau collects data using standardized techniques and predefined "areas" to ensure consistency and comparability across different datasets. The geographic defined go as follows:
  - **State**: A primary division of the United States.
  - **County**: A sub-division of a state.  
  - **Tract**: A small, relatively permanent statistical subdivision of a county.
  - **Block Group**: A subdivision of a tract, consisting of a group of blocks.
  - **Block**: The smallest geographic unit used by the Census Bureau.

  For full reference, please review the following link: [Census Geographic Boundaries](https://learn.arcgis.com/en/related-concepts/united-states-census-geography.htm)


For a visual reference, please reveiw the following photo:

![image.png](attachment:image.png)

### 🚀 **Your Challenge:**

A major challenge for you in this assignment is to **devise a plan** for tackling each question. While the questions outline the overall goals—and some hints are provided—you'll need to think critically about the following:

- **What data do you need** to answer each question?
- **What preprocessing steps** are required to clean and prepare the data for analysis?

You may need to perform additional data wrangling to get the data into the right form for 📊 **visualization**. Keep in mind that a well-thought-out approach will help you uncover insights and create effective visual representations. 🔍

### 📝 **Grading Rubric:**

Each question is worth the same number of points:

- 🎯 **90-100%**: Correct and complete answer. Code follows [PEP 8](https://www.python.org/dev/peps/pep-0008/) guidelines, includes a well-written interpretation in Markdown, and contains no spelling or grammar errors. Minor issues like formatting or missing names may slightly reduce the score.
- 🛠️ **75-85%**: Mostly correct, with two or fewer noticeable errors or omissions. Minor stylistic flaws in code or interpretation.
- ⚠️ **50-70%**: Significant omissions or errors. Noticeable deviations from PEP 8 or moderate spelling/grammar issues.
- 🚧 **25-45%**: Incomplete or incorrect answer with substantial parts missing.
- 🛑 **0 points**: Question not attempted.

**NOTE**: You are only permitted to use the pandas, Seaborn, and matplotlib libraries for creating 📊 visualizations in this assignment. You may use other libraries for other purposes, but you may not use them to create 📊 visualizations.

<hr>

### 🛠️ **Let’s Start with Our Usual Imports:**

First, let’s import the necessary libraries. You may need to import additional modules depending on how you choose to tackle each question.

##### Additional imports might be needed for specific tasks

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os

Let's start by reading in the data. You will need to load the data from the provided CSV files

**Note: The GSIs have provided the below code block to streamline the data loading process. You can use this code as a starting point for your data loading.**

### 📂 **Loading the Data:**

Let’s start by reading in the data from the provided CSV files.

**Note:** The GSIs have provided the code block below to help streamline the data loading process. You can use this as a starting point for your data loading.

In [25]:
# The teaching team has provided some starter code to load the datasets
# Our Expectation is that you follow along and understand the code

"""
DESC: Load all CSV files from a directory into individual variables

PARAMS: 
    my_path: str - The path to the directory containing the CSV files

RETURNS:
    tuple - A tuple containing the DataFrames for each CSV file
"""
def load_data_from_dir(my_path):
    """Load all CSV files from a directory into individual variables."""

    # Create an empty list to store the the data frames
    dataframes = []

    # Use list comprehension to get the filenames of all the CSVs in the directory
    filenames = sorted([f for f in os.listdir(my_path) if f.endswith('.csv')])

    # Loop through the filenames and read each CSV into a DataFrame
    for filename in filenames:
        df = pd.read_csv(os.path.join(my_path, filename))
        dataframes.append(df)

    # Return the dataframes as a tuple
    return tuple(dataframes)


# Load the data
# variable assignment known as tuple unpacking
acs_2010, acs_2011, acs_2012, acs_2013, acs_2014, acs_2015, acs_2016 = \
    load_data_from_dir("Occupation_Data")

# You may need to change the my_path argument to the directory where your CSV files are located ^^^


In [26]:
# Create a list of dataframes for easier manipulation
dataframes = [acs_2010, acs_2011, acs_2012, acs_2013, acs_2014, acs_2015, acs_2016]

# Loop through the data frames and print the dims of each

for i, df in enumerate(dataframes):
    print(f"acs_{2010 + i}: {df.shape}")


acs_2010: (2813, 8)
acs_2011: (2813, 8)
acs_2012: (2813, 8)
acs_2013: (2813, 8)
acs_2014: (2813, 8)
acs_2015: (2813, 8)
acs_2016: (2813, 8)


## 🟢 **Q0:**


Please show your understanding of the above code by finishing the following fill in the blanks:

- How many CSV files from the American Community Survey were provided? ______
    - **Note: specifically referencing the number of American Community Survey files for Occupation not including the Income CSV**
- Each file is loaded from my _______ directory.
- The data is returned as a _______.

## 🟢 **Q1: Data Preparation for Visualization**

The data format is as follows:

- 📅 **7 DataFrames**: One for each year from 2010 to 2016.
- 📊 **Same Structure**: Each DataFrame contains the same columns but represents different years.
- 🏙️ **Census Tracts**: Each row (observation) represents a census tract in this version of the American Community Survey.

---

### 🔑 **Features (Columns) of the Datasets**:

- 🗺️ **Geography**: The name of the census tract.
- 🆔 **Geographic Area Name**: The unique identifier for the census tract.
- 👥 **Total**: The total number of employed civilians above the age of 16 in the census tract.
- 💼 **Management, Business, Science, and Arts Occupations**: Percentage of employed civilians in this category.
- 🛠️ **Service Occupations**: Percentage of employed civilians in this category.
- 🏢 **Sales and Office Occupations**: Percentage of employed civilians in this category.
- 🏗️ **Natural Resources, Construction, and Maintenance Occupations**: Percentage of employed civilians in this category.
- 🚚 **Production, Transportation, and Material Moving Occupations**: Percentage of employed civilians in this category.

The data in each of the columns (with the exclusion of `Geography`, `Geographic Area Name` and `Total`) is a percentage formatted as a whole number (i.e., 24.6 == 0.246). We suggest that you use these percentages and the `Total` column to calculate the number of employed civilians over the age of 16 in each occupation category for each census tract.

### 🟢 **Steps to Prepare the Data for 📊 Visualization:**

1. **Add a `year` column**:  
   Take the data from each year and add a new column called `year` to each dataframe that contains the year of the data.
   
2. **Combine the dataframes**:  
   Row bind the dataframes together into a single dataframe and call it `combined_df`.  

3. **Calculate employment numbers**:  
   Calculate the number of employed civilians in each occupation category for each census tract by multiplying the percentage by the `Total` column.

---

#### 💡 **Hints:**

- You can use the `pd.concat()` function to row bind the dataframes together.  
- Check out this helpful Stack Overflow post for an explanation of what **Row Bind** means:  
  [Row Bind](https://stackoverflow.com/questions/14988480/pandas-version-of-rbind)
- Inspect the `dtypes` of the columns in the combined dataframe to ensure you're working with appropriate data types for 📊 visualizations and calculations.
- The dataframe column names are quite long. You might want to **rename** them for easier access. Think back to the Pandas lecture on renaming columns using a dictionary.

In [None]:
# Answer Here

### 🟢 **Q2: Filtering and Visualization**

Using the newly created `combined_df` DataFrame, filter the data to include only following counties: `Wayne`, `Oakland`, `Macomb`, and `Washtenaw`.

---

#### 🔄 **Process**: 

1. Filter the data for the selected counties.
2. Create a 📊 **visualization** that shows the **total number of employed civilians** for the selected counties by year.
   - The **x-axis** should represent the year.
   - The **y-axis** should represent the total number of employed civilians.

---

#### 💡 **Hints**:

- You will need to group the data by year and **sum the total number of employed civilians** for the selected counties.
- Since the data consists of **census tracts** and not individual counties, you need to **aggregate all census tracts** belonging to the selected counties.
- The `Geographic Area Name` column contains a **delimiter (",")** that can be used to split the column into components. Use the `str.split()` function to help with filtering.
- You can use the `groupby()` function to **aggregate the data by year**.

In [None]:
# Answer Here

### 🟢 **Q3: Occupation Category Visualization for 2016**

Create a 📊 **visualization** that shows the number of employed civilians in each occupation category for the most recent year, **2016**, in the following counties:  
`St. Clair`, `Macomb`, `Oakland`, `Lapeer`.

---

#### 🔄 **Process**:

1. Filter the data for the selected counties.
2. Limit the data to the **year 2016** (using logical indexing).
3. Create a 📊 **visualization** that shows the number of employed civilians in each occupation category for these counties.

---

#### 💡 **Hint**:

- Use **logical indexing** to limit the data to **2016**.
- Filter for the counties of interest using `.isin()`. You can find the documentation for `.isin()` [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html).

In [None]:
# Answer Here

### 🟢 **Q4: Distribution of Employed Civilians (2013)**

For all counties in the dataset for the year **2013**, create a 📊 **visualization** that shows the distribution of the total number of employed civilians across all counties.

---

#### 🔄 **Steps**:

1. **Create a density plot** to visualize the distribution of the total number of employed civilians.
2. After examining the density plot, you’ll notice some census tracts have very high numbers of employed civilians.  
   - Create a **second density plot** that **excludes census tracts** with more than **1,250 employed civilians**.

---

#### ⚠️ **Note**:

- You may notice **values below 0** in the density plot, which is likely due to the **noise** introduced by the American Community Survey.  
- Here is a link to learn more about this: [Census Noise in Data](https://www2.census.gov/programs-surveys/cbp/technical-documentation/methodology/noise-data.pdf).
- You don't need to filter out these values, but make a comment on whether you should or should not include them in your visualization.

#### 💡 **Hint**: 

- Use the `sns.kdeplot()` function from the Seaborn library to create the density plots.
- For reference, here is a formal definition of a **density plot**: A density plot is a graph that shows the distribution of a continuous variable, such as a numeric variable over time. It's a smooth curve that's often used as an alternative to a histogram

In [None]:
# Answer Here

### 🟢 **Q5: Census Tracts by County (2016)**

Write code to find the **number of census tracts by county** for the year **2016**. Once you have the counts, choose **5 counties** to visualize that are in **Southeast Michigan**.

---

#### 🔄 **Process**:

1. Find the **number of census tracts** by county for 2016.
2. Choose **5 counties** from the list below:
   - `Wayne`, `Oakland`, `Macomb`, `Washtenaw`, `Livingston`, `St. Clair`, `Lapeer`, `Monroe`, `Genesee`.
3. Create a 📊 **visualization** to show the number of census tracts for the selected counties. In your visulization sort the selected counties alphabetically


---

#### 🗺️ **Reference**:

For reference, here’s a map of the State of Michigan:

![image.png](attachment:image.png)

#### 💡 **Hint**: You can use a bar plot to visualize the number of census tracts by county

In [None]:
# Answer Here

### 🟢 **Q6: Histogram of Employed Civilians (2016)**

Create a **histogram** that shows the distribution of the total number of employed civilians across all counties for the year **2016**.

---

#### 🔄 **Steps**:

1. Create a **histogram** to visualize the distribution of employed civilians.
2. Experiment with the `bins` argument:
   - Produce **two histograms** with **different bin sizes** to see how it affects the look of the distribution.
3. **Comment** on why you chose one bin size over the other.

---

This task will help you understand how bin size impacts the visualization of distribution! 🎨📊

In [None]:
# Answer Here

### 🟢 **Q7: Occupational Category Visualization (2012)**

Create a 📊 **visualization** that shows the **average number of employed civilians** in each occupational category by county for the year **2012**.

---

#### 🔄 **Steps**:

1. Show the values as a **percentage** of the total number of employed civilians in each county.
   - Create a **100% stacked bar plot** to represent this data.
2. Only include the following counties from **Southeast Michigan**:
   - `Wayne`, `Oakland`, `Macomb`, `Washtenaw`, `Livingston`, `St. Clair`, `Lapeer`, `Monroe`, `Genesee`.

---

#### 💡 **Hints**:

- Use the `groupby()` function to aggregate data by county.
- To visualize this effectively, consider using a **100% stacked bar plot**. For a reference, check out this [link](https://www.statisticshowto.com/stacked-bar-chart/).

---

This question will help you practice data aggregation and visualize percentages in a clear, impactful way! 🎨📊

In [None]:
# Answer Here


### 🟢 **Q8: High vs. Low Earning Census Tracts (2015)**

Create a 📊 **visualization** that shows the distribution of **High Earning Census Tracts** (Median Income > 35,000) and **Low Earning Census Tracts** (Median Income < 35,000) for the year **2015**.

---

#### 🔄 **Steps**:

1. **First Visualization**: 
   - Show the distribution for **all counties** in the dataset.
   
2. **Second Visualization**: 
   - Repeat the process, but only for the following counties:
     - `Wayne`, `Oakland`, `Macomb`, `Washtenaw`.

---

#### 💡 **Hint**:

- You’ll need to **join the median income dataset for 2015** with the employment dataset for 2015.
- Recall our previous in-class exercise on joining datasets. Consider:
   - **What type of join** is appropriate for this task?
   - **What column** should you join on?

---

This task will help you practice dataset merging and visualize the economic distribution in different regions. 📊💡

In [None]:
# Answer Here

### 🟢 IMPORTANT: Final Checks

Before you submit your homework, please ensure that:
- Your complete homework **runs without errors** from top to bottom.  
  💡 **Tip**: Use the **Run All** feature to quickly check this.
  
---

## 📤 Submission Instructions

Submit your completed assignment in both of the following formats:
1. **.IPYNB** (Jupyter Notebook format)
2. **.HTML** (Webpage format)

Upload your files to **Canvas** before the deadline.