# CA100 Demographic Pyramid Visualization
This is a project to generate visualizations specifically for [California 100](https://california100.org/)'s regional reports. We will create 9 regional demographic pyramid visualizations using the latest year from the US Census (ACS5) data.

## Demographic Pyramid
A [population pyramid](https://en.wikipedia.org/wiki/Population_pyramid) is a common socioeconomic data visualization to understand the age and gender distribution of a geographic region.

For this project, we will generate a more detailed **demographic pyramid** that displays the distribution of ethnicity for each age demographic. This was inspired by [this project](https://medium.com/@databayou/why-i-made-race-and-ethnicity-population-pyramids-e41b486e3806).

This is a project using an iPython (Jupyter) Notebook which will accompany the code to document the actions performed.

### Pre-Requisites
1. Install [Python](https://www.python.org/downloads/)
2. Install [Jupyter](https://www.codecademy.com/article/how-to-use-jupyter-notebooks)
3. Open terminal on [Mac](https://support.apple.com/guide/terminal/open-or-quit-terminal-apd5265185d-f365-44cb-8b09-71a064a42125/mac) or [PC](https://www.wikihow.com/Open-Terminal-in-Windows)

### How to run this project
The notebook is divided into "cells" of individual code.

1. Open terminal and run `jupyter notebook`
2. A tab should open in your browser with the url "localhost:8888". If the tab doesn't open, you can type in "localhost:8888" into the url manually.
3. Run the entire notebook by pressing the "Play" button.
4. You can run each cell by clicking the "Shift + Enter" keys.
5. **Run each cell**. Later cells depend on the earlier cells, so while iPython allows you to run later cells, this may result in errors.

If there are a lot of errors, click on the icon with the arrow looped back to the right to reset the notebook.


In [3]:
# Install packages
# Run this only the first time
pip install itertools pandas cenpy census us matplotlib

SyntaxError: invalid syntax (1365288269.py, line 2)

In [2]:
# Import packages
import itertools
import pandas as pd
from census import Census
import os

## Data Set-up
We create our data tables and structures, so we can work with the data from the census.

### Defining Regions
Each county is aligned with a region defined by CA100. This is done through a csv file in this folder called "region_county_mapping.csv".

If you wish to change the region a state belongs to, update and save the "region_county_mapping.csv" and re-run this section to allow the code to know that the state has been reassigned.

In [3]:
# Connect census package with census API key
# NOTE: To use the Census API, you need to sign-up at this site https://api.census.gov/data/key_signup.html
# You will receive your API key in your email. Copy and paste it here
c = Census("6c30242c40faab9091e343eba633f256c7eecfda")

In [4]:
# Import table that associates county with CA100 defined regions
REGIONAL_MAPPING_FILENAME = 'region_county_mapping.csv'
REGIONAL_MAPPING_FILEPATH = os.path.abspath(REGIONAL_MAPPING_FILENAME)
county_codes_df = pd.read_csv(REGIONAL_MAPPING_FILEPATH)
county_codes_df = county_codes_df.dropna(axis=1)
county_codes_df['FIPS Code'] = county_codes_df['FIPS Code'].astype(str)
county_codes_df['FIPS Code'] = county_codes_df['FIPS Code'].str[1:]
county_codes_df

Unnamed: 0,County,Region,FIPS Code
0,Alameda,Bay Area,1
1,Alpine,Sierras,3
2,Amador,Sierras,5
3,Butte,Far North,7
4,Calaveras,Sierras,9
5,Colusa,Far North,11
6,Contra Costa,Bay Area,13
7,Del Norte,Far North,15
8,El Dorado,Sacramento Metro,17
9,Fresno,San Joaquin Valley,19


In [5]:
county_codes = county_codes_df.copy().set_index('County').to_dict('index')
county_codes

{'Alameda': {'Region': 'Bay Area', 'FIPS Code': '001'},
 'Alpine': {'Region': 'Sierras', 'FIPS Code': '003'},
 'Amador': {'Region': 'Sierras', 'FIPS Code': '005'},
 'Butte': {'Region': 'Far North', 'FIPS Code': '007'},
 'Calaveras': {'Region': 'Sierras', 'FIPS Code': '009'},
 'Colusa': {'Region': 'Far North', 'FIPS Code': '011'},
 'Contra Costa': {'Region': 'Bay Area', 'FIPS Code': '013'},
 'Del Norte': {'Region': 'Far North', 'FIPS Code': '015'},
 'El Dorado': {'Region': 'Sacramento Metro', 'FIPS Code': '017'},
 'Fresno': {'Region': 'San Joaquin Valley', 'FIPS Code': '019'},
 'Glenn': {'Region': 'Far North', 'FIPS Code': '021'},
 'Humboldt': {'Region': 'Far North', 'FIPS Code': '023'},
 'Imperial': {'Region': 'Southern Border', 'FIPS Code': '025'},
 'Inyo': {'Region': 'Sierras', 'FIPS Code': '027'},
 'Kern': {'Region': 'San Joaquin Valley', 'FIPS Code': '029'},
 'Kings': {'Region': 'San Joaquin Valley', 'FIPS Code': '031'},
 'Lake': {'Region': 'Far North', 'FIPS Code': '033'},
 'Lasse

In [6]:
# Data scope table
# Set-up what census variables we want to pull
CA_FIPS = '06'
BASE_CODE = 'B01001'
MAX_AGE_CAT = 31
ETH = [ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']
ETH_CODES = [ BASE_CODE + eth_code for eth_code in ETH]
ETH_AGE_CODES = [
    [ eth_code + '_' + str(age_cat).zfill(3) + 'E' for age_cat in range(1,MAX_AGE_CAT+1)]
    for eth_code in ETH_CODES
]
ETH_AGE_CODES = list(itertools.chain.from_iterable(ETH_AGE_CODES))
COUNTIES = list(county_codes.keys())

In [7]:
# Set-up empty table to put data from Census API into
# This creates a table that has number of people per ethnicity, age group, and county with additional labeling of region
# This is based on the regional definition csv and the census variables we want to download
county_census_df = county_codes_df.copy()
county_census_df = county_census_df.reindex(columns=['County', 'Region', 'FIPS Code'] + ETH_AGE_CODES)
county_census_df

Unnamed: 0,County,Region,FIPS Code,B01001A_001E,B01001A_002E,B01001A_003E,B01001A_004E,B01001A_005E,B01001A_006E,B01001A_007E,...,B01001I_022E,B01001I_023E,B01001I_024E,B01001I_025E,B01001I_026E,B01001I_027E,B01001I_028E,B01001I_029E,B01001I_030E,B01001I_031E
0,Alameda,Bay Area,1,,,,,,,,...,,,,,,,,,,
1,Alpine,Sierras,3,,,,,,,,...,,,,,,,,,,
2,Amador,Sierras,5,,,,,,,,...,,,,,,,,,,
3,Butte,Far North,7,,,,,,,,...,,,,,,,,,,
4,Calaveras,Sierras,9,,,,,,,,...,,,,,,,,,,
5,Colusa,Far North,11,,,,,,,,...,,,,,,,,,,
6,Contra Costa,Bay Area,13,,,,,,,,...,,,,,,,,,,
7,Del Norte,Far North,15,,,,,,,,...,,,,,,,,,,
8,El Dorado,Sacramento Metro,17,,,,,,,,...,,,,,,,,,,
9,Fresno,San Joaquin Valley,19,,,,,,,,...,,,,,,,,,,


## Download Census Data [Optional After First Run]
We are getting the latest data for each county, ethnicity, and age group.

This is using the latest [ACS 5 dataset](https://www.census.gov/data/developers/data-sets/acs-5year.html). Here is some [documentation](https://pypi.org/project/census/) for the Python census package.

Script last run: 12/15/2022, using ACS 5 2020 dataset.

**NOTE**
This should be done only once or as needed, since the data is saved in "county_eth_age_breakdown.csv" after this has been pulled.

In [23]:
codes = ETH_AGE_CODES

for i, eth_age_code in enumerate(codes):
    eth_age_data = []
    threads = []
    print("Started " + eth_age_code)
    with ThreadPoolExecutor(max_workers=5) as executor:
        for county in COUNTIES:
            threads.append(executor.submit(c.acs5.state_county,
                                           eth_age_code,
                                           CA_FIPS,
                                           county_codes[county]['FIPS Code']))
        for task in threads:
            eth_age_data.append(task.result()[0][eth_age_code])
    print(eth_age_code + " done. " + "{percent:.3}%".format(percent=(i+1)/len(codes)) + " done.")
    county_census_df[eth_age_code] = eth_age_data
    county_census_df.to_csv('county_eth_age_breakdown.csv')

county_census_df

Started B01001A_001E
B01001A_001E done. 0.00358% done.
Started B01001A_002E
B01001A_002E done. 0.00717% done.
Started B01001A_003E
B01001A_003E done. 0.0108% done.
Started B01001A_004E
B01001A_004E done. 0.0143% done.
Started B01001A_005E
B01001A_005E done. 0.0179% done.
Started B01001A_006E
B01001A_006E done. 0.0215% done.
Started B01001A_007E
B01001A_007E done. 0.0251% done.
Started B01001A_008E
B01001A_008E done. 0.0287% done.
Started B01001A_009E
B01001A_009E done. 0.0323% done.
Started B01001A_010E
B01001A_010E done. 0.0358% done.
Started B01001A_011E
B01001A_011E done. 0.0394% done.
Started B01001A_012E
B01001A_012E done. 0.043% done.
Started B01001A_013E
B01001A_013E done. 0.0466% done.
Started B01001A_014E
B01001A_014E done. 0.0502% done.
Started B01001A_015E
B01001A_015E done. 0.0538% done.
Started B01001A_016E
B01001A_016E done. 0.0573% done.
Started B01001A_017E
B01001A_017E done. 0.0609% done.
Started B01001A_018E
B01001A_018E done. 0.0645% done.
Started B01001A_019E
B01001

Unnamed: 0,County,Region,FIPS Code,B01001A_001E,B01001A_002E,B01001A_003E,B01001A_004E,B01001A_005E,B01001A_006E,B01001A_007E,...,B01001I_022E,B01001I_023E,B01001I_024E,B01001I_025E,B01001I_026E,B01001I_027E,B01001I_028E,B01001I_029E,B01001I_030E,B01001I_031E
0,Alameda,Bay Area,1,631037.0,314747.0,14893.0,14588.0,14790.0,8798.0,6325.0,...,5927.0,14760.0,15965.0,14856.0,27245.0,22089.0,15565.0,8302.0,3965.0,2230.0
1,Alpine,Sierras,3,663.0,360.0,13.0,0.0,17.0,14.0,0.0,...,4.0,3.0,0.0,2.0,53.0,0.0,6.0,6.0,0.0,0.0
2,Amador,Sierras,5,33040.0,17534.0,674.0,704.0,722.0,421.0,311.0,...,99.0,128.0,125.0,126.0,371.0,138.0,234.0,201.0,11.0,53.0
3,Butte,Far North,7,178568.0,87260.0,4604.0,4930.0,4421.0,2835.0,2807.0,...,874.0,2671.0,1575.0,1195.0,2212.0,1743.0,1350.0,823.0,439.0,99.0
4,Calaveras,Sierras,9,40058.0,20110.0,778.0,1196.0,619.0,700.0,415.0,...,140.0,257.0,49.0,79.0,273.0,452.0,415.0,228.0,61.0,127.0
5,Colusa,Far North,11,17012.0,8477.0,669.0,766.0,644.0,333.0,174.0,...,208.0,502.0,465.0,446.0,817.0,703.0,482.0,206.0,90.0,151.0
6,Contra Costa,Bay Area,13,608789.0,300282.0,14340.0,16058.0,17361.0,11227.0,6270.0,...,4652.0,11418.0,11150.0,10859.0,21868.0,18149.0,12858.0,6892.0,3465.0,1580.0
7,Del Norte,Far North,15,20002.0,10844.0,484.0,588.0,398.0,337.0,121.0,...,120.0,92.0,146.0,29.0,324.0,227.0,202.0,91.0,10.0,0.0
8,El Dorado,Sacramento Metro,17,164552.0,82376.0,3649.0,4197.0,5077.0,3067.0,1619.0,...,381.0,908.0,754.0,741.0,1692.0,1540.0,1318.0,721.0,316.0,92.0
9,Fresno,San Joaquin Valley,19,597574.0,295324.0,21719.0,22495.0,24702.0,13036.0,7646.0,...,8537.0,21503.0,21647.0,18898.0,34022.0,27289.0,20644.0,12123.0,5510.0,2606.0


## Data Visualization - Creating the Demographic Pyramid
We process the ACS data here. If you already have downloaded the ACS data from the earlier step, you will have a "county_eth_age_breakdown.csv" file which contains all the data required to generate the visualization.

In this case, you won't need to run the earlier step again when you run this section of code.

In [1]:
# Install required visualization library
pip install seaborn

SyntaxError: invalid syntax (1820086906.py, line 2)

In [8]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import seaborn as sns
sns.set()
%matplotlib notebook

In [24]:
# Set County as the index
county_census_df = pd.read_csv('county_eth_age_breakdown.csv')
county_census_df.set_index('County', inplace=True)
county_census_df = county_census_df.drop(county_census_df.columns[[0]], axis=1)
county_census_df

Unnamed: 0_level_0,Region,FIPS Code,B01001A_001E,B01001A_002E,B01001A_003E,B01001A_004E,B01001A_005E,B01001A_006E,B01001A_007E,B01001A_008E,...,B01001I_022E,B01001I_023E,B01001I_024E,B01001I_025E,B01001I_026E,B01001I_027E,B01001I_028E,B01001I_029E,B01001I_030E,B01001I_031E
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alameda,Bay Area,1,631037.0,314747.0,14893.0,14588.0,14790.0,8798.0,6325.0,16637.0,...,5927.0,14760.0,15965.0,14856.0,27245.0,22089.0,15565.0,8302.0,3965.0,2230.0
Alpine,Sierras,3,663.0,360.0,13.0,0.0,17.0,14.0,0.0,0.0,...,4.0,3.0,0.0,2.0,53.0,0.0,6.0,6.0,0.0,0.0
Amador,Sierras,5,33040.0,17534.0,674.0,704.0,722.0,421.0,311.0,656.0,...,99.0,128.0,125.0,126.0,371.0,138.0,234.0,201.0,11.0,53.0
Butte,Far North,7,178568.0,87260.0,4604.0,4930.0,4421.0,2835.0,2807.0,8928.0,...,874.0,2671.0,1575.0,1195.0,2212.0,1743.0,1350.0,823.0,439.0,99.0
Calaveras,Sierras,9,40058.0,20110.0,778.0,1196.0,619.0,700.0,415.0,919.0,...,140.0,257.0,49.0,79.0,273.0,452.0,415.0,228.0,61.0,127.0
Colusa,Far North,11,17012.0,8477.0,669.0,766.0,644.0,333.0,174.0,506.0,...,208.0,502.0,465.0,446.0,817.0,703.0,482.0,206.0,90.0,151.0
Contra Costa,Bay Area,13,608789.0,300282.0,14340.0,16058.0,17361.0,11227.0,6270.0,15223.0,...,4652.0,11418.0,11150.0,10859.0,21868.0,18149.0,12858.0,6892.0,3465.0,1580.0
Del Norte,Far North,15,20002.0,10844.0,484.0,588.0,398.0,337.0,121.0,527.0,...,120.0,92.0,146.0,29.0,324.0,227.0,202.0,91.0,10.0,0.0
El Dorado,Sacramento Metro,17,164552.0,82376.0,3649.0,4197.0,5077.0,3067.0,1619.0,4022.0,...,381.0,908.0,754.0,741.0,1692.0,1540.0,1318.0,721.0,316.0,92.0
Fresno,San Joaquin Valley,19,597574.0,295324.0,21719.0,22495.0,24702.0,13036.0,7646.0,20101.0,...,8537.0,21503.0,21647.0,18898.0,34022.0,27289.0,20644.0,12123.0,5510.0,2606.0


### Adjusting Age Groups
To visualize a demographic or age pyramid, we need all of our age groups to be defined in the same interval of **5 years** (e.g. Under 5 years, 5 - 9 years, 10 - 14 years, etc.).

However, the census has age groups that are 2-3 years ([15 - 17 years](https://api.census.gov/data/2021/acs/acs1/variables/B01001I_006E.json), and [17 - 19 years](https://api.census.gov/data/2021/acs/acs1/variables/B01001I_007E.json)) and 10 years ([35 - 44 years](https://api.census.gov/data/2021/acs/acs1/variables/B01001I_011E.json), and [45 - 54 years](https://api.census.gov/data/2021/acs/acs1/variables/B01001I_012E.json)).

The cell below is the code to merge or split these age groups so that in our demographic pyramids have the same intervals of 5 years.

In [25]:
# Identify age groups that are only 2 years in length so all age groups are the same length (4 years)
# Combine 006, 007 (male) 021, 022 (female) cols
ETH_AGE_CODES_MERGED = [
    [ eth_code + '_' + str(age_cat).zfill(3) + 'E' for age_cat in [6,7,21,22]]
    for eth_code in ETH_CODES
]
ETH_AGE_CODES_MERGED = list(itertools.chain.from_iterable(ETH_AGE_CODES_MERGED))
ETH_AGE_CODES_MERGED_ZIPPED = list(zip(ETH_AGE_CODES_MERGED[0::2], ETH_AGE_CODES_MERGED[1::2]))
merged_demo_cols = pd.DataFrame()
for merged_codes in ETH_AGE_CODES_MERGED_ZIPPED:
    merged_codes_list = list(merged_codes)
    merged_col_name = merged_codes[0]+'_merged'
    merged_demo_cols[merged_col_name] = county_census_df[merged_codes_list].sum(axis=1)
# Identify age groups that are only 4 years in length instead of 8 years
# Split 011, 012, 013, 014, 015 (male) 026, 027, 028, 029, 030 (female) cols into a and b
SPLIT_COL_NAMES = [11,12,13,14,15,26,27,28,29,30]
ETH_AGE_CODES_SPLIT = [
    [ eth_code + '_' + str(age_cat).zfill(3) + 'E' for age_cat in SPLIT_COL_NAMES]
    for eth_code in ETH_CODES
]
ETH_AGE_CODES_SPLIT = list(itertools.chain.from_iterable(ETH_AGE_CODES_SPLIT))
split_demo_cols = pd.DataFrame()
for split_code in ETH_AGE_CODES_SPLIT:
    split_demo_cols[split_code+"_a"] = county_census_df[split_code]/2
    split_demo_cols[split_code+"_b"] = county_census_df[split_code]/2

In [26]:
# Perform the actual merging and splitting of age brackets
print("Number of columns: ",len(county_census_df.columns))
county_census_viz_df = county_census_df.copy()
county_census_viz_df = county_census_viz_df.drop(columns=ETH_AGE_CODES_MERGED)
county_census_viz_df = county_census_viz_df.drop(columns=ETH_AGE_CODES_SPLIT)
print("Number of columns: ",len(county_census_viz_df.columns))
county_census_viz_df = pd.concat([county_census_viz_df, merged_demo_cols, split_demo_cols], axis='columns')
list(county_census_viz_df.columns)

Number of columns:  281
Number of columns:  155


['Region',
 'FIPS Code',
 'B01001A_001E',
 'B01001A_002E',
 'B01001A_003E',
 'B01001A_004E',
 'B01001A_005E',
 'B01001A_008E',
 'B01001A_009E',
 'B01001A_010E',
 'B01001A_016E',
 'B01001A_017E',
 'B01001A_018E',
 'B01001A_019E',
 'B01001A_020E',
 'B01001A_023E',
 'B01001A_024E',
 'B01001A_025E',
 'B01001A_031E',
 'B01001B_001E',
 'B01001B_002E',
 'B01001B_003E',
 'B01001B_004E',
 'B01001B_005E',
 'B01001B_008E',
 'B01001B_009E',
 'B01001B_010E',
 'B01001B_016E',
 'B01001B_017E',
 'B01001B_018E',
 'B01001B_019E',
 'B01001B_020E',
 'B01001B_023E',
 'B01001B_024E',
 'B01001B_025E',
 'B01001B_031E',
 'B01001C_001E',
 'B01001C_002E',
 'B01001C_003E',
 'B01001C_004E',
 'B01001C_005E',
 'B01001C_008E',
 'B01001C_009E',
 'B01001C_010E',
 'B01001C_016E',
 'B01001C_017E',
 'B01001C_018E',
 'B01001C_019E',
 'B01001C_020E',
 'B01001C_023E',
 'B01001C_024E',
 'B01001C_025E',
 'B01001C_031E',
 'B01001D_001E',
 'B01001D_002E',
 'B01001D_003E',
 'B01001D_004E',
 'B01001D_005E',
 'B01001D_008E',
 'B010

### Creating Labels and Colors
This section creates the labels that will appear on the demographic pyramid.

#### Changing Ethnicity Colors
The demographic pyramid represent different age groups by different colors. These colors are defined here so to change the color follow these steps:

1. In the next cell of code (In 22), go to the line 22 where it defines `ETH_COLOR_CODES_DICT`
2. Each ethnicity is represented by a letter defined by the US Census. This is where each ethnicity is assigned a [color as a hexadecimal number](https://htmlcolorcodes.com/)
3. Change the hexadecimal value for a given letter to change an ethnicity's color represented on the demographic pyramid

In [27]:
# Identify columns by gender
MIN_MALE_INDEX = 3
MAX_MALE_INDEX = 16
MIN_FEMALE_INDEX = 18
MAX_FEMALE_INDEX = 31

# Identify columns by ethnicity
ETH_CODES_DICT = {
    'A': 'White',
    'B': 'Black or African American',
    'C': 'American Indian and Alaska Native',
    'D': 'Asian',
    'E': 'Native Hawaiian and Other Pacific Islander',
    'F': 'Other Race',
    'G': 'Two or More Races',
    'H': 'White, Not Hispanic or Latino',
    'I': 'Hispanic or Latino',
}

# Each ethnicity has a color. Change this dictionary if you want to change the color.
ETH_COLOR_CODES_DICT = {
    'A': '#1abc9c',
    'B': '#f1c40f',
    'C': '#95a5a6',
    'D': '#e67e22',
    'E': '#3498db',
    'F': '#e74c3c',
    'G': '#fd79a8',
    'H': '#34495e',
    'I': '#2ecc71',
}

# Identify columns by age brackets
AGE_RANGE_DICT = {
    '003': 'Under 5 Years',
    '004': '5 to 9 Years',
    '005': '10 to 14 Years',
    '006': '15 to 17 Years',
    '007': '18 to 19 Years',
    '008': '20 to 24 Years',
    '009': '25 to 29 Years',
    '010': '30 to 34 Years',
    '011': '35 to 44 Years',
    '012': '45 to 54 Years',
    '013': '55 to 64 Years',
    '014': '65 to 74 Years',
    '015': '75 to 84 Years',
    '016': '85 Years and Over',
    '018': 'Under 5 Years',
    '019': '5 to 9 Years',
    '020': '10 to 14 Years',
    '021': '15 to 17 Years',
    '022': '18 to 19 Years',
    '023': '20 to 24 Years',
    '024': '25 to 29 Years',
    '025': '30 to 34 Years',
    '026': '35 to 44 Years',
    '027': '45 to 54 Years',
    '028': '55 to 64 Years',
    '029': '65 to 74 Years',
    '030': '75 to 84 Years',
    '031': '85 Years and Over',
}

AGE_RANGE_TO_INDEX_DICT = {
    'Under 5 Years': 0,
    '5 to 9 Years': 1,
    '10 to 14 Years': 2,
    '15 to 17 Years': 3,
    '18 to 19 Years': 3,
    '20 to 24 Years': 4,
    '25 to 29 Years': 5,
    '30 to 34 Years': 6,
    '35 to 44 Years': 7,
    '45 to 54 Years': 9,
    '55 to 64 Years': 11,
    '65 to 74 Years': 13,
    '75 to 84 Years': 15,
    '85 Years and Over': 17,
}
AGE_RANGES = [
    'Under 5 Years',
    '5 to 9 Years',
    '10 to 14 Years',
    '15 to 19 Years',
    '20 to 24 Years',
    '25 to 29 Years',
    '30 to 34 Years',
    '35 to 39 Years',
    '40 to 44 Years',
    '45 to 49 Years',
    '50 to 54 Years',
    '55 to 59 Years',
    '60 to 64 Years',
    '65 to 69 Years',
    '70 to 74 Years',
    '75 to 79 Years',
    '80 to 84 Years',
    '85 Years and Over',
]

MERGED_AGES = ['021', '022', '006', '007']
SPLIT_AGES = ['011', '012', '013', '014', '015', '026', '027', '028', '029', '030']

# Replaces index in string
def replacer(s, newstring, index, nofail=False):
    # raise an error if index is outside of the string
    if not nofail and index not in range(len(s)):
        raise ValueError("index outside given string")

    # if not erroring, but the index is still not in the correct range..
    if index < 0:  # add it to the beginning
        return newstring + s
    if index > len(s):  # add it to the end
        return s + newstring

    # insert the new string between "slices" of the original
    return s[:index] + newstring + s[index + 1:]

# Helper function to get left columns based on census code
def get_left_cols(code):
    eth_code = code[6:7]
    gender_age = code[8:11]

    if eth_code == 'H':
        return 0

    # Eth Order: H, I, B, C, D, E, F, G
    eth_order = ['H', 'I', 'B', 'C', 'D', 'E', 'F', 'G']

    eth_order = eth_order[:eth_order.index(eth_code)]
    eth_offset_cols = [replacer(code, eth_idx, 6)[:12] for eth_idx in eth_order]

    if gender_age in ['021', '022', '006', '007']:
        complementary_age = ''
        if gender_age == '021':
            complementary_age = '022'
        elif gender_age == '022':
            complementary_age = '021'
        elif gender_age == '006':
            complementary_age = '007'
        elif gender_age == '007':
            complementary_age = '006'
        complementary_code = replacer(replacer(code, complementary_age, 8), 'E', 11)
        additional_eth_offset_cols = [replacer(complementary_code, eth_idx, 6)[:12] for eth_idx in eth_order]
        eth_offset_cols = additional_eth_offset_cols + eth_offset_cols
    return eth_offset_cols

# Helper function to get left offset based on census code (return int)
def get_left_offset(code, region):
    gender_age = code[8:11]
    # Get columns to the left
    eth_offset_cols = get_left_cols(code)
    if eth_offset_cols == 0:
        return 0

    # Sum values to get left offset
    regional_df = county_census_df[county_census_df['Region'] == region]

    if set(eth_offset_cols).issubset(regional_df.columns.tolist()):
        if gender_age in ['011', '012', '013', '014', '015', '026', '027', '028', '029', '030']:
            return regional_df.loc[:,eth_offset_cols].sum().sum()/2
        return regional_df.loc[:,eth_offset_cols].sum().sum()
    return eth_offset_cols

# Helper function to translate census code to attributes (gender, age, ethnicity)
def get_demo_col(code, region):
    gender_age = code[8:11]
    eth_code = code[6:7]
    suffix = code[13:] if len(code) > 12 else ''
    suffix_val = 1 if suffix == 'b' else 0

    gender = 'Male' if ((int(gender_age) >= MIN_MALE_INDEX) & (int(gender_age) <= MAX_MALE_INDEX)) else 'Female'
    gender_idx = 0 if gender == 'Male' else 1
    age = AGE_RANGE_DICT[gender_age]
    age_idx = AGE_RANGE_TO_INDEX_DICT[age] + suffix_val
    eth = ETH_CODES_DICT[eth_code]
    eth_left_len = get_left_offset(code, region)

    return {
        'label': {'gender': gender, 'age': age, 'eth': eth},
        'chart': (gender_idx, age_idx, eth_left_len)
    }
census_codes_list = county_census_viz_df.columns[2:].tolist()

In [28]:
print("Len of census codes list: ", len(census_codes_list))
# Remove codes we don't want to look at
# Age: 001, 002, 017
# Eth: A
codes_to_remove = []
for code in census_codes_list:
    gender_age = code[8:11]
    eth_code = code[6:7]
    if (eth_code == 'A') | (gender_age == '001') | (gender_age == '017') | (gender_age == '002'):
        print(gender_age, eth_code)
        codes_to_remove.append(code)

for remove_code in codes_to_remove:
    census_codes_list.remove(remove_code)

print("Len of census codes list: ", len(census_codes_list))

Len of census codes list:  351
001 A
002 A
003 A
004 A
005 A
008 A
009 A
010 A
016 A
017 A
018 A
019 A
020 A
023 A
024 A
025 A
031 A
001 B
002 B
017 B
001 C
002 C
017 C
001 D
002 D
017 D
001 E
002 E
017 E
001 F
002 F
017 F
001 G
002 G
017 G
001 H
002 H
017 H
001 I
002 I
017 I
006 A
021 A
011 A
011 A
012 A
012 A
013 A
013 A
014 A
014 A
015 A
015 A
026 A
026 A
027 A
027 A
028 A
028 A
029 A
029 A
030 A
030 A
Len of census codes list:  288


In [29]:
# View an updated version of our table
county_census_viz_df[census_codes_list]

Unnamed: 0_level_0,B01001B_003E,B01001B_004E,B01001B_005E,B01001B_008E,B01001B_009E,B01001B_010E,B01001B_016E,B01001B_018E,B01001B_019E,B01001B_020E,...,B01001I_026E_a,B01001I_026E_b,B01001I_027E_a,B01001I_027E_b,B01001I_028E_a,B01001I_028E_b,B01001I_029E_a,B01001I_029E_b,B01001I_030E_a,B01001I_030E_b
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alameda,3870.0,4218.0,4725.0,5654.0,7375.0,6607.0,1082.0,4029.0,4324.0,4988.0,...,13622.5,13622.5,11044.5,11044.5,7782.5,7782.5,4151.0,4151.0,1982.5,1982.5
Alpine,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,26.5,26.5,0.0,0.0,3.0,3.0,3.0,3.0,0.0,0.0
Amador,0.0,0.0,0.0,66.0,114.0,66.0,0.0,0.0,0.0,0.0,...,185.5,185.5,69.0,69.0,117.0,117.0,100.5,100.5,5.5,5.5
Butte,85.0,44.0,81.0,258.0,236.0,38.0,36.0,64.0,171.0,135.0,...,1106.0,1106.0,871.5,871.5,675.0,675.0,411.5,411.5,219.5,219.5
Calaveras,0.0,0.0,0.0,7.0,7.0,6.0,99.0,0.0,0.0,0.0,...,136.5,136.5,226.0,226.0,207.5,207.5,114.0,114.0,30.5,30.5
Colusa,0.0,0.0,0.0,14.0,66.0,11.0,46.0,16.0,0.0,0.0,...,408.5,408.5,351.5,351.5,241.0,241.0,103.0,103.0,45.0,45.0
Contra Costa,2419.0,2811.0,3464.0,3570.0,3940.0,2949.0,394.0,2243.0,2654.0,3248.0,...,10934.0,10934.0,9074.5,9074.5,6429.0,6429.0,3446.0,3446.0,1732.5,1732.5
Del Norte,22.0,0.0,19.0,89.0,93.0,47.0,0.0,0.0,0.0,18.0,...,162.0,162.0,113.5,113.5,101.0,101.0,45.5,45.5,5.0,5.0
El Dorado,5.0,23.0,50.0,102.0,45.0,99.0,4.0,0.0,11.0,18.0,...,846.0,846.0,770.0,770.0,659.0,659.0,360.5,360.5,158.0,158.0
Fresno,1809.0,1867.0,1903.0,1997.0,1917.0,1594.0,195.0,1841.0,1856.0,1641.0,...,17011.0,17011.0,13644.5,13644.5,10322.0,10322.0,6061.5,6061.5,2755.0,2755.0


### Pyramid Visualization
We can finally generate our demographic pyramid! This is where we configure the visual labels, axes, legends, and everything you can visually see.

PNG images will be generated and appear in the same folder as this notebook (note: make a note of where this notebook is located on your computer!).

#### How to Make Adjustments
If you want to make any adjustments:
1. Look at the comments in the code marked by a "#" symbol and the code below it
2. Google what you'd like to do with the prefix "matplotlib"
  - Example: Search "matplotlib remove x-axis tick marks"
 3. This is a trial and error process to make these edits; there's no defined way for the most part. Be patient with them - copy, pasting, and experimenting is highly encouraged!

#### NOTE: Sierras Pyramid is Broken
For some reason that I can't figure out, the male side of the demographic pyramid is not inverting like the other pyramids. The solution right now is to edit this in Photoshop by flipping the graph in Photoshop. If you can figure out how to fix this in code, pleae update this code!

In [30]:
# Generate Regional Age Pyramid
def regional_age_pyramid(region):
    fig, axes = plt.subplots(figsize=(12,6.5), facecolor='#eaeaf2', ncols=2, sharey=True)
    for census_code in census_codes_list:
        demo_details = get_demo_col(census_code, region)
        gender_index, age_index, eth_left_off = demo_details['chart']
        eth_code = census_code[6:7]
        region_df = county_census_viz_df[county_census_df['Region'] == region]

        # Get region code value
        region_code = census_code[:12]
        region_age = region_code[8:11]
        if region_age in MERGED_AGES:
            region_code = census_code[:12] + '_merged'
        if region_age in SPLIT_AGES:
            region_code = census_code[:12] + '_a'
        region_code_val = region_df[region_code].sum()
        axes[gender_index].barh(age_index,
                                region_code_val,
                                left=eth_left_off,
                                color=ETH_COLOR_CODES_DICT[eth_code],
                                label=ETH_CODES_DICT[eth_code]
                                )

    # Set Titles
    fig.suptitle(region + ' Demographic Pyramid')
    axes[0].set_title('Male', fontsize=18, pad=15, zorder=10)
    axes[1].set_title('Female', fontsize=18, pad=15, zorder=10)

    # Editing ticks
    axes[0].set(yticks=list(range(0,len(AGE_RANGES))), yticklabels=AGE_RANGES)
    axes[0].yaxis.tick_left()
    axes[0].tick_params(axis='x', rotation=30)
    axes[1].tick_params(axis='x', rotation=-30)

    # Flip men to be the left chart
    axes[0].invert_xaxis()

    # Adjust subplots
    plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)

    # Set men and women axis to be the longer length of the two
    # EX: if men is 50,000 and women is 30,000, choose 50,000 as the x lim
    print(axes[0].get_xlim(), axes[1].get_xlim())
    if axes[0].get_xlim()[0] > axes[1].get_xlim()[1]:
        axes[1].set_xlim([0, axes[0].get_xlim()[0]])
    else:
        axes[0].set_xlim([0, axes[1].get_xlim()[1]])

    # Remove zero from one of the axis
    xticks = axes[1].xaxis.get_major_ticks()
    xticks[0].label1.set_visible(False)

    # Legend
    # axes[0].legend()
    handles, labels = axes[0].get_legend_handles_labels()
    n = 7
    unique_handles = handles[n - 1::n][:8]
    unique_labels = labels[n-1::n][:8]
    axes[1].legend(unique_handles, unique_labels,
               loc='center left', bbox_to_anchor=(1, 0.5)
               )
    box1 = axes[0].get_position()
    box2 = axes[1].get_position()
    axes[0].set_position([box1.x0, box1.y0, box1.width * 0.65, box1.height])
    axes[1].set_position([box1.width * 1.12, box2.y0, box2.width * 0.65, box2.height])

    filename = 'mpl-bidirectional-' + region
    plt.savefig(filename+'.png', facecolor='#eaeaf2')

regions = county_census_viz_df['Region'].unique().tolist()
for region in regions:
    regional_age_pyramid(region)

<IPython.core.display.Javascript object>

(384943.65, 0.0) (0.0, 357097.65)


<IPython.core.display.Javascript object>

(8482.95, 0.0) (0.0, 8962.275)


<IPython.core.display.Javascript object>

(43670.55, 0.0) (0.0, 42379.575)


<IPython.core.display.Javascript object>

(110490.45, 0.0) (0.0, 106491.0)


<IPython.core.display.Javascript object>

(239272.95, 0.0) (0.0, 225811.95)


<IPython.core.display.Javascript object>

(189808.5, 0.0) (0.0, 168321.3)


<IPython.core.display.Javascript object>

(738277.05, 0.0) (0.0, 702513.0)


<IPython.core.display.Javascript object>

(127725.15, 0.0) (0.0, 119096.25)


<IPython.core.display.Javascript object>

(247705.5, 0.0) (0.0, 233937.9)
