# Incarcerated California Population 1991-2015

Incarceration data has typically been collected at the state-level, but this data is insufficient to understand local trends in incarceration. Individuals integral to incarceration pipeline (i.e. officiers,judges, lawyers, etc.) all act at the county instead of the state level. Therefore, it is important to investigate incarceration rates at the sub-state (i.e. geographic area, urbanicity, and county) level. 

In this assignment we will investigate incarcerations in the state of California between 1991 and 2015 at the sub-state level. 


## Dataset: 

The dataset contains county-level incarceration and population statistics (11400 samples) for the state of California between 1991-2015. Every county has been categorized according to urbanicity (i.e. urban, mid-small, suburban, and rural) and geographic area (i.e. Bay Area, Central Coast, Southern California Coastal, etc.). The total number of individuals aged 15-64 and the number of incarcerated individuals in each population category (i.e. race, gender, Total) have been recorded.

The dataset was obtained from a member of the [r for data science community](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-22), who tidied data published by the [Vera Institute of Justice](https://www.vera.org/) on [github](https://github.com/vera-institute/incarceration-trends). The original dataset was aggregated from county-level jail data (1970-2015) and prison data (1983-2015) across the United States. Explaination about dataset curation can be found under the subsection Incarceration Trends Data [here](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-22). 

A subset of the original dataset corresponding to the state of California was selected to simplify the data quality check and downstream analyses. The dataset used in this assignment is available at [my github account](https://github.com/babrahamson/CSE_512).

## Analysis Questions:

We will begin our analysis with an investigation into the quality of our dataset (Q1). Counties with insufficient data for the subsequent analysis will be dropped from the dataset. This will be followed by an exploration into annual population (incarcerated and total) trends (Q2-3). We will conclude our analysis with an investigation in the relationship between race and incarceration (Q4-5).

1.	Are there any gaps in the dataset?
2.	Is there a relationship between the annual incarcerated and total population?
3.	Is there a relationship between urbanicity and the annual incarcerated and total population?
4.	How does race correlate with the incarcerated and aged 15-64 population?
5.	Are there trends between the incarceration of different racial groups?

In [261]:
import pandas as pd
import altair as alt
import numpy as np

In [262]:
# Loading US Map
from vega_datasets import data

In [263]:
# Creating Look-up Dictionary
area = {'Alameda': 'Bay Area', 'Amador': 'Mountain South', 
        'Butte': 'Central Valley North', 'Calaveras': 'Mountain South',
        'Colusa':'Central Valley North','Contra Costa':'Bay Area',
        'Del Norte':'Costal North', 'El Dorado': 'Mountain South',
        'Fresno':'Central Valley South', 'Glenn':'Central Valley North',
        'Humboldt':'Costal North','Imperial':'Southern California Inland',
        'Inyo':'Mountain South', 'Kern':'Central Valley South',
        'Kings':'Central Valley South', 'Lake':'Costal North', 
        'Lassen':'Mountain North','Los Angeles':'Southern California Coastal', 
        'Madera':'Central Valley South', 'Marin':'Bay Area',
        'Mariposa':'Central Valley South', 'Mendocino':'Costal North',
        'Merced':'Central Valley South', 'Modoc':'Mountain North',
        'Mono':'Mountain South', 'Monterey':'Central Coast',
        'Napa':'Bay Area', 'Nevada':'Mountain North', 
        'Orange':'Southern California Coastal', 'Placer':'Greater Sacramento', 
        'Plumas':'Mountain North', 'Riverside':'Southern California Inland',
        'Sacramento':'Greater Sacramento', 'San Benito':'Central Coast',
        'San Bernardino':'Southern California Inland','San Diego':'Southern California Coastal',
        'San Francisco':'Bay Area', 'San Joaquin':'Central Valley South',
        'San Luis Obispo':'Central Coast', 'San Mateo':'Bay Area',
        'Santa Barbara':'Central Coast', 'Santa Clara':'Bay Area',
        'Santa Cruz':'Central Coast', 'Shasta':'Central Valley North', 
        'Sierra':'Mountain North', 'Siskiyou':'Mountain North',
        'Solano':'Greater Sacramento', 'Sonoma':'Bay Area',
        'Stanislaus':'Central Valley South', 'Sutter':'Central Valley North',
        'Tehama':'Central Valley North', 'Trinity':'Mountain North', 'Tulare':'Central Valley South',
        'Tuolumne':'Mountain South','Ventura':'Southern California Coastal',
        'Yolo':'Greater Sacramento', 'Yuba':'Central Valley North'}

## Question 1: Are there any gaps in the dataset? 

We will begin our exploration by investigating the quality dataset.

In [264]:
# Loading California Prison and total Population Dataset
df = pd.read_csv('California_Prison_Population_v4.csv')
df['area'] = df['county_name'].map(area)
df.head(10)

Unnamed: 0,year,county_ID,county_name,urbanicity,region,pop_category,population,prison_population,area
0,1991,6001,Alameda,urban,West,Total,896633.0,3804.0,Bay Area
1,1992,6001,Alameda,urban,West,Total,905440.0,3863.0,Bay Area
2,1993,6001,Alameda,urban,West,Total,905948.0,3987.0,Bay Area
3,1994,6001,Alameda,urban,West,Total,904792.0,4236.0,Bay Area
4,1995,6001,Alameda,urban,West,Total,906566.0,4325.0,Bay Area
5,1996,6001,Alameda,urban,West,Total,914648.0,4713.0,Bay Area
6,1997,6001,Alameda,urban,West,Total,935240.0,4873.0,Bay Area
7,1998,6001,Alameda,urban,West,Total,960308.0,5046.0,Bay Area
8,1999,6001,Alameda,urban,West,Total,980847.0,5144.0,Bay Area
9,2000,6001,Alameda,urban,West,Total,1001712.0,5401.0,Bay Area


### Is it true that missing or null values are present in the dataset?

In [265]:
df.isnull().values.any()

True

### Which columns contain missing or null values?

In [266]:
df.isnull().sum()

year                    0
county_ID               0
county_name             0
urbanicity              0
region                  0
pop_category            0
population           1425
prison_population     658
area                    0
dtype: int64

It appears that missing or null values are present in the aged 16-64 population (1425) and incarcerated population (658) columns.

In [267]:
# Determining Null Elements in Population and Prison Population Data

# Creating Null Data Frame
null_columns = df.columns[df.isnull().any()]
headers = ['year','county_name','area','urbanicity','pop_category','population','prison_population']
null_df = df[df.isnull().any(axis=1)][headers]

# Summing over the Null Data Frame
null_sum_df = pd.DataFrame(df['pop_category'].unique(),columns=['pop_category'])
col_name    = ['population','prison_population']
cat         = ['Total','Female','Male','Asian','Black','Latino','Native American','Other']
sums        = np.zeros(8)
full        = np.ones(8)*100
cat_size    = df.shape[0]/8

for i in range(2):
    for j in range(8):
        sums[j] = null_df[col_name[i]][null_df['pop_category']==cat[j]].isnull().sum()
    null_sum_df['null_' + col_name[i]]=sums
    null_sum_df['null_' + col_name[i] + '_prop'] = sums/cat_size*100

null_sum_df['full_percent']=full

In [268]:
# Plotting the percentage of missing data

# 
bar_1 = alt.Chart(null_sum_df).mark_bar().encode(
    alt.Y('pop_category:N',
         axis=alt.Axis(title='Category')),
    alt.X('null_population_prop:Q',
        scale = alt.Scale(domain=[0,100]),
        axis=alt.Axis(title='Percentage [%]')),
    alt.Color('pop_category:N'),
).properties(
    width=400,
    height=100,
    title='Missing Aged 15-64 Population Data'
)

bar_2 = alt.Chart(null_sum_df).mark_bar().encode(
    alt.Y('pop_category:N',
         axis=alt.Axis(title='Category')),
    alt.X('null_prison_population_prop:Q',
        scale = alt.Scale(domain=[0,100]),
        axis=alt.Axis(title='Percentage [%]')),
    alt.Color('pop_category:N'),
).properties(
    width=400,
    height=100,
    title='Missing Incarcerated Population Data'
)

### Which categories are missing data in the aged 16-64 population and incarcerated population dataset?

In [269]:
bar_1&bar_2

We see that the aged 15-64 dataset does not contain data for the 'Other' racial category. In addition, we see gaps (all <~10%) for all racial and gender categories in the incarcerated dataset.

Due to the gaps in the dataset, we are unable to calculate the number of 'White' individuals in the aged 15-64 population.

### Which counties are missing racial data?

Subsequent analysis will investigate trends in the incarceration of racial groups. Counties containing gaps in racial data will be dropped to ensure the insufficient data does not interfere with the data analysis.

We will start this anlysis by looking at the percentage of annual incarceration data missing from each racial category per county.

In [270]:
header = ['year','county_name','urbanicity','pop_category','prison_population']
null_df_drop = null_df[null_df[header].isnull().any(axis=1)]

# Dropping 'Female' and 'Male' from dataset
for cat in ['Female','Male']:
    drop_index   = null_df_drop[null_df_drop['pop_category']==cat].index.values
    null_df_drop = null_df_drop.drop(drop_index)

# Creating Data Frame for missing data
counties = null_df_drop['county_name'].unique()
unique_null_df = pd.DataFrame(counties,columns=['county_name'])
unique_null_df['area'] = unique_null_df['county_name'].map(area)

# Determining missing racial data per county
race  = ['Asian','Black','Latino','Native American']
sums = np.zeros(len(counties))

for i in range(len(race)):
    for j in range(len(counties)):
        sums[j] = null_df_drop['prison_population'][(null_df_drop['pop_category']==race[i])&
                                  (null_df_drop['county_name']==counties[j])].isnull().sum()
    unique_null_df[race[i]] = sums/25*100

In [271]:
unique_null_df

Unnamed: 0,county_name,area,Asian,Black,Latino,Native American
0,Butte,Central Valley North,24.0,0.0,0.0,0.0
1,Glenn,Central Valley North,28.0,52.0,0.0,24.0
2,Humboldt,Costal North,60.0,0.0,0.0,0.0
3,Kings,Central Valley South,52.0,0.0,0.0,0.0
4,Marin,Bay Area,88.0,0.0,0.0,32.0
5,Napa,Bay Area,32.0,0.0,0.0,56.0
6,Plumas,Mountain North,8.0,52.0,48.0,80.0
7,San Benito,Central Coast,100.0,88.0,0.0,24.0
8,Santa Barbara,Central Coast,36.0,0.0,0.0,0.0
9,Shasta,Central Valley North,28.0,0.0,0.0,0.0


We see that 20 counties are missing racial data to varying degrees, but it is difficult to understand or compare this data. A heatmap of this dataset is presented below, which makes it easier to visualize the amount of data missing for each county.

In [272]:
melt_vars  =['Asian','Black','Latino','Native American']
other_vars = ['county_name','area']
melt_null_df = pd.melt(unique_null_df, id_vars=other_vars, value_vars=melt_vars,
        var_name='race', value_name='num_null')

Heat_1 = alt.Chart(melt_null_df).mark_rect().encode(
    alt.Y('race:N',
         axis=alt.Axis(title='Race')),
    alt.X('county_name:N',
         axis=alt.Axis(title='County')),
    alt.Color('num_null:Q', 
              scale=alt.Scale(scheme='viridis'),
             bin=alt.Bin(maxbins=25),
             legend=alt.Legend(title="Percentage of Missing Data",orient='top')),
).properties(
    width=600,
    height=112.5,
)

In [273]:
Heat_1

We see that the majority of counties (65%) are missing data for the incarcerated Asian and Native American population, while only 30% and 20% of counties are missing data for the incarcerated Black and Latino populations respectively.

While analyzing the incarceration of racial groups, the 20 counties missing a portion of racial data will be excluded from the analysis.  

### Conclusions:

1. No gaps were present in the 'Total' aged 15-64 and incarcerated datasets
2. All values were absent from the 'Other' racial group in the aged 15-64 dataset
3. 20 coutnies contained gaps in the incarceration by race data
4. Due to gaps in the race categories, the 'White' popultion could not be calculated.

## Question 2: Is there a relationship between the annual incarcerated and total population?

We will begin our analysis by determining the annual incarcerated and aged 15-64 population for the state of california.

In [274]:
# Adding Proportion Category to datafram
df['proportion'] = df['prison_population']/df['population']*100

# Creating data frame for countly level prison population
pp_county_total = df.loc[df['pop_category'] == 'Total']

In [275]:
# Creating data frame for yearly prison population
pp_year_total = pp_county_total.groupby('year',as_index=False)['population','prison_population'].sum()
pp_year_total['proportion'] = pp_year_total['prison_population']/pp_year_total['population']*100

In [276]:
# Plotting total population (aged 15-64) and prison population by year
Prison_pop = alt.Chart(pp_year_total).mark_line().encode(
    alt.X('year:O'),
    alt.Y('prison_population:Q',
        axis=alt.Axis(title='Prison Population')),
)

Population = alt.Chart(pp_year_total).mark_line().encode(
    alt.X('year:O'),
    alt.Y('population:Q',
        axis=alt.Axis(title='Population'),
        scale = alt.Scale(domain=[20000000,27000000])),
    color=alt.value('black'),
)

In [277]:
Population & Prison_pop

We see that the population of California continually increases between 1991-2015, but the prison population does not follow the same trend. Instead, we observe an increase in the prison population between 1991-1998 followed by a plateau between 1999-2002. The plateau in the prison population coencides with the governorship of Grey Davis. The California prison population continues to increase between 2003-2006 and plateaus again betwen 2007-2010. This entire timeframe corresponds to the govenorship of Arnold Schwarzenegger. The California prison population declines between 2011-2015 under the govenorship of Jerry Brown.

It is difficult to make any conculsions about the California prison population, because there is not a direct correlation between the total incarcerated and aged 15-64 population. Therefore, the incarcerated population was normalized by the aged 15-64 population.

In [278]:
# Plotting total population (aged 15-64) and prison population by year
Normalized_Prison_Pop = alt.Chart(pp_year_total).mark_bar().encode(
    alt.X('year:O'),
    alt.Y('proportion:Q',
        axis=alt.Axis(title='Percentage of Population Incarcerated [%]'))
).properties(title='Normalized Annual Incarcerated Prison Population')
# NOTES: change year format, change line color, add title

# Make chart with percent change year to year

In [279]:
Normalized_Prison_Pop

After normalizing the data, we observe the same general trend that we saw in the total incarcerated population data.

## Question 3: Is there a relationship between urbanicity and the annual incarcerated population?

We will begin our analysis by determining the annual incarcerated and aged 15-64 population for each urbanicity category in the state of california.

In [280]:
# Creating data frame for yearly prison population by urbanicity
pp_year_urban = pp_county_total.groupby(['year','urbanicity'],as_index=False)['population','prison_population'].sum()
pp_year_urban['proportion'] = pp_year_urban['prison_population']/pp_year_urban['population']*100

In [281]:
# Urbanacity Scatter Plot  [Add tool tip for year]
source = pp_year_urban

urbanicity_line_1 = alt.Chart(source).mark_line().encode(
    alt.X('year:O',
        axis=alt.Axis(title='Year')),
    alt.Y('prison_population:Q',
        axis=alt.Axis(title='Prison Population')),
    alt.Color('urbanicity:N'),
)

urbanicity_scatter_1 = alt.Chart(source).mark_point(filled=True).encode(
    alt.X('year:O',
        axis=alt.Axis(title='Year')),
    alt.Y('prison_population:Q',
        axis=alt.Axis(title='Prison Population')),
    alt.Color('urbanicity:N'),
).properties(title='Annual Prison Population')
# NOTES: change year format, change line color, add title

In [282]:
urbanicity_line_1 + urbanicity_scatter_1

As expected, the incarcerated population is proportional to urbanicity, with urban areas having the largest prison population and rural areas having the smallest prison population. It should be noted that the trend in the annual incarcerated population data is reflected in the urbanicity data.

Since the population in significantly differs between urbanicity areas, the data must be normalized to compare the incarcerated populations.

In [283]:
source = pp_year_urban

urbanicity_line_1_normal = alt.Chart(source).mark_line().encode(
    alt.X('year:O',
        axis=alt.Axis(title='Year')),
    alt.Y('proportion:Q',
        axis=alt.Axis(title='Percentage of Population Incarcerated [%]')),
    alt.Color('urbanicity:N'),
)

urbanicity_point_1_normal = alt.Chart(source).mark_point(filled=True).encode(
    alt.X('year:O',
        axis=alt.Axis(title='Year')),
    alt.Y('proportion:Q',
        axis=alt.Axis(title='Percentage of Population Incarcerated [%]')),
    alt.Color('urbanicity:N'),
).properties(title='Normalized Incarcerated Population')

In [284]:
urbanicity_line_1_normal + urbanicity_point_1_normal

After normalization, we observe overlap between the urban, small/mid sized cities, and rural area. Many socio-economic factors could potentially explain the decreased percentage of the subpruban population incarcerated. It should be noted that all normalized trends follow the noramlized trends for the total population. 

In [285]:
# Urbanacity Scatter Plot [Add average & quartile plot]  [Add tool tip for year]
source = pp_year_urban

urbanicity_box = alt.Chart(source).mark_boxplot().encode(
    alt.X('proportion:Q',
        axis=alt.Axis(title='Percentage of Population Incarcerated [%]')),
    alt.Y('urbanicity:N'),
    alt.Color('urbanicity:N'),
).properties(title='Normalized Incarcerated Population')

In [286]:
urbanicity_box

By using a box plot, we are able to observe that mean normalized incarcerated population for suburban areas is significanly lower than rural, small/mid, and urban. We are also able to observe that small/mid sized counties have the largest mean normalized incarcerated population.

## Question 4: How does race correlate with the annual incarcerated and total population?

We will begin our analysis by performing the following data manipulations:
* Removing counties with gaps in incarcerated racial data
* Determing the proportion of a race (i.e. Asian, Black, Latio, and Native American) in the aged 15-64 and incarcerated population each year for the state of California

In [287]:
# Creating data frame for race by excluding counties with insufficient race data

# Variables for data clean-up
bad_race_data = counties
cat_1 = ['year','pop_category']
cat_2 = ['year','county_name','area','urbanicity','pop_category']

# Copying Imported Data Frame to new Variable
pp_county_total = df

# Removing Counties with for loop
for county in  bad_race_data:
    pp_county_total = pp_county_total.loc[pp_county_total['county_name'] != county]

# Grouping cleaned up data by year and population category
pp_year_race_1 = pp_county_total.groupby(cat_1,as_index=False)['population','prison_population'].sum()
pp_year_race_2 = pp_county_total.groupby(cat_2,as_index=False)['population','prison_population'].sum()

In [288]:
def race_data_convert(data_frame,categories):
    # Unmelting Data Frame
    Population = data_frame.set_index(categories)['population'].unstack().reset_index()
    Prison_Pop = data_frame.set_index(categories)['prison_population'].unstack().reset_index()

    # Creating Base dataset for manipulation Proporation of Race in Population
    df_base = data_frame.set_index(categories)['population'].unstack().reset_index()
    df_base = df_base.drop(columns=['Asian','Black','Female','Latino','Male','Native American','Other'])

    # Rearranging Population and Prison Data
    Population_2 = df_base
    for race in ['Asian','Black','Female','Latino','Male','Native American','Other']:
        Population_2[race] = Population[race]

    Prison_Pop_2 = df_base
    for race in ['Asian','Black','Female','Latino','Male','Native American','Other']:
        Prison_Pop_2[race] = Prison_Pop[race]

    # Calculating Proporation of Race in Population
    Race_Pop_Prop = df_base
    Race_Pop_Prop = Race_Pop_Prop.drop(columns='Total')
    for race in ['Total','Asian','Black','Female','Latino','Male','Native American','Other']:
        Race_Pop_Prop[race] = Population[race]/Population['Total']*100

    # Calculating Proporation of Race in Prison
    Race_Pri_Prop = df_base
    Race_Pri_Prop = Race_Pri_Prop.drop(columns='Total')
    for race in ['Total','Asian','Black','Female','Latino','Male','Native American','Other']:
        Race_Pri_Prop[race] = Prison_Pop[race]/Prison_Pop['Total']*100

    # Create variables from column headers for data frame tidying
    val_vars = ['Total','Asian','Black','Female','Latino','Male','Native American','Other']
    other_vars = Population_2.columns.difference(val_vars)

    # Creating Tidy Data Frames
    Population_2 = pd.melt(Population_2, id_vars=other_vars, value_vars=val_vars,var_name='pop_category', value_name='population')
    Prison_Pop_2 = pd.melt(Prison_Pop_2, id_vars=other_vars, value_vars=val_vars,var_name='pop_category', value_name='prison_pop')
    Race_Pop_Prop = pd.melt(Race_Pop_Prop, id_vars=other_vars, value_vars=val_vars,var_name='pop_category', value_name='pop_race_prop')
    Race_Pri_Prop = pd.melt(Race_Pri_Prop, id_vars=other_vars, value_vars=val_vars,var_name='pop_category', value_name='pris_race_prop')

    # Merging Tidy Data
    df_race = Population_2
    df_race['prison_pop']=Prison_Pop_2['prison_pop']
    df_race['pop_race_prop']=Race_Pop_Prop['pop_race_prop']
    df_race['pris_race_prop']=Race_Pri_Prop['pris_race_prop']

    return df_race

In [289]:
# Calculating proportion of Population
pp_year_race_mod_1 = race_data_convert(pp_year_race_1,cat_1)
pp_year_race_mod_2 = race_data_convert(pp_year_race_2,cat_2)

# Removing Total category from data frame
pp_year_race_mod_1 = pp_year_race_mod_1.loc[pp_year_race_mod_1 ['pop_category'] != 'Total']
pp_year_race_mod_2 = pp_year_race_mod_2.loc[pp_year_race_mod_2 ['pop_category'] != 'Total']

# Settign new variables for manipulation
pp_year_race_ed_1   = pp_year_race_mod_1
pp_year_race_ed_2   = pp_year_race_mod_2
pp_year_gender_ed_1 = pp_year_race_mod_1
pp_year_gender_ed_2 = pp_year_race_mod_2


for gender in ['Male','Female']:
    pp_year_race_ed_1 = pp_year_race_ed_1.loc[pp_year_race_ed_1 ['pop_category'] != gender]
for gender in ['Male','Female']:
    pp_year_race_ed_2 = pp_year_race_ed_2.loc[pp_year_race_ed_2 ['pop_category'] != gender]

In [290]:
# Race Scatter Plot [Yearly Data]
source = pp_year_race_ed_1.loc[pp_year_race_ed_1 ['pop_category'] != 'Other']

race_scatter_1 = alt.Chart(source).mark_point(filled=True, opacity=0.5).encode(
    alt.X('pop_race_prop:Q'),
    alt.Y('pris_race_prop:Q'),
    alt.Color('pop_category:N'),
)

line_df = pd.DataFrame({
    'x': [0,45],
    'y': [0,45],
})

race_scatter_line = alt.Chart(line_df).mark_line(strokeDash=[5,5]).encode(
    alt.X('x:Q',
        scale = alt.Scale(domain=[0,45]),
        axis=alt.Axis(title='Aged 15-64 Population Percentage [%]')),
    alt.Y('y:Q',
        scale = alt.Scale(domain=[0,45]),
        axis=alt.Axis(title='Prison Percentage  [%]')),
    color=alt.value('black'),
).properties(title='Representation of Race in the Incarcerated Population')

In [291]:
race_scatter_line + race_scatter_1

In the state of California, we observe an over, slightly over, under, and equal representation for the incarcerated Black, Latino, Asian, and Native American populations respectively. The dashed black line indicates an equal representation between the prison and aged 15-64 population.

This scatterplot is informative, but there is not sufficient data to understand underlying trends in the dataset. Therefore, another scatterplot was generated with the annual county-level data included in the dataset.

In [292]:
# Race Scatter Plot [County Data]
source = pp_year_race_ed_2.loc[pp_year_race_ed_2 ['pop_category'] != 'Other']

race_scatter_2 = alt.Chart(source).mark_point(filled=True, size=15, opacity=0.5).encode(
    alt.X('pop_race_prop:Q',
        scale = alt.Scale(domain=[0,80]),
        axis=alt.Axis(title='Population Percentage [%]')),
    alt.Y('pris_race_prop:Q',
        scale = alt.Scale(domain=[0,80]),
        axis=alt.Axis(title='Prison Percentage [%]')),
    alt.Color('pop_category:N'),
)

line_df = pd.DataFrame({
    'x': [0,80],
    'y': [0,80],
})

race_scatter_line_1 = alt.Chart(line_df).mark_line(strokeDash=[5,5]).encode(
    alt.X('x:Q',
        scale = alt.Scale(domain=[0,80]),
        axis=alt.Axis(title='Aged 15-64 Population Percentage [%]')),
    alt.Y('y:Q',
        scale = alt.Scale(domain=[0,80]),
        axis=alt.Axis(title='Prison Percentage  [%]')),
    color=alt.value('black'),
).properties(title='Representation of Race in the Incarcerated Population')

In [293]:
race_scatter_line_1 + race_scatter_2

Similar trends are observed by plotting the entire datset. We continue to obeserve an overrepresentation of the Black population and an under representation of Asians in the incarcerated population. Although there is significanly more variablity, there appears to be an equal representaiton of the Latino population in prison. The most significan deviation is in the Native American population. As the fraction of Native Americans in the aged 15-64 population increases, the Native American population becomes more overrepresented in the incarcerated population.

In [294]:
# Define Facet
def race_facet(data,cat):
    # Plotting Scatter Plot Matrix
    race_facet_fig = alt.layer(
    
    alt.Chart(data).mark_point(filled=True, size=15, opacity=0.5).encode(
        alt.X('pop_race_prop:Q',
            axis=alt.Axis(title='Aged 15-64 Population [%]')),
        alt.Y('pris_race_prop:Q',
            axis=alt.Axis(title='Prison Percentage [%]')),
        alt.Color(cat[1]+':N'))
    
    ).properties(
      width=150,
      height=150
    ).facet(
    alt.Facet('pop_category:N',title='Proportion of Population in Prison'),
    )

    return race_facet_fig

In [295]:
# Faceted Plots
data = pp_year_race_ed_2.loc[pp_year_race_ed_2 ['pop_category'] != 'Other']
scm_cat_1 = ['year','area','county_name','pop_category']
scm_cat_2 = ['year','urbanicity','county_name','pop_category']

race_facet_fig_1 = race_facet(data,scm_cat_1)
race_facet_fig_2 = race_facet(data,scm_cat_2)

Next, we wanted to see how the incarceration of specific races corresponds to urbainicity category.

In [296]:
race_facet_fig_2

Although the Black population makes up between 0-20% of the population in all urbanicity categories, the Black population is most overrepresented in the incarcerated population in urban areas, followed by suburban and small/mid-sized. It appears that the rural Black population has an equal representation in the incarcerated population. An inverse trend can be observed for the Native American population. The Native American population is overrepresented in the incarcerated population in rural areas. 

## Question 5: Are there trends between the incarceration of different racial groups?

In our final analysis, we will investigate the trends in incarceration between different racial groups. We will begin out analysis with the racial prision percentage dataset we previously examined. We will look for trends in the racial incarceration using a scatterplot matrix. In the matrix each axis represents the percentage of the prison population corresponding to a specific race.

In [297]:
# Define Scatter Plot Matrix function
def SPM(data,cat):
    # Manipulation of Data
    source = data
    categories = cat #['year','area','county_name','pop_category']
    source = source.set_index(categories)['pris_race_prop'].unstack().reset_index()
    source = source.drop(columns='Other')

    # Plotting Scatter Plot Matrix
    scatter_matrix = alt.Chart().mark_point(filled=True, size=15, opacity=0.5).encode(
        alt.X(alt.repeat('column'), type='quantitative',
              scale = alt.Scale(domain=[0,80])),
        alt.Y(alt.repeat('row'), type='quantitative',
              scale = alt.Scale(domain=[0,80])),
        alt.Color(cat[1]+':N')
    ).properties(
      width=150,
      height=150
    ).repeat(
      data=source,
      row=['Asian', 'Black', 'Latino','Native American'],
      column=['Native American', 'Latino', 'Black','Asian']
    )

    return scatter_matrix

In [298]:
# Scatter Plot Matricies
data = pp_year_race_ed_2
scm_cat_1 = ['year','area','county_name','pop_category']
scm_cat_2 = ['year','urbanicity','county_name','pop_category']

scm_fig_1 = SPM(data,scm_cat_1)
scm_fig_2 = SPM(data,scm_cat_2)

In [299]:
scm_fig_2

With the exception of the Black and Latino population, we are not able to observe strong correlations between the incarceration of different races.

There appears to be a strong correlation between the incarceration of the Black and Latino population in urban, small/mid-sized, and suburban areas. As the percentage of the Black population increases, the percentage of the incarcerated Latino population decreases.

## Takeaways:

1. About 10% of the racial data is missing from the dataset
2. There is siginifcation overlap between the normalized incarcerated population for the urbanicity categories
3. The Black population is overrepresented in the incarcerated population
4. The Asian population is underrepresented in the incarcerated population
5. The Native American population i soverrepresented in the rural incarcerated population
6. There is approximatly equal representation for the Latio population in the incarcerated population
7. An increase in the incarcerated Black population is correlated with a decrease in the incarcerated Latino population

## Issues with Analysis:

1. Issues with the dataset prevented calculations of the 'White' population