### **Dataset Cleaning - Emily's Sets**

In this notebook, I (Emily) will explain my data cleaning process for the datasets I was assigned to work on.  My collaborators and I worked independently on this portion of the work and used the methods that worked the best for us personally.  However, we all discussed and agreed upon which data to delete, preserve, or modify.

One decision that became clear as we cleaned the data was that we unfortunately needed to exclude Alpine County from the dataset.  Alpine County is very small, and therefore, almost all of its social condition data were suppressed for privacy.  After that decision was made, Alpine County was removed from any further datasets we cleaned.  It remained, however, in datasets that had already been cleaned, and was removed for good at a later step.

In [1]:
# Imports
import numpy as np
import os
import pandas as pd

# Thanks to stackoverflow for this tip to suppress SettingWithCopyWarning
# https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None 

In [2]:
os.chdir('./../02_data')
os.listdir()

['.Rhistory',
 '00_ignore',
 '01_original_datasets',
 '02_cleaned_datasets',
 '03_output',
 '04_data-dictionaries',
 'ca_dropout_and_predictors_v6_eks.csv']

#### Abortion Costs

In [3]:
# Read it in
abortion_costs = pd.read_csv('./01_original_datasets/abortions_funded_costs.csv')

# Rename columns
col_names = [col.lower().replace(' ', '_') for col in abortion_costs.columns]
abortion_costs.columns = col_names

# Get the shape
abortion_costs.shape

(815, 5)

In [4]:
# 2015 only
abortion_costs = abortion_costs[abortion_costs['calendar_year']==2015]

# Real counties only
abortion_costs = abortion_costs[abortion_costs['county']!='Unknown']
abortion_costs = abortion_costs[abortion_costs['county']!='Total'] 

# Get the shape
abortion_costs.shape

(114, 5)

In [5]:
# Take a look at the column counts - this will help me identify unnecessary features

# Define a function to use this again later
def col_counts(df):
  '''A function to print the number of unique values for every column in a dataframe
  Arg: 
    {df}, a dataframe
  Return: 
    Nothing, the purpose of this function is to see it in the output.
  Raise:
    Hopefully not!'''
  for i in list(df.columns):
    print('='*20)
    print(i)
    print(df[i].nunique())

# Run it
col_counts(abortion_costs)

calendar_year
1
delivery_system
2
county
58
total_expenditures
55
date_of_data
1


In [6]:
# Look for missing data
abortion_costs.isna().sum()

calendar_year          0
delivery_system        0
county                 0
total_expenditures    58
date_of_data           0
dtype: int64

Just the one column has NAs, and it is exactly the number of counties.

In [7]:
# Take a look at the column counts if I were to drop those
col_counts(abortion_costs.dropna())

calendar_year
1
delivery_system
1
county
56
total_expenditures
55
date_of_data
1


It looks like an entire category of `delivery_system` is missing.

In [8]:
# Check that hypothesis
abortion_costs['total_expenditures'].isna().groupby(abortion_costs['delivery_system']).sum()

delivery_system
Fee-for-Service     0
Managed Care       58
Name: total_expenditures, dtype: int64

In [9]:
# Check that hypothesis
abortion_costs['total_expenditures'].isna().groupby(abortion_costs['delivery_system']).value_counts(dropna = False)

delivery_system  total_expenditures
Fee-for-Service  False                 56
Managed Care     True                  58
Name: count, dtype: int64

Two counties seem to be missing from the Fee-for-Service level that are present in the Managed Care level, but they're NA there.  They're not suppressed for small numbers or anything; they just aren't present in the dataset at all.

In [10]:
# Drop the unnecessary category (rows)
abortion_costs.dropna(inplace = True)

# Drop unnecessary columns
drop_cols = ['calendar_year', 'delivery_system', 'date_of_data']
abortion_costs.drop(columns = drop_cols, inplace = True)

# Get the shape
abortion_costs.shape

(56, 2)

In [11]:
# Save to csv 
abortion_costs.to_csv('./03_output/abortion_costs_eks.csv', index = False)

The datasets that resulted from my initial cleaning, that went on to be part of the final analysis, can be found in `./02_cleaned_datasets`.  I have edited this code to send them to `./03_output` instead, to prevent these files from being overwritten over and over.  This is the same throughout this notebook.  Curious readers - or at least those using Git Bash on a PC - can run the following command through their terminal to confirm that the files are the same.

```diff 02_data/03_output/<dataset name>_eks.csv 02_data/02_cleaned_datasets/<dataset name>_simplified_eks.csv```

#### Abortion Counts

In [12]:
# Read it in
abortion_counts = pd.read_csv('./01_original_datasets/abortions_funded_counts.csv')

# Rename columns
col_names = [col.lower().replace(' ', '_') for col in abortion_counts.columns]
col_names = [col.lower().replace('-', '_') for col in col_names]
abortion_counts.columns = col_names
abortion_counts.columns

# Get the shape
abortion_counts.shape

(1231, 7)

In [13]:
# 2015 only
abortion_counts = abortion_counts[abortion_counts['calendar_year']==2015]

# Real counties only
abortion_counts = abortion_counts[abortion_counts['county']!='Unknown']
abortion_counts = abortion_counts[abortion_counts['county']!='Statewide'] 

# Get the shape
abortion_counts.shape

(174, 7)

In [14]:
# Take a look at the column counts
col_counts(abortion_counts)

calendar_year
1
county
58
delivery_system
3
total_abortion_related_services
140
annotation_code
2
annotation_description
2
date_of_data
1


In [15]:
# Look for missing data
abortion_counts.isna().sum()

calendar_year                        0
county                               0
delivery_system                      0
total_abortion_related_services     25
annotation_code                    149
annotation_description             149
date_of_data                         0
dtype: int64

This dataset did not come with a data dictionary, but the annotation columns indicate that the numbers from some counties were suppressed for privacy reasons (i.e., they were so small).  I hypothesized that that was why some columns had missing values.

In [16]:
# Check those rows out
abortion_counts[abortion_counts['total_abortion_related_services'].isna()].head(2)

Unnamed: 0,calendar_year,county,delivery_system,total_abortion_related_services,annotation_code,annotation_description,date_of_data
178,2015,Alpine,Managed Care,,1.0,Cell suppressed for small numbers,1/12/2022
179,2015,Alpine,Total,,1.0,Cell suppressed for small numbers,1/12/2022


In [17]:
print(f'''The total number of rows in this dataset is {abortion_counts.shape[0]}.
Are the annotation columns only filled in when the count was suppressed and NA elsewhere, 
and the count columns only filled in when they weren't suppresed, and NA elsewhere?
{abortion_counts.shape[0]==(
    abortion_counts['total_abortion_related_services'].isna().sum() + abortion_counts[
    'annotation_code'].isna().sum())}''')

The total number of rows in this dataset is 174.
Are the annotation columns only filled in when the count was suppressed and NA elsewhere, 
and the count columns only filled in when they weren't suppresed, and NA elsewhere?
True


In [18]:
# What is in these annotation columns?
abortion_counts['annotation_code'].value_counts(dropna = False)

annotation_code
NaN    149
1.0     16
2.0      9
Name: count, dtype: int64

In [19]:
abortion_counts['annotation_description'].value_counts(dropna = False)

annotation_description
NaN                                       149
Cell suppressed for small numbers          16
Cell suppressed for complementary cell      9
Name: count, dtype: int64

The missing values correspond to suppressions.  I left them in at this stage, and my collaborators and I later decided to impute them with 0s.  Although it is unlikely that the real number was 0 (in fact, 0s were reported in this data), it is a reasonable value to use given that the cause of the missingness is that the true values were so small.

The annotation columns, as received, list two types of suppression.  For our purposes, I combined these two types at this stage.  In a later stage, in the interest of reducing our dimensions, we dropped the suppression markers entirely.  Because the values were imputed with 0s, rather than an estimate within range of the unsuppressed columns, these rows were essentially already marked as different. 

In [20]:
abortion_counts.columns

Index(['calendar_year', 'county', 'delivery_system',
       'total_abortion_related_services', 'annotation_code',
       'annotation_description', 'date_of_data'],
      dtype='object')

In [21]:
# Recode these annotation columns
abortion_counts.drop(columns = ['annotation_description'], inplace = True)
abortion_counts['annotation_code'] = abortion_counts['annotation_code'].fillna(0)
annot_2s = abortion_counts['annotation_code']==2
abortion_counts.loc[annot_2s, 'annotation_code'] = 1
abortion_counts.rename(columns = {
    'annotation_code': 'abortion_rs_count_total_suppressed',
    'total_abortion_related_services': 'abortion_rs_count_total'}, inplace = True)

Data were listed on separate rows depending on the type of payment Medi-Cal made for the abortion related services, but in the interest of minimizing our dimensions, we choose to use the total count only.

In [22]:
# Drop unnecessary rows
abortion_counts = abortion_counts[abortion_counts['delivery_system']=='Total']
abortion_counts = abortion_counts[abortion_counts['county']!='Alpine']

# Drop unnecessary columns
abortion_counts.drop(
    columns = ['calendar_year', 'delivery_system', 'date_of_data'], inplace = True)

# Get the shape
abortion_counts.shape

(57, 3)

In [23]:
# Save to csv
abortion_counts.to_csv('./03_output/abortion_counts_eks.csv', index = False)

#### Daycare Slots

In [24]:
# Read it in
daycare_slots = pd.read_csv('./01_original_datasets/daycare_slots.csv')

# Rename columns
col_names = {
    'reportyear': 'report_year', 'geotypevalue': 'geotype_value'}
daycare_slots.rename(columns = col_names, inplace = True)

# Get the shape
daycare_slots.shape

(20101, 28)

In [25]:
# Drop redundant columns
drop_cols = ['ind_id', 'ind_definition', 'race_eth_code',
       'race_eth_name', 'version', 'region_name',
       'strata_name_code', 'strata_name', 'ca_decile']
daycare_slots.drop(columns = drop_cols, inplace = True)

# Get the shape
daycare_slots.shape

(20101, 19)

In [26]:
# Real counties only
daycare_slots = daycare_slots[daycare_slots['geotype']=='CO']

# Get the shape
daycare_slots.shape

(116, 19)

In [27]:
# Take a look at the column counts
col_counts(daycare_slots)

report_year
1
geotype
1
geotype_value
58
geoname
58
county_fips
58
county_name
58
region_code
14
strata_level_name_code
2
strata_level_name
2
facility_capacity
108
total_pop
116
rate_slots
111
ll_95ci
111
ul_95ci
111
se
111
rse
108
ca_rr
111
no_facility
72
pct_nonwhite
115


In [28]:
# Look for missing data
daycare_slots.isna().sum()

report_year               0
geotype                   0
geotype_value             0
geoname                   0
county_fips               0
county_name               0
region_code               0
strata_level_name_code    0
strata_level_name         0
facility_capacity         5
total_pop                 0
rate_slots                5
ll_95ci                   5
ul_95ci                   5
se                        5
rse                       5
ca_rr                     5
no_facility               5
pct_nonwhite              0
dtype: int64

This dataset contains two rows per county - one for infant (0-2) daycare availability, and one for child (2-5) daycare availability.  To condense these into one row, I first broke the dataframe into two dataframes (one per type), gave the columns within those dataframes different names, then merged them back together.

In [29]:
# Divide the df
df1 = daycare_slots[daycare_slots['strata_level_name_code']==1] # child
df2 = daycare_slots[daycare_slots['strata_level_name_code']==2] # infant

In [30]:
# Take a look at the column counts
col_counts(df1)

report_year
1
geotype
1
geotype_value
58
geoname
58
county_fips
58
county_name
58
region_code
14
strata_level_name_code
1
strata_level_name
1
facility_capacity
58
total_pop
58
rate_slots
58
ll_95ci
58
ul_95ci
58
se
58
rse
58
ca_rr
58
no_facility
50
pct_nonwhite
58


In [31]:
col_counts(df2)

report_year
1
geotype
1
geotype_value
58
geoname
58
county_fips
58
county_name
58
region_code
14
strata_level_name_code
1
strata_level_name
1
facility_capacity
51
total_pop
58
rate_slots
53
ll_95ci
53
ul_95ci
53
se
53
rse
51
ca_rr
53
no_facility
33
pct_nonwhite
58


In [32]:
# Drop unnecessary columns
drop_cols = ['county_fips', 'region_code',
    'geoname', 'report_year', 'strata_level_name_code', 
    'strata_level_name', 'geotype', 'geotype_value',
    'll_95ci', 'ul_95ci', 'se', 'rse', 'ca_rr']

df1.drop(columns = drop_cols, inplace = True) #child
df2.drop(columns = drop_cols, inplace = True) #infant

# Get the shapes
print(df1.shape)
print(df2.shape)

(58, 6)
(58, 6)


In [33]:
# Rename the columns
infant_cols = ['county_name', 'infant_facility_capacity',
    'infant_total_pop', 'infant_rate_slots',
    'infant_num_facility', 'infant_pct_nonwhite']

child_cols = ['county_name', 'child_facility_capacity', 
    'child_total_pop', 'child_rate_slots', 
    'child_num_facility', 'child_pct_nonwhite']

df1.columns = child_cols
df2.columns = infant_cols

# Get the shapes
print(df1.shape)
print(df2.shape)

(58, 6)
(58, 6)


In [34]:
# Merge them back together
daycare_slots = df1.merge(df2, left_on = 'county_name', 
  right_on = 'county_name', how = 'left')

In [35]:
# Rename the columns again
cols = [''.join(['daycare_slots_', c]) for c in list(daycare_slots.columns)]
daycare_slots.columns = cols
daycare_slots.rename(columns = {'daycare_slots_county_name': 'county'}, inplace = True)

# Drop Alpine county
daycare_slots = daycare_slots[daycare_slots['county']!='Alpine']

In [36]:
daycare_slots

Unnamed: 0,county,daycare_slots_child_facility_capacity,daycare_slots_child_total_pop,daycare_slots_child_rate_slots,daycare_slots_child_num_facility,daycare_slots_child_pct_nonwhite,daycare_slots_infant_facility_capacity,daycare_slots_infant_total_pop,daycare_slots_infant_rate_slots,daycare_slots_infant_num_facility,daycare_slots_infant_pct_nonwhite
0,Alameda,30404.0,78508.0,387.272635,528.0,78.641667,2252.0,58612.0,38.422166,98.0,78.78762
2,Amador,320.0,1202.0,266.222962,14.0,28.785358,24.0,834.0,28.776978,2.0,28.057554
3,Butte,2984.0,9973.0,299.207861,74.0,41.441893,443.0,7322.0,60.502595,25.0,42.024037
4,Calaveras,471.0,1673.0,281.530185,18.0,25.702331,52.0,1156.0,44.982699,3.0,25.346021
5,Colusa,306.0,1506.0,203.187251,11.0,79.150066,110.0,1097.0,100.273473,4.0,80.94804
6,Contra Costa,18271.0,56056.0,325.941915,320.0,67.241687,1785.0,38823.0,45.9779,73.0,68.894727
7,Del Norte,277.0,1346.0,205.794948,10.0,47.17682,20.0,972.0,20.576132,1.0,48.868313
8,El Dorado,2718.0,8201.0,331.422997,62.0,33.556883,275.0,5318.0,51.71117,19.0,34.373825
9,Fresno,13564.0,63454.0,213.76115,290.0,81.860876,1061.0,47329.0,22.417545,48.0,82.37233
10,Glenn,305.0,1729.0,176.402545,10.0,60.323887,48.0,1308.0,36.697248,3.0,62.461774


In [37]:
# Save to csv 
daycare_slots.to_csv('./03_output/daycare_slots_eks.csv', index = False)

#### E-cigarettes

In [38]:
# Read it in
ecig_availability = pd.read_csv('./01_original_datasets/ecig_availability.csv')

# Rename columns
col_names = [col.lower().replace(' ', '_') for col in list(ecig_availability.columns)]
col_names = [col.lower().replace('-', '_') for col in col_names]
ecig_availability.columns = col_names
col_names = {
  'percentage': 'ecigs_percentage_tobacco_stores_that_sell',
  'lower95ci': 'ecigs_lower95ci', 'upper95ci': 'ecigs_upper95ci'}
ecig_availability.rename(columns = col_names, inplace = True)

# Get the shape
ecig_availability.shape

(186, 5)

In [39]:
# Take a look at the column counts
col_counts(ecig_availability)

county
62
year
3
ecigs_percentage_tobacco_stores_that_sell
63
ecigs_lower95ci
60
ecigs_upper95ci
44


In [40]:
# 2015 only - this dataset comes in 3 year intervals. We evaluated 2013 and 2016.
ecig_availability = ecig_availability[ecig_availability['year']!=2019]

# Real counties only - several cities were evaluated on their own
ecig_availability = ecig_availability[ecig_availability['county']!='STATEWIDE'] 
ecig_availability = ecig_availability[ecig_availability['county']!='Berkeley'] 
ecig_availability = ecig_availability[ecig_availability['county']!='Long Beach'] 
ecig_availability = ecig_availability[ecig_availability['county']!='Pasadena'] 

# Drop unnecessary columns
ecig_availability.drop(columns = ['ecigs_lower95ci', 'ecigs_upper95ci'], inplace = True)

# Get the shape
ecig_availability.shape

(116, 3)

In [41]:
# Look for missing data - see if either year is better
ecig_availability.isna().groupby(ecig_availability['year']).sum()

Unnamed: 0_level_0,county,year,ecigs_percentage_tobacco_stores_that_sell
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,0,0,0
2016,0,0,0


In [42]:
# Convert the rows for 2013 and 2016 into columns

# Split them up
ecigs_2013 = ecig_availability[ecig_availability['year']==2013] 
ecigs_2016 = ecig_availability[ecig_availability['year']==2016]

# Drop columns
ecigs_2013.drop(columns = ['year'], inplace = True)
ecigs_2016.drop(columns = ['year'], inplace = True)

# Rename the columns
ecigs_2013.columns = ['county', 'ecigs_sold_pct_tobacco_stores_2013']
ecigs_2016.columns = ['county', 'ecigs_sold_pct_tobacco_stores_2016']

# Merge them back together
ecig_availability = ecigs_2016.merge(ecigs_2013, 
  left_on = 'county', right_on = 'county', how = 'left')

#Get the shape
ecig_availability.shape

(58, 3)

In [43]:
# Drop Alpine County
ecig_availability = ecig_availability[ecig_availability['county']!='Alpine']

In [44]:
# Save to csv
ecig_availability.to_csv('./03_output/ecig_availability_eks.csv', index = False)

#### Poverty Rate

In [45]:
# Read it in
poverty_rate = pd.read_csv('./01_original_datasets/poverty_rate.csv')

# Get the shape
poverty_rate.shape

(32005, 26)

In [46]:
# Real counties only
poverty_rate = poverty_rate[poverty_rate['geotype']=='CO']

# Get the shape
poverty_rate.shape

(693, 26)

In [47]:
# Drop columns, round 1
drop_cols = ['ind_id', 'ind_definition', 'geotype', 'geoname', 'version',
  'region_name', 'strata_two_code', 'strata_two_name', 'ca_decile']
poverty_rate.drop(columns = drop_cols, inplace = True)

# Get the shape
poverty_rate.shape

(693, 17)

In [48]:
# Rename columns
col_names = [f'povr_{col}' for col in poverty_rate.columns]
poverty_rate.columns = col_names
col_names = {'povr_county_fips': 'county_fips',
    'povr_reportyear': 'year', 'povr_county_name': 'county'}
poverty_rate.rename(columns = col_names, inplace = True)

# Get the shape
poverty_rate.shape

(693, 17)

In [49]:
# Take a look at the column counts
col_counts(poverty_rate)

year
2
povr_race_eth_code
9
povr_race_eth_name
9
povr_geotypevalue
58
county
58
county_fips
58
povr_region_code
14
povr_strata_one_code
2
povr_strata_one_name
2
povr_numerator
660
povr_denominator
680
povr_estimate
683
povr_ll_95ci
657
povr_ul_95ci
655
povr_se
681
povr_rse
681
povr_ca_rr
685


In [50]:
# Look for missing data
poverty_rate.isna().sum()

year                     0
povr_race_eth_code       0
povr_race_eth_name       0
povr_geotypevalue        0
county                   0
county_fips              0
povr_region_code         0
povr_strata_one_code     0
povr_strata_one_name     0
povr_numerator           0
povr_denominator         0
povr_estimate            0
povr_ll_95ci            12
povr_ul_95ci            12
povr_se                 12
povr_rse                12
povr_ca_rr               0
dtype: int64

This dataset came with up 18 rows per county - 2 report years (one for child poverty, one for overall poverty), and 9 race/ethnicity categories.  Each of these combinations needs to be converted to a column rather than a row.

In [51]:
# Split them up by year/type
pov_kids = poverty_rate[poverty_rate['povr_strata_one_code']==1] 
pov_whole = poverty_rate[poverty_rate['povr_strata_one_code']==3] 

# Drop differentiating columns
pov_kids.drop(columns = ['povr_strata_one_name', 'povr_strata_one_code', 'year'], inplace = True)
pov_whole.drop(columns = ['povr_strata_one_name', 'povr_strata_one_code', 'year'], inplace = True)

# Rename the columns
pov_kids.columns = [
  'povr_kids_race_eth_code', 'povr_kids_race_eth_name', 'povr_geotypevalue',
  'county', 'county_fips', 'povr_region_code', 'povr_kids_numerator', 
  'povr_kids_denominator', 'povr_kids_estimate', 'povr_kids_ll_95ci',
  'povr_kids_ul_95ci', 'povr_kids_se', 'povr_kids_rse', 'povr_kids_ca_rr']

# Drop some more columns
whole_drop = ['county_fips', 'povr_region_code', 'povr_geotypevalue']
pov_whole.drop(columns = whole_drop, inplace = True)

# Rename some more columns
pov_whole.columns = [
  'povr_whole_race_eth_code', 'povr_whole_race_eth_name', 'county', 
  'povr_whole_numerator', 'povr_whole_denominator', 'povr_whole_estimate', 
  'povr_whole_ll_95ci', 'povr_whole_ul_95ci', 'povr_whole_se', 
  'povr_whole_rse', 'povr_whole_ca_rr']

# Get the shapes
pov_whole.shape, pov_kids.shape

((400, 11), (293, 14))

In [52]:
# Split them up by race
pov_kids['povr_kids_race_eth_code'].unique() #n=9
pov_whole['povr_whole_race_eth_code'].unique() #n=9

# Drop redundant columns
pov_kids.drop(columns = ['povr_kids_race_eth_name'], inplace = True)
pov_whole.drop(columns = ['povr_whole_race_eth_name'], inplace = True)

# Kids first
pov_kids_1 = pov_kids[pov_kids['povr_kids_race_eth_code']==1] # continental native
pov_kids_2 = pov_kids[pov_kids['povr_kids_race_eth_code']==2] # Asian
pov_kids_3 = pov_kids[pov_kids['povr_kids_race_eth_code']==3] # Black or AA
pov_kids_4 = pov_kids[pov_kids['povr_kids_race_eth_code']==4] # Latine
pov_kids_5 = pov_kids[pov_kids['povr_kids_race_eth_code']==5] # island native
pov_kids_6 = pov_kids[pov_kids['povr_kids_race_eth_code']==6] # white
pov_kids_7 = pov_kids[pov_kids['povr_kids_race_eth_code']==7] # multiple
pov_kids_8 = pov_kids[pov_kids['povr_kids_race_eth_code']==8] # other
pov_kids_9 = pov_kids[pov_kids['povr_kids_race_eth_code']==9] # total

In [53]:
# Get the shapes

# kinds = ['kids', 'whole']
# nums = list(range(2, 10))
# for i in kinds:
#   print('')
#   for j in nums:
# 	  print(f'print(pov_{i}_{j}.shape)')

print(pov_kids_2.shape)
print(pov_kids_3.shape)
print(pov_kids_4.shape)
print(pov_kids_5.shape)
print(pov_kids_6.shape)
print(pov_kids_7.shape)
print(pov_kids_8.shape)
print(pov_kids_9.shape)

(34, 13)
(34, 13)
(34, 13)
(29, 13)
(34, 13)
(34, 13)
(27, 13)
(34, 13)


In [54]:
# Drop the differentiating column
#for i in list(range(1, 10)): print(f"pov_kids_{i}.drop(columns = ['povr_kids_race_eth_code'], inplace = True)")
pov_kids_1.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_2.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_3.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_4.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_5.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_6.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_7.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_8.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
pov_kids_9.drop(columns = ['povr_kids_race_eth_code'], inplace = True)

# Rename the first set of columns
pov_kids_1.columns = [
    'povr_geotypevalue', 'county', 'county_fips', 'povr_region_code', 
    'povr_kids_cont_native_numerator', 'povr_kids_cont_native_denominator',
    'povr_kids_cont_native_estimate', 'povr_kids_cont_native_ll_95ci', 
    'povr_kids_cont_native_ul_95ci', 'povr_kids_cont_native_se', 
    'povr_kids_cont_native_rse', 'povr_kids_cont_native_ca_rr']
pov_kids_1.columns

Index(['povr_geotypevalue', 'county', 'county_fips', 'povr_region_code',
       'povr_kids_cont_native_numerator', 'povr_kids_cont_native_denominator',
       'povr_kids_cont_native_estimate', 'povr_kids_cont_native_ll_95ci',
       'povr_kids_cont_native_ul_95ci', 'povr_kids_cont_native_se',
       'povr_kids_cont_native_rse', 'povr_kids_cont_native_ca_rr'],
      dtype='object')

In [55]:
# Drop redundant columns in all but 1 df
# for i in list(range(2, 10)):
#   print(f'pov_kids_{i}.drop(columns = whole_drop, inplace = True)')

pov_kids_2.drop(columns = whole_drop, inplace = True)
pov_kids_3.drop(columns = whole_drop, inplace = True)
pov_kids_4.drop(columns = whole_drop, inplace = True)
pov_kids_5.drop(columns = whole_drop, inplace = True)
pov_kids_6.drop(columns = whole_drop, inplace = True)
pov_kids_7.drop(columns = whole_drop, inplace = True)
pov_kids_8.drop(columns = whole_drop, inplace = True)
pov_kids_9.drop(columns = whole_drop, inplace = True)

In [56]:
# Rename columns in n>1 dfs

# Define the components of the names
races = ['asian', 'black', 'latine', 'island_native', 
  'white', 'multiracial', 'other_race', 'total']

cols = ['povr_numerator', 'povr_denominator', 
  'povr_estimate', 'povr_ll_95ci', 'povr_ul_95ci', 
  'povr_se', 'povr_rse', 'povr_ca_rr',]

kinds = ['kids', 'whole']

nums = list(range(2, 10))

# Commented out because the output is long

# Generate the column names
# for i in kinds:
# 	print('')
# 	these_cols = []
# 	for j in cols:
# 		c = j.replace('povr', f'povr_{i}') # povr_{kind}_variable_name
# 		these_cols.append(c)
# 	for k, m in zip(nums, races):
# 	  specific_cols = ['county']
# 	  for n in these_cols:
# 	    p = n.split('_')
# 	    a = ('_'.join([p[0], p[1]])) # povr_{kind}
# 	    b = ('_'.join(p[2:]))  # variable name
# 	    d = ('_'.join([a, m, b])) # povr_{kind}_{race}_variable_name
# 	    specific_cols.append(d)
# 	  print(f'pov_{i}_{k}.columns = {specific_cols}') # pov_{kind}_{race code}.col...
# 	  print('')

In [57]:
# Use that output to rename the columns
pov_kids_2.columns = ['county', 'povr_kids_asian_numerator', 'povr_kids_asian_denominator', 'povr_kids_asian_estimate', 'povr_kids_asian_ll_95ci', 'povr_kids_asian_ul_95ci', 'povr_kids_asian_se', 'povr_kids_asian_rse', 'povr_kids_asian_ca_rr']
pov_kids_3.columns = ['county', 'povr_kids_black_numerator', 'povr_kids_black_denominator', 'povr_kids_black_estimate', 'povr_kids_black_ll_95ci', 'povr_kids_black_ul_95ci', 'povr_kids_black_se', 'povr_kids_black_rse', 'povr_kids_black_ca_rr']
pov_kids_4.columns = ['county', 'povr_kids_latine_numerator', 'povr_kids_latine_denominator', 'povr_kids_latine_estimate', 'povr_kids_latine_ll_95ci', 'povr_kids_latine_ul_95ci', 'povr_kids_latine_se', 'povr_kids_latine_rse', 'povr_kids_latine_ca_rr']
pov_kids_5.columns = ['county', 'povr_kids_island_native_numerator', 'povr_kids_island_native_denominator', 'povr_kids_island_native_estimate', 'povr_kids_island_native_ll_95ci', 'povr_kids_island_native_ul_95ci', 'povr_kids_island_native_se', 'povr_kids_island_native_rse', 'povr_kids_island_native_ca_rr']
pov_kids_6.columns = ['county', 'povr_kids_white_numerator', 'povr_kids_white_denominator', 'povr_kids_white_estimate', 'povr_kids_white_ll_95ci', 'povr_kids_white_ul_95ci', 'povr_kids_white_se', 'povr_kids_white_rse', 'povr_kids_white_ca_rr']
pov_kids_7.columns = ['county', 'povr_kids_multiracial_numerator', 'povr_kids_multiracial_denominator', 'povr_kids_multiracial_estimate', 'povr_kids_multiracial_ll_95ci', 'povr_kids_multiracial_ul_95ci', 'povr_kids_multiracial_se', 'povr_kids_multiracial_rse', 'povr_kids_multiracial_ca_rr']
pov_kids_8.columns = ['county', 'povr_kids_other_race_numerator', 'povr_kids_other_race_denominator', 'povr_kids_other_race_estimate', 'povr_kids_other_race_ll_95ci', 'povr_kids_other_race_ul_95ci', 'povr_kids_other_race_se', 'povr_kids_other_race_rse', 'povr_kids_other_race_ca_rr']
pov_kids_9.columns = ['county', 'povr_kids_total_numerator', 'povr_kids_total_denominator', 'povr_kids_total_estimate', 'povr_kids_total_ll_95ci', 'povr_kids_total_ul_95ci', 'povr_kids_total_se', 'povr_kids_total_rse', 'povr_kids_total_ca_rr']

These sub-dfs don't have every county listed.  The easiest way to join them back together is with a base dataframe that does have all the counties listed.

In [58]:
# Create a base
counties = abortion_counts[['county']]
counties.shape

(57, 1)

In [59]:
# Merge them back together
# goal: (57, 76)
poverty = counties.merge(
  pov_kids_2, on = 'county', how = 'left').merge(
    pov_kids_1, on = 'county', how = 'left').merge(
    pov_kids_3, on = 'county', how = 'left').merge(
    pov_kids_4, on = 'county', how = 'left').merge(
    pov_kids_5, on = 'county', how = 'left').merge(
    pov_kids_6, on = 'county', how = 'left').merge(
    pov_kids_7, on = 'county', how = 'left').merge(
    pov_kids_8, on = 'county', how = 'left').merge(
    pov_kids_9, on = 'county', how = 'left')

# Moment of truth
poverty.shape

(57, 76)

In [60]:
# Put it back under its own name
pov_kids = poverty.copy(deep=True)

In [61]:
# now whole

# Split them

# Commented out long output
# a = '''
# pov_kids_1 = pov_kids[pov_kids['povr_kids_race_eth_code']==1] # continental native
# pov_kids_2 = pov_kids[pov_kids['povr_kids_race_eth_code']==2] # Asian
# pov_kids_3 = pov_kids[pov_kids['povr_kids_race_eth_code']==3] # Black or AA
# pov_kids_4 = pov_kids[pov_kids['povr_kids_race_eth_code']==4] # Latine
# pov_kids_5 = pov_kids[pov_kids['povr_kids_race_eth_code']==5] # island native
# pov_kids_6 = pov_kids[pov_kids['povr_kids_race_eth_code']==6] # white
# pov_kids_7 = pov_kids[pov_kids['povr_kids_race_eth_code']==7] # multiple
# pov_kids_8 = pov_kids[pov_kids['povr_kids_race_eth_code']==8] # other
# pov_kids_9 = pov_kids[pov_kids['povr_kids_race_eth_code']==9] # total
# '''
# print(a.replace('_kids', '_whole'))

# Use that output to do the splits
pov_whole_1 = pov_whole[pov_whole['povr_whole_race_eth_code']==1] # continental native
pov_whole_2 = pov_whole[pov_whole['povr_whole_race_eth_code']==2] # Asian
pov_whole_3 = pov_whole[pov_whole['povr_whole_race_eth_code']==3] # Black or AA
pov_whole_4 = pov_whole[pov_whole['povr_whole_race_eth_code']==4] # Latine
pov_whole_5 = pov_whole[pov_whole['povr_whole_race_eth_code']==5] # island native
pov_whole_6 = pov_whole[pov_whole['povr_whole_race_eth_code']==6] # white
pov_whole_7 = pov_whole[pov_whole['povr_whole_race_eth_code']==7] # multiple
pov_whole_8 = pov_whole[pov_whole['povr_whole_race_eth_code']==8] # other
pov_whole_9 = pov_whole[pov_whole['povr_whole_race_eth_code']==9] # total

# Get the shapes with code generated above
print(pov_whole_2.shape)
print(pov_whole_3.shape)
print(pov_whole_4.shape)
print(pov_whole_5.shape)
print(pov_whole_6.shape)
print(pov_whole_7.shape)
print(pov_whole_8.shape)
print(pov_whole_9.shape)

(47, 10)
(44, 10)
(56, 10)
(21, 10)
(58, 10)
(50, 10)
(19, 10)
(58, 10)


Number 6 has all 58 (for now) counties, so that will be the base onto which all other dataframes are merged.

In [62]:
# Drop the differentiating column

# Generate the code, then comment out the long output
# a = '''
# pov_kids_1.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_2.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_3.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_4.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_5.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_6.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_7.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_8.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# pov_kids_9.drop(columns = ['povr_kids_race_eth_code'], inplace = True)
# '''
# print(a.replace('_kids', '_whole'))

# Use the output to do the drops
pov_whole_1.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_2.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_3.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_4.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_5.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_6.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_7.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_8.drop(columns = ['povr_whole_race_eth_code'], inplace = True)
pov_whole_9.drop(columns = ['povr_whole_race_eth_code'], inplace = True)

In [63]:
# Rename the "first" set of columns
pov_whole_6.columns = [
  'county', 'povr_whole_white_numerator', 'povr_whole_white_denominator', 'povr_whole_white_estimate', 'povr_whole_white_ll_95ci', 'povr_whole_white_ul_95ci', 'povr_whole_white_se', 'povr_whole_white_rse', 'povr_whole_white_ca_rr']

# Generate names for the actual 1st one
a = '''
'povr_kids_cont_native_numerator', 'povr_kids_cont_native_denominator', 
  'povr_kids_cont_native_estimate', 'povr_kids_cont_native_ll_95ci', 
  'povr_kids_cont_native_ul_95ci', 'povr_kids_cont_native_se', 
  'povr_kids_cont_native_rse', 'povr_kids_cont_native_ca_rr']
  '''
print(a.replace('_kids', '_whole'))


'povr_whole_cont_native_numerator', 'povr_whole_cont_native_denominator', 
  'povr_whole_cont_native_estimate', 'povr_whole_cont_native_ll_95ci', 
  'povr_whole_cont_native_ul_95ci', 'povr_whole_cont_native_se', 
  'povr_whole_cont_native_rse', 'povr_whole_cont_native_ca_rr']
  


In [64]:
# Rename columns in the other dfs
pov_whole_2.columns = ['county', 'povr_whole_asian_numerator', 'povr_whole_asian_denominator', 'povr_whole_asian_estimate', 'povr_whole_asian_ll_95ci', 'povr_whole_asian_ul_95ci', 'povr_whole_asian_se', 'povr_whole_asian_rse', 'povr_whole_asian_ca_rr']
pov_whole_3.columns = ['county', 'povr_whole_black_numerator', 'povr_whole_black_denominator', 'povr_whole_black_estimate', 'povr_whole_black_ll_95ci', 'povr_whole_black_ul_95ci', 'povr_whole_black_se', 'povr_whole_black_rse', 'povr_whole_black_ca_rr']
pov_whole_4.columns = ['county', 'povr_whole_latine_numerator', 'povr_whole_latine_denominator', 'povr_whole_latine_estimate', 'povr_whole_latine_ll_95ci', 'povr_whole_latine_ul_95ci', 'povr_whole_latine_se', 'povr_whole_latine_rse', 'povr_whole_latine_ca_rr']
pov_whole_5.columns = ['county', 'povr_whole_island_native_numerator', 'povr_whole_island_native_denominator', 'povr_whole_island_native_estimate', 'povr_whole_island_native_ll_95ci', 'povr_whole_island_native_ul_95ci', 'povr_whole_island_native_se', 'povr_whole_island_native_rse', 'povr_whole_island_native_ca_rr']
pov_whole_1.columns = ['county', 'povr_whole_cont_native_numerator', 'povr_whole_cont_native_denominator', 'povr_whole_cont_native_estimate', 'povr_whole_cont_native_ll_95ci', 'povr_whole_cont_native_ul_95ci', 'povr_whole_cont_native_se', 'povr_whole_cont_native_rse', 'povr_whole_cont_native_ca_rr']
pov_whole_7.columns = ['county', 'povr_whole_multiracial_numerator', 'povr_whole_multiracial_denominator', 'povr_whole_multiracial_estimate', 'povr_whole_multiracial_ll_95ci', 'povr_whole_multiracial_ul_95ci', 'povr_whole_multiracial_se', 'povr_whole_multiracial_rse', 'povr_whole_multiracial_ca_rr']
pov_whole_8.columns = ['county', 'povr_whole_other_race_numerator', 'povr_whole_other_race_denominator', 'povr_whole_other_race_estimate', 'povr_whole_other_race_ll_95ci', 'povr_whole_other_race_ul_95ci', 'povr_whole_other_race_se', 'povr_whole_other_race_rse', 'povr_whole_other_race_ca_rr']
pov_whole_9.columns = ['county', 'povr_whole_total_numerator', 'povr_whole_total_denominator', 'povr_whole_total_estimate', 'povr_whole_total_ll_95ci', 'povr_whole_total_ul_95ci', 'povr_whole_total_se', 'povr_whole_total_rse', 'povr_whole_total_ca_rr']

In [65]:
# Merge them back together
# goal: (57, 73)
poverty = counties.merge(
  pov_whole_6, on = 'county', how = 'left').merge(
    pov_whole_1, on = 'county', how = 'left').merge(
      pov_whole_2, on = 'county', how = 'left').merge(
        pov_whole_3, on = 'county', how = 'left').merge(
          pov_whole_4, on = 'county', how = 'left').merge(
            pov_whole_5, on = 'county', how = 'left').merge(
              pov_whole_7, on = 'county', how = 'left').merge(
                pov_whole_8, on = 'county', how = 'left').merge(
                  pov_whole_9, on = 'county', how = 'left')
poverty.shape
# I can't believe that works.

(57, 73)

In [66]:
# Put it back under its own name
pov_whole = poverty.copy(deep = True)
pov_whole.shape, pov_kids.shape

((57, 73), (57, 76))

In [67]:
# Merge the whole thing back together
# goal: (57, 148)
poverty_rate = pov_whole.merge(pov_kids, on = 'county', how = 'left')
poverty_rate.shape

(57, 148)

In [68]:
# At a later stage, we opted to drop many of those columns
poverty_rate = poverty_rate[['county', 'povr_whole_white_estimate', 'povr_whole_cont_native_estimate', 
  'povr_whole_asian_estimate', 'povr_whole_black_estimate', 
  'povr_whole_latine_estimate', 'povr_whole_island_native_estimate', 
  'povr_whole_multiracial_estimate', 'povr_whole_other_race_estimate', 
  'povr_whole_total_estimate', 'povr_kids_asian_estimate', 
  'povr_kids_cont_native_estimate', 'povr_kids_black_estimate', 
  'povr_kids_latine_estimate', 'povr_kids_island_native_estimate', 
  'povr_kids_white_estimate', 'povr_kids_multiracial_estimate', 
  'povr_kids_other_race_estimate', 'povr_kids_total_estimate']]

In [69]:
# Save to csv 
poverty_rate.to_csv('./03_output/poverty_rate_eks.csv', index = False)

#### Suicide Rate

Suicide rate was assessed in 3 year intervals, with one interval starting in 2015.  In order to ensure our estimates were centered on 2015, rather than projecting into the future, past the other predictors, we combined the data from 2012-2014 and 2015-2017.

In [70]:
# Read it in
suicide_rate = pd.read_csv('./01_original_datasets/suicide_lghc_indicator_21_2.csv')

# Rename the columns
col_names = [f'sui_{col}' for col in suicide_rate.columns]
suicide_rate.columns = col_names
col_names = {'sui_year': 'year', 'sui_geography': 'county'}
suicide_rate.rename(columns = col_names, inplace = True)

# Get the shape
suicide_rate.shape

(1404, 9)

In [71]:
# 2015 only
first = suicide_rate['year']=='2012-2014'
second = suicide_rate['year']=='2015-2017'
suicide_rate = suicide_rate[first | second] 

# Real counties only
suicide_rate = suicide_rate[suicide_rate['county']!='CALIFORNIA']

# Get the shape
suicide_rate.shape # goal = (455, 9)

(455, 9)

In [72]:
# Take a look at the column counts
col_counts(suicide_rate)

sui_indicator
1
county
54
year
2
sui_strata
2
sui_strata_name
8
sui_numerator
217
sui_denominator
451
sui_rate
442
sui_age_adjusted_rate
429


In [73]:
# Look for missing data
suicide_rate.isna().sum()

sui_indicator            0
county                   0
year                     0
sui_strata               0
sui_strata_name          0
sui_numerator            0
sui_denominator          0
sui_rate                 0
sui_age_adjusted_rate    0
dtype: int64

Although this dataset tantalizingly offered by-race rates, the sum of those rows did not equal the county total.  We concluded that small numbers must suppressed.  Because including the by-race rows and the total would introduce massive colinearity, we opted to only use the totals.

In [74]:
# Drop unnecessary rows
suicide_rate = suicide_rate[suicide_rate['sui_strata_name']=='Total'] # goal: 100-116

# Drop unnecessary columns
suicide_rate.drop(columns = ['sui_indicator', 'sui_strata', 
    'sui_strata_name', 'sui_numerator', 'sui_denominator'], inplace = True)

# Get the shape
suicide_rate.shape

(103, 4)

In [75]:
# Split up the rows and remerge as columns
sui_1 = suicide_rate[suicide_rate['year']=='2012-2014']
sui_2 = suicide_rate[suicide_rate['year']=='2015-2017']

# Drop differentiating columns
sui_1.drop(columns = ['year'], inplace = True)
sui_2.drop(columns = ['year'], inplace = True)

# Rename columns
sui_1.columns = ['county', 'sui_rate_2012_2014', 'sui_age_adjusted_rate_2012_2014']
sui_2.columns = ['county', 'sui_rate_2015_2017', 'sui_age_adjusted_rate_2015_2017']

# Save these guys as CSVs for safekeeping
sui_1.to_csv('./03_output/suicide_rate_2012_2014.csv', index = False)
sui_2.to_csv('./03_output/suicide_rate_2015_2017.csv', index = False)

# Get the shapes
sui_1.shape, sui_2.shape

((51, 3), (52, 3))

In [76]:
# Merge them back together - goal: (57, 5)
suicide_rate = counties.merge(sui_1, on = 'county', how = 'left').merge(
    sui_2, on = 'county', how = 'left')
suicide_rate.shape 

(57, 5)

In [77]:
# Check NAs again
suicide_rate.isna().sum()

county                             0
sui_rate_2012_2014                 6
sui_age_adjusted_rate_2012_2014    6
sui_rate_2015_2017                 5
sui_age_adjusted_rate_2015_2017    5
dtype: int64

The original dataframe didn't have any NAs because whenever the given report didn't include a particular county, it simply wasn't listed.  Because I merged the sub-dfs onto a base that does include all counties, there are now NAs.  However, between the two reports, most counties are covered.  Therefore, when a county is covered in both reports, I averaged the two values together to get an estimate centered on 2015, but when a county was only covered in one report or the other, I used that value.  Counties that were not covered in either report were left NA for now and imputed later.

In [78]:
# Create new columns with the averages
suicide_rate['sui_rate_avg'] = (
  suicide_rate['sui_rate_2012_2014'] + suicide_rate['sui_rate_2015_2017'])/2
suicide_rate['sui_adjusted_rate_avg'] = (
  suicide_rate['sui_age_adjusted_rate_2012_2014'] + suicide_rate[
      'sui_age_adjusted_rate_2015_2017'])/2
suicide_rate.isna().sum()

county                             0
sui_rate_2012_2014                 6
sui_age_adjusted_rate_2012_2014    6
sui_rate_2015_2017                 5
sui_age_adjusted_rate_2015_2017    5
sui_rate_avg                       8
sui_adjusted_rate_avg              8
dtype: int64

In [79]:
# See which counties are covered in one and NA in one
suicide_rate[suicide_rate['sui_rate_avg'].isna()]

Unnamed: 0,county,sui_rate_2012_2014,sui_age_adjusted_rate_2012_2014,sui_rate_2015_2017,sui_age_adjusted_rate_2015_2017,sui_rate_avg,sui_adjusted_rate_avg
4,Colusa,24.28,26.57,,,,
6,Del Norte,24.16,22.69,,,,
9,Glenn,,,29.8,28.31,,
23,Modoc,,,,,,
24,Mono,,,,,,
33,San Benito,,,10.34,11.24,,
44,Sierra,,,,,,
51,Trinity,,,39.51,43.87,,


In [80]:
# Get a list
list(suicide_rate[suicide_rate['sui_rate_avg'].isna()]['county'])

['Colusa',
 'Del Norte',
 'Glenn',
 'Modoc',
 'Mono',
 'San Benito',
 'Sierra',
 'Trinity']

In [81]:
# Generate some code, then comment out the long output
a = ['Colusa', 'Del Norte']
# for i in a:
#   print('')
#   print(f"suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_rate_2012_2014']")
#   print(f"suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_age_adjusted_rate_2012_2014']")

In [82]:
# Generate some code, then comment out the long output
a = ['Glenn', 'San Benito', 'Trinity']
# for i in a:
#   print('')
#   print(f"suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_rate_2015_2017']")
#   print(f"suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='{i}'), 'sui_age_adjusted_rate_2015_2017']")

In [83]:
# Run that code to fill in the NAs
suicide_rate.loc[(suicide_rate['county']=='Colusa'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Colusa'), 'sui_rate_2012_2014']
suicide_rate.loc[(suicide_rate['county']=='Colusa'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Colusa'), 'sui_age_adjusted_rate_2012_2014']

suicide_rate.loc[(suicide_rate['county']=='Del Norte'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Del Norte'), 'sui_rate_2012_2014']
suicide_rate.loc[(suicide_rate['county']=='Del Norte'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Del Norte'), 'sui_age_adjusted_rate_2012_2014']

suicide_rate.loc[(suicide_rate['county']=='Glenn'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Glenn'), 'sui_rate_2015_2017']
suicide_rate.loc[(suicide_rate['county']=='Glenn'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Glenn'), 'sui_age_adjusted_rate_2015_2017']

suicide_rate.loc[(suicide_rate['county']=='San Benito'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='San Benito'), 'sui_rate_2015_2017']
suicide_rate.loc[(suicide_rate['county']=='San Benito'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='San Benito'), 'sui_age_adjusted_rate_2015_2017']

suicide_rate.loc[(suicide_rate['county']=='Trinity'), 'sui_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Trinity'), 'sui_rate_2015_2017']
suicide_rate.loc[(suicide_rate['county']=='Trinity'), 'sui_adjusted_rate_avg'] = suicide_rate.loc[(suicide_rate['county']=='Trinity'), 'sui_age_adjusted_rate_2015_2017']

# Drop redundant columns
suicide_rate = suicide_rate[[
    'county', 'sui_rate_avg', 'sui_adjusted_rate_avg']]

# Check the NAs again
suicide_rate.isna().sum()

county                   0
sui_rate_avg             3
sui_adjusted_rate_avg    3
dtype: int64

In [84]:
# Save to csv 
suicide_rate.to_csv('./03_output/suicide_rate_eks.csv', index = False)

#### Graduation Cohort & Dropout Rate

In [85]:
# Read it in
graduation_rate_2019 = pd.read_csv('./01_original_datasets/graduation_rate_ish_2019.txt', sep='\t')

# Get the shape
graduation_rate_2019.shape

(198022, 34)

In [86]:
# Reduce to a single y variable, one row per county

# Real counties only
graduation_rate_2019 = graduation_rate_2019[graduation_rate_2019['AggregateLevel']=='C']
# I don't care whether it's a charter school or not.
graduation_rate_2019 = graduation_rate_2019[graduation_rate_2019['CharterSchool']=='All']
# I don't care whether it's a DASS school or not.
graduation_rate_2019 = graduation_rate_2019[graduation_rate_2019['DASS']=='All']
# I do care whether there are disparities in graduation rates, but not for THIS model.
graduation_rate_2019 = graduation_rate_2019[graduation_rate_2019['ReportingCategory']=='TA']

# Get the shape
graduation_rate_2019.shape

(57, 34)

In [87]:
# Drop columns
dropout_rate_2019 = graduation_rate_2019[['CountyName', 'CohortStudents', 'Dropout (Rate)']]

# Get the shape
dropout_rate_2019.shape

(57, 3)

In [88]:
# Look for missing data
dropout_rate_2019.isna().sum()

CountyName        0
CohortStudents    0
Dropout (Rate)    0
dtype: int64

In [89]:
# Rename the columns
dropout_rate_2019.columns = ['county', 
    'graduation_2019_cohort_size', 'dropout_rate_2019_cohort']

In [90]:
# Save to csv 
dropout_rate_2019.to_csv('./03_output/dropout_rate_2019.csv', index = False)

#### Unemployment Rate

In [91]:
# Read it in
joblessness = pd.read_csv('./01_original_datasets/Calfornia_Unemployment_Data_By County_2015_may-13-2024.csv')

# Get the shape
joblessness.shape

(708, 9)

In [92]:
# Take a look at the column counts
col_counts(joblessness)

Year
1
Period
12
Area
58
Adjusted
2
Preliminary
1
Labor Force
622
Employment
638
Unemployment
452
Unemployment Rate
121


In [93]:
# Drop unnecessary rows
joblessness = joblessness[joblessness['Adjusted']=='Not Adj']

# Drop unnecessary columns
joblessness = joblessness[['Period', 'Area', 'Labor Force',
       'Employment', 'Unemployment', 'Unemployment Rate']]

# Get the shape
joblessness.shape

(696, 6)

In [94]:
# Look for missing data
joblessness.isna().sum()

Period               0
Area                 0
Labor Force          0
Employment           0
Unemployment         0
Unemployment Rate    0
dtype: int64

In [95]:
# fix county names
county_list = []
for i in joblessness['Area']:
  county_list.append(i.replace(' County', ''))

joblessness['Area'] = county_list

This dataset is broken down by month.  I aggregated it to an overall estimate for 2015 by averaging each month within a county.

In [96]:
# Define a function to average each county
def county_average(county_name, dataframe, column_name, target):
  '''A function to take the average of a variable within
  a level of another variable.
  Arg:
    county_name: the level of the variable to subset by
    dataframe: the df to do this on
    column_name: the variable to subset by
    target: the variable to take the average of
  Return:
    the average of 'target'
  Raise:
    TypeErrors, IndexErrors/KeyErrors, math errors'''
  df = dataframe[dataframe[column_name]==county_name]
  x = df[target].mean()
  return x

In [97]:
# Define a new dataframe
unemployment = counties.copy(deep = True)

# Pull out the averages we want
for i in list(unemployment['county']):
  unemployment.loc[unemployment['county']==i, 1] = county_average(
    i, joblessness, 'Area', 'Unemployment Rate')
unemployment.columns = ['county', 'unemployment_rate']

for i in list(unemployment['county']):
  unemployment.loc[unemployment['county']==i, 2] = county_average(
    i, joblessness, 'Area', 'Labor Force')
unemployment.columns = ['county', 'unemployment_rate', 'labor_force_size']

# Get the shape
unemployment.shape

(57, 3)

In [98]:
# Save to csv
unemployment.to_csv('./03_output/unemployment_eks.csv', index = False)

#### Overall Merge

In [99]:
# Moment of truth 
emilys_dfs_merged = counties.merge(
  abortion_costs, left_on = 'county', right_on = 'county', how = 'left').merge(
    abortion_counts, left_on = 'county', right_on = 'county', how = 'left').merge(
      daycare_slots, left_on = 'county', right_on = 'county', how = 'left').merge(
        dropout_rate_2019, left_on = 'county', right_on = 'county', how = 'left').merge(
          ecig_availability, left_on = 'county', right_on = 'county', how = 'left').merge(
            poverty_rate, left_on = 'county', right_on = 'county', how = 'left').merge(
              sui_1, left_on = 'county', right_on = 'county', how = 'left').merge(
                sui_2, left_on = 'county', right_on = 'county', how = 'left').merge(
                  unemployment, left_on = 'county', right_on = 'county', how = 'left').merge(
    suicide_rate, left_on = 'county', right_on = 'county', how = 'left')

# Get the shape
emilys_dfs_merged.shape

(57, 44)

# EMILY DELETE THIS YOU-SPECIFIC LINE
```diff 02_data_eks/03_output/suicide_rate_2012_2014.csv 02_data_eks/02_cleaned_datasets/suicide_rate_2012_2014_simplified_eks.csv```

diff 02_data_eks/03_output/suicide_rate_2015_2017.csv 02_data_eks/02_cleaned_datasets/suicide_rate_2015_2017_simplified_eks.csv

diff 02_data_eks/03_output/dropout_rate_2019.csv 02_data_eks/02_cleaned_datasets/dropout_rate_2019_simplified_eks.csv

diff 02_data_eks/03_output/emilys_dfs_merged.csv 02_data_eks/02_cleaned_datasets/emilys_dfs_merged.csv

In [100]:
emilys_dfs_merged.shape

(57, 44)

In [101]:
# Save mine to CSV
emilys_dfs_merged.to_csv('./03_output/emilys_dfs_merged.csv', index = False)
# Note: Unemployment rate was actually added later, so these files won't match exactly.

In [102]:
# Combine mine and Eli's
elis_dfs = pd.read_csv('./02_cleaned_datasets/eli_concat.csv')

# goal: (57, 57)
eli_emily_dfs_merged = emilys_dfs_merged.merge(elis_dfs, left_on = 'county', right_on = 'county', how = 'left')
eli_emily_dfs_merged.shape

(57, 61)

#### Group Data Cleaning

The following data cleaning steps were taken made together during a group call.  I ran them on my computer, and distributed the resulting dataframe as a CSV.

In [103]:
# Give it a more convenient name
big_df = eli_emily_dfs_merged.copy(deep = True)

# Check for NAs
big_df.isna().sum()

county                                   0
total_expenditures                       1
abortion_rs_count_total                  1
abortion_rs_count_total_suppressed       0
daycare_slots_child_facility_capacity    0
                                        ..
std_gonorrhea_supressed                  0
std_syphilis_cases                       0
std_population                           0
std_syphilis_rate                        0
std_syphilis_supressed                   0
Length: 61, dtype: int64

In [104]:
# Drop columns with too many NAs - not practical to impute that much
big_df.drop(columns = ['povr_whole_white_estimate',
       'povr_whole_cont_native_estimate', 'povr_whole_asian_estimate',
       'povr_whole_black_estimate', 'povr_whole_latine_estimate',
       'povr_whole_island_native_estimate', 'povr_whole_multiracial_estimate',
       'povr_whole_other_race_estimate', 'povr_kids_asian_estimate', 'povr_kids_cont_native_estimate',
       'povr_kids_black_estimate', 'povr_kids_latine_estimate',
       'povr_kids_island_native_estimate', 'povr_kids_white_estimate',
       'povr_kids_multiracial_estimate', 'povr_kids_other_race_estimate',
       'povr_kids_total_estimate'], inplace = True)

# Get the shape
big_df.shape

(57, 44)

In [105]:
# Drop counts columns where rates are available
big_df.drop(columns = ['std_chlamydia_cases', 'std_gonorrhea_cases', 
    'std_syphilis_cases', 'adolescent_birth_births'], inplace=True)

# Drop redundant columns
big_df.drop(columns = ['labor_force_size', 'sui_rate_2012_2014', 'sui_rate_2015_2017', 
    'sui_age_adjusted_rate_2012_2014', 'sui_age_adjusted_rate_2015_2017', 
    'adolescent_birth_population', 'sui_rate_avg'], inplace = True)

# Get the shape
big_df.shape

(57, 33)

In [106]:
# Fix a string column
big_df.rename(columns = {
    'total_expenditures': 'abortion_medicaid_expenditures'}, inplace = True)

# Impute 0 where suppressed
big_df['abortion_medicaid_expenditures'] = big_df['abortion_medicaid_expenditures'].fillna('$0')

# Get a list
exp = list(big_df['abortion_medicaid_expenditures'])

# Get rid of the $
x = []
for i in exp:
  j = i.replace(' ', '')
  k = j.replace(',', '')
  x.append(k[1:])

# Put back in
big_df['abortion_medicaid_expenditures'] = x

# Set dtype
big_df['abortion_medicaid_expenditures'] = big_df['abortion_medicaid_expenditures'].astype('float')

In [107]:
# Save to CSV
big_df.to_csv('./03_output/big_df_eks.csv', index = False)

#### Combine GINI Coefficients

Eli found a dataset on income inequality that covered nearly all of the counties, but was from a range of years.  Radha found one from the same source that was specifically from 2015, but many of the counties had been evaluated as groups rather than individually.  I combined these estimates similarly to the suicide rate estimates.

In [108]:
# This df had gone around a few times by now
big_df = pd.read_csv('./02_cleaned_datasets/ca_dropout_and_predictors_v1.csv')

# Subset for sanity
gini = big_df[['county', 'gini_coefficient_2015', 'gini_suppressed_2015', 
  'gini_combined_2015', 'gini_coefficient_range', 'gini_suppressed_range']]

In [109]:
# Define conditions
a_na = gini['gini_coefficient_2015'].isna()
a_pop = gini['gini_coefficient_2015'].notna()
b_na = gini['gini_coefficient_range'].isna()
b_pop = gini['gini_coefficient_range'].notna()
c_0 = gini['gini_combined_2015']==0
c_1 = gini['gini_combined_2015']==1

In [110]:
# Fill in the appropriate values
gini.loc[(a_na & b_pop), 'gini_coef'] = gini.loc[
  (a_na & b_pop), 'gini_coefficient_range']
gini.loc[(a_na & b_pop), 'gini_suppressed'] = gini.loc[
  (a_na & b_pop), 'gini_suppressed_range']

gini.loc[(a_pop & b_na), 'gini_coef'] = gini.loc[
  (a_pop & b_na), 'gini_coefficient_2015']
gini.loc[(a_pop & b_na), 'gini_suppressed'] = gini.loc[
  (a_pop & b_na), 'gini_suppressed_2015']

gini.loc[(a_pop & b_pop & c_0), 'gini_coef'] = gini.loc[
  (a_pop & b_pop & c_0), 'gini_coefficient_2015']
gini.loc[(a_pop & b_pop & c_0), 'gini_suppressed'] = gini.loc[
  (a_pop & b_pop & c_0), 'gini_suppressed_2015']

gini.loc[(a_pop & b_pop & c_1), 'gini_coef'] = (gini.loc[
  (a_pop & b_pop & c_1), 'gini_coefficient_2015'] + gini.loc[
  (a_pop & b_pop & c_1), 'gini_coefficient_range'])/2

In [111]:
# Mark the new column as suppressed if either of the contributing columns were suppressed
gini['total_suppression'] = gini[
  'gini_suppressed_2015'] + gini['gini_suppressed_range']
gini['total_suppression'] = gini['total_suppression'].fillna(0)

gini.loc[(a_pop & b_pop & c_1), 'gini_suppressed'] = gini.loc[
  (a_pop & b_pop & c_1), 'total_suppression']

gini = gini[['county', 'gini_coef', 'gini_suppressed']]

In [112]:
# Drop redundant columns
big_df.drop(columns = ['gini_coefficient_2015', 
  'gini_suppressed_2015', 'gini_combined_2015', 
  'gini_coefficient_range', 'gini_suppressed_range'], inplace = True)

# Merge
new_df = big_df.merge(gini, how = 'left', on = 'county')

# Save to CSV
new_df.to_csv('./03_output/ca_dropout_and_predictors_v2.csv', index = False)