# Population data download and preparation

In this notebook we download U.S. population data from [US Census Bureau](https://www.census.gov) using the API to the American Community Survey (ACS) - 5 Year dataset.

More information about the data can be found at the following URL: [ACS 5 year data](https://www.census.gov/data/developers/data-sets/acs-5year.html).

In [1]:
# import libraries
import requests
import pandas as pd

In this dataset population is grouped in the following age categories:

In [2]:
ages = [
    'under 5 years',
    '5 to 9 years',
    '10 to 14 years',
    '15 to 17 years',
    '18 and 19 years',
    '20 years',
    '21 years',
    '22 to 24 years',
    '25 to 29 years',
    '30 to 34 years',
    '35 to 39 years',
    '40 to 44 years',
    '45 to 49 years',
    '50 to 54 years',
    '55 to 59 years',
    '60 and 61 years',
    '62 to 64 years',
    '65 and 66 years',
    '67 to 69 years',
    '70 to 74 years',
    '75 to 79 years',
    '80 to 84 years',
    '85 years and over'
]

The age group '10 to 14 years' is overlapped with the age groups 'Child 0-11' and 'Teen 12-17' provided in the gun incidents dataset. To use the same age groups, we will assume an even distribution of the population within this range across each individual year of age, assigning 2/5 of the population to the 'Child 0-11' group and 3/5 to the 'Teen 12-17' group.

In the code below we define the variables to query the API and download the data:

In [3]:
host = "https://api.census.gov/data"
dataset = "acs/acs5"

vars_to_retrieve = {}
for i, age in enumerate(ages):
    if i+3 < 10:
        males_suf = "00" + str(i+3) + "E"
    else:
        males_suf = "0" + str(i+3) + "E"
    females_suf = "0" + str(i+27) + "E"
    
    vars_to_retrieve['B01001_'+males_suf] = "Males " + age
    vars_to_retrieve['B01001_'+females_suf] = "Females " + age

predicates = {}
predicates["get"] = ",".join(vars_to_retrieve.keys())
predicates["for"] = "congressional district:*"

As an example, we now make a query for the year 2016:

In [4]:
base_url = "/".join([host, "2016", dataset])
req = requests.get(base_url, params=predicates)

population_df = pd.DataFrame(
    columns=req.json()[0],
    data=req.json()[1:]
)
population_df.columns = population_df.columns.map(lambda x: vars_to_retrieve[x] if x in vars_to_retrieve else x)
population_df.head()

Unnamed: 0,Males under 5 years,Females under 5 years,Males 5 to 9 years,Females 5 to 9 years,Males 10 to 14 years,Females 10 to 14 years,Males 15 to 17 years,Females 15 to 17 years,Males 18 and 19 years,Females 18 and 19 years,...,Males 70 to 74 years,Females 70 to 74 years,Males 75 to 79 years,Females 75 to 79 years,Males 80 to 84 years,Females 80 to 84 years,Males 85 years and over,Females 85 years and over,state,congressional district
0,21966,21389,22381,22137,24155,23036,14742,13722,8614,8792,...,12662,15291,8931,10840,5885,7742,3906,8542,1,1
1,21659,20239,22446,21620,23014,21553,13455,13432,8500,7823,...,11917,14105,8250,11491,5356,8036,3668,7685,1,2
2,20475,20475,22282,21188,22772,21565,14166,13165,10791,10799,...,11808,13943,8208,10528,4873,7587,3513,7347,1,3
3,20894,19852,22456,20403,22469,22756,14240,13180,8257,7136,...,14077,17079,9172,11968,5635,8938,4181,7851,1,4
4,22343,22165,19246,18447,20242,18193,11799,12054,13940,16007,...,9034,13191,6400,9762,4630,7973,4326,10049,42,2


In [5]:
population_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437 entries, 0 to 436
Data columns (total 48 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Males under 5 years        437 non-null    object
 1   Females under 5 years      437 non-null    object
 2   Males 5 to 9 years         437 non-null    object
 3   Females 5 to 9 years       437 non-null    object
 4   Males 10 to 14 years       437 non-null    object
 5   Females 10 to 14 years     437 non-null    object
 6   Males 15 to 17 years       437 non-null    object
 7   Females 15 to 17 years     437 non-null    object
 8   Males 18 and 19 years      437 non-null    object
 9   Females 18 and 19 years    437 non-null    object
 10  Males 20 years             437 non-null    object
 11  Females 20 years           437 non-null    object
 12  Males 21 years             437 non-null    object
 13  Females 21 years           437 non-null    object
 14  Males 22 t

We cast the numerical columns:

In [6]:
columns_to_cast = [x for x in population_df.columns if x not in ["state", "congressional district"]]
population_df[columns_to_cast] = population_df[columns_to_cast].astype('UInt64')

We retrieve the state name from the code, using the official mapping provided by the Census Bureau:

In [7]:
usa_states = pd.read_csv(
    'https://www2.census.gov/geo/docs/reference/state.txt',
    sep='|',
    dtype={'STATE': str, 'STATE_NAME': str}
)
usa_num_name = usa_states.set_index('STATE').to_dict()['STATE_NAME']
population_df['state_name'] = population_df['state'].map(lambda x: usa_num_name[x])
population_df.head()

Unnamed: 0,Males under 5 years,Females under 5 years,Males 5 to 9 years,Females 5 to 9 years,Males 10 to 14 years,Females 10 to 14 years,Males 15 to 17 years,Females 15 to 17 years,Males 18 and 19 years,Females 18 and 19 years,...,Females 70 to 74 years,Males 75 to 79 years,Females 75 to 79 years,Males 80 to 84 years,Females 80 to 84 years,Males 85 years and over,Females 85 years and over,state,congressional district,state_name
0,21966,21389,22381,22137,24155,23036,14742,13722,8614,8792,...,15291,8931,10840,5885,7742,3906,8542,1,1,Alabama
1,21659,20239,22446,21620,23014,21553,13455,13432,8500,7823,...,14105,8250,11491,5356,8036,3668,7685,1,2,Alabama
2,20475,20475,22282,21188,22772,21565,14166,13165,10791,10799,...,13943,8208,10528,4873,7587,3513,7347,1,3,Alabama
3,20894,19852,22456,20403,22469,22756,14240,13180,8257,7136,...,17079,9172,11968,5635,8938,4181,7851,1,4,Alabama
4,22343,22165,19246,18447,20242,18193,11799,12054,13940,16007,...,13191,6400,9762,4630,7973,4326,10049,42,2,Pennsylvania


And now that we tested the API we can download the data for all the years:

In [8]:
years = ["20"+str(i) for i in range(13, 21)]
population_df = pd.DataFrame()

for year in years:
    base_url = "/".join([host, year, dataset])
    req = requests.get(base_url, params=predicates)
    
    population_year_df = pd.DataFrame(
        columns=req.json()[0],
        data=req.json()[1:]
    )
    population_year_df.columns = population_year_df.columns.map(lambda x: vars_to_retrieve[x] if x in vars_to_retrieve else x)
    columns_to_cast = [x for x in population_year_df.columns if x not in ["state", "congressional district"]]
    population_year_df[columns_to_cast] = population_year_df[columns_to_cast].astype('UInt64')
    
    population_year_df['year'] = year
    population_year_df['state_name'] = population_year_df['state'].map(lambda x: usa_num_name[x])
    population_df = pd.concat([population_df, population_year_df])

population_df.head()

Unnamed: 0,Males under 5 years,Females under 5 years,Males 5 to 9 years,Females 5 to 9 years,Males 10 to 14 years,Females 10 to 14 years,Males 15 to 17 years,Females 15 to 17 years,Males 18 and 19 years,Females 18 and 19 years,...,Males 75 to 79 years,Females 75 to 79 years,Males 80 to 84 years,Females 80 to 84 years,Males 85 years and over,Females 85 years and over,state,congressional district,year,state_name
0,22186,19450,23272,22210,23858,23604,15575,14519,10681,9379,...,8411,10417,6175,8551,4594,8736,36,1,2013,New York
1,19327,20779,22937,21950,26645,25324,16093,15963,9839,9135,...,7373,10791,5555,9189,4349,9301,36,2,2013,New York
2,19150,18238,22696,21689,26899,23183,17370,15360,8636,7772,...,10006,12443,7906,12171,7755,14125,36,3,2013,New York
3,20733,20158,22266,21096,24335,23010,16087,15704,10127,9698,...,7384,10767,6302,9617,5872,12277,36,4,2013,New York
4,24339,22668,23207,22743,24458,25369,17220,15645,11145,10582,...,6945,9606,4360,7803,3441,8264,36,5,2013,New York


We now group the population by age as stated above:

In [9]:
child_ages = [ages[i] for i in range(2)]
teen_ages = [ages[i] for i in range(3, 5)]
adult_ages = [ages[i] for i in range(5, len(ages))]

population_df['male_child'] = population_df[['Males '+age for age in child_ages]].sum(axis=1)
population_df['male_child'] += ((2/5)*population_df['Males '+ages[2]]).astype('UInt64')
population_df['male_teen'] = population_df[['Males '+age for age in teen_ages]].sum(axis=1)
population_df['male_teen'] += ((3/5)*population_df['Males '+ages[2]]).astype('UInt64')
population_df['male_adult'] = population_df[['Males '+age for age in adult_ages]].sum(axis=1)

population_df['female_child'] = population_df[['Females '+age for age in child_ages]].sum(axis=1)
population_df['female_child'] += ((2/5)*population_df['Females '+ages[2]]).astype('UInt64')
population_df['female_teen'] = population_df[['Females '+age for age in teen_ages]].sum(axis=1)
population_df['female_teen'] += ((3/5)*population_df['Females '+ages[2]]).astype('UInt64')
population_df['female_adult'] = population_df[['Females '+age for age in adult_ages]].sum(axis=1)

population_df.head()

Unnamed: 0,Males under 5 years,Females under 5 years,Males 5 to 9 years,Females 5 to 9 years,Males 10 to 14 years,Females 10 to 14 years,Males 15 to 17 years,Females 15 to 17 years,Males 18 and 19 years,Females 18 and 19 years,...,state,congressional district,year,state_name,male_child,male_teen,male_adult,female_child,female_teen,female_adult
0,22186,19450,23272,22210,23858,23604,15575,14519,10681,9379,...,36,1,2013,New York,55001.0,40570.0,262261.0,51101.0,38060.0,273423.0
1,19327,20779,22937,21950,26645,25324,16093,15963,9839,9135,...,36,2,2013,New York,52922.0,41919.0,254920.0,52858.0,40292.0,278666.0
2,19150,18238,22696,21689,26899,23183,17370,15360,8636,7772,...,36,3,2013,New York,52605.0,42145.0,256785.0,49200.0,37041.0,282972.0
3,20733,20158,22266,21096,24335,23010,16087,15704,10127,9698,...,36,4,2013,New York,52733.0,40815.0,252800.0,50458.0,39208.0,279627.0
4,24339,22668,23207,22743,24458,25369,17220,15645,11145,10582,...,36,5,2013,New York,57329.0,43039.0,249486.0,55558.0,41448.0,300275.0


We rename and reorder the columns and we sort the rows:

In [10]:
population_df.rename(
    columns={
        'state': 'state_code',
        'state_name': 'state',
        'congressional district': 'congressional_district'
    },
    inplace=True
)
cols = [
    'state_code',
    'state',
    'congressional_district',
    'year',
    'male_child',
    'male_teen',
    'male_adult',
    'female_child',
    'female_teen',
    'female_adult'
    ] + \
    ['Males '+age for age in ages] + \
    ['Females '+age for age in ages]
population_df = population_df[cols]
population_df.sort_values(
    by=['year', 'state_code', 'congressional_district'],
    inplace=True
)

In [11]:
population_df['congressional_district'].unique()

array(['01', '02', '03', '04', '05', '06', '07', '00', '08', '09', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21',
       '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32',
       '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43',
       '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '98',
       'ZZ'], dtype=object)

In [12]:
population_df[population_df['congressional_district']=='ZZ']

Unnamed: 0,state_code,state,congressional_district,year,male_child,male_teen,male_adult,female_child,female_teen,female_adult,...,Females 50 to 54 years,Females 55 to 59 years,Females 60 and 61 years,Females 62 to 64 years,Females 65 and 66 years,Females 67 to 69 years,Females 70 to 74 years,Females 75 to 79 years,Females 80 to 84 years,Females 85 years and over
217,9,Connecticut,ZZ,2018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
233,17,Illinois,ZZ,2018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
150,26,Michigan,ZZ,2018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
423,9,Connecticut,ZZ,2019,,,0.0,,,0.0,...,,,,,,,,,,
27,17,Illinois,ZZ,2019,,,0.0,,,0.0,...,,,,,,,,,,
150,26,Michigan,ZZ,2019,,,0.0,,,0.0,...,,,,,,,,,,
308,9,Connecticut,ZZ,2020,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
133,17,Illinois,ZZ,2020,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
335,26,Michigan,ZZ,2020,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We drop obsolete congressional districts:

In [13]:
population_df = population_df[population_df['congressional_district'] != 'ZZ']

In [14]:
population_df[population_df['congressional_district']=='98']['state'].unique()

array(['District of Columbia', 'Puerto Rico'], dtype=object)

We set to 0 the congressional districts for District of Columbia to use the same notation as in the gun incidents dataset:

In [15]:
population_df.loc[population_df['state'] == 'District of Columbia', 'congressional_district'] = 0

We convert in uppercase the state names:

In [16]:
population_df['state'] = population_df['state'].str.upper()

We assess if there are any missing values:

In [17]:
population_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3496 entries, 296 to 261
Data columns (total 56 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   state_code                 3496 non-null   object 
 1   state                      3496 non-null   object 
 2   congressional_district     3496 non-null   object 
 3   year                       3496 non-null   object 
 4   male_child                 3496 non-null   Float64
 5   male_teen                  3496 non-null   Float64
 6   male_adult                 3496 non-null   float64
 7   female_child               3496 non-null   Float64
 8   female_teen                3496 non-null   Float64
 9   female_adult               3496 non-null   float64
 10  Males under 5 years        3496 non-null   UInt64 
 11  Males 5 to 9 years         3496 non-null   UInt64 
 12  Males 10 to 14 years       3496 non-null   UInt64 
 13  Males 15 to 17 years       3496 non-null   UInt

We save the data to a CSV file:

In [18]:
population_df.to_csv('../data/population.csv', index=False)