# Build GA 2020 Source Data

This notebook accepts the Census 2020 DHC data and polling location data to build a csv to be used with scip.  Clone this notebook for new states or years.

## Census Data:

### 2020 Redistricting [P4 group](https://api.census.gov/data/2010/dec/sf1/groups/P4.html)
[HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE FOR THE POPULATION 18 YEARS AND OVER](https://data.census.gov/table?g=050XX00US13135$1000000&d=DEC+Redistricting+Data+(PL+94-171)&tid=DECENNIALPL2020.P4)
1. Select Geography:
   1. Filter for Geography -> Blocks -> State -> County Name, State -> All Blocks within County Name, State
   1. If asked to select table vintage, select 2020;  DEC Redistricting Data (PL-94-171)

Columns we want from P4:
* Total population
* Total hispanic
* Total non-hispanic

### 2020 Redistricting [P3 group](https://api.census.gov/data/2010/dec/sf1/groups/P3.html)
Data source that includes Race and geography: 
[RACE FOR THE POPULATION 18 YEARS AND OVER](https://data.census.gov/table?q=P3:+RACE+FOR+THE+POPULATION+18+YEARS+AND+OVER&tid=DECENNIALPL2020.P3)
1. Select Geography:
   1. Filter for Geography -> Blocks -> State -> County Name, State -> All Blocks within County Name, State
   1. If asked to select table vintage, select 2020;  DEC Redistricting Data (PL-94-171)

Colums we want from P3:
* White alone
* Black or African American alone
* American Indian And Alaska Native alone
* Asian alone
* Native Hawaiian and Other Pacific Islander alone
* Some Other Race alone
* Two or More Races


### 2020 Tiger/Line Shapefiles: Blocks (2020) 
Source Data https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2020&layergroup=Blocks+%282020%29

Documentation: https://www.census.gov/programs-surveys/geography/technical-documentation/complete-technical-documentation/tiger-geo-line/2020.html

Columns we want from blocks:
* GEO_ID - obtained by converting values in GEOID20 column and preppending "1000000US", e.g. 131510703153004 -> 1000000US131510703153004
* geometry - the polygon of the block
* INTPTLAT20 - latitude of block centroid
* INTPTLON20 - longitude of block centroid

**Susama:** please verify the above

# Instructions


## Redistricting source data:
1. Download desired Census Data
   1. P4 zip file from Census
   1. P3 zip file from Census
   1. Recommend downloading one county's worth of data at a time.
1. Create a directory to datasets/census/redistricting/County_Name_ST **Chad**: This is a change to  your folder structure. I want to indicate geography in the name
   1. E.g datasets/census/redistricting/Gwinett_GA
   1. Key files: DECENNIALPL2020.P3-Data.csv; DECENNIALPL2020.P4-Data.csv
   1. Note, this requires one to filter for an individual county from the census.
1. Unzip the downloaded file P11 file zip to the directory 

## Tiger/Line Shapefile
1. Download desired Tiger/Line file zip file from Census
1. Create a directory to datasets/census/tiger/<download name> e.g. datasets/census/tiger/tl_2020_13_tabblock20
1. Unzip the downloaded Tiger/Line zip file to the directory 
    
## Run cells
1. Update constants such as P11_SOURCE_FILE, P3_SOURCE_FILE, BLOCK_SOURCE_FILE as needed
1. Run each cell


In [74]:
# Load the P11 csv source data into a data frame and filter out unneeded columns

import pandas as pd  
import numpy as np  
#import fiona #; help(fiona.open)
import geopandas as gpd

P4_SOURCE_FILE = 'datasets/census/redistricting/Gwinett_GA/DECENNIALPL2020.P4-Data.csv'
P3_SOURCE_FILE = 'datasets/census/redistricting/Gwinett_GA/DECENNIALPL2020.P3-Data.csv'

# The column in the P11 data that contains the GEO id.  This will be used later to join
# against the Block Shape File
P4_GEOID = 'GEO_ID'
P3_GEOID = 'GEO_ID'

# Prefix to add to Shape files to join them with this P11.  Note this
# needs to match the prefix found in GEO_ID output from this cell.
GEO_ID_PREFIX = '1000000US'

PL_P4_COLUMNS = [
    P4_GEOID,
    'NAME',
    'P4_001N', # Total population
    'P4_002N', # Total hispanic
    'P4_003N', # Total non-hispanic
]

PL_P3_COLUMNS = [
    P3_GEOID,
    'NAME',
    'P3_001N', # Total population
    'P3_002N', # White alone
    'P3_003N', # Black or African American alone
    'P3_004N', # American Indian or Alaska Native alone
    'P3_005N', # Asian alone
    'P3_006N', # Native Hawaiian and Other Pacific Islander alone
    'P3_007N', # Some other race alone 
    'P3_008N', # Two or More Races   
]

pd.set_option('display.max_columns', None)

#print('DHC P11 File')
#print(f'  Source {P11_SOURCE_FILE}')

p4_df = pd.read_csv(
    P4_SOURCE_FILE,
    header=[0,1], # DHC files have two headers rows when exported to CSV - tell pandas to only take top one
    low_memory=False, # files are too big, set this to False to prevent errors
    # nrows=10, # limit rows loaded - testing purposes only
)

p3_df = pd.read_csv(
    P3_SOURCE_FILE,
    header=[0,1], # DHC files have two headers rows when exported to CSV - tell pandas to take top one
    low_memory=False, # files are too big, set this to False to prevent errors
    # nrows=10, # limit rows loaded - testing purposes only
)


# Filter out the un-needed columns and keep only one header
p4_df = p4_df[PL_P4_COLUMNS]
p4_df.columns = p4_df.columns=[multicols[0] for multicols in p4_df.columns]

p3_df = p3_df[PL_P3_COLUMNS]
p3_df.columns = p3_df.columns=[multicols[0] for multicols in p3_df.columns]



In [85]:
#Merge the data sets
demographics = p4_df.merge(p3_df, left_on=['GEO_ID', 'NAME'], right_on=['GEO_ID', 'NAME'],how = 'outer')

#Consistency check for the data pull
demographics['Pop_diff'] = demographics.P4_001N- demographics.P3_001N
if demographics.loc[demographics.Pop_diff != 0].shape[0]!=0:
    raise ValueError('Populations different in P3 and P4. Are both pulled from the voting age universe?')

#Change column names
demographics.drop(['P4_001N'], axis =1)
#demographics = demographics.rename(columns = {'P4_002N': 'hispanic', 'P4_003N':'non-hispanic', 'P3_001N':'population', 'P3_002N':'white', 'P3_003N':'black', 
#                      'P3_004N':'native', 'P3_005N':'asian', 'P3_006N':'hispanic', 'P3_007N':'other', 'P3_008N':'multiple_races'})
#Note, Hispanic is an ethnicity, not a race. The P4 columns add to the total population. The P3 columns add to the total population
demographics.columns

Index(['GEO_ID', 'NAME', 'P4_001N', 'P4_002N', 'P4_003N', 'P3_001N', 'P3_002N',
       'P3_003N', 'P3_004N', 'P3_005N', 'P3_006N', 'P3_007N', 'P3_008N',
       'Pop_diff'],
      dtype='object')

In [2]:
# Load the census block shape file using geopandas and filter out unneeded columns

BLOCK_SOURCE_FILE = 'datasets/census/tiger/tl_2020_13_tabblock20/tl_2020_13_tabblock20.shp'

BLOCKS_GEOID = 'GEOID20'

print('Shape File')
print(f'  Source {BLOCK_SOURCE_FILE}')
print(f'  Layers: {fiona.listlayers(BLOCK_SOURCE_FILE)}')

blocks_gdf = gpd.read_file(BLOCK_SOURCE_FILE)

# Create a GEO_ID column for the blocks file
blocks_geoids=blocks_gdf.apply(lambda row: f'{GEO_ID_PREFIX}{row[BLOCKS_GEOID]}', axis=1)
blocks_gdf = blocks_gdf.assign(GEO_ID = blocks_geoids)
blocks_gdf

Shape File
  Source datasets/census/tiger/tl_2020_13_tabblock20/tl_2020_13_tabblock20.shp
  Layers: ['tl_2020_13_tabblock20']


Unnamed: 0,STATEFP20,COUNTYFP20,TRACTCE20,BLOCKCE20,GEOID20,NAME20,MTFCC20,UR20,UACE20,UATYPE20,FUNCSTAT20,ALAND20,AWATER20,INTPTLAT20,INTPTLON20,HOUSING20,POP20,geometry,GEO_ID
0,13,151,070315,3004,131510703153004,Block 3004,G5040,U,03817,U,S,75161,0,+33.5064575,-084.1940099,31,95,"POLYGON ((-84.19809 33.50737, -84.19781 33.507...",1000000US131510703153004
1,13,161,960300,2024,131619603002024,Block 2024,G5040,R,,,S,130224,0,+31.7508557,-082.7342260,5,11,"POLYGON ((-82.73693 31.75329, -82.73297 31.753...",1000000US131619603002024
2,13,161,960102,3033,131619601023033,Block 3033,G5040,U,37891,U,S,199830,0,+31.8642367,-082.5914771,14,15,"POLYGON ((-82.59510 31.86420, -82.59501 31.864...",1000000US131619601023033
3,13,121,007806,1008,131210078061008,Block 1008,G5040,U,03817,U,S,15500,0,+33.7526075,-084.5173215,13,33,"POLYGON ((-84.51803 33.75287, -84.51799 33.753...",1000000US131210078061008
4,13,121,000502,2036,131210005022036,Block 2036,G5040,U,03817,U,S,1491,0,+33.7962344,-084.3743644,0,0,"POLYGON ((-84.37467 33.79614, -84.37465 33.796...",1000000US131210005022036
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232712,13,251,970500,2007,132519705002007,Block 2007,G5040,R,,,S,7766465,15669,+32.5629842,-081.5048558,34,81,"POLYGON ((-81.52109 32.55130, -81.52107 32.551...",1000000US132519705002007
232713,13,251,970200,1079,132519702001079,Block 1079,G5040,R,,,S,1098966,61978,+32.9336306,-081.6341132,0,0,"POLYGON ((-81.64892 32.93368, -81.64816 32.933...",1000000US132519702001079
232714,13,251,970500,1065,132519705001065,Block 1065,G5040,R,,,S,2376584,1729,+32.5269416,-081.6030725,0,0,"POLYGON ((-81.61204 32.52818, -81.61204 32.528...",1000000US132519705001065
232715,13,131,950501,2010,131319505012010,Block 2010,G5040,R,,,S,0,22810,+30.8587725,-084.1294148,0,0,"POLYGON ((-84.13059 30.85896, -84.12944 30.859...",1000000US131319505012010


In [3]:
# Combine the DHC p11 data with the block groups shape file using a join on GEO IDs

combined_df = p11_df.copy()
combined_df.columns = combined_df.columns.droplevel(1)

combined_df = combined_df.join(blocks_gdf.set_index(P11_GEOID), on=P11_GEOID, how='inner')
combined_df

Unnamed: 0,GEO_ID,NAME,P11_001N,P11_002N,P11_003N,P11_005N,P11_006N,P11_007N,P11_008N,P11_009N,P11_010N,STATEFP20,COUNTYFP20,TRACTCE20,BLOCKCE20,GEOID20,NAME20,MTFCC20,UR20,UACE20,UATYPE20,FUNCSTAT20,ALAND20,AWATER20,INTPTLAT20,INTPTLON20,HOUSING20,POP20,geometry
1,1000000US130019501001000,"Block 1000, Block Group 1, Census Tract 9501, ...",12,0,12,12,0,0,0,0,0,13,001,950100,1000,130019501001000,Block 1000,G5040,R,,,S,16677473,20707,+31.9226639,-082.3107529,8,12,"POLYGON ((-82.34985 31.92087, -82.34960 31.920..."
2,1000000US130019501001001,"Block 1001, Block Group 1, Census Tract 9501, ...",9,0,9,5,0,2,0,0,0,13,001,950100,1001,130019501001001,Block 1001,G5040,R,,,S,331208,0,+31.9079015,-082.3314389,2,10,"POLYGON ((-82.33439 31.90390, -82.33416 31.904..."
3,1000000US130019501001002,"Block 1002, Block Group 1, Census Tract 9501, ...",0,0,0,0,0,0,0,0,0,13,001,950100,1002,130019501001002,Block 1002,G5040,R,,,S,0,613882,+31.9344955,-082.3076355,0,0,"POLYGON ((-82.35306 31.93902, -82.34760 31.938..."
4,1000000US130019501001003,"Block 1003, Block Group 1, Census Tract 9501, ...",0,0,0,0,0,0,0,0,0,13,001,950100,1003,130019501001003,Block 1003,G5040,R,,,S,576473,0,+31.9427806,-082.3107383,0,0,"POLYGON ((-82.31536 31.94550, -82.31516 31.945..."
5,1000000US130019501001004,"Block 1004, Block Group 1, Census Tract 9501, ...",0,0,0,0,0,0,0,0,0,13,001,950100,1004,130019501001004,Block 1004,G5040,R,,,S,0,5475,+31.9391665,-082.3540595,0,0,"POLYGON ((-82.35489 31.93956, -82.35484 31.939..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232713,1000000US133219506002054,"Block 2054, Block Group 2, Census Tract 9506, ...",0,0,0,0,0,0,0,0,0,13,321,950600,2054,133219506002054,Block 2054,G5040,R,,,S,42401,0,+31.3390687,-083.8502829,0,0,"POLYGON ((-83.85137 31.34027, -83.85055 31.340..."
232714,1000000US133219506002055,"Block 2055, Block Group 2, Census Tract 9506, ...",11,0,11,10,0,0,0,0,0,13,321,950600,2055,133219506002055,Block 2055,G5040,R,,,S,1365255,0,+31.3350513,-083.8346621,7,11,"POLYGON ((-83.85060 31.33752, -83.85040 31.337..."
232715,1000000US133219506002056,"Block 2056, Block Group 2, Census Tract 9506, ...",7,0,7,6,0,0,0,0,0,13,321,950600,2056,133219506002056,Block 2056,G5040,R,,,S,266576,0,+31.3360159,-083.8797673,3,8,"POLYGON ((-83.88364 31.33533, -83.88303 31.335..."
232716,1000000US133219506002057,"Block 2057, Block Group 2, Census Tract 9506, ...",0,0,0,0,0,0,0,0,0,13,321,950600,2057,133219506002057,Block 2057,G5040,R,,,S,45380,0,+31.3340078,-083.8840827,0,0,"POLYGON ((-83.88901 31.33359, -83.88827 31.333..."


In [None]:
# TODO combine above with pollng locations