<img src="../data/images/good_husband.png" style="float: left; padding: 5px 10px 5px 10px;" />  

# Census2020 DataMapping
### Good Husband Version

Example paragraphs below of loading all the data files without filtering, joining/merging, then filtering - then loading the shapefiles we need based on the county FIPS numbers in the filtered rows.

We've imported below the os and pathlib modules and included some code to make it easier to refer to folders where the various data files are stored when we need to load them.

In [1]:
# os module provides a variety of frequently used file system functions including path.join
# pathlib module makes it easy to manipulate folder and file paths with Python
# everything builds off basedir, so if you move later only have to change basedir
import os, pathlib
# Immediate parent folder of this notebook
# Where we expect other folders with data to be found
base_dir = pathlib.Path(os.getcwd()).parent

# The data_archive folder - compressed files go here
# these will be preserved in git
data_archive_dir = os.path.join(base_dir, "data_archive")

# The data folder - large/numerous files go here
# These will not be preserved in git!
# Only put files here that we can recreate with some
# python code (e.g. downlod or unpack from data_archive,
# or generate from a dataframe)
data_dir = os.path.join(base_dir, "data")

# data/shapes folder: Other folders containing shape files go here
shapes_dir = os.path.join(data_dir,"shapes")

# data/geogson folder: When we generate geojson and want to save
# for re-use we'll save those files here 
json_dir = os.path.join(data_dir,"geojson")

### State Codes

In [2]:
import pandas as pd

# This is a list of all states and territories in the US
# we got the file from here: https://www.census.gov/library/reference/code-lists/ansi.html
# We'll give the columns more descriptive names 

file_state_fips = "../data_archive/state.txt"
df_state_fips = pd.read_csv(file_state_fips, delimiter="|")
df_state_fips.reset_index(inplace=True)
df_state_fips.rename(columns={'STATE': 'State FIPS', 'STUSAB': 'State', 'STATE_NAME': 'State Name'}, inplace=True)
df_state_fips = df_state_fips[['State FIPS', 'State', 'State Name',]]
df_state_fips.head()
#?? Julie - need to drop index

Unnamed: 0,State FIPS,State,State Name
0,1,AL,Alabama
1,2,AK,Alaska
2,4,AZ,Arizona
3,5,AR,Arkansas
4,6,CA,California


### Place Codes
NOTE: Changed this a bit to only load the cols we want (dropped place fips because we don't use it anywhere) and renamed cols directly (no copy)

In [3]:
place_file = os.path.join(data_archive_dir, 'PLACElist.txt')

df_place = pd.read_csv(place_file,
                       usecols=['STATE','STATEFP', 'PLACENAME', 'COUNTY'],
                       delimiter="|",
                       encoding_errors='ignore'
                      )[['STATE','STATEFP', 'COUNTY', 'PLACENAME']] # Reorder cols

# Rename columns
df_place.columns = ['State', 'State FIPS', 'County', 'Place']

df_place.head(3)

Unnamed: 0,State,State FIPS,County,Place
0,AL,1,Chambers County,Abanda CDP
1,AL,1,Henry County,Abbeville city
2,AL,1,Jefferson County,Adamsville city


#### Annoying that some Place names are not capitalized correctly!
see New York city (city not capitlaized) below

Somewhat useless example of using an index (useless here... we'll see why) probably useful in another situation.

In [4]:
df_place.loc[df_place['Place'].str.startswith("New York")]

Unnamed: 0,State,State FIPS,County,Place
17317,MN,27,Otter Tail County,New York Mills city
17318,MN,27,Otter Tail County,New York Mills city
24316,NY,36,"Bronx County, Kings County, New York County, Q...",New York city
24317,NY,36,Oneida County,New York Mills village


#### Let's just change the New York's to have Uppercase last words

In [5]:
# Create an index of just the rows where Place starts with "New York"
idx = df_place.loc[df_place['Place'].str.startswith("New York"),'Place'].index
idx # just a bunch of numbers - the rows where this occurs

Int64Index([17317, 17318, 24316, 24317], dtype='int64')

#### Change just those rows
use iloc[idx, column name] to select just our rows (not a view) and assign something to them
Here we're assigning the same selection but calling Series.str.title() which uppercases the first letter of each word in a atring (also see str.upper() and str.lower()... and firends)

In [6]:
df_place.loc[idx,'Place'] = df_place.loc[idx,'Place'].str.title()
df_place.iloc[idx]

Unnamed: 0,State,State FIPS,County,Place
17317,MN,27,Otter Tail County,New York Mills City
17318,MN,27,Otter Tail County,New York Mills City
24316,NY,36,"Bronx County, Kings County, New York County, Q...",New York City
24317,NY,36,Oneida County,New York Mills Village


#### We only changed the New Yorks
here we're selecting row numbers 2 and 17317 (from the index shown above)
first row was left alone - didn't get the effect of str.title() 

In [7]:
df_place.iloc[[2,24316]]


Unnamed: 0,State,State FIPS,County,Place
2,AL,1,Jefferson County,Adamsville city
24316,NY,36,"Bronx County, Kings County, New York County, Q...",New York City


We see from the above that New York City spans these 5 counties: Bronx, Kings, New York, Queens, and Richmond. Let's split and pivot this list, so that each county appears as a separate record, and save it as a new DataFrame.

In [8]:
# code below splits every Place into separate record for each county

# This is the original version - very slow - like 8-10 seconds

# df_counties = (df_place['County']    # County can contain multiple comma-separated names
#                    .str.split(",")        # split into lists
#                    .apply(pd.Series)      # convert to multiple cols / Series
#                    .stack()               # pivot the series into rows
#                    .str.strip()           # strip leading/trailing spaces
#                    .reset_index(level=1)  # convert to DF by resetting index
#                    .drop('level_1',axis=1) # drop the new 'index'
#                    .rename(columns={0:'County'}) # now-split counties are in column "0" so rename
#                   )

# Much faster < 1 second
df_counties = (df_place['County']
               .str.strip()
               .str.split(",",expand=True)
               .melt(ignore_index=False)
               .dropna()
               .drop('variable',axis=1)
               .rename(columns={'value':'County'})
              )
df_counties = df_counties.join(df_place, rsuffix='_messy').drop(['State'], axis=1)
df_counties = df_counties.drop(['County_messy'], axis=1)

#df_counties
df_counties.loc[df_counties['Place'] == ('New York City')]

Unnamed: 0,County,State FIPS,Place
24316,Bronx County,36,New York City
24316,Kings County,36,New York City
24316,New York County,36,New York City
24316,Queens County,36,New York City
24316,Richmond County,36,New York City


In [9]:
# Now we want to add county codes for the counties that make up NYC (we need these to be able to isolate the NYC census tracts in the shapefiles)
# So, we've downloaded this list of all counties in the US, including their County ANSI code and County Name.
# we got the file from here: ftp://ftp2.census.gov/geo/docs/reference/codes/national_county.txt

national_county_file = os.path.join(data_archive_dir, 'national_county.txt')

df_county_fips = pd.read_csv(national_county_file,
                                 delimiter=",",
                                 usecols=['State ANSI', 'County ANSI', 'County Name'],
                                 encoding_errors='ignore')
df_county_fips.columns = ['State FIPS', 'County FIPS', 'County']

df_county_fips

Unnamed: 0,State FIPS,County FIPS,County
0,1,1,Autauga County
1,1,3,Baldwin County
2,1,5,Barbour County
3,1,7,Bibb County
4,1,9,Blount County
...,...,...,...
3230,72,153,Yauco Municipio
3231,74,300,Midway Islands
3232,78,10,St. Croix Island
3233,78,20,St. John Island


In [10]:
df_codes = pd.merge(df_counties,
                    df_county_fips, on=['State FIPS','County']
                   )[['State FIPS', 'County FIPS', 'County', 'Place']] # Reorder columns
df_codes

Unnamed: 0,State FIPS,County FIPS,County,Place
0,1,17,Chambers County,Abanda CDP
1,1,17,Chambers County,Cusseta town
2,1,17,Chambers County,Five Points town
3,1,17,Chambers County,Fredonia CDP
4,1,17,Chambers County,Huguley CDP
...,...,...,...,...
41331,72,153,Yauco Municipio,Palomas comunidad
41332,72,153,Yauco Municipio,Yauco zona urbana
41333,72,127,San Juan Municipio,San Juan zona urbana
41334,72,139,Trujillo Alto Municipio,Trujillo Alto zona urbana


In [11]:
# stopping now so we can watch a show....
# next... select county FIPS codes from above where Place == "New York City"
# use to_list() on the series to make the list of County FIPS a regular list
# save the list as a variable e.g. 
nyc_counties = df_codes.loc[df_codes['Place'].str.startswith("New York")]
nyc_counties #sorry - something is wrong here - must have i ntroduced a bug above...

Unnamed: 0,State FIPS,County FIPS,County,Place
15199,27,111,Otter Tail County,New York Mills City
15200,27,111,Otter Tail County,New York Mills City
24171,36,65,Oneida County,New York Mills Village
24957,36,5,Bronx County,New York City


## Downloading NY State shapefiles

Now we can finally download our shapefiles! We want just the files for NY State, with means the ones containing 36 in the file name. 

We got the file from [here](ftp://ftp2.census.gov/geo/tiger/TIGER2020PL/LAYER/TRACT/2020).

In [12]:
# Uses the geopandas function read_file to grab our file
import geopandas as gpd

tract_shapefiles_dir = os.path.join(data_dir,"tiger2020PL_NY_tracts")
ny_shapefiles=os.listdir(shapes_dir)
files_to_load = []
#zipfile = "zip://../data/tiger2020PL_NY_tracts/tl_2020_36003_tract20.zip"
#zipfile = "zip://../data/tiger2020PL_NY_tracts/tl_2020_36047_tract20.zip"
for file in ny_shapefiles:
    county_fips = file.replace(".zip","").split("_")[2]


#gdf_NY_tracts = gpd.read_file(zipfile)
#gdf_NY_tracts.rename(columns={'STATEFP20': 'State FIPS', 'COUNTYFP20': 'County FIPS', 'TRACTCE20': 'Tract #', 'GEOID20': 'GEOID', 'PLACENAME': "Place", 'COUNTY':'County'},inplace=True
#gdf_NY_tracts.head()


Now let's take our NY state shapefiles dataframe 'gdf_NY_tracts' and whittle it down to just the tracts within the 5 counties that makeup NYC. We'll match on gdf_NY_tracts['COUNTYFP20'] and df_nyc_codes['COUNTY ANSI'].

In [13]:
# Patrick, the dtypes for gdf_NY_tracts['COUNTYFP20'] and df_nyc_codes['COUNTY ANSI'] are different, so I can't match them up.

# Create a view of the counties data with just the columns we want to include in the join/merge
# by creating a veiew before we merge, we reduce the number of additional / superfluous columns
# produced in the merge of the two dataframes. We'll name our view after the df_nyc_codes
# dataframe with a _v suffix - just so we remember it's a view
#
df_nyc_codes_v = df_nyc_codes[['State ANSI', 'County ANSI', 'County', 'Place']]


# List the columns in the counties data we want to use as keys to match
# against columns in the NY Shapes data
#
counties_join_cols = ['State ANSI', 'County ANSI']


# List the columns in the NY shapes data we want to use as keys to match
# our tracts_join_cols list above
#
tracts_join_cols = ['STATEFP20', 'COUNTYFP20']

# GeoPandas loaded our NY shapes columns as string ("object") types.
# To use our two sets of join columns as keys to match on we'll first
# need to convert the NY shapes columns to numeric (int64) types

gdf_NY_tracts[tracts_join_cols] = gdf_NY_tracts[tracts_join_cols].apply(pd.to_numeric)


# Do the merge - "left" here is the gdf_ny_tracts data and "right" is the nyc_codes
#
ny_shapes = gdf_NY_tracts.merge(df_nyc_codes_v, left_on=tracts_join_cols, right_on=counties_join_cols)

# Create a view of just the columns we need for rendering a map with GeoJSON data
#
geojson_cols = ['']

#gdf_NY_tracts[tracts_join_cols]
#df_nyc_codes_v[counties_join_cols]

NameError: name 'df_nyc_codes' is not defined

## FAQ

**Choropleth**
A choropleth  is a map made of different colored polygons, where each color represents a different quantity or range of quantities. 

**Shapefile**
A standardize file format for storing geospatial information, including geometry (coordinates) and attributes of geographic features, needed to create maps. 

**TIGER?**
TIGER, also referred to as MAF/TIGER, is the Census Bureau's geographic database system. The acronym MAF/TIGER stands for Master Address. File/Topologically Integrated Geographic Encoding and Referencing. 

**GeoJSON** 
A format for storing a variety of geographic data and is based on JSON. We know that JSON is format for storing data. Similar to a Python dictionaries, it uses key value pairs. Dictionaries can be nested.

**Which areas do the TIGER/Line shapefiles describe?**
Shapefiles are available for the fifty states, District of Columbia, Puerto Rico, and the Island areas (American Samoa, the Commonwealth of the Northern
Mariana Islands, Guam, and the United States Virgin Islands).

