# Mapping US Census Tracts 

With population density for every Census Tract in the US now tidily contained in one DataFrame, we can move on to the fun part - mapping! First we'll map the geometry of the Census Tracts, later we'll add color to visually represent population density. We'll be building a tile-map choropleth (as opposed to an outline-based choropleth.)

Again, we've imported below the os and pathlib modules and included some code to make it easier to refer to the folders where our various data files are stored.

In [1]:
# os module provides a variety of frequently used file system functions including path.join
# pathlib module makes it easier to manipulate folder and file paths with Python
# everything builds off base_dir, so if we move our code later, we'll only need to change base_dir
import os, pathlib
# base_dir - the immediate parent folder of this notebook
# we expect our data folders to be found here
base_dir = pathlib.Path(os.getcwd()).parent

# data_archive - we'll store compressed files here
# these will be preserved in git
data_archive_dir = os.path.join(base_dir, "data_archive")

# data_dir - large/numerous files will go here
# these will not be preserved in git!
# we'll only put files here that can be recreated with some python code (e.g. downloaded 
# or unpacked from data_archive, or generated from a DataFrame)
data_dir = os.path.join(base_dir, "data")

# shapes_dir - folders containing shapefiles go here
shapes_dir = os.path.join(data_dir,"shapes")

# json_dir - we'll store here GeoJSON we've generated and want to save for re-use 
json_dir = os.path.join(data_dir,"geojson")

## Downloading NY State shapefiles

Now we can finally download our shapefiles! 

Source: ftp://ftp2.census.gov/geo/tiger/TIGER2020PL/LAYER/TRACT/2020

In [2]:
# Uses the geopandas function read_file to grab our file
import geopandas as gpd

tract_shapefiles_dir = os.path.join(shapes_dir,"tiger2020PL_NY_tracts") # provide the full path for the directory containing our shapefiles
ny_shapefiles=[x for x in pathlib.Path(tract_shapefiles_dir).iterdir() if x.is_file()] # make a list of all the files in the directory with their full path

In [3]:
# this code block loops through our 'ny_shapefiles' list, and creates a separate list of FIPS codes for county and state 
county_codes=[] # create an empty list for County FIPS
state_codes=[] # create an empty list for State FIPS
filtered_shapefiles=[]
for file in ny_shapefiles: 
    filename_parts = file.name.replace(".zip","").split("_") # take each filename - remove '.zip', split the remaining string wherever "_"  appears, and save it as a list
    if len(filename_parts) >=3: # take every 'filename_parts' list containing 3 or more elements (fewer than 3 parts indicates a file is extraneous and we don't want it)
        if len(filename_parts[2]) ==5: # take from each list the element at index 2 (position 3), but only if it contains 5 digits [State FIPS + County FIPS = 5 digits]
            # filename_parts -->  tl_2020_36013_tract20.zip
            # 36013 <---filename_parts[2]
            # 013 <---filename_parts[2][1:4]
            county_codes.append(filename_parts[2][-3:]) # take the last 3 digits of the element at index 2, and append it to the list 
            state_codes.append(filename_parts[2][0:2])  # take the first 3 digits of the element at index 2, State FIPS, and append it to our list
            filtered_shapefiles.append(file)

In [4]:
len(county_codes)

62

In [5]:
len(state_codes)

62

In [6]:
# lets zip our 3 lists into one and call it 'files_to_load'
files_to_load = list(zip(state_codes, county_codes, filtered_shapefiles))


# now let's turn it into a DataFrame and rename the columns for consistency
df_ny_shapes = pd.DataFrame.from_records(files_to_load).rename({0: 'State FIPS', 1: 'County FIPS', 2: 'File name'}, axis=1)
# df_ny_shapes = pd.DataFrame.from_records(files_to_load)
df_ny_shapes

# to ensure the dtypes of the State and County codes are correctly cast as integers, let's set them here
df_ny_shapes['State FIPS'] = df_ny_shapes['State FIPS'].astype(int)
df_ny_shapes['County FIPS'] = df_ny_shapes['County FIPS'].astype(int)
df_ny_shapes.info()
df_ny_shapes.head()

NameError: name 'pd' is not defined

In [None]:
# merge our two DataFrames into one that contains just the counties within New York City and the full path to their shapefile
df_ny = df_nyc_codes.merge(df_ny_shapes)
df_ny

In [None]:
# now let's create a GEO DataFrame with just the columns we need and rename them for consistency
import geopandas as gpd

zipfile = os.path.join(shapes_dir, 'tiger2020PL_NY_tracts/tl_2020_36005_tract20.zip')
geo_df = gpd.read_file(zipfile)[['STATEFP20','COUNTYFP20', 'TRACTCE20',
                                 'GEOID20', 'ALAND20', 'geometry']].rename(
                                    {'STATEFP20': 'State FIPS','COUNTYFP20': 'County FIPS', 
                                     'TRACTCE20': 'Census Tract', 'GEOID20': 'GEOID', 'ALAND20': 'Land Area'}, axis=1
                                    )

geo_df

### this is the longer way of writing the above code
#geo_df = gpd.read_file(zipfile)

# geo_df['State FIPS'] = geo_df['State FIPS'].astype(int)
# geo_df['County FIPS'] = geo_df['County FIPS'].astype(int)
# geo_df['Census Tract'] = geo_df['Census Tract'].astype(int)
# geo_df['GEOID'] = geo_df['GEOID'].astype(int)
# geo_df
# make a geo dataframe for each of the 5 counties
# concatenate into one big DF and map it?



In [None]:
geo_df.info()

In [None]:
from keplergl import KeplerGl
ny_map = KeplerGl(height=600, show_docs=False)
for row in df_ny.itertuples():
    zipfile = f"zip://{row[7]}"
    ny_map.add_data(data=gpd.read_file(zipfile), name=row[5])
ny_map

In [None]:
#ny_map.config

In [None]:
import tarfile

# 32mb+ of census data saved in a 4.7mb archive
census_data_archive = os.path.join(data_archive_dir, "census_data_2022_03_01.tgz")

# This is the US Census file with population data we will extract
# this file is contained in the above tgz file
census_2020_file = "DECENNIALPL2020.P1_data_with_overlays_2021-12-02T121459.csv"

use_cols = [0, 1, 2]
col_names = ['GEOID', 'CENSUS TRACT NAME', 'POPULATION']

# This extracts a DataFrame from a tgz archived file
def extract_from_tgz(filename):
    with tarfile.open(filename) as tf:
        for file in tf.getmembers():
            if file.name == census_2020_file:
                data = tf.extractfile(file)
                return pd.read_csv(data, low_memory=False, skiprows=1, header=0, usecols=use_cols, names=col_names)

df_census_raw = extract_from_tgz(census_data_archive)

# change some options that determine how much data is displayed in the notebook


df_census_raw.head(5)

In [None]:
# state_fp = df_census_raw['GEOID'].str.slice(9,11).rename('State FIPS').astype(int)
# county_fp = df_census_raw['GEOID'].str.slice(11,14).rename('County FIPS').astype(int)
# df_census_pop = pd.concat([df_census_raw, state_fp, county_fp], axis=1).drop('GEOID', axis=1)
# df_census_pop

In [None]:
df_all = df_codes.merge(df_census_pop, on=['State FIPS', 'County FIPS'], how='inner')
#df_codes.dtypes
df_ny = df_ny_shapes.merge(df_all, on=['State FIPS', 'County FIPS'], how='inner')
df_ny
# This is probably not the dataframe we want
# the rows are at a Census Tract Level of Granularity
# but I merged in files... which are at a county level of granularity
# .. we want to load the files for the counties we want.. ONCE...
# so either... filter this list to the rows you want and then find
# the "unique" list of filenames you want.... 
# see paragrap below this for an example... the county filenames
# are repeated across all the census tracts for NY...

In [None]:
# df_ny.loc[df_ny['Place'] == 'New York City']

## FAQ

**Choropleth**
A choropleth  is a map made of different colored polygons, where each color represents a different quantity or range of quantities. 

**Shapefile**
A standardize file format for storing geospatial information, including geometry (coordinates) and attributes of geographic features, needed to create maps. 

**TIGER?**
TIGER, also referred to as MAF/TIGER, is the Census Bureau's geographic database system. The acronym MAF/TIGER stands for Master Address. File/Topologically Integrated Geographic Encoding and Referencing. 

**GeoJSON** 
A format for storing a variety of geographic data and is based on JSON. We know that JSON is format for storing data. Similar to a Python dictionaries, it uses key value pairs. Dictionaries can be nested.

**Which areas do the TIGER/Line shapefiles describe?**
Shapefiles are available for the fifty states, District of Columbia, Puerto Rico, and the Island areas (American Samoa, the Commonwealth of the Northern
Mariana Islands, Guam, and the United States Virgin Islands).

