# Geographic Data

The Census Bureau provides geographic information in shapefile format. The goal of this notebook is to produce from these a clean GEO DataFrame of NY census tracts complete with population density and a second of NY state places.

### Data sources

Files downloaded from Census Bureau FTP server via FTP client, file locations provided below.

*2020 Census Redistricting Data (P.L. 94-171) Shapefiles - census tract level, NY State*
* file location: ftp2.census.gov/geo/tiger/TIGER2020PL/LAYER/TRACT/2020
* file names beginning with "tl_2020_36" (NY State FIPS code is 36)

*2020 Census Redistricting Data (P.L. 94-171) Shapefiles - places level, NY State*
* file location: ftp2.census.gov/geo/tiger/TIGER2020PL/LAYER/PLACE/2020
* file name: tl_2020_36_place20.zip  (36 for NY State)

### Code to help manage our file paths
The code below is used in each notebook of this project to make it easier to refer to the folders where our various data files are stored.

In [1]:
import os, pathlib
base_dir = pathlib.Path(os.getcwd()).parent
data_archive_dir = os.path.join(base_dir, "data_archive")
clean_data_dir = os.path.join(data_archive_dir, "clean")
data_dir = os.path.join(base_dir, "data")
shapes_dir = os.path.join(data_dir,"shapes")
json_dir = os.path.join(data_dir,"geojson")

## NY State census tract geographic data

 Let's load the NY State Census Tract shapefiles as a GEO DataFrame and call it *geodf_tract_ny*.

In [2]:
big_ny_shapefile = os.path.join(shapes_dir,"tiger2020PL_NY_tracts/tl_2020_36_tract20.zip") # provide the full path to our shapefiles
small_clean_shapefile = "../data_archive/clean/tl_2020_36_tract20.parquet"

# Uses the geopandas function read_file to grab our file
import geopandas as gpd
geodf_tract_ny = gpd.read_file(big_ny_shapefile)[['STATEFP20', 'COUNTYFP20', 'TRACTCE20', 'GEOID20', 'ALAND20', 'geometry']] 

import pyarrow as pa
import warnings; 
warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
geodf_tract_ny.to_parquet(small_clean_shapefile, index=False, compression='BROTLI')
    #= pa.Table.from_pandas(ny_shapes_df, preserve_index=False)

In [3]:
# Rename the columns
geodf_tract_ny = geodf_tract_ny.rename(columns={'STATEFP20': 'State FIPS', 'COUNTYFP20': 'County FIPS', 'TRACTCE20': 'Census Tract Code', 'GEOID20': 'GEOID Census Tract', 'ALAND20': 'Land Area'})

# Set the data types of each column as we want them to be
geodf_tract_ny = geodf_tract_ny.astype({'State FIPS': 'int', 'County FIPS':'int', 'Census Tract Code':'int', 'GEOID Census Tract': 'int', 'Land Area': 'int'})

geodf_tract_ny.shape # we want to see 5411 rows, since NY State had a total of 5411 census tracts for the 2020 census (https://www.census.gov/geographies/reference-files/time-series/geo/tallies.html#tract_bg_block)

(5411, 6)

In [4]:
geodf_tract_ny.head(3)

Unnamed: 0,State FIPS,County FIPS,Census Tract Code,GEOID Census Tract,Land Area,geometry
0,36,47,700,36047000700,176774,"POLYGON ((-74.00154 40.69279, -74.00132 40.693..."
1,36,47,900,36047000900,163469,"POLYGON ((-73.99405 40.69090, -73.99374 40.691..."
2,36,47,1100,36047001100,168507,"POLYGON ((-73.99073 40.69305, -73.99045 40.693..."


As usual, we'll conduct our checks for correct data types and missing values.

In [5]:
geodf_tract_ny.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 5411 entries, 0 to 5410
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   State FIPS          5411 non-null   int64   
 1   County FIPS         5411 non-null   int64   
 2   Census Tract Code   5411 non-null   int64   
 3   GEOID Census Tract  5411 non-null   int64   
 4   Land Area           5411 non-null   int64   
 5   geometry            5411 non-null   geometry
dtypes: geometry(1), int64(5)
memory usage: 253.8 KB


In [6]:
# double check for null values
geodf_tract_ny[geodf_tract_ny.isna().any(axis=1)]

Unnamed: 0,State FIPS,County FIPS,Census Tract Code,GEOID Census Tract,Land Area,geometry


No null values, so onto the next step...

## Calculating population density for NY State census tracts

We're finally ready to use the population data we so carefully cleaned in our first notebook! We'll begin by merging *df_census_pop* into *geodf_tract_ny* on column "GEOID Census Tract". Then we can add a column to calculate population density.

In [7]:
import pandas as pd

# might be nice to use the parquet file here if we can...but seems tricky to load these as GEO DataFrames
# clean_census_pop_file_pq = os.path.join(clean_data_dir,'census_pop.parquet')

clean_census_pop_file = os.path.join(clean_data_dir,'census_pop.csv')
df_census_pop = pd.read_csv(clean_census_pop_file, low_memory=False, index_col=0)
# df_census_pop.head(3)

In [8]:
geodf_tract_ny = geodf_tract_ny.merge(df_census_pop, on=['GEOID Census Tract'])
geodf_tract_ny = geodf_tract_ny[['State FIPS', 'County FIPS', 'Census Tract Name', 'GEOID Census Tract', 'Population', 'Land Area', 'geometry']]

In [9]:
# add a column to calculate population density
# land area measured in whole sq meters
# sq miles = sq meters/2,589,988 
# population density in sq miles  = population/ (land area sq meters / 2,589,988)
geodf_tract_ny['Population Density'] = (geodf_tract_ny['Population']/(geodf_tract_ny['Land Area']/2589988)).round(0)
geodf_tract_ny = geodf_tract_ny[['State FIPS', 'County FIPS', 'Census Tract Name', 'GEOID Census Tract', 'Population', 'Land Area', 'Population Density', 'geometry']]
geodf_tract_ny.head(3)

Unnamed: 0,State FIPS,County FIPS,Census Tract Name,GEOID Census Tract,Population,Land Area,Population Density,geometry
0,36,47,"Census Tract 7, Kings County, New York",36047000700,4415,176774,64686.0,"POLYGON ((-74.00154 40.69279, -74.00132 40.693..."
1,36,47,"Census Tract 9, Kings County, New York",36047000900,5167,163469,81865.0,"POLYGON ((-73.99405 40.69090, -73.99374 40.691..."
2,36,47,"Census Tract 11, Kings County, New York",36047001100,1578,168507,24254.0,"POLYGON ((-73.99073 40.69305, -73.99045 40.693..."


Our population density calculation will produce a NaN value if population == 0. So, we'll want to locate any rows with 0 population and replace their NaN "Population Density" with 0,

In [10]:
# tracts with population == 0 because these will result in NaN when we calculate Population Density
# so let's find these and replace the NaNs with 0s
geodf_tract_ny[geodf_tract_ny['Population'] == 0]
geodf_tract_ny['Population Density'] = geodf_tract_ny['Population Density'].fillna(0)
geodf_tract_ny['Population Density'] = geodf_tract_ny['Population Density'].astype(int) # change dtype to integer

## NY State "places" geographic data

Let's load the NY State "Place" shapefiles as a GEO DataFrame and call it *geodf_place_ny*.

In [11]:
import geopandas as gpd 

# make GEO DataFrame from the NY State "places" shapefiles
shapefile_place_ny = os.path.join(shapes_dir,"tl_2020_36_place20.zip")
geodf_place_ny = gpd.read_file(shapefile_place_ny, encoding_errors='ignore')

# Select only the columns we want from the DataFrame 
geodf_place_ny = geodf_place_ny[['STATEFP20', 'PLACEFP20', 'GEOID20', 'NAME20', 'ALAND20', 'geometry']]

# rename the columns
geodf_place_ny.rename(columns={'STATEFP20': 'State FIPS', 'PLACEFP20': 'Place FIPS', 'GEOID20': 'GEOID', 'NAME20': 'Name', 'ALAND20': 'Land Area'}, inplace=True)

geodf_place_ny.head(3)

Unnamed: 0,State FIPS,Place FIPS,GEOID,Name,Land Area,geometry
0,36,1517,3601517,Altamont,3279144,"MULTIPOLYGON (((-74.02271 42.70420, -74.02231 ..."
1,36,56979,3656979,Peekskill,11251161,"POLYGON ((-73.95544 41.27786, -73.95204 41.280..."
2,36,18388,3618388,Cortland,10085376,"POLYGON ((-76.20049 42.61248, -76.19596 42.612..."


And, of course, we'll conduct our checks for missing values and correct datatypes.

In [12]:
geodf_place_ny.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1293 entries, 0 to 1292
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   State FIPS  1293 non-null   object  
 1   Place FIPS  1293 non-null   object  
 2   GEOID       1293 non-null   object  
 3   Name        1293 non-null   object  
 4   Land Area   1293 non-null   int64   
 5   geometry    1293 non-null   geometry
dtypes: geometry(1), int64(1), object(4)
memory usage: 60.7+ KB


Our data types for State FIPS, Place FIPS, and GEOID should be integers. Let's fix that now.

In [13]:
# Set the data types of each column are as we want them to be
geodf_place_ny['State FIPS'] = geodf_place_ny['State FIPS'].astype(int)
geodf_place_ny['Place FIPS'] = geodf_place_ny['Place FIPS'].astype(int)
geodf_place_ny['GEOID'] = geodf_place_ny['GEOID'].astype(int)
geodf_place_ny.dtypes

State FIPS       int64
Place FIPS       int64
GEOID            int64
Name            object
Land Area        int64
geometry      geometry
dtype: object

One final check for missing values.

In [14]:
# another check for null values
geodf_place_ny[geodf_place_ny.isna().any(axis=1)]

Unnamed: 0,State FIPS,Place FIPS,GEOID,Name,Land Area,geometry


All clear! We now have a clean GEO DataFrame of NY State census tracts with poulation density and second with NY State places. Let's go ahead and save them as .csv files. 

In [15]:
# save DF geodf_tract_ny_file as a CSV
clean_geodf_tract_ny_file = os.path.join(clean_data_dir, 'geodf_tract_ny.csv')
geodf_tract_ny.to_csv(clean_geodf_tract_ny_file)
geodf_tract_ny.head(3)

Unnamed: 0,State FIPS,County FIPS,Census Tract Name,GEOID Census Tract,Population,Land Area,Population Density,geometry
0,36,47,"Census Tract 7, Kings County, New York",36047000700,4415,176774,64686,"POLYGON ((-74.00154 40.69279, -74.00132 40.693..."
1,36,47,"Census Tract 9, Kings County, New York",36047000900,5167,163469,81865,"POLYGON ((-73.99405 40.69090, -73.99374 40.691..."
2,36,47,"Census Tract 11, Kings County, New York",36047001100,1578,168507,24254,"POLYGON ((-73.99073 40.69305, -73.99045 40.693..."


In [16]:
# save DF geodf_place_ny as a CSV
clean_geodf_place_ny_file = os.path.join(clean_data_dir, 'geodf_place_ny.csv')
geodf_place_ny.to_csv(clean_geodf_place_ny_file)
geodf_place_ny.head(3)

Unnamed: 0,State FIPS,Place FIPS,GEOID,Name,Land Area,geometry
0,36,1517,3601517,Altamont,3279144,"MULTIPOLYGON (((-74.02271 42.70420, -74.02231 ..."
1,36,56979,3656979,Peekskill,11251161,"POLYGON ((-73.95544 41.27786, -73.95204 41.280..."
2,36,18388,3618388,Cortland,10085376,"POLYGON ((-76.20049 42.61248, -76.19596 42.612..."
