## Mapping New York City

Steps
1. load list of shapefiles for NY State as df
2. use df_codes_nyc to isolate just nyc census tract shape files and save it as df_codes_files_nycd
3. make GEO DataFrame with just NYC geo info
4. map nyc shapes
5. add population density

*GIS*
* **2020 Census Redistricting Data (P.L. 94-171) Shapefiles** - downloaded from the [ftp archive](https://www2.census.gov/geo/tiger/TIGER2020PL/LAYER/TRACT/2020/) via ftp client
* To download just the files for NY State, selecting only files whose names begin with "tl_2020_36"
* Data can also be downloaded through a browser, but may result in formatting issues, so I recommend avoiding this, is possible

In [1]:
# os module provides a variety of frequently used file system functions including path.join
# pathlib module makes it easier to manipulate folder and file paths with Python
# everything builds off base_dir, so if we move our code later, we'll only need to change base_dir
import os, pathlib
# base_dir - the immediate parent folder of this notebook
# we expect our data folders to be found here
base_dir = pathlib.Path(os.getcwd()).parent

# data_archive - we'll store compressed files here
# these will be preserved in git
data_archive_dir = os.path.join(base_dir, "data_archive")

# data_dir - large/numerous files will go here
# these will not be preserved in git!
# we'll only put files here that can be recreated with some python code (e.g. downloaded 
# or unpacked from data_archive, or generated from a DataFrame)
data_dir = os.path.join(base_dir, "data")

# shapes_dir - folders containing shapefiles go here
shapes_dir = os.path.join(data_dir,"shapes")

# json_dir - we'll store here GeoJSON we've generated and want to save for re-use 
json_dir = os.path.join(data_dir,"geojson")

## Geographic Data

The Census Bureau provides geographic this information in the form of shapefiles. We'll convert this to GeoJSON for mapping.

In [2]:
# Uses the geopandas function read_file to grab our file
import geopandas as gpd

tract_shapefiles_dir = os.path.join(shapes_dir,"tiger2020PL_NY_tracts") # provide the full path for the directory containing our shapefiles
ny_shapefiles=[x for x in pathlib.Path(tract_shapefiles_dir).iterdir() if x.is_file()] # make a list of all the files in the directory with their full path

In [3]:
# population data is stored compressed as tgz files to save time and space
import pandas as pd

# these options determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [4]:
# this code block loops through our 'ny_shapefiles' list, and creates a separate list of FIPS codes for county and state 
county_codes=[] # create an empty list for County FIPS
state_codes=[] # create an empty list for State FIPS
filtered_shapefiles=[]

for file in ny_shapefiles: 
    filename_parts = file.name.replace(".zip","").split("_") # take each filename - remove '.zip', split the remaining string wherever "_"  appears, and save it as a list
    if len(filename_parts) >=3: # take every 'filename_parts' list containing 3 or more elements (fewer than 3 parts indicates a file is extraneous and we don't want it)
        if len(filename_parts[2]) ==5: # take from each list the element at index 2 (position 3), but only if it contains 5 digits [State FIPS + County FIPS = 5 digits]
            # filename_parts -->  tl_2020_36013_tract20.zip
            # 36013 <---filename_parts[2]
            # 013 <---filename_parts[2][1:4]
            county_codes.append(filename_parts[2][-3:]) # take the last 3 digits of the element at index 2, and append it to the list 
            state_codes.append(filename_parts[2][0:2])  # take the first 3 digits of the element at index 2, State FIPS, and append it to our list
            filtered_shapefiles.append(file)

In [5]:
# lets zip our 3 lists into one and call it 'files_to_load'
files_list = list(zip(state_codes, county_codes, filtered_shapefiles))

In [6]:

# now let's turn it into a DataFrame and rename the columns for consistency
df_ny_files = pd.DataFrame.from_records(files_list).rename({0: 'State FIPS', 1: 'County FIPS', 2: 'File name'}, axis=1)
df_ny_files

# let's check out the datatypes
df_ny_files.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   State FIPS   62 non-null     object
 1   County FIPS  62 non-null     object
 2   File name    62 non-null     object
dtypes: object(3)
memory usage: 1.6+ KB


In [7]:
df_ny_files.head()

Unnamed: 0,State FIPS,County FIPS,File name
0,36,25,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36025_tract20.zip
1,36,85,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36085_tract20.zip
2,36,45,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36045_tract20.zip
3,36,11,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36011_tract20.zip
4,36,17,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36017_tract20.zip


In [8]:
df_codes_nyc = pd.read_csv('codes_nyc.csv')
# need to drop column "unnamed: 0"
df_codes_nyc['State FIPS']=df_codes_nyc['State FIPS'].astype(str)
df_codes_nyc['Place FIPS']=df_codes_nyc['Place FIPS'].astype(str)
df_codes_nyc['County FIPS']=df_codes_nyc['County FIPS'].astype(str)
df_codes_nyc

Unnamed: 0.1,Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County,County FIPS
0,25080,New York,NY,36,New York City,51000,Bronx County,5
1,25081,New York,NY,36,New York City,51000,Kings County,47
2,25082,New York,NY,36,New York City,51000,New York County,61
3,25083,New York,NY,36,New York City,51000,Queens County,81
4,25084,New York,NY,36,New York City,51000,Richmond County,85


In [9]:
###### trying to make a df here with files for just nyc, including columns: State FIPS, County Fips, Place, Place FIPS, File name
df_codes_files_nyc = pd.merge(df_codes_nyc, df_ny_files, on=('State FIPS', 'County FIPS'), how='right')
df_codes_files_nyc
# df_codes_files_nyc[['State FIPS', 'County FIPS', 'File name']]

Unnamed: 0.1,Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County,County FIPS,File name
0,,,,36,,,,025,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36025_tract20.zip
1,,,,36,,,,085,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36085_tract20.zip
2,,,,36,,,,045,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36045_tract20.zip
3,,,,36,,,,011,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36011_tract20.zip
4,,,,36,,,,017,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36017_tract20.zip
...,...,...,...,...,...,...,...,...,...
57,,,,36,,,,071,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36071_tract20.zip
58,,,,36,,,,113,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36113_tract20.zip
59,,,,36,,,,117,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36117_tract20.zip
60,,,,36,,,,057,/home/julie/git/portfolio/data/shapes/tiger2020PL_NY_tracts/tl_2020_36057_tract20.zip


In [10]:
#### now I want to load just the shapefiles shown above (all NYC census tracts) as a GEO DataFrame

In [11]:
# i want this GEO DataFrame to just be NYC census tracts 
# iclude just the columns we need and rename them for consistency
import geopandas as gpd

zipfile = os.path.join(shapes_dir, 'tiger2020PL_NY_tracts/tl_2020_36005_tract20.zip')
geo_df_nyc = gpd.read_file(zipfile)[['STATEFP20','COUNTYFP20', 'TRACTCE20',
                                 'GEOID20', 'ALAND20', 'geometry']].rename(
                                    {'STATEFP20': 'State FIPS','COUNTYFP20': 'County FIPS', 
                                     'TRACTCE20': 'Census Tract', 'GEOID20': 'GEOID', 'ALAND20': 'Land Area'}, axis=1
                                    )

geo_df_nyc.head(3)

### this is the longer way of writing the above code
#geo_df = gpd.read_file(zipfile)

# geo_df['State FIPS'] = geo_df['State FIPS'].astype(int)
# geo_df['County FIPS'] = geo_df['County FIPS'].astype(int)
# geo_df['Census Tract'] = geo_df['Census Tract'].astype(int)
# geo_df['GEOID'] = geo_df['GEOID'].astype(int)
# geo_df
# make a geo dataframe for each of the 5 counties
# concatenate into one big DF and map it?


Unnamed: 0,State FIPS,County FIPS,Census Tract,GEOID,Land Area,geometry
0,36,5,100,36005000100,1677210,"POLYGON ((-73.89772 40.79514, -73.89611 40.79692, -73.89250 40.80121, -73.87226 40.79499, -73.86712 40.79374, -73.87021 40.79091, -73.87030 40.79069, -73.87097 40.78906, -73.87095 40.78895, -73.87098 40.78888, -73.87096 40.78874, -73.87095 40.78861, -73.87095 40.78847, -73.87089 40.78834, -73.87085 40.78821, -73.87080 40.78806, -73.87079 40.78803, -73.87079 40.78800, -73.87079 40.78796, -73.87080 40.78793, -73.87081 40.78790, -73.87107 40.78757, -73.87122 40.78744, -73.87137 40.78727, -73.87147 40.78704, -73.87163 40.78680, -73.87188 40.78662, -73.87211 40.78647, -73.87226 40.78642, -73.87247 40.78630, -73.87268 40.78615, -73.87279 40.78608, -73.87287 40.78598, -73.87282 40.78596, -73.87277 40.78594, -73.87281 40.78590, -73.87285 40.78593, -73.87288 40.78596, -73.87298 40.78589, -73.87309 40.78585, -73.87320 40.78582, -73.87329 40.78581, -73.87336 40.78579, -73.87343 40.78576, -73.87351 40.78572, -73.87369 40.78575, -73.87386 40.78576, -73.87404 40.78578, -73.87421 40.78580, -73.87436 40.78581, -73.87460 40.78584, -73.87484 40.78586, -73.87508 40.78587, -73.87532 40.78588, -73.87556 40.78588, -73.87581 40.78588, -73.87605 40.78587, -73.87629 40.78586, -73.87653 40.78584, -73.87677 40.78581, -73.87697 40.78578, -73.87717 40.78574, -73.87737 40.78569, -73.87757 40.78564, -73.87776 40.78558, -73.87794 40.78551, -73.87813 40.78544, -73.87831 40.78536, -73.88293 40.78621, -73.88522 40.78663, -73.88530 40.78668, -73.88537 40.78667, -73.88577 40.78668, -73.88596 40.78675, -73.88613 40.78671, -73.88632 40.78673, -73.88659 40.78674, -73.88692 40.78673, -73.88715 40.78683, -73.88728 40.78681, -73.88770 40.78687, -73.88831 40.78701, -73.88856 40.78716, -73.88905 40.78737, -73.88940 40.78779, -73.88952 40.78806, -73.88961 40.78820, -73.88980 40.78865, -73.88987 40.78889, -73.88988 40.78901, -73.89013 40.78964, -73.89030 40.78988, -73.89057 40.79011, -73.89080 40.79021, -73.89092 40.79023, -73.89116 40.79017, -73.89140 40.79019, -73.89161 40.79016, -73.89177 40.79022, -73.89197 40.79024, -73.89210 40.79032, -73.89218 40.79039, -73.89987 40.79245, -73.89772 40.79514))"
1,36,5,200,36005000200,452832,"POLYGON ((-73.86648 40.80590, -73.86231 40.80992, -73.86222 40.81003, -73.86115 40.81142, -73.86164 40.81170, -73.86279 40.81232, -73.86337 40.81265, -73.86443 40.81327, -73.86463 40.81421, -73.86513 40.81414, -73.86522 40.81453, -73.86425 40.81466, -73.86333 40.81478, -73.86240 40.81491, -73.86148 40.81503, -73.86055 40.81515, -73.85972 40.81527, -73.85956 40.81511, -73.85838 40.81400, -73.85819 40.81381, -73.85772 40.81336, -73.85716 40.81282, -73.85613 40.81141, -73.85562 40.81061, -73.85546 40.80995, -73.85531 40.80933, -73.85529 40.80890, -73.85516 40.80832, -73.85498 40.80772, -73.85481 40.80726, -73.85460 40.80653, -73.85442 40.80578, -73.85586 40.80567, -73.85650 40.80568, -73.85649 40.80547, -73.85648 40.80541, -73.85653 40.80479, -73.85543 40.80451, -73.85526 40.80461, -73.85463 40.79972, -73.86297 40.79767, -73.86648 40.80590))"
2,36,5,400,36005000400,770689,"POLYGON ((-73.85960 40.81528, -73.85870 40.81540, -73.85778 40.81553, -73.85678 40.81566, -73.85552 40.81583, -73.85472 40.81594, -73.85362 40.81609, -73.85163 40.81635, -73.85115 40.81440, -73.85090 40.81434, -73.85012 40.81442, -73.84910 40.81455, -73.84811 40.81469, -73.84705 40.81483, -73.84663 40.81301, -73.84610 40.81308, -73.84647 40.81231, -73.84639 40.81162, -73.84313 40.81160, -73.84512 40.80630, -73.84665 40.80102, -73.85307 40.80011, -73.85463 40.79972, -73.85526 40.80461, -73.85543 40.80451, -73.85653 40.80479, -73.85648 40.80541, -73.85649 40.80547, -73.85650 40.80568, -73.85586 40.80567, -73.85442 40.80578, -73.85460 40.80653, -73.85481 40.80726, -73.85498 40.80772, -73.85516 40.80832, -73.85529 40.80890, -73.85531 40.80933, -73.85546 40.80995, -73.85562 40.81061, -73.85613 40.81141, -73.85716 40.81282, -73.85772 40.81336, -73.85819 40.81381, -73.85838 40.81400, -73.85956 40.81511, -73.85972 40.81527, -73.85960 40.81528))"


In [12]:
#### now I want to load the map of NYC census tracts

from keplergl import KeplerGl
ny_map = KeplerGl(height=600, show_docs=False)
for row in df_codes_files_nyc.itertuples():
    zipfile = f"zip://{row[7]}"
    ny_map.add_data(data=gpd.read_file(zipfile), name=row[5])
ny_map

DriverError: '/vsizip/nan' does not exist in the file system, and is not recognized as a supported dataset name.

In [None]:
# now I want to load the population data for just NYC and add it to the map
# should i join this with geo_df_nyc?
import tarfile

# 32mb+ of census data saved in a 4.7mb archive
census_data_archive = os.path.join(data_archive_dir, "census_data_2022_03_01.tgz")

# This is the US Census file with population data we will extract
# this file is contained in the above tgz file
census_2020_file = "DECENNIALPL2020.P1_data_with_overlays_2021-12-02T121459.csv"

use_cols = [0, 1, 2]
col_names = ['GEOID', 'CENSUS TRACT NAME', 'POPULATION']

# This extracts a DataFrame from a tgz archived file
def extract_from_tgz(filename):
    with tarfile.open(filename) as tf:
        for file in tf.getmembers():
            if file.name == census_2020_file:
                data = tf.extractfile(file)
                return pd.read_csv(data, low_memory=False, skiprows=1, header=0, usecols=use_cols, names=col_names)

df_census_raw = extract_from_tgz(census_data_archive)

# change some options that determine how much data is displayed in the notebook


df_census_raw.head(5)

In [None]:
geo_df.info()

In [None]:
# this is for later if i want to save the config file for my map
#ny_map.config

In [None]:
# state_fp = df_census_raw['GEOID'].str.slice(9,11).rename('State FIPS').astype(int)
# county_fp = df_census_raw['GEOID'].str.slice(11,14).rename('County FIPS').astype(int)
# df_census_pop = pd.concat([df_census_raw, state_fp, county_fp], axis=1).drop('GEOID', axis=1)
# df_census_pop