## GEOID codes

*Geographic Identifier (GEOID)* is a general term for the unique numeric codes used to identify geographic entities. Specifically, the Census Bureau uses Federal Information Processing Series (FIPS) codes to identify most of the geographies for which it tabulates data, such as states, counties, and "places". Because FIPS codes are maintained by the American National Standards Institute (ANSI), they may be referred to as "ANSI", "FIPS", or "ANSI/FIPS" codes. To eliminate confusion, we'll refer to them here just as "FIPS" codes. The exception here is census tract codes, these are maintained by the Census Bureau, so we'll just call them "Census Tract Codes." 


### Why do we need GEOIDs?
Regardless of what we call them, GEOIDs are the critical piece of information that allows us to tie population data to geographic data. A unique GEOID can be derived for every piece of geography in the US by concatenating its FIPS code with those of the geographies in which it nests. The geography we'll be mapping, the *census tract*, nests within a *county*, which nests within a *state*. So, to derive a census tract's unique GEOID we concatenate it: STATE + COUNTY + TRACT codes.

For example, the GEOID for the census tract in Brooklyn where I currently sit writing this is:

```
STATE + COUNTY + TRACT
NY + Kings County + Tract #
36 + 047 + 016500 
= 36047016500

```

### Which specific GEOIDs do we need?

Because a glance at the [FTP Archive](https://www2.census.gov/geo/tiger/TIGER2020PL/LAYER/TRACT/2020/) of shapefiles, quickly revealsa that the dataset required to map every census tract in the US is *enormous*, for this initial mapping exercise, we'll narrow our scope to a smaller subsection of particular interest within the indoor farming industry, New York City. We will, however, write our code in such a way that we can easily add more locations when we're ready.

Given the [hierarchy by which census geographic entities are oganized](https://www2.census.gov/geo/pdfs/reference/geodiagram.pdf), accessing New York City shapefiles at the census tract level is not quite as straightforward as one might expect. We've downloaded the full set of NY state shapefiles and will be joining several sets of GEOIDs to isolate the specific census tracts belonging to New York City.As mentioned above, we'll need state FIPS, county FIPS, and census tract codes. We've also got a fourth to add to our list - "Place". We need this last one because, interestingly, "City" isn't one of the categories within the hierarchy of census geographic entities, rather the Census Bureau categorizes New York City, the area we'll be mapping, as a "Place". 

We'll produce a clean DataFrame for each, containing just the info we want, then merge into one big DataFrame with the name and GEOIDs for every state, county, place, and census tract in the US. While our initial mapping exercise will be limited to NYC, we'll include all of the US in our codes DataFrame, as we'll need it for future mapping of additional areas.

### Data Sources

The below list details the data sources used for this section of the project. Since several routes can be taken to access the same census data, I've included the specific steps I followed to access them.

*GEOID Codes*

Files downloaded from census.gov ftp archive via ftp client.
* State FIPS codes - ftp://ftp2.census.gov/geo/docs/reference/codes/state.txt
* County - ftp://ftp2.census.gov/geo/docs/reference/codes/national_county.txt# 
* Place - ftp://ftp2.census.gov/geo/docs/reference/codes/PLACElist.txt

In [1]:
# os module provides a variety of frequently used file system functions including path.join
# pathlib module makes it easier to manipulate folder and file paths with Python
# everything builds off base_dir, so if we move our code later, we'll only need to change base_dir
import os, pathlib
# base_dir - the immediate parent folder of this notebook
# we expect our data folders to be found here
base_dir = pathlib.Path(os.getcwd()).parent

# data_archive - we'll store compressed files here
# these will be preserved in git
data_archive_dir = os.path.join(base_dir, "data_archive")

# data_dir - large/numerous files will go here
# these will not be preserved in git!
# we'll only put files here that can be recreated with some python code (e.g. downloaded 
# or unpacked from data_archive, or generated from a DataFrame)
data_dir = os.path.join(base_dir, "data")

# shapes_dir - folders containing shapefiles go here
shapes_dir = os.path.join(data_dir,"shapes")

# json_dir - we'll store here GeoJSON we've generated and want to save for re-use 
json_dir = os.path.join(data_dir,"geojson")


### State Codes

Let's go ahead and load as a DataFrame the list of codes for the states and state equivalents we downloaded from the Census FTP Site and examine the first few rows. We'll include just the columns we want, rename them to better describe their contents, and reorder them as we like. 

In [2]:
import pandas as pd

# these options determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# this text file uses UTF-8 encoding

file_state = os.path.join(data_archive_dir, 'state.txt')

# change some options that determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

df_state = pd.read_csv(file_state, 
                       usecols=['STATE', 'STUSAB', 'STATE_NAME'], # use only these columns
                       delimiter="|", # load txt file as pandas DataFrame 
                       encoding="iso-8859-1", # QUESTION: Patrick - text file uses UTF-8, what encoding should we use here? 
                       encoding_errors='ignore')[['STATE_NAME', 'STUSAB', 'STATE',]] # reorder cols

df_state.rename(columns={'STATE': 'State FIPS', 'STUSAB': 'State', 'STATE_NAME': 'State Name'}, inplace=True) # rename columns
df_state.head()

Unnamed: 0,State Name,State,State FIPS
0,Alabama,AL,1
1,Alaska,AK,2
2,Arizona,AZ,4
3,Arkansas,AR,5
4,California,CA,6


In [3]:
df_state.shape

(57, 3)

In [4]:
df_state.tail()

Unnamed: 0,State Name,State,State FIPS
52,Guam,GU,66
53,Northern Mariana Islands,MP,69
54,Puerto Rico,PR,72
55,U.S. Minor Outlying Islands,UM,74
56,U.S. Virgin Islands,VI,78


In [5]:
df_state.dtypes

State Name    object
State         object
State FIPS     int64
dtype: object

***** DOUBLE CHECK THIS****

Note - this list includes the United States, Puerto Rico, and the Island Areas (American Samoa, Guam, Commonwealth of the Northern Mariana Islands, and United States Virgin Islands). Our population data is limited to the United States and Puerto Rico, it does not include the other Island Areas.

In [6]:
df_state['State FIPS'] = df_state['State FIPS'].astype(str)

Now we can do our usual search for data oddities, making sure there are no missing values or datatypes that don't make sense.

In [7]:
df_state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   State Name  57 non-null     object
 1   State       57 non-null     object
 2   State FIPS  57 non-null     object
dtypes: object(3)
memory usage: 1.5+ KB


<!-- ****DOUBLE CHECK - do we want State FIPS as integer or string? Population data has GEOIDs as strings. What about shapefiles? Let's leave it for now.****

Hmm. "State" and "State Name" are of type "object", which makes sense since our visual inspection makes it clear they are strings. State FIPS, however, is showing up as an integer type. This makes sense, but since our population data has GEOIDs formatted as strings, we'll follow its lead and change the dtype here, so we'll be able match on this column if needed in the future.  -->

According to `.info()` our list has no null values. To be safe, let's double check this using `.isna().sum().`

In [8]:
df_state.isna().sum()

State Name    0
State         0
State FIPS    0
dtype: int64

Looks good. Let's move on to our next list, County Codes.

### County Codes

Now, let's load as a DataFrame the list of County FIPS codes we downloaded previously. Again, we'll include only the columns we need, reorder and rename them as we like, and examine the first few rows.

In [9]:
# this text file uses UTF-8 encoding 
file_national_county = os.path.join(data_archive_dir, 'national_county.txt')

df_county = pd.read_csv(file_national_county,
                        delimiter=",",
                        usecols=['State ANSI', 'County ANSI', 'County Name'], # use only these columns
                        encoding="iso-8859-1", # QUESTION: Patrick, this text file uses UTF-8 encoding, what should we use here?
                        encoding_errors='ignore')[['State ANSI', 'County Name', 'County ANSI']] # reorder columns

# rename columns
df_county.columns = ['State FIPS', 'County', 'County FIPS']
df_county.head(3)

Unnamed: 0,State FIPS,County,County FIPS
0,1,Autauga County,1
1,1,Baldwin County,3
2,1,Barbour County,5


Let's proceed with our usual checks - making sure there are no missing values or data types that don't seem quite right.

In [10]:
df_county.tail(3)

Unnamed: 0,State FIPS,County,County FIPS
3232,78,St. Croix Island,10
3233,78,St. John Island,20
3234,78,St. Thomas Island,30


In [11]:
df_county['State FIPS'] = df_county['State FIPS'].astype(str)
df_county['County FIPS'] = df_county['County FIPS'].astype(str)
df_county.dtypes

State FIPS     object
County         object
County FIPS    object
dtype: object

In [12]:
df_county.isna().sum()

State FIPS     0
County         0
County FIPS    0
dtype: int64

Before we move on to places, let's add to *df_county* the "State Names" column from our *df_states* DataFrame.

In [13]:
df_county = pd.merge(df_state, df_county, on=['State FIPS'], how='left') 
df_county.head()

Unnamed: 0,State Name,State,State FIPS,County,County FIPS
0,Alabama,AL,1,Autauga County,1
1,Alabama,AL,1,Baldwin County,3
2,Alabama,AL,1,Barbour County,5
3,Alabama,AL,1,Bibb County,7
4,Alabama,AL,1,Blount County,9


In [14]:
df_county.tail()

Unnamed: 0,State Name,State,State FIPS,County,County FIPS
3230,Puerto Rico,PR,72,Yauco Municipio,153
3231,U.S. Minor Outlying Islands,UM,74,Midway Islands,300
3232,U.S. Virgin Islands,VI,78,St. Croix Island,10
3233,U.S. Virgin Islands,VI,78,St. John Island,20
3234,U.S. Virgin Islands,VI,78,St. Thomas Island,30


In [15]:
df_county.shape

(3235, 5)

<!-- The Tally page tells us there are 3,143 Counties & Equivalents in the 50 states and DC. (This does not include Puerto Rico and the Island Areas.) We can check if the number of rows in our DF looks right by selecting all states and excluding PR (72), American Samoa (60), Guam (66), Commonwealth of the Northern Mariana Islands (69), United States Virgin Islands (78) and comparing this to the Tally page count (3,143). -->

Looks good. Let's move on to our list of Place codes.

### Place Codes

As per usual, we'll load our list as a DataFrame, rename and reorder columns, and inspect the first few rows. 

In [16]:
# this text file uses ANSI encoding
file_place = os.path.join(data_archive_dir, 'PLACElist.txt')

df_place = pd.read_csv(file_place, 
                       delimiter="|", 
                       usecols=['STATE', 'STATEFP', 'PLACEFP', 'PLACENAME', 'COUNTY'], # use only these columns
                       encoding="iso-8859-1", # QUESTION: Patrick, this txt file uses ANSI encoding, what should we use here?
                       encoding_errors='ignore')[['STATE','STATEFP', 'PLACENAME', 'PLACEFP', 'COUNTY']] # reorder columns

df_place.rename(columns={'STATE': 'State', 'STATEFP': 'State FIPS', 'PLACENAME': 'Place', 'PLACEFP': 'Place FIPS', 'COUNTY': 'County'}, inplace=True) # rename columns
                         
df_place.head(3)  # display first 3 rows

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
0,AL,1,Abanda CDP,100,Chambers County
1,AL,1,Abbeville city,124,Henry County
2,AL,1,Adamsville city,460,Jefferson County


Since we'll be mapping just New York City to start, let's see what the data for this "place" looks like.

In [17]:
df_place['State FIPS'] = df_place['State FIPS'].astype(str)
df_place['Place FIPS'] = df_place['Place FIPS'].astype(str)
df_place.dtypes

State         object
State FIPS    object
Place         object
Place FIPS    object
County        object
dtype: object

In [18]:
df_place.loc[df_place['Place'] == ("New York city")]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
24316,NY,36,New York city,51000,"Bronx County, Kings County, New York County, Queens County, Richmond County"


Oh no! We see from the above that the "County" column for the Place called 'New York City' includes not one, but several counties. This breaks one of our cardinal rules of data tidiness! Let's fix this throughout our DataFrame, so that any county associated with a place appears in a separate row. To do this, we'll first extract our county column, save it as Series called 'srs_place_expanded', and clean up the string.

In [19]:
# search for any rows containing a comma in the "County" column
df_place[df_place['County'].str.contains(',')]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
11,AL,1,Altoona town,1660,"Blount County, Etowah County"
15,AL,1,Arab city,2116,"Cullman County, Marshall County"
17,AL,1,Argo town,2320,"Jefferson County, St. Clair County"
48,AL,1,Birmingham city,7000,"Jefferson County, Shelby County"
53,AL,1,Boaz city,7912,"Etowah County, Marshall County"
...,...,...,...,...,...
41276,PR,72,La Fermina comunidad,40563,"Humacao Municipio, Las Piedras Municipio"
41309,PR,72,Mariano Colón comunidad,51055,"Coamo Municipio, Santa Isabel Municipio"
41317,PR,72,Monserrate comunidad,53979,"Vega Alta Municipio, Vega Baja Municipio"
41331,PR,72,Palmas comunidad,58516,"Arroyo Municipio, Patillas Municipio"


In [20]:
# srs_place_expanded = (df_place['County']
#                           .str.strip( ) # strip any leading/trailing spaces
#                           .str.split(",", expand=True) # split the string wherever a comma appears
#                           .melt(ignore_index=False) # we want to preserve the index here, so we can use join to add it back to df_place
#                           .dropna() # drop any rows with null values
#                           .drop('variable', axis=1) # drop the new index column (labeled 'variable')
#                           .rename(columns={'value': 'County'})
#                          )

# strip any leading/trailing spaces
# split the string wherever a comma appears
srs_place_expanded = (df_place['County'].str.strip( ).str.split(",", expand=True)) 

# we want to preserve the index here, so we can use join to add it back to df_place
# .dropna() # drop any rows with null values
# .drop('variable', axis=1) # drop the new index column (labeled 'variable')
# .rename(columns={'value': 'County'})
srs_place_expanded = srs_place_expanded.melt(ignore_index=False).dropna().drop('variable', axis=1).rename(columns={'value': 'County'})
                          
# rejoin our series into our df
# adding rsuffix '_messy' as a reminder that this is the 'County' column we'll be dropping shortly
df_place_expanded = srs_place_expanded.join(df_place, rsuffix='_messy')

# now let's drop "County_messy"
df_place_expanded = df_place_expanded.drop(['County_messy'], axis=1)

# reorder the columns
df_place_expanded = df_place_expanded[['State', 'State FIPS', 'Place', 'Place FIPS', 'County']]
df_place_expanded

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
0,AL,1,Abanda CDP,100,Chambers County
1,AL,1,Abbeville city,124,Henry County
2,AL,1,Adamsville city,460,Jefferson County
3,AL,1,Addison town,484,Winston County
4,AL,1,Akron town,676,Hale County
...,...,...,...,...,...
41409,PR,72,Vieques zona urbana,86014,Vieques Municipio
41410,PR,72,Villalba zona urbana,86831,Villalba Municipio
41411,PR,72,Yabucoa zona urbana,87863,Yabucoa Municipio
41412,PR,72,Yauco zona urbana,88035,Yauco Municipio


In [21]:
# let's look at New York City to make sure the melt worked and each county really is in a new row 
df_place_expanded.loc[df_place_expanded['Place'] == ("New York city")]
# df_place_expanded.loc[df_place_expanded['Place'].str.startswith("New York")]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
24316,NY,36,New York city,51000,Bronx County
24316,NY,36,New York city,51000,Kings County
24316,NY,36,New York city,51000,New York County
24316,NY,36,New York city,51000,Queens County
24316,NY,36,New York city,51000,Richmond County


Let's run through the usual data tidiness checks.

In [22]:
# Let's check for any characters with diacritic marks
##### Is there a better way to check for diacritic marks???
df_place_expanded.loc[df_place_expanded['Place FIPS'] == 3820]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County


In [23]:
import json
# Open a json file with mappings of characters with diacritics to their
# closest approximate in the english 26 letter alphabet
diacritic_mapping_file = os.path.join(data_dir,"util/diacritic_translate.json")
with open(diacritic_mapping_file, "r") as mapping_file:
    diac_char_map = json.load(mapping_file)

# Now we have a dict of diacritic char to english char
# but we need the integer oridnal of each to use with
# pandas Series str.translate
    
# Create an empty dict
diac_ord_map = dict()

# Iterate over the entries in the character dict
for k,v in diac_char_map.items():
    # populate the oridnal dict with the ordinals
    # of each key => value character
    diac_ord_map[ord(k)] = ord(v)
    
# Use pandas Series.str.trannslate to translate all our County names
# with diacritics
df_place_expanded['County'] = df_place_expanded['County'].str.translate(diac_ord_map)
df_place_expanded.loc[df_place_expanded['Place FIPS'] == 3820]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County


Adios, diacritics! So, let's probe a little now using .info() to detect any oddities that might need our attention. 

In [24]:
df_place_expanded.info() # do the dtypes make sense?

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42887 entries, 0 to 41413
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   State       42887 non-null  object
 1   State FIPS  42887 non-null  object
 2   Place       42887 non-null  object
 3   Place FIPS  42887 non-null  object
 4   County      42887 non-null  object
dtypes: object(5)
memory usage: 2.0+ MB


Looks like all of our dtypes make sense and, according to `.info()`, there are no missing values. To be safe, let's double check for missing values using `.isna().sum()`.

In [25]:
df_place_expanded.isna().sum()

State         0
State FIPS    0
Place         0
Place FIPS    0
County        0
dtype: int64

Looks good! No missing values. 

Now let's merge into one big list with State Name, State, State FIPS, County, County FIPS, Place, and Place FIPS. 

In [26]:
# need to strip whitepsace from df_place['County'] (apparently the melt thing above didn't do it...!)
df_place_expanded['County'] = df_place_expanded['County'].str.strip()
df_place_expanded.dtypes

State         object
State FIPS    object
Place         object
Place FIPS    object
County        object
dtype: object

In [27]:
# change datatypes of FIPS codes to strings to be consistent with their format in the shapefiles.
df_place_expanded.head()

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
0,AL,1,Abanda CDP,100,Chambers County
1,AL,1,Abbeville city,124,Henry County
2,AL,1,Adamsville city,460,Jefferson County
3,AL,1,Addison town,484,Winston County
4,AL,1,Akron town,676,Hale County


In [28]:
df_codes = pd.merge(df_place_expanded, df_county, on=['State', 'State FIPS', 'County'], how = 'left') 
df_codes =  df_codes[['State Name', 'State', 'State FIPS', 'Place', 'Place FIPS', 'County', 'County FIPS']] 
df_codes.head(3)

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County,County FIPS
0,Alabama,AL,1,Abanda CDP,100,Chambers County,17
1,Alabama,AL,1,Abbeville city,124,Henry County,67
2,Alabama,AL,1,Adamsville city,460,Jefferson County,73


It's been bothering us that the "city" in "New York city" is not capitalized. Let's change this throughout our DataFrame, so that every place names is in titlecase. Since our main concern right now is New York City, let's create a subsection of df_codes that contains just New York City and call it *df_codes_nyc*.

In [29]:
df_codes['Place'] = df_codes['Place'].str.title()
df_codes_nyc=df_codes[df_codes['Place']=='New York City']
df_codes_nyc

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County,County FIPS
25080,New York,NY,36,New York City,51000,Bronx County,5
25081,New York,NY,36,New York City,51000,Kings County,47
25082,New York,NY,36,New York City,51000,New York County,61
25083,New York,NY,36,New York City,51000,Queens County,81
25084,New York,NY,36,New York City,51000,Richmond County,85


Done. Above, we've got a clean DataFrame of the codes for New York City and have saved it as a .csv. Below is a clean DataFrame for the entire US, also saved as a .csv file.

In [30]:
df_codes.to_csv('codes.csv')
df_codes.head()

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County,County FIPS
0,Alabama,AL,1,Abanda Cdp,100,Chambers County,17
1,Alabama,AL,1,Abbeville City,124,Henry County,67
2,Alabama,AL,1,Adamsville City,460,Jefferson County,73
3,Alabama,AL,1,Addison Town,484,Winston County,133
4,Alabama,AL,1,Akron Town,676,Hale County,65


In [31]:
df_codes_nyc.to_csv('codes_nyc.csv')
df_codes_nyc.head()

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County,County FIPS
25080,New York,NY,36,New York City,51000,Bronx County,5
25081,New York,NY,36,New York City,51000,Kings County,47
25082,New York,NY,36,New York City,51000,New York County,61
25083,New York,NY,36,New York City,51000,Queens County,81
25084,New York,NY,36,New York City,51000,Richmond County,85
