## GEOID codes and why we need them

*Geographic Identifier (GEOID)* is a general term for the unique numeric codes used to identify geographic entities. Specifically, the Census Bureau uses Federal Information Processing Series (FIPS) codes to identify most of the geographies for which it tabulates data, such as states, counties, and "places". Because FIPS codes are maintained by the American National Standards Institute (ANSI), they are often referred to interchangeably as "ANSI", "FIPS", or "ANSI/FIPS" codes. To avoid confusion, we'll refer to them as "FIPS" codes. Because the codes for census tracts are maintained by the Census Bureau itself, rather than ANSI, we'll simplry refer to them as "census tract" codes.

Regardless of what we call them, GEOIDs are the critical piece of information that allows us to tie population data to geographic data. A unique GEOID can be derived for every piece of geography in the US by concatenating its FIPS code with those of the geographies in which it nests. The geography we'll be mapping, the *census tract*, nests within a *county*, which nests within a *state*. So, to derive a census tract's unique GEOID we concatenate its: STATE FIPS + COUNTY FIPS + CENSUS TRACT codes.

For example, the GEOID for the census tract in Brooklyn where I sit writing this is:

```
STATE + COUNTY + TRACT
NY + Kings County + Tract #
36 + 047 + 016500 
= 36047016500

```

### Which specific GEOIDs do we need here?

A glance at the [FTP Archive](https://www2.census.gov/geo/tiger/TIGER2020PL/LAYER/TRACT/2020/) of shapefiles for the census tract layer, quickly reveals that the dataset required to map every census tract in the US is *enormous*. Therefore, for this initial mapping exercise, we'll narrow our scope to a smaller subsection of particular interest within the indoor farming industry, NY state, and more specifically, New York City. We will, however, write our code in such a way that we can easily add more locations when we're ready.

Given the [hierarchy according to which census geographic entities are oganized](https://www2.census.gov/geo/pdfs/reference/geodiagram.pdf), accessing New York City geography data at the census tract level is not quite as straightforward as one might expect. We've, therefore, downloaded the full set of NY state files so we can make population density for the entire state and will then use GEOIDs to draw a boundary around New York City. As mentioned above, we'll need state FIPS, county FIPS, and census tract codes. There's also a fourth to add to our list - "place". We need this last one because, interestingly, "city" isn't one of the categories within the hierarchy of census geographic entities, rather the Census Bureau categorizes New York City, the area we'll be mapping, as a "place". 

The goal of this section is to produce two clean DataFrames of US GEOIDs, containing just the columns we want. The first DataFrame will contain state and county FIPS codes, the second will contain place FIPS codes. While our initial mapping exercise will be limited to NYC, we'll include all US codes here, as we'll need them to map additional areas in the future. (Note - our population data in the previous notebook is limited to the United States and Puerto Rico, it does not include the other Island Areas, we'll need to access that information separately when we're ready to map those areas.)

### Data Sources

The below list details the data sources used for this section of the project. Since several routes can be taken to access the same census data, I've included the specific steps I followed to access them.

**GEOID Codes:**

Files downloaded from Census Bureau FTP server via FTP client

*State FIPS codes*
* file location: ftp://ftp2.census.gov/geo/docs/reference/codes/state.txt

*County FIPS codes*
* file location: ftp://ftp2.census.gov/geo/docs/reference/codes/national_county.txt

*Place FIPS codes*
* file location: ftp://ftp2.census.gov/geo/docs/reference/codes/PLACElist.txt

In [1]:
# also included in the previous notebook, this code makes it easier to refer 
# to the folders where we'll store various data files for this project

import os, pathlib
base_dir = pathlib.Path(os.getcwd()).parent
data_archive_dir = os.path.join(base_dir, "data_archive")
clean_data_dir = os.path.join(data_archive_dir, "clean")
data_dir = os.path.join(base_dir, "data")
shapes_dir = os.path.join(data_dir,"shapes")
json_dir = os.path.join(data_dir,"geojson")

## State FIPS codes

Let's go ahead and load as a DataFrame the list of codes for the states and state equivalents and examine the first few rows. We'll include just the columns we want, rename them to better describe their contents, and reorder them as we like. 

In [2]:
import pandas as pd

# these options determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

file_state = os.path.join(data_archive_dir, 'state.txt')

df_state = pd.read_csv(file_state, 
                       usecols=['STATE', 'STUSAB', 'STATE_NAME'], # use only these columns
                       delimiter="|", # load txt file as pandas DataFrame 
                       encoding="iso-8859-1", 
                       encoding_errors='ignore')[['STATE_NAME', 'STUSAB', 'STATE',]] # reorder cols

df_state.rename(columns={'STATE': 'State FIPS', 'STUSAB': 'State', 'STATE_NAME': 'State Name'}, inplace=True) # rename columns
df_state.head(3)

Unnamed: 0,State Name,State,State FIPS
0,Alabama,AL,1
1,Alaska,AK,2
2,Arizona,AZ,4


In [3]:
df_state.tail(3)

Unnamed: 0,State Name,State,State FIPS
54,Puerto Rico,PR,72
55,U.S. Minor Outlying Islands,UM,74
56,U.S. Virgin Islands,VI,78


In [4]:
df_state.shape

(57, 3)

Now we can do our usual search for data oddities, starting with a thorough search for missing values.

In [5]:
df_state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   State Name  57 non-null     object
 1   State       57 non-null     object
 2   State FIPS  57 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 1.5+ KB


<!-- ****DOUBLE CHECK - do we want State FIPS as integer or string? Population data has GEOIDs as strings. What about shapefiles? Let's leave it for now.****

Hmm. "State" and "State Name" are of type "object", which makes sense since our visual inspection makes it clear they are strings. State FIPS, however, is showing up as an integer type. This makes sense, but since our population data has GEOIDs formatted as strings, we'll follow its lead and change the dtype here, so we'll be able match on this column if needed in the future.  -->

According to `.info()` our list has no null values, let's double check this using `.isna().sum().`

In [6]:
df_state.isna().sum()

State Name    0
State         0
State FIPS    0
dtype: int64

Lastly, let's confirm that our datatypes make sense. "State Name" and "State" column values are strings, so we want to see dtype "object" when we run `.dtypes`. State FIPS codes are integers, so we want to see an integer datatype below.

In [7]:
df_state.dtypes

State Name    object
State         object
State FIPS     int64
dtype: object

Our dtypes looks good, so we're ready to move onto our list of county FIPS codes.  

## County FIPS codes

Let's load as a DataFrame our list of County FIPS codes. Again, we'll include only the columns we need, reorder and rename them, and visually inspect the first and last few rows of our DataFrame.

In [8]:
# this text file uses UTF-8 encoding 
file_national_county = os.path.join(data_archive_dir, 'national_county.txt')

df_county = pd.read_csv(file_national_county,
                        delimiter=",",
                        usecols=['State ANSI', 'County ANSI', 'County Name'], # use only these columns
                        encoding="utf-8", # QUESTION: Patrick, this text file uses UTF-8 encoding, what should we use here?
                        encoding_errors='ignore')[['State ANSI', 'County Name', 'County ANSI']] # reorder columns

# rename columns
df_county.columns = ['State FIPS', 'County', 'County FIPS']
df_county.head(3)

Unnamed: 0,State FIPS,County,County FIPS
0,1,Autauga County,1
1,1,Baldwin County,3
2,1,Barbour County,5


In [9]:
df_county.tail(3)

Unnamed: 0,State FIPS,County,County FIPS
3232,78,St. Croix Island,10
3233,78,St. John Island,20
3234,78,St. Thomas Island,30


In [10]:
df_county.shape

(3235, 3)

Let's proceed with our usual checks, first making sure there are no missing values.

In [11]:
df_county.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3235 entries, 0 to 3234
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   State FIPS   3235 non-null   int64 
 1   County       3235 non-null   object
 2   County FIPS  3235 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 75.9+ KB


In [12]:
df_county.isna()

Unnamed: 0,State FIPS,County,County FIPS
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
3230,False,False,False
3231,False,False,False
3232,False,False,False
3233,False,False,False


In [13]:
df_county.isna().sum()

State FIPS     0
County         0
County FIPS    0
dtype: int64

In [14]:
df_county.dtypes

State FIPS      int64
County         object
County FIPS     int64
dtype: object

Let's add to *df_state_county* the "State Names" column from our *df_states* DataFrame and save it as a new DataFrame, *df_state_county*. We're including the full names because our data includes not just the states, but also the island territories, the 2-letter abbreviations for which may be less familiar. 

In [15]:
df_state_county = pd.merge(df_state, df_county, on=['State FIPS'], how='left') 
df_state_county.head(3)

Unnamed: 0,State Name,State,State FIPS,County,County FIPS
0,Alabama,AL,1,Autauga County,1
1,Alabama,AL,1,Baldwin County,3
2,Alabama,AL,1,Barbour County,5


While we're pretty confident that our state names and abbreviations contain only standard ASCII characters, depending on the file format, it's possible that some county names include characters with diacritic marks. We want to ensure consistent treatment of these between files, which will become especially important if we need to perform merges on these columns at any point. So, let's check *df_state_county* for these now.

In [16]:
df_state_county.shape

(3235, 5)

In [17]:
# this function checks for characters with diacritics

def is_non_ascii(value):
    if type(value) != str:
        return True
    try:
        value.encode('ascii')
        return False
    except:
        return True

In [18]:
# call our function 'is_non_ascii' to check for characters 
# with diacritics    

df_state_county.loc[ (df_state_county["State Name"].apply(is_non_ascii)) | (df_state_county["County"].apply(is_non_ascii)) ]

Unnamed: 0,State Name,State,State FIPS,County,County FIPS


Good! No diacritics found in *df_state_county.* Let's move on to our "places" list.

In [19]:
df_state_county.head(3)

Unnamed: 0,State Name,State,State FIPS,County,County FIPS
0,Alabama,AL,1,Autauga County,1
1,Alabama,AL,1,Baldwin County,3
2,Alabama,AL,1,Barbour County,5


In [20]:
df_state_county.tail(3)

Unnamed: 0,State Name,State,State FIPS,County,County FIPS
3232,U.S. Virgin Islands,VI,78,St. Croix Island,10
3233,U.S. Virgin Islands,VI,78,St. John Island,20
3234,U.S. Virgin Islands,VI,78,St. Thomas Island,30


Looking good. Let's move on to our list of Place codes.

In [21]:
df_state_county.shape

(3235, 5)

In [22]:
df_state_county.dtypes

State Name     object
State          object
State FIPS      int64
County         object
County FIPS     int64
dtype: object

## Place FIPS Codes

As per usual, we'll load our list as a DataFrame, rename and reorder columns, and inspect the first few rows. 

In [23]:
# this text file uses ANSI encoding
file_place = os.path.join(data_archive_dir, 'PLACElist.txt')

df_place = pd.read_csv(file_place, 
                       delimiter="|", 
                       usecols=['STATE', 'STATEFP', 'PLACEFP', 'PLACENAME', 'COUNTY'], # use only these columns
                       encoding="iso-8859-1" # QUESTION: Patrick, this txt file uses ANSI encoding, what should we use here?
                       )[['STATE','STATEFP', 'PLACENAME', 'PLACEFP', 'COUNTY']] # reorder columns
#encoding_errors='ignore'
df_place.rename(columns={'STATE': 'State', 'STATEFP': 'State FIPS', 'PLACENAME': 'Place', 'PLACEFP': 'Place FIPS', 'COUNTY': 'County'}, inplace=True) # rename columns
                         
df_place.head() 

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
0,AL,1,Abanda CDP,100,Chambers County
1,AL,1,Abbeville city,124,Henry County
2,AL,1,Adamsville city,460,Jefferson County
3,AL,1,Addison town,484,Winston County
4,AL,1,Akron town,676,Hale County


In [24]:
df_place.tail()  

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
41409,PR,72,Vieques zona urbana,86014,Vieques Municipio
41410,PR,72,Villalba zona urbana,86831,Villalba Municipio
41411,PR,72,Yabucoa zona urbana,87863,Yabucoa Municipio
41412,PR,72,Yauco zona urbana,88035,Yauco Municipio
41413,PR,72,Yaurel comunidad,88121,Arroyo Municipio


In [25]:
df_place.shape

(41414, 5)

In [26]:
df_place.dtypes

State         object
State FIPS     int64
Place         object
Place FIPS     int64
County        object
dtype: object

While all of our state names contain only standard ASCII characters, depening on how the file is formatted, it's possible that some place or county names include characters with diacritic marks. Let's take a look.

In [27]:
# this checks for characters with diacritics

def is_non_ascii(value):
    if type(value) != str:
        return True
    try:
        value.encode('ascii')
        return False
    except:
        return True
    
df_place.loc[ (df_place["County"].apply(is_non_ascii)) | (df_place["Place"].apply(is_non_ascii)) ]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
2599,CA,6,La Cañada Flintridge city,39003,Los Angeles County
2982,CA,6,Piñon Hills CDP,57302,San Bernardino County
3506,CO,8,Cañon City city,11810,Fremont County
22545,NM,35,Anthony CDP,3820,Doña Ana County
22561,NM,35,Berino CDP,6830,Doña Ana County
...,...,...,...,...,...
41395,PR,72,Tallaboa comunidad,81413,Peñuelas Municipio
41396,PR,72,Tallaboa Alta comunidad,81456,Peñuelas Municipio
41399,PR,72,Tierras Nuevas Poniente comunidad,82187,Manatí Municipio
41405,PR,72,Vázquez comunidad,85111,Salinas Municipio


Looks like there are more than a few diacritics in *df_place*! Now that we've confirmed their presence, let's address them by mapping them to their closest approximate in the English 26 letter alphabet. 

In [28]:
import json

# open a json file of characters with diacritics mapped to their
# closest approximate in the English alphabet
# create a dictionary of these character mappings
diacritic_mapping_file = os.path.join(data_dir,"util/diacritic_translate.json")
with open(diacritic_mapping_file, "r") as mapping_file:
    diac_char_map = json.load(mapping_file)

# # now we need the integer ordinal of each character to use with 
# pandas Series str.translate    
# create an empty dictionary to store these
diac_ord_map = dict()

# iterate over the entries in the character mappings dictionary
for k,v in diac_char_map.items():
    # populate the ordinal dict with the ordinals
    # of each key => value character
    diac_ord_map[ord(k)] = ord(v)

In [29]:
# use pandas Series.str.trannslate to translate all our County and Place names
# with diacritics
df_place['County'] = df_place['County'].str.translate(diac_ord_map)
df_place['Place'] = df_place['Place'].str.translate(diac_ord_map)

In [30]:
# let's confirm that our character replacement worked by checking 
# a row we know contained a diacritic in the "Place" column
df_place.loc[df_place['Place FIPS'] == 39003]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
2599,CA,6,La Canada Flintridge city,39003,Los Angeles County


In [31]:
# let's check another row we know contained 
# a diacritic in the "County" column
df_place.loc[df_place['Place FIPS'] == 81413]

Unnamed: 0,State,State FIPS,Place,Place FIPS,County
41395,PR,72,Tallaboa comunidad,81413,Penuelas Municipio


In [32]:
df_place.shape

(41414, 5)

In [33]:
df_place.dtypes

State         object
State FIPS     int64
Place         object
Place FIPS     int64
County        object
dtype: object

In [34]:
# df_place = pd.merge(df_place, df_state_county ) 
df_place = pd.merge(df_place, df_state) 
df_place = df_place[['State Name', 'State', 'State FIPS', 'Place', 'Place FIPS', 'County',]] # reorder columns
df_place.head()

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County
0,Alabama,AL,1,Abanda CDP,100,Chambers County
1,Alabama,AL,1,Abbeville city,124,Henry County
2,Alabama,AL,1,Adamsville city,460,Jefferson County
3,Alabama,AL,1,Addison town,484,Winston County
4,Alabama,AL,1,Akron town,676,Hale County


Our spot checks look good. Let's probe our data a bit more, running `.info()` to detect other pieces of data that might need some attention. 

In [35]:
df_place.dtypes

State Name    object
State         object
State FIPS     int64
Place         object
Place FIPS     int64
County        object
dtype: object

In [36]:
df_place.shape

(41414, 6)

In [37]:
df_place.info() # any missing values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41414 entries, 0 to 41413
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   State Name  41414 non-null  object
 1   State       41414 non-null  object
 2   State FIPS  41414 non-null  int64 
 3   Place       41414 non-null  object
 4   Place FIPS  41414 non-null  int64 
 5   County      41414 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.2+ MB


In [38]:
df_place.isna().sum() # another check for missing values

State Name    0
State         0
State FIPS    0
Place         0
Place FIPS    0
County        0
dtype: int64

Looks good! No missing values. Now let's confirm that our  datatypes make sense.

In [39]:
df_place.dtypes # do our datatypes make sense?

State Name    object
State         object
State FIPS     int64
Place         object
Place FIPS     int64
County        object
dtype: object

Since we'll be paying close attention to New York City in our initial mapping exercise, let's see what the geographic data for this "place" looks like.

In [40]:
df_place.loc[df_place['Place'].str.startswith("New York")]

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County
17317,Minnesota,MN,27,New York Mills city,46060,Otter Tail County
17318,Minnesota,MN,27,New York Mills city,46060,Otter Tail County
24316,New York,NY,36,New York city,51000,"Bronx County, Kings County, New York County, Queens County, Richmond County"
24317,New York,NY,36,New York Mills village,51011,Oneida County


Strangely, the word "city" in the name "New York city" is not capitalized. Let's change this throughout our DataFrame, so that every place name is in titlecase. 

In [41]:
df_place['Place'] = df_place['Place'].str.title()
df_place.loc[df_place['Place'] == ("New York City")]

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County
24316,New York,NY,36,New York City,51000,"Bronx County, Kings County, New York County, Queens County, Richmond County"


Done. We now have a clean nationwide DataFrame of State and County FIPS codes, as well as Place FIPS codes and have saved both as .csv files. 

In [42]:
clean_codes_file = os.path.join(clean_data_dir,'codes.csv')

df_state_county.to_csv(clean_codes_file)
df_state_county.head()

Unnamed: 0,State Name,State,State FIPS,County,County FIPS
0,Alabama,AL,1,Autauga County,1
1,Alabama,AL,1,Baldwin County,3
2,Alabama,AL,1,Barbour County,5
3,Alabama,AL,1,Bibb County,7
4,Alabama,AL,1,Blount County,9


In [43]:
clean_place_file = os.path.join(clean_data_dir,'place.csv')

df_place.to_csv(clean_place_file)
df_place.head()

Unnamed: 0,State Name,State,State FIPS,Place,Place FIPS,County
0,Alabama,AL,1,Abanda Cdp,100,Chambers County
1,Alabama,AL,1,Abbeville City,124,Henry County
2,Alabama,AL,1,Adamsville City,460,Jefferson County
3,Alabama,AL,1,Addison Town,484,Winston County
4,Alabama,AL,1,Akron Town,676,Hale County


In [44]:
clean_place_file_pq = os.path.join(clean_data_dir,'place.parquet')
df_place.to_parquet(clean_place_file_pq, compression='BROTLI')