# What's the state of indoor agriculture in the US?

Having worked in food distribution for a decade, I probably spend a lot more time thinking about the logistics of food than most people do. When I eat, it's hard for me *not* to imagine where my food came from and how it reached me. Still, it wasn't until the pandemic reached New York City that the thought of losing access to fresh food began to seem like a real possibility. The combination of fear and free time is powerful. Overnight, my long dormant interest in gardening reawakened. The ability to grow a few vegetables at home did provide a bit of comfort. It was really the knowledge, though, that food was being grown on a commercial scale only blocks from my home in this crowded city, that set my mind closer to something like ease. I'd been won over to local, hydroponically grown greens years before based on quality alone and had happily paid a premium. As the pandemic gained momentum and my city locked down, what had once been a source of luxury - indoor farms- became a for me a symbol of hope. Sure the world was a mess, but we might just be able to dig ourselves out of it.

Once I became attuned, indoor farms seemed to begin cropping up left and right. Apparently, the industry was quite a bit larger than I'd realized. Looking to understand the landscape better, the question of *where* exactly these farms were located seemed like a good place to start.

## Examining population density and the emergence of indoor farming
So, what kinds of places become home to indoor farms? There are a lot of ways to characterize a place, so I decided to narrow my initial scope to population density using data from the most recent US Census, 2020. Given that my own neighborhood can very significantly in the span of a few blocks, I wanted the population data to be pretty granular and opted for the smallest geographic unit for which data was available. While census *blocks* are actually the smallest unit for which census data is gathered, 2020 data at this level was not yet available when I began this project, so I took what I could get - census *tracts*.

## Taming the data beast

The [US Decennial Census of Population and Housing](https://www.census.gov/programs-surveys/decennial-census/data/datasets.html) provides a rich source of raw data, but it can be challenging to decipher and synthesize into usable features. The "Data sources" section of my notebooks details the specific files I used and the steps I took to access them. The remaining paragraphs walk through the data cleaning and analysis techniques I applied towards my goal of contributing to the understanding of the indoor farming industry and food security overall.  

Steps: 

1. Population data - all census tracts in the US and Puerto Rico

    - clean and organize data
    - produce a tidy, ready-to-use DataFrame
    
2. GEOID codes - state, county, "place" code tables

    - clean and organize data -
    - merge state and county into one DataFrame
    - create a separate DataFrame for places 
    
3. Geographic data - shapefiles for NY State at the census tract level
   - create a GEO DataFrame for NY State
   - merge with population data and calculate population density
   - GEOID codes DataFrame to isolate shapefiles for NYC
   - map NY state census tracts with population desnity in Kepler
   - add color based on population density
   - add a layer for places


4. Farm data - NYC indoor farms, addresses
    - produce a DataFrame of infoor farms in NYC
    - turn addresses into GEO info
    - plot in Kepler


The goal of this section is to produce a clean, tidy, ready-to-use DataFrame of population data for every census tract in the US and Puerto Rico. 

### Data Sources

The below list details the data sources used for this section of the project. Since several routes can be taken to access the same census data, I've included the specific steps I followed to access them. To save both time and space, I've store the population data compressed as a tgz file. 

**Population Data:**

*2020 Census: Redistricting File (Public Law 94-171)* - downloaded from Census.gov
* accessed via ["Table" tab](https://data.census.gov/cedsci/table?q=United%20States) on Census.gov
* filters selected: *Years: 2020 > Geography: All Census Tracts within United States > Topics: Populations and People*
* table name: "Decennial Census, P1 | RACE, 2020: DEC Redistricting Data (PL 94-171)" 

### Initial setup to make life easier
I've imported below the Python `os` and `pathlib` modules and included some additional code to make it easier to refer to the folders where we'll be storing the various data files for this project.

In [1]:
# os module provides a variety of frequently used file system functions including path.join
# pathlib module makes it easier to manipulate folder and file paths with Python
# everything builds off base_dir, so if we move our code later, we'll only need to change base_dir
import os, pathlib
# base_dir - the immediate parent folder of this notebook
# we expect our data folders to be found here
base_dir = pathlib.Path(os.getcwd()).parent

# data_archive - we'll store compressed files here
# these will be preserved in git
data_archive_dir = os.path.join(base_dir, "data_archive")

#
clean_data_dir = os.path.join(data_archive_dir, "clean")

# data_dir - large/numerous files will go here
# these will not be preserved in git!
# we'll only put files here that can be recreated with some python code (e.g. downloaded 
# or unpacked from data_archive, or generated from a DataFrame)
data_dir = os.path.join(base_dir, "data")

# shapes_dir - folders containing shapefiles go here
shapes_dir = os.path.join(data_dir,"shapes")

# json_dir - we'll store here GeoJSON we've generated and want to save for re-use 
json_dir = os.path.join(data_dir,"geojson")


## Population data

Let's load our population data as a pandas DataFrame using the `extract_from_tgz` function included below. (Alternative code for opening regular non-compressed .csv files as a pandas DataFrame is included below.)

In [2]:
# population data is stored compressed as tgz files to save time and space
import tarfile
import pandas as pd

# these options determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# 32mb+ of census data saved in a 4.7mb archive
file_census_data_tgz = os.path.join(data_archive_dir, 'census_data_2022_03_01.tgz')

# this is the population data file we'll extract from the above tgz file
file_census_data_csv = 'DECENNIALPL2020.P1_data_with_overlays_2021-12-02T121459.csv'

# this function creates a DataFrame from our tgz archive file
def extract_from_tgz(filename):
    with tarfile.open(filename) as tf:
        for file in tf.getmembers():
            if file.name == file_census_data_csv:
                data = tf.extractfile(file)
                return pd.read_csv(data, low_memory=False, usecols=[0, 1, 2])

# now call the function to extract our tgz file and load it as a pandas DataFrame
df_census_pop = extract_from_tgz(file_census_data_tgz)

# alternative code for loading .csv files as a pandas DataFrame
# use the line of code below and ignore the extract_from_tgz function if working with a regular .csv file
# df_census_pop = pd.read_csv(filename, low_memory=False)

# 17mb of GIS data saved in a 2.4mb archive
# not sure we are using this file anymore...
# file_gis = os.path.join(data_archive_dir, '2020_Gaz_tracts_national.gz')


### Data inspection

Calling `.shape` tells us, before even setting eyes on it, that our DataFrame is enormous. 73 columns and 85,000+ rows! The [Tallies list on Census.gov](https://www.census.gov/geographies/reference-files/time-series/geo/tallies.html#tract_bg_block) is a handy reference to quickly check that we haven't accidentally dropped any rows. The *2020 Census Tallies of Census Tracts, Block Groups & Blocks* table on that page indicates that there are 85,395 census tracts total for the US and Puerto Rico. This checks out with what .shape is telling us (we can subtract 1 from the total rows in our table, since the values in the first row are descriptions, rather than data.)

In [3]:
df_census_pop.shape

(85396, 3)

We can get a pretty good sense of the actual data by using `.head(n)` to view just the first few rows.

In [4]:
df_census_pop.head(3)

Unnamed: 0,GEO_ID,NAME,P1_001N
0,id,Geographic Area Name,!!Total:
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055


### Data cleanup

It looks like the columns have been named using codes that aren't very meaningful to us. Fortunately, as mentioned above, the first row of the table includes descriptions of the data, so we'll replace the original column names with these. Let's also specify that the values in the first column are the full GEOIDs found in .csv files downloaded from data.census.gov. They include 9 characters not found in the shorter form GEOIDs used in Tiger/Line shapefiles, which we'll be looking at shortly. For reference, the census.gov site provides a detailed explanation of [how GEOIDs work](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html).

In [5]:
df_census_pop = df_census_pop.drop(0)
df_census_pop.columns = ["GEOID Census Tract Full", "Census Tract Name", "Population"]
df_census_pop.head(2)

Unnamed: 0,GEOID Census Tract Full,Census Tract Name,Population
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055


In [6]:
df_census_pop.tail()

Unnamed: 0,GEOID Census Tract Full,Census Tract Name,Population
85391,1400000US72153750501,"Census Tract 7505.01, Yauco Municipio, Puerto Rico",3968
85392,1400000US72153750502,"Census Tract 7505.02, Yauco Municipio, Puerto Rico",1845
85393,1400000US72153750503,"Census Tract 7505.03, Yauco Municipio, Puerto Rico",2155
85394,1400000US72153750601,"Census Tract 7506.01, Yauco Municipio, Puerto Rico",4368
85395,1400000US72153750602,"Census Tract 7506.02, Yauco Municipio, Puerto Rico",2587


Better. Now we can search for any data oddities that might need our attention, such as missing/null values. Let's run `.info()` and see what we find. 

In [7]:
df_census_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 1 to 85395
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   GEOID Census Tract Full  85395 non-null  object
 1   Census Tract Name        85395 non-null  object
 2   Population               85395 non-null  object
dtypes: object(3)
memory usage: 2.0+ MB


According to `.info()`, our columns are free of null values. We *really* want to make sure this is the case, so that we can avoid the annoying complications null values can bring once we start working with the data. Let's probe a bit more, using `.isna()` to return a boolean indicating whether the observation in each column is missing, `True`, or not, `False`. 

In [8]:
df_census_pop.isna()

Unnamed: 0,GEOID Census Tract Full,Census Tract Name,Population
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
...,...,...,...
85391,False,False,False
85392,False,False,False
85393,False,False,False
85394,False,False,False


All of the rows displayed so far are returning `False`, indicating that there are no missing values. So far, so good. As another check, let's use `.sum()` to get a count of missing values for our full dataset, in case any might be lurking deeper in the data and are just not visible to us in the current display.

In [9]:
df_census_pop.isna().sum()

GEOID Census Tract Full    0
Census Tract Name          0
Population                 0
dtype: int64

Good - still seeing 0 missing values. This time, lets tell pandas to actually *show* us any rows with missing values. 

In [10]:
df_census_pop[df_census_pop.isna().any(axis=1)]

Unnamed: 0,GEOID Census Tract Full,Census Tract Name,Population


Again, nothing! So, it looks like we are in the clear - no missing values to address. On to the next step. 

We'll now call `.dtypes` to check whether the datatypes for each column make sense.

In [11]:
df_census_pop.dtypes

GEOID Census Tract Full    object
Census Tract Name          object
Population                 object
dtype: object

Based on our initial visual inspection using `.head()` we know that the values in both "GEOID Census Tract Full" and "Census Tract Name" are strings. Because pandas uses the "object" type for storing strings, the dtypes for "GEOID Census Tract Full" and "Census Tract Name" are also correctly reflected when we run `.dtypes`. Population, however, is an integer, but is showing up here as an object.  Let's fix that by setting the datatype to, `int64`, when we call `.dtypes`. 

In [12]:
df_census_pop['Population'] = df_census_pop['Population'].astype(int)
df_census_pop.dtypes

GEOID Census Tract Full    object
Census Tract Name          object
Population                  int64
dtype: object

In preparation for the next section of our data journey, let's add a new column for "GEOID Census Tract" to our *df_census_pop* DataFrame. We'll be using this in future setps to match our population data with our GIS data. Because we know that the last 11 digits of "GEOID  Census Tract Full" represent the census tract level, we can create the new "GEOID Census Tract" column using a lambda to grab just the digits we need. 

In [13]:
df_census_pop['GEOID Census Tract'] = df_census_pop['GEOID Census Tract Full'].apply(lambda x: str(x)[-11:])
df_census_pop.head()
df_census_pop.dtypes

GEOID Census Tract Full    object
Census Tract Name          object
Population                  int64
GEOID Census Tract         object
dtype: object

In [14]:
df_census_pop.tail()

Unnamed: 0,GEOID Census Tract Full,Census Tract Name,Population,GEOID Census Tract
85391,1400000US72153750501,"Census Tract 7505.01, Yauco Municipio, Puerto Rico",3968,72153750501
85392,1400000US72153750502,"Census Tract 7505.02, Yauco Municipio, Puerto Rico",1845,72153750502
85393,1400000US72153750503,"Census Tract 7505.03, Yauco Municipio, Puerto Rico",2155,72153750503
85394,1400000US72153750601,"Census Tract 7506.01, Yauco Municipio, Puerto Rico",4368,72153750601
85395,1400000US72153750602,"Census Tract 7506.02, Yauco Municipio, Puerto Rico",2587,72153750602


In [15]:
clean_census_pop_file = os.path.join(clean_data_dir,'census_pop.csv')
df_census_pop.to_csv(clean_census_pop_file)

# Save the dataframe to Parquet

In [16]:
# Experiment - let's save as arrow/parquet and see if it's a smaller file with dtypes preserved
import pyarrow as pa
import pyarrow.parquet as pq

census_table = pa.Table.from_pandas(df_census_pop, preserve_index=False)

clean_census_pop_file_pq = os.path.join(clean_data_dir,'census_pop.parquet')
pq.write_table(census_table, clean_census_pop_file_pq, compression='BROTLI')
census_table.schema

GEOID Census Tract Full: string
Census Tract Name: string
Population: int64
GEOID Census Tract: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 607

There we have it - a clean, tidy, ready-to-use DataFrame of population data for every census tract in the US and Puerto Rico. 