# What's the state of indoor farming in the US?

Having worked in food distribution for a decade, I probably spend more time than most thinking about the logistics of food. When I eat, it's hard for me *not* to imagine where exactly my food came from and how it reached me. Still, it wasn't until the pandemic that I'd ever seriously considered the possibilty of losing access to fresh food. The combination of fear and abundant freetime is powerful. My long dormant interest in gardening reawakened. I found a bit of comfort in the ability to grow a few vegetables at home. It was the knowledge that food was being grown right here in this crowded city, though, that gave me real comfort. Indoor farms became for me a symbol of hope. Sure the world was a mess, but we might just be able to dig ourselves out of it.

Once I became attuned, it seemed like indoor farms were suddenly popping up left and right. Apparently, the industry was quite a bit larger than I'd realized. Looking to understand the landscape better, the question of where exactly these farms were located seemed like a good place to start.

## Investigating relationships between population density and the emergence of indoor farming
So, what kinds of places become home to indoor farms? There a lot of ways to characterize a place, so I decided to start by narrowing my scope to population density. 

The US Decennial Census of Population and Housing provides a rich source of raw data, but it is challenging to decipher and synthesize into usable features. To begin, I've downloaded raw population demographic data from the 2020 US Census (include URL / reference).   Deciphering and translating the data required multiple steps. The notebook paragraphs below outline the initial steps of my journey applying data analysis techniques towards my goal of contributing to the understanding of indoor farming and food security overall.

*maybe consider looking at this link and adding some color about what is differenbt about now* https://www.census.gov/library/stories/2021/08/united-states-adult-population-grew-faster-than-nations-total-population-from-2010-to-2020.html

### Joining US Census population data with GIS data to display population density

In [1]:
# data files are stored compressed to save time and space
import tarfile
import pandas as pd

# 32mb+ of census data saved in a 4.7mb archive
census_data_archive = "../data_archive/census_data_2022_03_01.tgz"

# 17mb of GIS data saved in a 2.4mb archive
gis_file = "2020_Gaz_tracts_national.gz"

# This is the US Census file with population data we will extract
census_2020_file = "DECENNIALPL2020.P1_data_with_overlays_2021-12-02T121459.csv"

# This extracts a DataFrame from a tgz archived file
def extract_from_tgz(filename):
    with tarfile.open(filename) as tf:
        for file in tf.getmembers():
            if file.name == census_2020_file:
                data = tf.extractfile(file)
                return pd.read_csv(data, low_memory=False)
                        
df_census = extract_from_tgz(census_data_archive)

# change some options that determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

Using .shape, we can tell before even setting eyes on it that our DataFrame is going to be enormous - 73 columns and 85,000+ rows!

In [2]:
df_census.shape

(85396, 73)

We can get a pretty good sense of the data just by viewing the first few rows of our DataFrame. So, let's use .head(), to limit the display to the first 5 rows.

In [3]:
df_census.head()

Unnamed: 0,GEO_ID,NAME,P1_001N,P1_002N,P1_003N,P1_004N,P1_005N,P1_006N,P1_007N,P1_008N,...,P1_062N,P1_063N,P1_064N,P1_065N,P1_066N,P1_067N,P1_068N,P1_069N,P1_070N,P1_071N
0,id,Geographic Area Name,!!Total:,!!Total:!!Population of one race:,!!Total:!!Population of one race:!!White alone,!!Total:!!Population of one race:!!Black or African American alone,!!Total:!!Population of one race:!!American Indian and Alaska Native alone,!!Total:!!Population of one race:!!Asian alone,!!Total:!!Population of one race:!!Native Hawaiian and Other Pacific Islander alone,!!Total:!!Population of one race:!!Some Other Race alone,...,!!Total:!!Population of two or more races:!!Population of four races:!!American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; American Indian and Alaska Native; Asian; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; American Indian and Alaska Native; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of six races:,!!Total:!!Population of two or more races:!!Population of six races:!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1653,1389,213,5,8,3,35,...,0,0,0,0,0,0,0,0,0,0
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1984,842,1104,2,12,4,20,...,0,0,0,0,0,0,0,0,0,0
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,3039,2244,714,12,14,6,49,...,0,0,0,0,0,0,0,0,0,0
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,3993,3578,327,17,32,1,38,...,0,0,0,0,0,0,0,0,0,0


The population data columns have been named using Census Bureau codes that aren't very meaningful to us. In a moment, we'll rename them using the descriptions provided in the first row (index 0). But first, let's use regex to remove all of those distracting exclamation points! 

In [4]:
df_census.iloc[0].replace('!!', ' ', regex=True, inplace=True)
df_census.iloc[0] = df_census.iloc[0].str.lstrip()

Now we can go ahead and replace the column names with the cleaned up descriptions.

In [5]:
df_census.columns = df_census.iloc[0]
df_census = df_census.drop(0)
df_census

Unnamed: 0,id,Geographic Area Name,Total:,Total: Population of one race:,Total: Population of one race: White alone,Total: Population of one race: Black or African American alone,Total: Population of one race: American Indian and Alaska Native alone,Total: Population of one race: Asian alone,Total: Population of one race: Native Hawaiian and Other Pacific Islander alone,Total: Population of one race: Some Other Race alone,...,Total: Population of two or more races: Population of four races: American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races:,Total: Population of two or more races: Population of five races: White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander,Total: Population of two or more races: Population of five races: White; Black or African American; American Indian and Alaska Native; Asian; Some Other Race,Total: Population of two or more races: Population of five races: White; Black or African American; American Indian and Alaska Native; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races: White; Black or African American; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races: White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races: Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of six races:,Total: Population of two or more races: Population of six races: White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1653,1389,213,5,8,3,35,...,0,0,0,0,0,0,0,0,0,0
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1984,842,1104,2,12,4,20,...,0,0,0,0,0,0,0,0,0,0
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,3039,2244,714,12,14,6,49,...,0,0,0,0,0,0,0,0,0,0
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,3993,3578,327,17,32,1,38,...,0,0,0,0,0,0,0,0,0,0
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322,4055,3241,632,29,93,2,58,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85391,1400000US72153750501,"Census Tract 7505.01, Yauco Municipio, Puerto Rico",3968,1967,707,180,35,0,0,1045,...,0,0,0,0,0,0,0,0,0,0
85392,1400000US72153750502,"Census Tract 7505.02, Yauco Municipio, Puerto Rico",1845,915,380,48,10,6,0,471,...,0,0,0,0,0,0,0,0,0,0
85393,1400000US72153750503,"Census Tract 7505.03, Yauco Municipio, Puerto Rico",2155,1152,492,76,7,0,1,576,...,0,0,0,0,0,0,0,0,0,0
85394,1400000US72153750601,"Census Tract 7506.01, Yauco Municipio, Puerto Rico",4368,2221,782,204,23,1,0,1211,...,0,0,0,0,0,0,0,0,0,0


Now that the names for each column describe their contents, we're able to see that we really only need the first column of population data. We can "throw away" the other 70 columns! So, let's create a new, much smaller DataFrame with just the columns we need. We'll call it "df_census_pop".

In [6]:
df_census_pop = pd.DataFrame(df_census[['id', 'Geographic Area Name', 'Total:']])
#df_census = None
df_census_pop

Unnamed: 0,id,Geographic Area Name,Total:
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322
...,...,...,...
85391,1400000US72153750501,"Census Tract 7505.01, Yauco Municipio, Puerto Rico",3968
85392,1400000US72153750502,"Census Tract 7505.02, Yauco Municipio, Puerto Rico",1845
85393,1400000US72153750503,"Census Tract 7505.03, Yauco Municipio, Puerto Rico",2155
85394,1400000US72153750601,"Census Tract 7506.01, Yauco Municipio, Puerto Rico",4368


There's still room to improve our column names. Let's make them more descriptive.

In [7]:
df_census_pop.rename(columns={
    'id': 'GEOID Full',
    'Geographic Area Name': 'Census Tract Name',
    'Total:': 'Population'},
    inplace=True)
df_census_pop.head()

Unnamed: 0,GEOID Full,Census Tract Name,Population
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322


Let's run .info() on our new DataFrame to see what data oddities might need some cleanup.

In [8]:
df_census_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 1 to 85395
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   GEOID Full         85395 non-null  object
 1   Census Tract Name  85395 non-null  object
 2   Population         85395 non-null  object
dtypes: object(3)
memory usage: 2.0+ MB


According to .info, our columns are free of null values. This sounds a little suspicious... Let's probe a bit more to see if this is actually the case.

We can start by using isna() to identify possible missing values. This will give us a boolean indicating if the observation in that column is missing (True) or not (False). 

In [9]:
df_census_pop.isna()

Unnamed: 0,GEOID Full,Census Tract Name,Population
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
...,...,...,...
85391,False,False,False
85392,False,False,False
85393,False,False,False
85394,False,False,False


All of the rows displayed so far are returning False (no missing values). So far, so good. Let's use .sum() to get a count for our full dataset and make sure there aren't any missing values we just haven't seen yet. 

In [10]:
df_census_pop.isna().sum()

0
GEOID Full           0
Census Tract Name    0
Population           0
dtype: int64

Good - still seeing 0 missing values. But, let's check another way to be extra sure. This time, we'll use pandas to display any rows containing missing values. 

In [11]:
df_census_pop[df_census_pop.isna().any(axis=1)]

Unnamed: 0,GEOID Full,Census Tract Name,Population


Looks like we are in the clear, no missing values!

In [12]:
df_census_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 1 to 85395
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   GEOID Full         85395 non-null  object
 1   Census Tract Name  85395 non-null  object
 2   Population         85395 non-null  object
dtypes: object(3)
memory usage: 2.0+ MB


We've learned from .info that all of our columns are of datatype object. Based on our visual inspection of the first few rows of data, we know that the values in both columns GEOID Full and Tract Name are actually strings, while values in the column Population are really integers. Let's change this.

In [13]:
df_census_pop['GEOID Full'] = df_census_pop['GEOID Full'].astype(str)
df_census_pop['Census Tract Name'] = df_census_pop['Census Tract Name'].astype(str)
df_census_pop['Population'] = df_census_pop['Population'].astype(int)
df_census_pop.dtypes

0
GEOID Full           object
Census Tract Name    object
Population            int32
dtype: object

Now we're ready to load our second dataset into a DataFrame. Let's call our DataFrame 'df_gaz." We'll be matching this up shorlty with our df_census_pop DataFrame, so we can calculate population density for each census tract. 

Since we took a peek at the data before loading it and have also spent HOURS on the Census website, we know that the GEOIDS we are looking at here are for the census tract level. So, we're going to rename our columns to make this very clear.

(Per the Census.gov website, the GEOID Structure for a census tract is: STATE+COUNTY+TRACT | Digits: 2+3+6=11 | Example GeoArea: Census Tract 2231 in Harris County, TX | Example GEOID: 48201223100)

In [14]:
file_gaz = "../data_archive/2020_Gaz_tracts_national.gz"
df_gaz = pd.read_csv(file_gaz, delimiter = '\t', compression='gzip', dtype={'GEOID' : str})
df_gaz.rename(columns={'GEOID' : 'GEOID Tract'}, inplace=True)
df_gaz

Unnamed: 0,USPS,GEOID Tract,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
0,AL,01001020100,9825304,28435,3.794,0.011,32.481973,-86.491565
1,AL,01001020200,3320818,5669,1.282,0.002,32.475758,-86.472468
2,AL,01001020300,5349271,9054,2.065,0.003,32.474024,-86.459703
3,AL,01001020400,6384282,8408,2.465,0.003,32.471030,-86.444835
4,AL,01001020501,6203654,0,2.395,0.000,32.447861,-86.422558
...,...,...,...,...,...,...,...,...
85390,PR,72153750501,1820185,0,0.703,0.000,18.031211,-66.867347
85391,PR,72153750502,689931,0,0.266,0.000,18.024746,-66.860442
85392,PR,72153750503,3298433,1952,1.274,0.001,18.023148,-66.876603
85393,PR,72153750601,10985103,4527,4.241,0.002,18.017809,-66.839070


Let's check .dtypes on our new DataFrame to make sure the columns we'll be using are of the correct types. We want "GEOID TRACT" to be strings and "ALAND_SQMI" as floating point numbers.

In [15]:
df_gaz.dtypes

USPS                                                                                                                                       object
GEOID Tract                                                                                                                                object
ALAND                                                                                                                                       int64
AWATER                                                                                                                                      int64
ALAND_SQMI                                                                                                                                float64
AWATER_SQMI                                                                                                                               float64
INTPTLAT                                                                                                                    

Good. "ALAND_SQMI" is of type float64, which is what we hoped to see. "GEOID TRACT" is of type object, but this *probably* means it is as string, which we want. To make sure, let's use .astype to cast is as string and run. info() again.

In [16]:
df_gaz['GEOID Tract'] = df_gaz['GEOID Tract'].astype(str)
df_gaz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 0 to 85394
Data columns (total 8 columns):
 #   Column                                                                                                                                  Non-Null Count  Dtype  
---  ------                                                                                                                                  --------------  -----  
 0   USPS                                                                                                                                    85395 non-null  object 
 1   GEOID Tract                                                                                                                             85395 non-null  object 
 2   ALAND                                                                                                                                   85395 non-null  int64  
 3   AWATER                                                        

Ok! We're almost ready for our matchup, we just need to add a new column for "GEOID Tract" to df_census_pop, so we can use it to match with our df_gaz dataframe. We know that the last 11 digits of the GEOID Full signify the census tract level. So, we'll create the new "GEOID Tract" column using a lambda to slice just those last 11 digits from the "GEOID Full" column.

In [17]:
df_census_pop["GEOID Tract"] = df_census_pop["GEOID Full"].apply(lambda x: str(x)[-11:])
df_census_pop.head()

Unnamed: 0,GEOID Full,Census Tract Name,Population,GEOID Tract
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1001020100
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1001020200
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,1001020300
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,1001020400
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322,1001020501


Our new GEOID Tract column looks good, so now we can go ahead with the join. We'll join df_census_pop and df_gaz on the 'GEOID Tract' column.

In [18]:
#df_census_pop.join(df_gaz.set_index('GEOID Tract'), on="GEOID Tract").head(10)

In [19]:
df_census_pop = df_census_pop.join(df_gaz.set_index('GEOID Tract'), on="GEOID Tract")
df_census_pop.head(10)

Unnamed: 0,GEOID Full,Census Tract Name,Population,GEOID Tract,USPS,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1001020100,AL,9825304,28435,3.794,0.011,32.481973,-86.491565
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1001020200,AL,3320818,5669,1.282,0.002,32.475758,-86.472468
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,1001020300,AL,5349271,9054,2.065,0.003,32.474024,-86.459703
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,1001020400,AL,6384282,8408,2.465,0.003,32.47103,-86.444835
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322,1001020501,AL,6203654,0,2.395,0.0,32.447861,-86.422558
6,1400000US01001020502,"Census Tract 205.02, Autauga County, Alabama",3284,1001020502,AL,2097390,379,0.81,0.0,32.465736,-86.418855
7,1400000US01001020503,"Census Tract 205.03, Autauga County, Alabama",3616,1001020503,AL,3107823,43155,1.2,0.017,32.475447,-86.425533
8,1400000US01001020600,"Census Tract 206, Autauga County, Alabama",3729,1001020600,AL,8041611,59779,3.105,0.023,32.44734,-86.476828
9,1400000US01001020700,"Census Tract 207, Autauga County, Alabama",3409,1001020700,AL,22411848,772012,8.653,0.298,32.430692,-86.439202
10,1400000US01001020801,"Census Tract 208.01, Autauga County, Alabama",3143,1001020801,AL,124272664,8117631,47.982,3.134,32.418071,-86.527127


Before we go any further, let's create a new DataFrame containing just the columns we want. While we're at it, let's also improve our column names.

In [20]:
df_census_pop = df_census_pop[['GEOID Tract', 'USPS', 'Census Tract Name', 'Population', 'ALAND_SQMI']].copy()
df_census_pop.rename(columns={'USPS': 'State', 'ALAND_SQMI': 'Square Miles'}, inplace=True)
df_census_pop['Population Density'] = df_census_pop['Population']/df_census_pop['Square Miles']
#df.index = [x for x in range(1, len(df.values))]
df_census_pop[df_census_pop['State'] == 'NY']


Unnamed: 0,GEOID Tract,State,Census Tract Name,Population,Square Miles,Population Density
49733,36001000100,NY,"Census Tract 1, Albany County, New York",2073,0.914,2268.052516
49734,36001000201,NY,"Census Tract 2.01, Albany County, New York",3125,0.238,13130.252101
49735,36001000202,NY,"Census Tract 2.02, Albany County, New York",2598,0.557,4664.272890
49736,36001000301,NY,"Census Tract 3.01, Albany County, New York",3190,0.255,12509.803922
49737,36001000302,NY,"Census Tract 3.02, Albany County, New York",3496,1.968,1776.422764
...,...,...,...,...,...,...
55139,36123150301,NY,"Census Tract 1503.01, Yates County, New York",2854,57.651,49.504779
55140,36123150302,NY,"Census Tract 1503.02, Yates County, New York",2296,71.484,32.119076
55141,36123150400,NY,"Census Tract 1504, Yates County, New York",3836,27.464,139.673755
55142,36123150501,NY,"Census Tract 1505.01, Yates County, New York",1815,9.676,187.577511


In [22]:
# Park Slope Brooklyn NY
df_census_pop[df_census_pop['GEOID Tract'] == '36047016500']

Unnamed: 0,GEOID Tract,State,Census Tract Name,Population,Square Miles,Population Density
51081,36047016500,NY,"Census Tract 165, Kings County, New York",5080,0.072,70555.555556
