## Joining US Census Population Data with GIS Data to Display Population Density



What's the state of urban farming in the US? Phrased differently, what kinds of places become home to urban farms? There a lot of ways to characterize a place, so we'll start by narrowing our scope to population denstityWhat kinds of places become home to urban farms? There a lot of ways to characterize a place, so we'll start by narrowing our scope to population denstity.


In [1]:
# data files are stored compressed to save time and space
import tarfile
import pandas as pd

# 32mb+ of census data saved in a 4.7mb archive
census_data_archive = "../data_archive/census_data_2022_03_01.tgz"

# 17mb of GIS data saved in a 2.4mb archive
gis_file = "2020_Gaz_tracts_national.gz"

# This is the US Census file with population data we will extract
census_2020_file = "DECENNIALPL2020.P1_data_with_overlays_2021-12-02T121459.csv"

# This extracts a DataFrame from a tgz archived file
def extract_from_tgz(filename):
    with tarfile.open(filename) as tf:
        for file in tf.getmembers():
            if file.name == census_2020_file:
                data = tf.extractfile(file)
                return pd.read_csv(data, low_memory=False)
                        
df_census = extract_from_tgz(census_data_archive)

# change some options that determine how much data is displayed in the notebook
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

Using .shape, we can tell before even setting eyes on it that the DataFrame is going to be enormous - 73 columns and 85,000+ rows!

In [2]:
df_census.shape

(85396, 73)

We can get a pretty good sense of the data by viewing just the first few rows of our DataFrame. So, let's use .head(), to limit the display to just the first 5 rows.

In [3]:
df_census.head()

Unnamed: 0,GEO_ID,NAME,P1_001N,P1_002N,P1_003N,P1_004N,P1_005N,P1_006N,P1_007N,P1_008N,...,P1_062N,P1_063N,P1_064N,P1_065N,P1_066N,P1_067N,P1_068N,P1_069N,P1_070N,P1_071N
0,id,Geographic Area Name,!!Total:,!!Total:!!Population of one race:,!!Total:!!Population of one race:!!White alone,!!Total:!!Population of one race:!!Black or African American alone,!!Total:!!Population of one race:!!American Indian and Alaska Native alone,!!Total:!!Population of one race:!!Asian alone,!!Total:!!Population of one race:!!Native Hawaiian and Other Pacific Islander alone,!!Total:!!Population of one race:!!Some Other Race alone,...,!!Total:!!Population of two or more races:!!Population of four races:!!American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; American Indian and Alaska Native; Asian; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; American Indian and Alaska Native; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!White; Black or African American; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of five races:!!Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,!!Total:!!Population of two or more races:!!Population of six races:,!!Total:!!Population of two or more races:!!Population of six races:!!White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1653,1389,213,5,8,3,35,...,0,0,0,0,0,0,0,0,0,0
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1984,842,1104,2,12,4,20,...,0,0,0,0,0,0,0,0,0,0
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,3039,2244,714,12,14,6,49,...,0,0,0,0,0,0,0,0,0,0
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,3993,3578,327,17,32,1,38,...,0,0,0,0,0,0,0,0,0,0


The population data columns have been named using Census Bureau codes that aren't very meaningful to us. In a moment, we'll rename them using the descriptions that have been provided in the first row (index 0). But first, let's use regex to remove all of those distracting exclamation points! 

In [4]:
df_census.iloc[0].replace('!!', ' ', regex=True, inplace=True)
df_census.iloc[0] = df_census.iloc[0].str.lstrip()

Now we can go ahead and replace the column names with the cleaned up descriptions.

In [5]:
df_census.columns = df_census.iloc[0]
df_census = df_census.drop(0)
df_census

Unnamed: 0,id,Geographic Area Name,Total:,Total: Population of one race:,Total: Population of one race: White alone,Total: Population of one race: Black or African American alone,Total: Population of one race: American Indian and Alaska Native alone,Total: Population of one race: Asian alone,Total: Population of one race: Native Hawaiian and Other Pacific Islander alone,Total: Population of one race: Some Other Race alone,...,Total: Population of two or more races: Population of four races: American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races:,Total: Population of two or more races: Population of five races: White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander,Total: Population of two or more races: Population of five races: White; Black or African American; American Indian and Alaska Native; Asian; Some Other Race,Total: Population of two or more races: Population of five races: White; Black or African American; American Indian and Alaska Native; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races: White; Black or African American; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races: White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of five races: Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race,Total: Population of two or more races: Population of six races:,Total: Population of two or more races: Population of six races: White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1653,1389,213,5,8,3,35,...,0,0,0,0,0,0,0,0,0,0
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1984,842,1104,2,12,4,20,...,0,0,0,0,0,0,0,0,0,0
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,3039,2244,714,12,14,6,49,...,0,0,0,0,0,0,0,0,0,0
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,3993,3578,327,17,32,1,38,...,0,0,0,0,0,0,0,0,0,0
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322,4055,3241,632,29,93,2,58,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85391,1400000US72153750501,"Census Tract 7505.01, Yauco Municipio, Puerto Rico",3968,1967,707,180,35,0,0,1045,...,0,0,0,0,0,0,0,0,0,0
85392,1400000US72153750502,"Census Tract 7505.02, Yauco Municipio, Puerto Rico",1845,915,380,48,10,6,0,471,...,0,0,0,0,0,0,0,0,0,0
85393,1400000US72153750503,"Census Tract 7505.03, Yauco Municipio, Puerto Rico",2155,1152,492,76,7,0,1,576,...,0,0,0,0,0,0,0,0,0,0
85394,1400000US72153750601,"Census Tract 7506.01, Yauco Municipio, Puerto Rico",4368,2221,782,204,23,1,0,1211,...,0,0,0,0,0,0,0,0,0,0


Now that the names for each column describe their contents, we're able to see that we really only need the first column of population data. We can "throw away" the other 70 columns! So, let's create a new, much smaller DataFrame with just the columns we need. We'll call it "df_census_pop".

In [6]:
df_census_pop = pd.DataFrame(df_census[['id', 'Geographic Area Name', 'Total:']])
#df_census = None
df_census_pop

Unnamed: 0,id,Geographic Area Name,Total:
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322
...,...,...,...
85391,1400000US72153750501,"Census Tract 7505.01, Yauco Municipio, Puerto Rico",3968
85392,1400000US72153750502,"Census Tract 7505.02, Yauco Municipio, Puerto Rico",1845
85393,1400000US72153750503,"Census Tract 7505.03, Yauco Municipio, Puerto Rico",2155
85394,1400000US72153750601,"Census Tract 7506.01, Yauco Municipio, Puerto Rico",4368


Let's run .info() on our new DataFrame to see what we might need to cleanup.

In [7]:
df_census_pop.rename(columns={
    'id': 'GEOID Full',
    'Geographic Area Name': 'Tract Name',
    'Total:': 'Population'},
    inplace=True)
df_census_pop.head()

Unnamed: 0,GEOID Full,Tract Name,Population
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322


In [8]:
df_census_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 1 to 85395
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   GEOID Full  85395 non-null  object
 1   Tract Name  85395 non-null  object
 2   Population  85395 non-null  object
dtypes: object(3)
memory usage: 2.0+ MB


According to .info, all of the columns are free of null values. This sounds a little suspicious. We want be sure this is actually the case, so we'll probe further.

To do this we can use isna() to identify if any values are missing. This will give us a boolean and indicate if the observation in that column is missing (True) or not (False). 

In [9]:
# df_census_pop['Tract Name'].isna() 
df_census_pop.isna() 

Unnamed: 0,GEOID Full,Tract Name,Population
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
...,...,...,...
85391,False,False,False
85392,False,False,False
85393,False,False,False
85394,False,False,False


Looks like a bunch of values are missing. Let's count how many.

In [10]:
df_census_pop.isna().sum()

0
GEOID Full    0
Tract Name    0
Population    0
dtype: int64

In [11]:
df_census_pop[df_census_pop.isna().any(axis=1)]

Unnamed: 0,GEOID Full,Tract Name,Population


We've learned from .info that all of our columns are of datatype object. Based on our visual inspection of the first few rows of data, we know that the values in both columns GEOID Full and Tract Name are actually strings, while values in the column Population are really integers. We'll fix this in a moment. First, we want to deal with any null values. 

Okay, now let's load in the dataset containing the area in square miles of each census tract. We're going to match this up to our first dataframe, so we can calculate population density. 

We took a peek at the data before loading it andm having previously spent HOURS on the Census website, we know that the GEOIDS we are looking at here are for the census tract level. So, we're going to rename our column to indicate this. 

(Per the Census.gov website, the GEOID Structure for a census tract is: STATE+COUNTY+TRACT | Digits: 2+3+6=11 | Example GeoArea: Census Tract 2231 in Harris County, TX | Example GEOID: 48201223100)

In [12]:
file_gaz = "../data_archive/2020_Gaz_tracts_national.gz"
df_gaz = pd.read_csv(file_gaz, delimiter = '\t', compression='gzip', dtype={'GEOID' : str})
df_gaz.rename(columns={'GEOID' : 'GEOID Tract'}, inplace=True)
df_gaz

Unnamed: 0,USPS,GEOID Tract,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
0,AL,01001020100,9825304,28435,3.794,0.011,32.481973,-86.491565
1,AL,01001020200,3320818,5669,1.282,0.002,32.475758,-86.472468
2,AL,01001020300,5349271,9054,2.065,0.003,32.474024,-86.459703
3,AL,01001020400,6384282,8408,2.465,0.003,32.471030,-86.444835
4,AL,01001020501,6203654,0,2.395,0.000,32.447861,-86.422558
...,...,...,...,...,...,...,...,...
85390,PR,72153750501,1820185,0,0.703,0.000,18.031211,-66.867347
85391,PR,72153750502,689931,0,0.266,0.000,18.024746,-66.860442
85392,PR,72153750503,3298433,1952,1.274,0.001,18.023148,-66.876603
85393,PR,72153750601,10985103,4527,4.241,0.002,18.017809,-66.839070


In [13]:
df_census_pop["GEOID Tract"] = df_census_pop["GEOID Full"].apply(lambda x: str(x)[-11:])
df_census_pop.join(df_gaz.set_index('GEOID Tract'), on="GEOID Tract").head(10)

Unnamed: 0,GEOID Full,Tract Name,Population,GEOID Tract,USPS,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
1,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1775,1001020100,AL,9825304,28435,3.794,0.011,32.481973,-86.491565
2,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2055,1001020200,AL,3320818,5669,1.282,0.002,32.475758,-86.472468
3,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3216,1001020300,AL,5349271,9054,2.065,0.003,32.474024,-86.459703
4,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",4246,1001020400,AL,6384282,8408,2.465,0.003,32.47103,-86.444835
5,1400000US01001020501,"Census Tract 205.01, Autauga County, Alabama",4322,1001020501,AL,6203654,0,2.395,0.0,32.447861,-86.422558
6,1400000US01001020502,"Census Tract 205.02, Autauga County, Alabama",3284,1001020502,AL,2097390,379,0.81,0.0,32.465736,-86.418855
7,1400000US01001020503,"Census Tract 205.03, Autauga County, Alabama",3616,1001020503,AL,3107823,43155,1.2,0.017,32.475447,-86.425533
8,1400000US01001020600,"Census Tract 206, Autauga County, Alabama",3729,1001020600,AL,8041611,59779,3.105,0.023,32.44734,-86.476828
9,1400000US01001020700,"Census Tract 207, Autauga County, Alabama",3409,1001020700,AL,22411848,772012,8.653,0.298,32.430692,-86.439202
10,1400000US01001020801,"Census Tract 208.01, Autauga County, Alabama",3143,1001020801,AL,124272664,8117631,47.982,3.134,32.418071,-86.527127
