### Week 3: Exploring Linguistic Isolation in LA County
This notebook explores linguistic isolation at the census tract level across LA county using census data. (Author: Aron Walker, with significant attribution to Yoh Kawano's instructions & models)

#### Step 0: Setting up libraries and data

In [18]:
# Here are the libraries I will use:
import geopandas as gp
import numpy as np

For data, CalEnvironScreen has conveniently already attached my census variables of interest (total population and linguistic isolation) to census tract geometry.

In [3]:
# Here is the data I will use:
data_file = 'data/calenviroscreen40shpf2021shp.zip'
data = gp.read_file(data_file)

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed


#### Step 1: Exploring the structure of the census data

In this section, I:

* count the number of rows and columns
* identify the types of data in each column
* look at a few rows to get a sense of the inputs

In [4]:
# What are its dimensions?
data.shape

(8035, 67)

In [5]:
# What types of data are present?
data.info(verbose=True, show_counts=True)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 8035 entries, 0 to 8034
Data columns (total 67 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Tract       8035 non-null   float64 
 1   ZIP         8035 non-null   int64   
 2   County      8035 non-null   object  
 3   ApproxLoc   8035 non-null   object  
 4   TotPop19    8035 non-null   int64   
 5   CIscore     8035 non-null   float64 
 6   CIscoreP    8035 non-null   float64 
 7   Ozone       8035 non-null   float64 
 8   OzoneP      8035 non-null   float64 
 9   PM2_5       8035 non-null   float64 
 10  PM2_5_P     8035 non-null   float64 
 11  DieselPM    8035 non-null   float64 
 12  DieselPM_P  8035 non-null   float64 
 13  Pesticide   8035 non-null   float64 
 14  PesticideP  8035 non-null   float64 
 15  Tox_Rel     8035 non-null   float64 
 16  Tox_Rel_P   8035 non-null   float64 
 17  Traffic     8035 non-null   float64 
 18  TrafficP    8035 non-null   float64 
 19

CalEnviroScreen has also already adopted column names that make sense. My target variables are `TotPop19` and `Ling_Isol`.

In [6]:
# What do the first few rows look like?
data.head()

Unnamed: 0,Tract,ZIP,County,ApproxLoc,TotPop19,CIscore,CIscoreP,Ozone,OzoneP,PM2_5,...,Elderly65,Hispanic,White,AfricanAm,NativeAm,OtherMult,Shape_Leng,Shape_Area,AAPI,geometry
0,6083002000.0,93454,Santa Barbara,Santa Maria,4495,36.019653,69.162885,0.03419,10.566273,7.567724,...,12.5028,68.921,20.8899,0.4004,0.267,1.3126,6999.357689,2847611.0,8.2091,"POLYGON ((-39795.070 -341919.191, -38126.384 -..."
1,6083002000.0,93455,Santa Barbara,Santa Maria,13173,37.030667,70.637922,0.035217,11.561917,7.624775,...,5.3519,78.6229,13.224,2.5051,0.0,0.9489,19100.578232,16352920.0,4.699,"POLYGON ((-39795.070 -341919.191, -39803.632 -..."
2,6083002000.0,93454,Santa Barbara,Santa Maria,2398,31.21314,61.069087,0.03419,10.566273,7.548835,...,12.8857,65.7214,30.6088,0.9591,0.0,2.1685,4970.985897,1352329.0,0.5421,"POLYGON ((-38115.747 -341130.248, -38126.384 -..."
3,6083002000.0,93455,Santa Barbara,Orcutt,4496,6.639331,5.988401,0.036244,13.615432,7.66057,...,14.4128,22.9537,69.1948,0.9342,0.7117,2.5356,6558.956012,2417717.0,3.6699,"POLYGON ((-37341.662 -348530.437, -37252.307 -..."
4,6083002000.0,93455,Santa Barbara,Orcutt,4008,14.022852,23.121533,0.036244,13.615432,7.66321,...,18.8872,33.4082,59.7804,0.6986,1.4721,1.3723,6570.36873,2608422.0,3.2685,"POLYGON ((-39465.107 -348499.262, -38244.305 -..."


In [7]:
# What do the last few rows look like?
data.tail()

Unnamed: 0,Tract,ZIP,County,ApproxLoc,TotPop19,CIscore,CIscoreP,Ozone,OzoneP,PM2_5,...,Elderly65,Hispanic,White,AfricanAm,NativeAm,OtherMult,Shape_Leng,Shape_Area,AAPI,geometry
8030,6037430000.0,91016,Los Angeles,Monrovia,5339,17.124832,30.610187,0.062365,88.69944,11.873339,...,17.4752,28.7132,53.3995,1.5733,0.0,7.1549,7166.130635,1938016.0,9.159,"POLYGON ((185152.883 -426843.064, 185240.372 -..."
8031,6037431000.0,91007,Los Angeles,Arcadia,4365,13.84199,22.566818,0.059387,79.987554,11.816074,...,10.4926,10.9507,26.3918,3.3677,0.0,3.3677,3941.781806,485563.0,55.9221,"POLYGON ((179874.001 -429709.190, 179885.911 -..."
8032,6037431000.0,91016,Los Angeles,Monrovia,6758,39.697849,74.508321,0.061338,84.579963,11.892654,...,7.2951,58.2273,16.1438,8.9967,0.0,1.1098,8020.091253,3015661.0,15.5223,"POLYGON ((184530.475 -428031.241, 184535.255 -..."
8033,6037534000.0,90201,Los Angeles,Bell,6986,62.931044,97.049924,0.046325,46.9944,12.019728,...,9.4188,91.4114,6.9425,0.6728,0.2577,0.7157,4949.116808,811895.5,0.0,"POLYGON ((167498.880 -447404.351, 167453.159 -..."
8034,6037534000.0,90201,Los Angeles,Bell Gardens,2358,63.315048,97.226425,0.047165,50.541381,12.025885,...,10.0933,91.0941,1.3147,1.9084,0.0,0.0,4420.126752,509871.8,5.6828,"POLYGON ((169695.249 -447290.043, 169560.378 -..."


In [8]:
# What do some random rows look like?
data.sample()

Unnamed: 0,Tract,ZIP,County,ApproxLoc,TotPop19,CIscore,CIscoreP,Ozone,OzoneP,PM2_5,...,Elderly65,Hispanic,White,AfricanAm,NativeAm,OtherMult,Shape_Leng,Shape_Area,AAPI,geometry
552,6059022000.0,92869,Orange,Orange,4710,17.975262,32.690368,0.048635,56.963286,11.819791,...,18.9597,25.6476,59.3206,0.0,0.0,4.1189,8098.754305,2519600.0,10.913,"POLYGON ((204440.041 -465942.014, 204537.179 -..."


#### Step 2: Limiting the data to what interests me

First, I am only looking at LA County:

In [30]:
# Filter by `County` for Los Angeles
data_la = data[data['County']=='Los Angeles']

# Did it work?
np.unique(data_la['County'])

array(['Alameda', 'Alpine', 'Amador', 'Butte', 'Calaveras', 'Colusa',
       'Contra Costa', 'Del Norte', 'El Dorado', 'Fresno', 'Glenn',
       'Humboldt', 'Imperial', 'Inyo', 'Kern', 'Kings', 'Lake', 'Lassen',
       'Los Angeles', 'Madera', 'Marin', 'Mariposa', 'Mendocino',
       'Merced', 'Modoc', 'Mono', 'Monterey', 'Napa', 'Nevada', 'Orange',
       'Placer', 'Plumas', 'Riverside', 'Sacramento', 'San Benito',
       'San Bernardino', 'San Diego', 'San Francisco', 'San Joaquin',
       'San Luis Obispo', 'San Mateo', 'Santa Barbara', 'Santa Clara',
       'Santa Cruz', 'Shasta', 'Sierra', 'Siskiyou', 'Solano', 'Sonoma',
       'Stanislaus', 'Sutter', 'Tehama', 'Trinity', 'Tulare', 'Tuolumne',
       'Ventura', 'Yolo', 'Yuba'], dtype=object)

Next, for each census tract, I really only want:

* tract geometry
* total population
* linguistic isolation score

In [31]:
# Here are the columns I want as labeled in the data:
columns_of_interest = ['geometry',
                       'TotPop19',
                       'Ling_Isol']

# Here is the data with just those columns:
data2 = data_la[columns_of_interest]

# Did it work?
data2.info(verbose=True, show_counts=True)

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2343 entries, 5692 to 8034
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   geometry   2343 non-null   geometry
 1   TotPop19   2343 non-null   int64   
 2   Ling_Isol  2343 non-null   float64 
dtypes: float64(1), geometry(1), int64(1)
memory usage: 73.2 KB


Finally, I only want census tracts with people.

In [12]:
# Drop the rows (tracts) without people
data2 = data2[data2['TotPop19']>0]

In [33]:
# What are the dimensions now?
data2.shape

(2327, 3)

#### Step 3: Explore the data's distribution