# 1. Introduction to GeoPandas


### HOMEWORK AND RECAP ARE STILL IN THIS FILE

Welcome back! We're diving into using a popular Python package, `GeoPandas`, so we can start looking at our data spatially! In this notebook we'll be covering the following topics:

- [1.1 Introduction](#section1)
    - GeoPandas and Geospatial Data in Python
- [1.2 Data Preparation](#section2)
    - Reading in and writing out data as csv
    - Preprocessing the ACS 5 year data
- [1.3 Mapping Census Tracts](#section3)
    - Learning about census tracts data
    - Reading in shapefiles with Geopandas
    - Exploring a GeoDataFrame
    - Mapping geospatial data stored in a GeoDataFrame
- [1.4 Spatial Subsetting](#section4)
    - Subsetting by bounding box coordinates
- [1.5 Attribute Joins](#section5)
    - Joining a pandas DataFrame to a GeoPandas GeoDataFrame
- [1.6 Data Driven Mapping](#section6)
    - Types of thematic mapping
    - Reading in and writing out spatial data in different file formats (e.g., shapefile, csv, geojson)
- [1.7 Coordinate Reference Systems (CRS)](#section7)
    - Handling CRS in GeoPandas (i.e., getting, setting, transforming)
    - Spatial measurement calculations (area, length)
- [1.8 Recap](#section8)
- [1.9 Homework](#section9)
- [1.10 References](#section10)
    



**INSTRUCTOR NOTES**:
- Datasets used:
    - "../notebook_data/census/ACS5yr/census_variables_CA.csv"
    - "../notebook_data/census/Tracts/cb_2018_06_tract_500k.zip"


- Expected time to complete:
    - Lecture + Questions: 1.5 hours
    - Homework: 40 minutes
    
---


<a id="section1"></a>
## 1.1 Introduction

The goal of this notebook is to give you a **tip of the iceberg introduction** to working with geospatial data in Python using the **GeoPandas** package.  

> #### Assumptions
> This lesson assumes you have basic working knowledge of Python and of geospatial data. If you need a geospatial refresher, we refer you to these freely available online resources:
> - The Open Textbook Library: [Essentials of Geographic Information Systems by Jonathan E. Campbell and Michael Shin](http://open.umn.edu/opentextbooks/BookDetail.aspx?bookId=67)
> - The Open Textbook Library: [Nature of Geographic Information Systems by David DiBiase](http://open.umn.edu/opentextbooks/BookDetail.aspx?bookId=428) from Esri.
> - Online Gitbook: [Intro to GIS and Spatial Analysis by Manuel Gimond](https://mgimond.github.io/Spatial/index.html)


#### Terminology

Just so we are on the same page..

- `Geographic data` is data about locations on or near the surface of the Earth.

- `Geospatial data`  is geographic data that can be explictly located on the surface of the Earth because it contains coordinates like latitude and longitude.

- `Spatial data` is a more generic term that includes geospatial data as well as other kinds of spatial data.
 
 
### GeoPandas and related Geospatial Packages

[GeoPandas](http://geopandas.org/) is a relatively new package that makes it easier to work with geospatial data in Python. In the last few years it has grown more powerful and stable. This really is great because previously it was quite complex to work with geospatial data in Python. GeoPandas is now the go to package for working with `vector` geospatial data in Python. 

> **Pro-tip**: If you work with `raster` data you will want to checkout the [rasterio](https://rasterio.readthedocs.io/en/latest/) package. We will not cover raster data in this tutorial.

### GeoPandas = pandas + geo
GeoPandas gives you access to all of the functionality of [pandas](https://pandas.pydata.org/), which is the primary data analysis tool for working with tabular data in Python. GeoPandas extends pandas with attributes and methods for working with geospatial data.


### Import Libraries

Let's start by importing the libraries that we will use.

In [None]:
import pandas as pd
import geopandas as gpd

import matplotlib # base python plotting library
import matplotlib.pyplot as plt # submodule of matplotlib

# To display plots, maps, charts etc in the notebook
%matplotlib inline  


<a id="section2"></a>
## 1.2 Data preparation

In this lesson we will use ACS and census tract data to demonstrate how to work GeoPandas. Data for Alameda County as our primary example.


<img src ="https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/CampanileMtTamalpiasSunset-original.jpg/1280px-CampanileMtTamalpiasSunset-original.jpg" height="100" width="400"> 
        

As you are probably aware, Berkeley (and of course the University) are located in Alameda County. 

### American Community Survey 5 Year Data (or ACS5)

To get started, let's read the ACS 5 year data for California tracts into a `dataframe` using the  `pandas read_csv` method. 

As we read in the ACS data we will tell pandas to make sure that the data in the column `FIPS_11_digit` is read in as a string to preserve leading zeros in the census tract identifiers.

In [None]:
# Read in the ACS5 data for CA into a pandas DataFrame.
# Note: We force the FIPS_11_digit to be read in as a string to preserve any leading zeroes.
acs5data_df = pd.read_csv("../notebook_data/census/ACS5yr/census_variables_CA.csv", dtype={'FIPS_11_digit':str})

Pandas provides a number of methods to view information about a dataframe.

The pandas dataframe attribute `shape` tells us the number of rows and columns in the dataframe.

In [None]:
# Take a look at the shape of the dataframe
acs5data_df.shape

Each row in our dataframe is an observation. For the ACS5 data each observation is about a census tract.

Each column in our dataframe is a variable for that observation.

Let's use `head` to take a look at the first 5 rows in the dataframe.

In [None]:
# Take a look at the data
acs5data_df.head()

A `...` in the middle of the top row indicates that there are two many columns to display.

The pandas dataframe `columns` attribute returns a list of the column names.

In [None]:
acs5data_df.columns

We can see more information about the variables included in our ACS5 year data using the `info` method. This method tells us at a glance what variables (or columns) are included in the data, the data type of each variable, and which variables have values for all rows.

In [None]:
acs5data_df.info()

### Brief review of the ACS data

These variables were combined from different ACS 5 year tables. We have information for the following:

- `c_race` - Total population
- `c_white` - Total white non-Latinx
- `c_black` - Total black and African American non-Latinx
- `c_asian` - Total Asian non-Latinx
- `c_latinx` - Total Latinx
- `state_fips` - State level FIPS code
- `county_fips` - County level FIPS code
- `tract_fips` - Tracts level FIPS code
- `med_rent` - Median rent
- `med_hhinc` - Median household income
- `c_tenants` - Total tenants
- `c_owners` - Total owners
- `c_renters` - Total renters
- `c_movers` - Total number of people who moved
- `c_stay` - Total number of people who stayed
- `c_movelocal` - Number of people who moved locally
- `c_movecounty` - Number of people who moved counties
- `c_movestate` - Number of people who moved states
- `c_moveabroad` - Number of people who moved abroad
- `c_commute` - Total number of commuters
- `c_car` - Number of commuters who use a car
- `c_carpool` - Number of commuters who carpool
- `c_transit` - Number of commuters who use public transit
- `c_bike` - Number of commuters who bike
- `c_walk` - Number of commuters who bike
- `year` - ACS data year
- `FIPS_11_digit` - 11-digit FIPS code

The ACS variables that start with `c_` are counts, those that start with `med_` are medians.  Variables that end in `_moe` denote margin of error. There are also a number of derived variables that start with `p_`. These are proportions calcuated from the counts divided by the table denominator (the total count for whom that variable was assessed).

We're going to drop all of our `moe` columns by identifying all of those that end with `_moe`. We can do that in two steps, first by using `filter` to identify columns that contain the string `_moe`.

In [None]:
moe_cols = acs5data_df.filter(like='_moe',axis=1).columns
moe_cols

Note how we set the filter `like=` to a value that matches the pattern of the names of the columns we want to drop. You need to make sure you get all but only the columns that you want to drop.

<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Question
</div>

What do you think happens if you match `_mo` instead of `_moe` in the filter?

Now that we've got our list of moe columns, we can use `.drop()` to remove them from the dataframe. 

In [None]:
# Drop MOE columns
acs5data_df.drop(moe_cols, axis=1, inplace=True)

Check that you no longer have the moe columns in the dataframe.

In [None]:
acs5data_df.columns

### Select data for our county and year of interest

Our ACS5 data contains observations for all CA counties and two ACS 5 year periods.

The counties are identified by a unique Census FIPS code. 
- You can see the list of all CA Counties and their FIPS codes [here](https://en.wikipedia.org/wiki/List_of_counties_in_California).

Let's use the `.unique` to check the unique set of county FIPS codes included in our dataframe.

In [None]:
acs5data_df['county_fips'].unique()  #what counties are in our dataframe

Now use `.unique` to see what years are included.

In [None]:
acs5data_df['year'].unique()

We are interested in Alameda County, which has the FIPS code `001`.  Moreover, we are only interested in the 2018 ACS 5 year data.  Let's filter the data to keep only the rows that match these two conditions.


In [None]:
acs5data_df_ac = acs5data_df[(acs5data_df['year']==2018) & (acs5data_df['county_fips']==1)]

<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Question
</div>

Why do we filter on `county_fips==1` instead of `county_fips==001` or `county_fips=='001'`?

In [None]:
# Write your thoughts here

Now, check the contents of our dataframe again.

In [None]:
# now what is the shape of the data when filtered for Alameda County?
print(acs5data_df_ac.shape)

In [None]:
# Take a look at the first 5 rows
acs5data_df_ac.head()

>**Pro-tip:** Checking your row and column counts and values often with `.shape` and values with `.head` help to make sure that these values are consistent with your understanding of the data.

### Saving our output

It's a good idea to save your data if you have done any major processing on it. Let's save our Alameda County sub-setted ACS5 data to a CSV file.

In [None]:
# Save processed data to a csv file - give it a name that is meaningful
acs5data_df_ac.to_csv('../outdata/acs5data_2018_AC.csv')

Confirm that the file was saved with a [shell command](https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html#Shell-Commands-in-IPython).  Shell commands are prefaced by a `!` and allow you to access the file system and run commands like you would from a terminal window. (This may differ if you are on a windows computer)

In [None]:
!ls ../outdata

#### Exercise

Now do this for the SF ACS data:
1. Find the FIPS code for [SF county](https://en.wikipedia.org/wiki/List_of_counties_in_California)
2. Subset the ACS data to keep only rows for SF county in 2018 and assign to `acs5data_df_sf`
3. Save out ACS data as `acs5data_2018_SF.csv`




In [None]:
# Your code here


*Click here for solution*

<!--- 
    # SOLUTION
    # 1 & 2 Subset ACS data for SF
    acs5data_df_sf = acs5data_df[(acs5data_df['county_fips']==75) & (acs5data_df.year==2018)]

    # SOLUTION
    acs5data_df_sf.head()

    # SOLUTION
    # 3. Save out ACS data as 'acs5data_2018_SF.csv'
    acs5data_df_sf.to_csv('../outdata/acs5data_2018_SF.csv')
--->

<a id="section8"></a>
## 1.8 Recap
This lesson provided a broad overview to using [GeoPandas](http://geopandas.org/) to work with geospatial data in Python. 

Below is a quick recap of the GeoPandas capabilities and geospatial concepts we covered:

- Reading and writing spatial data to/from Geopandas (gpd) GeoDataFrames (gdf), with a focus on ESRI Shapefiles and geojson files.
	- `gpd.read_file()`
    - `gdf.to_file()`
- Plotting a geodataframe 
	- `gdf.GeoDataFrame.plot()`
- Spatially subsetting a geodataframe
	- `gdf.cx()`
- Using attribute joins to merge Geopandas GeoDataFrames with pandas DataFrames (df)
	- `gdf.merge(df)`
- Choropleth mapping 
	- `.plot(column='<column_name>')`
- Adding columns to a GeoDataFrame to transform counts to densities
	- `tracts_acs_gdf_ac['pop_dens_km2'] = tracts_acs_gdf_ac['c_race']/ (tracts_acs_gdf_ac['ALAND']/SQMETER_PER_SQKM)`

- Getting, setting (defining), and transforming (projecting) a CRS using `EPSG` codes
	- `.crs`
	- `.to_crs()`
- Spatial measurements: accessing the spatial attributed of GeoDataFrame geometries
	- `.area` 
	- `.length`

<a id="section9"></a>
## 1.9 Homework

#### Exercise 1
1. Compare the values in the `GEOID` column of the tracts gdf and `FIPS_11_digit` in the ACS dataframe.
2. Join the two datasets and name the output geodataframe `tracts_acs_gdf_sf`
3. Check your output data - type, columns, shape, data values, etc. 

In [None]:
# Your code here

*Click here for answers*

<!--- 
    # SOLUTION
    # 1.a - look at census tract identifiers in the tract data
    tracts_gdf_sf['GEOID']

    # SOLUTION
    # 1.b - look at census tract identifiers in the ACS data
    acs5data_df_sf['FIPS_11_digit'].head()

    # SOLUTION
    # 2. Join the two datasets and name the output tracts_acs_gdf_sf
    tracts_acs_gdf_sf = tracts_gdf_sf.merge(acs5data_df_sf, left_on='GEOID',right_on="FIPS_11_digit", how='inner')
    tracts_acs_gdf_sf.head(2)

    # SOLUTION
    # 3. Check your output data
    print(tracts_gdf_sf.shape)
    print(tracts_acs_gdf_sf.shape)
--->

#### Exercise 2

Plot population density for SF county. Here are the steps you'll need to take:
1. Create a population density per km2 variable and add it to the data frame
2. Repeat but for population density per mile2
3. Create choropleth maps for both variables

In [None]:
# Your code here

*Click here for answers*

<!--- 
    # SOLUTION
    # 1. Create a population density per km2 variable and add it to the data frame
    tracts_acs_gdf_sf['pop_dens_km2'] = tracts_acs_gdf_sf['c_race']/ (tracts_acs_gdf_sf['ALAND']/SQMETER_PER_SQKM)

    # SOLUTION
    # 2. Repeat but for population density per mile2
    tracts_acs_gdf_sf['pop_dens_mi2'] = tracts_acs_gdf_sf['c_race']/ (tracts_acs_gdf_sf['ALAND']/SQMETER_PER_SQMILE)

    # SOLUTION
    # 3. Plot population density - km^2
    fig, ax = plt.subplots(figsize = (10,10)) 
    tracts_acs_gdf_sf.plot(column='pop_dens_km2', legend=True,
                        legend_kwds={'label': "Population per Sq KM",
                                     'orientation': "horizontal"},
                        ax=ax)
    plt.show()
--->

#### Exercise 3

Do you remember how to read in data from a file to a geodataframe? Test that below by completing the code.

In [None]:
# read in Alameda county Geojson file to a geodataframe
ac_tracts_from_geojson = ...

# Uncomment line below and plot
#ac_tracts_from_geojson.plot(column='pop_dens_mi2')

In [None]:
# read in Alameda county Geojson file to a geodataframe
ac_tracts_from_gpkg = ...
# Uncomment line below and plot
#ac_tracts_from_gpkg.plot(column='pop_dens_mi2')

*Click here for answers*

<!--- 
    # SOLUTION
    # read in Alameda county Geojson file to a geodataframe
    ac_tracts_from_geojson = gpd.read_file("../outdata/tracts_acs_gdf_ac.json")
    ac_tracts_from_geojson.plot(column='pop_dens_mi2')

    # SOLUTION
    # read in Alameda county Geojson file to a geodataframe
    ac_tracts_from_gpkg = gpd.read_file("../outdata/tracts_acs_gdf_ac.json", driver="GeoJSON")
    ac_tracts_from_gpkg.plot(column='pop_dens_mi2')
--->

#### Exercise 4
1. Check the CRS of the geodataframe `tracts_acs_gdf_sf` 
2. Transform the CRS of `tracts_acs_gdf_sf` to UTM Zone 10N, NAD83 and call it `tracts_acs_sf_utm10`
3. Display and compare your two CRS definitions.
4. Use plot to make a map of the data in both CRSs
3. Calculate the area of SF using the `.area` geodataframe attribute and the `ALAND` column

In [None]:
# Your code here

*Click here for answers*

<!--- 
# 1. Check the CRS 
tracts_acs_gdf_sf.crs
# 2. transform the crs of your SF tracts ACS data data 
tracts_acs_sf_utm10 = tracts_acs_gdf_sf.to_crs('epsg:26910')
# 3. Display the CRS definitions
tracts_acs_gdf_sf.crs
tracts_acs_sf_utm10.crs

# 3. Plot and compare your two CRSs

# plot geographic gdf
tracts_acs_gdf_sf.plot();
# plot utm gdf
tracts_acs_sf_utm10.plot();

# 4. Calculate the area of SF using the 2 above methods
tracts_acs_sf_utm10.area.sum()  / SQMETER_PER_SQKM
tracts_acs_sf_utm10.ALAND.sum()/SQMETER_PER_SQKM
--->

<a id="section10"></a>
## References

- [Kaggle Learn: Geospatial Analysis in Python](https://www.kaggle.com/learn/geospatial-analysis), an online interactive tutorial

- [Campbell & Shin, Geographic Information System Basics, v1.0](https://2012books.lardbucket.org/books/geographic-information-system-basics/index.html)

- [Intro to Python GIS: Map Projections and Coordinate Reference Systems](https://automating-gis-processes.github.io/CSC/notebooks/L2/projections.html)

- [ESRI 
Coordinate systems, map projections, and geographic (datum) transformations](http://resources.esri.com/help/9.3/arcgisengine/dotnet/89b720a5-7339-44b0-8b58-0f5bf2843393.htm)

#### Installing GeoPandas on Your Computer

To install GeoPandas on your own computer, see the instructions in this file [s0_0_Geopandas_Installation.md](https://github.com/dataforhousing/curriculum_dev/blob/master/code/s0_0_Geopandas_Installation.md) or on the [GeoPandas.org](https://geopandas.org/install.html) website.

The geospatial functionality of GeoPandas is provided by several lower level spatial data packages that are included in GeoPandas and which you may have used previously. These include:
- [shapely](https://pypi.python.org/pypi/Shapely) - for geometry processing
- [fiona](https://pypi.python.org/pypi/Fiona) - for spatial data file IO
- [GDAL/Ogr](https://gdal.org) - for spatial data file IO
- [pyproj](https://github.com/jswhit/pyproj) - for map projections and coordinate systems
- [PROJ.4](https://proj.org) - for map projections and coordinate systems
- [geopy](https://geopy.readthedocs.io/en/stable/) for geocoding and for geodesic distance calculations,
- [pysal](https://pysal.org/) for spatial analysis functions such as data classification methods and spatial autocorrelation,
- [descartes](https://bitbucket.org/sgillies/descartes/src/default/) for plotting Shapely geometric objects with Matplotlib

These packages may be installed as dependencies when you install Geopandas or you may need to install these directly.  We list the packages above for reference only in case you have questions about what is being installed on your system or need help getting Geopandas to run.


## Congrats you're done with GeoPandas part 1!
</br>


---
<div style="display:inline-block;vertical-align:middle;">
<a href="https://dlab.berkeley.edu" target="_blank"><img src ="" width="75" align="left">
</a>
</div>

<div style="display:inline-block;vertical-align:middle;">
    <div style="font-size:larger">&nbsp;Data Science for Housing Workshop, University of California Berkeley</div>
    <div>&nbsp;Tim Thomas, Patty Frontiera, Emmanuel Lopez, Ethan Ebinger, Hikari Murayama, Karen Chapple, Claudia von Vacano<div>
    <div>&copy; UC Regents, 2019-2020</div>
</div>