# Geospatial Data in Python with GeoPandas 

A [D-Lab](https://dlab.berkeley.edu) Workshop, Fall 2019 

---



Outline for day 1:

0. Setup
1. Geospatial Data: Types and importing
2. GeoDataFrame 
3. GeoSeries
4. Geometries (Points, Linestrings, Polygons)
5. Mapping 
6. Subsetting
7. Mapping: Overlays
8. CRS

Outline for day 2:

8. CRS
9. Spatial Measurements
10. Spatial Relationship Queries
11. Combining Data: Attributes + Spatial Joins
12. Data-driven Mapping


# Introduction



The goal of this notebook is to give you a **tip of the iceberg introduction** to working with geospatial data in Python using the **geopandas** package.  Most of the sample data and use cases are related to a UC Berkeley research a project that Patty Frontiera, D-Lab's Data Services Lead, has been working on called [The Louisiana Slave Conspiracies](https://dlab.berkeley.edu/landing-page/louisiana-slave-conspiracies). This project explores several slave conspiracies that occured in colonial Louisiana during the late 1700s and early 1800s. Since very little data exist for this time period, we begin with an exploration of US Census data from the early 1800s, shortly after the Louisiana Purchase made the Louisiana and Orleans Territories part of the United States.

> ### Assumptions

> This tutorial assumes you have basic working knowledge of Python and of geospatial data.   If you need a geospatial refresher, we can start with this **very** [Brief Introduction to Geospatial Data](https://docs.google.com/presentation/d/1d9GNcLDsnLxfLmrNRNZE976sHN5qNfkU9Rl2gabUsWc/edit?usp=sharing).

 
 \
## GeoPandas and related Geospatial Packages

[GeoPandas](http://geopandas.org/) is a relatively new package that makes it easier to work with geospatial data in Python. In the last few years it has grown more powerful and stable. This really is great because previously it was quite complex to work with geospatial data in Python.  GeoPandas is now the go to package for working with geospatial data. 

`GeoPandas` provides convenient, unified access to the functionality of the [pandas](https://pandas.pydata.org/) package , extending it with the geospatial processing capabilities provided by a number of lower level spatial data packages including [shapely](https://pypi.python.org/pypi/Shapely) for geometry processing, [fiona](https://pypi.python.org/pypi/Fiona) and [GDAL/Ogr](https://gdal.org) for spatial data file IO and[ pyproj](https://github.com/jswhit/pyproj) and [PROJ.4](https://github.com/OSGeo/proj.4/wiki) for map projections and coordinate systems.  


We will also use a few other optional geospatial libraries that are  commonly used with geopandas, including:

- **rtree** for spatial indexing to improve performance
- **geopy** for geocoding and for geodesic distance calculations
- **pysal** for spatial analysis functions such as data classification methods.
- **descartes** for ploting Shapelygeometric objects with Matplotlib.


Finally, we will use a number of standard Python libraries including pandas, numpy, and matplotlib.


# **0.- Setup**



Installing Geopandas can be a bit complex due to the libraries that it depends on.  See the [Geopandas documentation ](http://geopandas.org/install.html) for help with this process - read it carefully as that will save you many headaches!

We will use the [Google Colaboratory](https://colab.research.google.com/notebooks/welcome.ipynb) Jupyter notebook environment for this workshop so that we will all have the same working enviroment.

\
## About Google Colab

Google Colab is a freemium (*i.e., extra stuff costs $$*) Jupyter notebook environment that requires no setup and runs entirely in the cloud.

- A google account is required!

From the browser you can write and execute Python code and save and share your notebooks.

You can also install libraries that are not readily available and import local or remote data.

- However, the libraries you install and data you import are only available to you temporarily in the Colab environment.

\
### Why we like Colab

- It's free for our needs

- It's fast

- It removes alot of local package install problems so we can get right to work.

- It ensures that all workshop participants have the same computing environment.

### Learning more

To learn more go to the [Welcome to Google Colab](https://colab.research.google.com/notebooks/welcome.ipynb) site.

\

## An alternative to Colab for the Berkeley community
- https://datahub.berkeley.edu
- You sign in with bCourses (credentials)
- Learn more at https://data.berkeley.edu/academics/resources/berkeley-data-stack

\

# Getting Started

- Login to **Google Collaboratory** at <https://colab.research.google.com/notebooks/welcome.ipynb>

- From the **File** menu select **Open Notebook**

- Click on the **GitHUB** tab

- Insert the URL to this github repo: https://github.com/dlab-berkeley/Geospatial-Fundamentals-in-Python and click "SEARCH"

- Then, open the notebook **Geopandas_Intro_F2019_GC_workshop_11_12_2019.ipynb**

*If you are warned that this is not a Google notebook, select "Run anyway".*


\

## Install Geopandas and dependencies

Google Colaboratory comes with a Juypyter notebook environment with the most common Python packages already installed. To import a library that's not installed by default, you can use **!pip install** or **!apt-get install**.

* You can execute system commands within a juypter notebook by prepending the command with an exclamation mark (also called bang).

<br>

To run Geopandas in Google Colab, execute but do not change the code in the following cell. (*The install process we will follow is from [this notebook](https://colab.research.google.com/drive/1tSmJmjD3sTI31Cg1UCIKiE10dBUmWUG7#scrollTo=wHnmdr_QkKec&forceEdit=true&offline=true&sandboxMode=true)*).

>**IMPORTANT** -  if you are installing these Python packages on your local computer see [Geopandas documentation ](http://geopandas.org/install.html) . Do not use the code below as this is for the Google Colaboratory environment.

If you have your geopandas enviroment installed locally, you can get the data and notebook for this tutorial are in this github repository: https://github.com/dlab-berkeley/Geospatial-Fundamentals-in-Python



In [0]:
#######################################################
# Code to install geopandas in Google Colaboratory
# You need to run this code each time you run this 
# notebook on Google Colab
# Should take about 2 - 8 minutes.
#######################################################
%%time 
!apt update
!apt upgrade
!apt install gdal-bin python-gdal python3-gdal 
# Install rtree - Geopandas requirment
!apt install python3-rtree
# Install pysal
!pip install pysal
# Install mapclassify
!pip install mapclassify
# Install Geopandas
!pip install git+git://github.com/geopandas/geopandas.git
# Install descartes - Geopandas requirment
!pip install descartes 

## Import GeoPandas and Related Libraries

Next, import the libraries that we will use.


In [0]:
import pandas as pd
import geopandas as gpd
import mapclassify
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, LineString

## While things are being installed
\
Check out the https://dlab.berkeley.edu webpage for more info on
\
consulting, working groups, workshops, etc

\
Also, don't forget to **provide feedback** to improve our services

\

---

# **1.- Geospatial Data**

**LOCATION** --> Where
\
**ATTRIBUTES** --> What
\
**METADATA** --> When/Who/How

Encode location geometrically with coordinates

<img src="https://ucd3ed58a4a615469784742cf2ab.dl.dropboxusercontent.com/cd/0/inline/AsMIbQ0k28KFmyNfXcLTy-KvaAkTtFm6v5Ev6QdswOJS_kn__AiACDz2PsTk90UqSwmLZxf8I9CWaZskKMpGjny_Yg5UidjYUGF3t38yPwjrpifjqNwY5mxAKeBhZu8FXa0/file#" width="800px"></img>

\

\
Geospatial information is represented using two types of data models: vector and raster.

*   Vector data represents geographic information as points, lines or polygons.
*   Raster data represents geographic information as a continuous surface of grid cells.


\
<img src="https://uce1cd08627386c74cef5dd668ab.dl.dropboxusercontent.com/cd/0/inline/AsN6iS3Vs-xjLgWxPtR4g6HNFlZw22nyUiKj6jXHNtw3KIY6sM2hI59kS9b4k-KIpPLxOu-NStBaSiLywJG5-NRB8ATMDsLsQ7oVSyYza6AX3QYDqABOFqRQEwbKGaTPUBo/file#" width="1000px"></img>


Take a look at the map below. Can you identify the types of geospatial data that are shown?
<p>&nbsp;</p>
<p>
  <b>Geospatial Data for the City of Berkeley</b>
<img src="https://raw.githubusercontent.com/dlab-berkeley/Geospatial-Fundamentals-in-Python/master/data/geospatial_data_berkeley.png" width="800px"></img>
</p>

<br>



## GeoPandas
GeoPandas provides support for working with vector spatial data. If you are interested in working with raster data in Python, check out the rasterio package at <https://rasterio.readthedocs.io/en/stable>. This workshop only covers vector data.


## About the Data 

This tutorial uses historical census data for the USA and the Orleans Territory, most of which is now called Louisiana, that were obtained from the `NHGIS`, or *National Historical Geographic Information System* website ([IPUMS NHGIS, University of Minnesota, www.nhgis.org](https://www.nhgis.org)).  A cartographic boundary file for the United States was obtained from the [US Census website](https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html).


## Fetch the Data with `wget`

The data and related notebooks for this tutorial are in this github repository: https://github.com/dlab-geo/geopandas_intro

In the Google Colaboratory environment you can use the command **wget** to fetch the data from that repo and use it for the duration of this session.


In [0]:
!wget 'https://raw.githubusercontent.com/dlab-geo/geopandas_intro/master/data/us_states.zip'
!wget 'https://raw.githubusercontent.com/dlab-geo/geopandas_intro/master/data/uscounties_1810.zip'
!wget 'https://raw.githubusercontent.com/dlab-geo/geopandas_intro/master/data/orleans_census_data1810.csv'
!wget 'https://raw.githubusercontent.com/dlab-geo/geopandas_intro/master/data/lsc_points.csv'
!wget 'https://raw.githubusercontent.com/dlab-geo/geopandas_intro/master/data/orleans_places.csv'


### Take a look at the data files

Make sure that all of the data has been transferred. You can look at the files using the **ls** system command (on mac) or the **dir()** command (on windows). 

* You can execute system commands within a juypter notebook by prepending the command with an exclamation mark (also called bang).

In [0]:
!ls


Some of the files we just fetched are zipped. Let's take a look at those:

### Two ways of unzipping files


####1 - Combining python commands with system commands

In [0]:
myfiles = !ls *.zip

myfiles = myfiles[0].split()

print(myfiles)

Now, unzip the zipped files.

In [0]:
for f in myfiles:
  print("Unzipping: ", f)
  !unzip {f}

Take another look at our files.

In [0]:
!ls

####2 - Directly via geopandas (more on this below)

## Spatial Data File Formats

There are many different types of [vector geospatial data file formats](https://en.wikipedia.org/wiki/GIS_file_formats#Vector). You may have heard of Shapefiles, GeoJSON, KML, Spatialite files and others.

Of all of the available formats the most commonly used ones are the [ESRI Shapefile](https://en.wikipedia.org/wiki/Shapefile) and the **CSV** file.

Let's start with a brief discussion of the ESRI Shapefile.

* **ESRI Shapefile**: a collection of 3 to 15 files that collectively make up the Shapefile.
    * `.shp` - the spatial data encoded geometrically as points, lines or polygons
    * `.shx` - the spatial data index
    * `.dbf` - the attribute table that describes each feature
    * `.prj` - a text file that identifies the coordinate reference system (CRS) for the data



<img src="https://uce778cebd73ef35d88344dc080a.dl.dropboxusercontent.com/cd/0/inline/AsOibjlFvTuODXsQU_1ZRR0t2eMbW7rVGRkwM3TwIHguSRnYnDg4FyT0-tlqQ_FvBbqwB-z0Ai7QO7x98z2xh3mg8342GPHfuNn283yYlWwn2t8Zbwh3lA1_hS6OTMcrnqg/file#" width="1000px"></img>

## Reading in a Spatial Data from a Shapefile

GeoPandas makes it easy to read in almost any kind of vector data file with the [read_file](http://geopandas.org/io.html) command. Let's use it to read in the data in the **usa1810** shapefile.

In [0]:
usa1810 = gpd.read_file("./uscounties_1810.shp")  #US counties in 1810

Take a look at the first rows of data with the `head` method.

In [0]:
usa1810.head()

GeoPandas can also read in a zipped shapefile. This can be quite convenient.

* Note, the syntax for reading in a zipped file is slightly different.

In [0]:
# Read in the unzipped shapefile

usa1810 = gpd.read_file("zip://./uscounties_1810.zip")  #US counties in 1810

# Take a look at the GeoDataFrame
usa1810.head()

*However, sometimes GeoPandas cannot read a zipped shapefile due to its content or the way it was created. If this is the case, unzip it and read it in directly.*


# **2.- GeoDataFrame**

The `gpd.read_file` command returns a GeoPandas **GeoDataFrame** object.  We can double-check this with the `type` function.



In [0]:
type(usa1810)


The `GeoDataFrame` is a **pandas** DataFrame with extra geospatial capabilities. So if you know `pandas` then working with GeoPandas will be much, MUCH easier. 

Let's take a look at the GeoDataFrame again using the **head** method.

- *Do you notice anything different about the GeoDataFrame compared to a regular DataFrame?*

In [0]:
usa1810.sort_values(by="STATENAM").head(20)

Because a GeoDataFrame is a pandas DataFrame you can use all the pandas DataFrame methods with it.  Some examples are shown below.


In [0]:
# How many states or territories did the USA have in 1810?

usa1810.STATENAM.nunique()  

In [0]:
# What states had the most counties in 1810?
usa1810.STATENAM.value_counts()

**Suggestion**: If you don't know pandas or want to refresh your knowledge of it we recommend you take an online tutorial or D-Lab workshop to get familiar with its methods for data manipulation and analysis.  That will make it easier for you to get the most out of GeoPandas.

### Rename columns

The columns that contain the county and state names are labeled `NHGISNAM` and `STATENAM`. Use the pandas `rename` method to rename the county and state name columns. This will make our work with the data more intuitive moving forward.

In [0]:
usa1810.rename(columns={'NHGISNAM' : 'COUNTY', 'STATENAM': 'STATE'}, inplace=True)
usa1810.head()

## CSV Files as Geospatial Data File Format


A **CSV** is a text file with a `.csv` file extension that contains rows of comma separated values where, typically, the first row has the column names.


For example, take a look at the file `lsc_points.csv` which contains the names and locations of Louisiana Slave Conspiracies:

In [0]:
!cat lsc_points.csv

It's a two step process for creating a GeoDataFrame from a CSV file:

1. Read the csv file into a Pandas DataFrame.

2. Convert the DataFrame to a GeoDataFrame.

We show these steps below.

In [0]:
# Read the csv file into a Pandas DataFrame.
lsc_df = pd.read_csv("./lsc_points.csv")
 
# Take a look at the data
lsc_df

Once we identify the columns in the dataframe that contain the geometry, here longitude and latitude, we can create a GeoDataFrame as follows.

In [0]:
#Convert the DataFrame to a GeoDataFrame.
lsc_locs = gpd.GeoDataFrame(
    lsc_df, geometry=gpd.points_from_xy(lsc_df.longitude, lsc_df.latitude))

# Take a look
lsc_locs.head()

### Challenge

Read in the CSV file **orleans_places.csv** and create a GeoDataFrame from it called **orleans_places**.

Then take a look at the GeoDataFrame.

In [0]:
## Your code here to read the csv file into a Pandas dataframe


In [0]:
## Your code to create a gdf from the df and view it
#Convert the DataFrame to a GeoDataFrame.


# Take a look


### Challenge - Solution

In [0]:
# Read the csv file into a Pandas df.
orleans_places_df = pd.read_csv("./orleans_places.csv")

#take a look
orleans_places_df.head()

In [0]:
#Convert the df to a gdf.
orleans_places = gpd.GeoDataFrame(
    orleans_places_df, geometry=gpd.points_from_xy(orleans_places_df.longitude, orleans_places_df.latitude))

# Take a look
orleans_places.head()

# GeoDataFrame Deep Dive

It's a good idea to get familiar with the GeoDataFrame structure and components. This will help you understand the different geospatial analysis methods that GeoPandas provides and to troubleshoot when you get stuck.

## The GeoDataFrame Geometry Column

All GeoPandas GeoDataFrames must have one *special* geometry column that contains the spatial data. 

This column is named **geometry** by default, but it could be something else. 

When you read in a spatial data file to create a new GeoDataFrame the `geometry` column is automatically created. 
 
 
 You can always get the name of your special geometry column:

In [0]:
usa1810.geometry.name

In [0]:
lsc_locs.geometry.name

# **3.- GeoSeries**


The geometry column is of type **GeoSeries**, taking its name and its base functionality from the pandas **series** object.   





In [0]:
type(usa1810.geometry)

Not all columns in the GeoDataFrame are of type GeoSeries. What is the type of the COUNTY column?

In [0]:
# Your code here

A **GeoDataFrame** is a tabular data structure comprised of GeoSeries and Series objects - these are the columns in the table.

The data within each column also has a data type. You can check the type of data within the GeoSeries and Series columns using the **dtypes** method.



In [0]:
usa1810.dtypes

The `dtypes` method shows that the data in the geometry column are of type `geometry`. GeoPandas extends pandas by adding this data type.

* Note, Python labels character string data as an "object".

 



## Any Questions?

# **4.- Geometries: Points, Linestrings, and Polygons**

The GeoDataFrame is a pandas DataFrame that contains a special geometry column. 

That geometry column itself is of type GeoSeries and it contains data of type geometry.

GeoPandas supports three basic types of vector geometries:
- **Points / MultiPoints**
    - POINT( -122 38)

    - MULTIPOINT((-122 38), (-123 39))
    
- **Lines / MultiLines**
    - LINE (30 10, 10 30, 40 40)
    
    - MULTILINE((10 10, 20 20, 10 40),(40 40, 30 30, 40 20, 30 10))
    
- **Polygons / MultiPolygons**
    - POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), (20 30, 35 35, 30 20, 20 30))
    - MULTIPOLYGON (((30 20, 45 40, 10 40, 30 20)), ((15 5, 40 10, 10 20, 5 10, 15 5)))

 

 
**Notes**

* These geometries are displayed above and in the GeoDataFrame in what is called **Well-Known Text** format.

* A GeoSeries can contain mixed geometry types. But that is not always a great idea.



<img src="https://ucad11083bdd7ea2e7023d2559aa.dl.dropboxusercontent.com/cd/0/inline/AsNQWnwp-9gweXwcltxIRazE6AX29CP-I-Ip6fm9beT2LU7cZIta1IwXfDUiS3ExVIFbU68dm0_F71puELBvdueRUiRNtciawjwfEPZ3Bt3lXy6iNB3w8iKrnqOutDju0zo/file#" width="400px"></img>





Let's check the specific geometry type(s) in our GeoDataFrame.

In [0]:
set(usa1810.geom_type)  # set returns unique values

### Question
Why would this dataframe of **counties** contain both Polygon and Multipolygon geometries?

### Question

What specific geometry types are in the orleans_places gdf?

In [0]:
# Your answer here...
set(orleans_places.geom_type)

### Question
What if your dataframe is lacking some data that can be added manually? 
Answer: Create/input your data via Shapely

In [0]:
point_demo = Point(1,1)
print(point_demo)

In [0]:
polygon_demo = Polygon([(1, 1), (2,2), (2, 1)])
print(polygon_demo)

## **Progression: GeoDataFame, GeoSeries, Geometry**


Let's take a look at a GeoDataFrame a bit more closely.

First, let's subset the `usa1810` GeoDataFrame to select only the rows for the state of New York.

In [0]:
ny_gdf = usa1810[usa1810['STATE']=='New York']

print("The ny_gdf object is of type: ", type(ny_gdf), "\n")

ny_gdf.head()

Now let's create a GeoSeries from the geometry column in the GeoDataFrame.

In [0]:
ny_gs = ny_gdf.geometry

print("The ny_gs object is of type: ", type(ny_gs), "\n")

ny_gs.head()

Finally, let's get the geometry value itself.  

To extract a single value from a Series or GeoSeries you use its row index. 

In [0]:
# Get the geometry for NY County

#First get the index for the County
NYC_index = ny_gdf[ny_gdf.COUNTY=='New York'].index.values[0]
print(NYC_index)

# Fetch the geometry
ny_geom = ny_gs[NYC_index]

print("The ny_geom object is of type: ", type(ny_geom), "\n")

ny_geom


As shown above, when you return a single geometry, the object is plotted. To see the data  in  [well-known text](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry) format, or WKT,  use the `print` function.

In [0]:
print(ny_geom)

In [0]:
ny_gdf[ny_gdf.COUNTY=='New York'].index.values[0]
ny_gs[ny_gdf[ny_gdf.COUNTY=='New York'].index.values[0]]

In [0]:
ny_gdf[ny_gdf.COUNTY=='New York'].geometry.squeeze()
#type(ny_gdf[ny_gdf.COUNTY=='New York'].geometry.squeeze())

## GeoPandas Attributes and Methods

GeoPandas extends pandas with spatial attributes and methods that apply to the special `geometry` column.


For example, the code in the following cell returns the **total_bounds** attribute. These are the coordinates for the minimum bounding box that contain all geometries in the `geometry` GeoSeries.

In [0]:
usa1810.geometry.total_bounds

GeoPandas will apply a spatial method to the geometry column even if you do not explicitly reference it.

In [0]:
usa1810.total_bounds

Most Geopandas geometry methods and attributes apply to **each** geometry in the GeoSeries rather than **all** in the aggregrate. 

For example, let's use the bounds attribute to see the bounding coordinates of each county in the usa1810 geodataframe.

In [0]:
usa1810.bounds.head()

To see all of the attributes and methods of a GeoDataFrame, enter its name followed by a period and hit the tab key. Try that below.

In [0]:
#usa1810

In the rest of this tutorial we will explore the basic GeoPandas methods for working with GeoDataFrames, GeoSeries and geometries.

## Summary

GeoPandas extends Pandas with attributes and methods for **GeoDataFrames**, **GeoSeries** and **geometry** objects.

These objects have their own methods and the methods take arguments that may also be one of these types of objects.

\
You can use the "dot-tab" command to what is available for each type of geospatial object. This is a great way to explore the data, when used along with the help page and the GeoPandas online documentation.

\
As you work with GeoPandas and read through the [online documentation](http://geopandas.org) keep in mind which type of object you are working with and what type is required as input to a method or returned by a specific method or attribute.




### Any Questions?

---

# **5.- Mapping GeoDataFrames**


One of the first things to do with geographic data once you read it into GeoPandas is visualize it.

The GeoPandas **plot** method will display the data in a GeoDataFrame or GeoSeries. 

This uses `matplotlib` and the matplotlib `pyplot` module under the hood.

In [0]:
# Plot a GeoDataFrame
usa1810.plot()  # it's really that simple!
plt.show()

We can also plot a subset of the geodataframe.

In [0]:
# Plot all the 1810 counties in New York state
usa1810[usa1810['STATE']=='New York'].plot()

And we can plot a geoseries with plot()

In [0]:
# plot the geometry geoseries
usa1810[usa1810['STATE']=='New York'].geometry.plot()

Pretty cool to be able to make a map with a single command. However, there is always room for improvement. You can find out more about the plotting options for basic maps in the geopandas documentation and in the [matplotlib](https://matplotlib.org/) documentation.

</br>

For now, let's use some options to make a prettier map. Take a minute to consider what each option does.

In [0]:
usa1810.plot(linewidth=0.5, edgecolor='grey', facecolor='pink',  figsize=(10,8) )
plt.show()

When you have time, take a look at the method documentation for **plot** to see all of the available options.


In [0]:
#gpd.GeoDataFrame.plot?
#gpd.GeoSeries.plot?

## Question

Can you think of why the options for plotting a GeoDataFrame are different from those for a GeoSeries?

## Challenge

Let's take a few minutes to practice some of what we have done so far with a different data set.

- Read the **us_states** shapefile into a GeoPandas GeoDataFrame named **usa**.
- Take a look at the data in this dataframe using `head`.
- Then, make a map of the `usa`, 
    - setting the `figsize` to (14,10)
    - the fill color to green,
    - and the outline color to white

In [0]:
# your code here to load the data from the zip file into a geodataframe


In [0]:
# your code here to plot the geodataframe


## Challenge - solution

In [0]:
usa = gpd.read_file('zip://./us_states.zip')
usa.head()

In [0]:
usa.plot(linewidth=0.25, edgecolor='white', facecolor='green',figsize=(14,10))

# **6.- Spatial Subsetting**

It's never easy to make a nice map of the entire US. Why is that? 

We can zoom in on the contiguous USA by spatially subsetting the data using the GeoPandas **cx** method.  This method takes the form:
>usa1810.cx[xmin:xmax, ymin:ymax]

>where:
- **xmin** is the minimum X coordinate value
- **xmax** is the maximum X coordinate value 
- **ymin** is the minimum Y coordinate value
- **ymax** is the maximum Y coordinate value 

Since our date use geographic coordinates, X values are decimal degrees `longitude` and Y values are in decimal degrees `latitude`.

Let's give it a try.

In [0]:
usa.cx[-130:-80, 25:45].plot(linewidth=0.25, edgecolor='white', facecolor='green',figsize=(14,10))

## Questions

How did that last map turn out?

What exactly is **cx** doing?  Let's explore it a bit more. 

- Change the minimum Y value to 30 and then 35. Do Texas and Florida get clipped?

Take a second to uncomment the command below and read the documentation for `cx`. Then update the values in the previous code cell to get all states.

In [0]:
#usa.cx?

### Saving a spatial subset

We can make that subset permanent.

In [0]:
# FYI: conus is shorthand for contiguous USA
conus= usa.cx[-130:-50, 20:50].copy().reset_index(drop=True)
conus.head()

In [0]:
# Plot the subset
conus.plot()

In [0]:
conus.STATE.nunique()

## *Any questions?*



---



# **7.- Map Overlays**



A key strength of geospatial data analysis is the ability to overlay data that are located in the same coordinate space. Let's overlay the USA in 1810 on top of the USA in 2017 to visualize the change. Both of these data sets use the same coordinate reference system -  decimal degrees of latitude and longitude referenced to the **World Geodetic System of 1984**.  This is called the **WGS84** coordinate reference system (more about that in a minute). 


The general process for creating map with multiple data layers is as follows:

- First identify your base map - the layer to draw first, or at the bottom of the stack of layers.
- Then you add one or more additional layers, referencing the base map as the **ax**.

In [0]:
# Map the us states with the 1810 states and territories overlayed.
base = conus.plot(color='white', edgecolor='black',  figsize=(14,10))
usa1810.plot(ax=base, color="blue", edgecolor="blue", alpha=0.5)

We can add even more layers. These will draw in the order that you add them. Consider the following code.

In [0]:
base = usa1810.plot(facecolor="blue", edgecolor="blue",  figsize=(14,10))
conus.plot(ax=base, color='None', edgecolor='black')
conus.centroid.plot(ax=base, color="red")  # Hey - what's happening here?

What's different in the code for the previous two maps?

</br>

We can get even fancier with our maps by using the more **matplotlib** options. To access these you need to import mapplotlib.

In [0]:
# Mapping with advanced matplotlib settings

fig, ax = plt.subplots(1, figsize=(14,10))  # Initialize the plot figure (drawing area) and axes (data area)

ax.set_aspect('equal')   # set the aspect ratio for the x and y axes to be equal. 
                         # This is done automatically in gdf.plot()
    
base = usa1810.plot(ax=ax, color='grey', edgecolor='grey')  # Set the base map, or bottom map layer

conus.plot(ax=base, color='None', edgecolor="blue")  # draw the data with the base
_ = ax.axis('off') # Don't show the x, y axes and labels in the plot
ax.set_title("The Geographic Extent of the Contiguous USA 1810 and 2017")  # Give the plot a title

plt.show()

### Challenge

Make a map that displays the `orleans_places` over the `Orleans Territory` in the usa1810 GeoDataFrame.

- Color the points "red" so that they are visible.

In [0]:
# Your code here


### Challenge - Solution

In [0]:
base = usa1810[usa1810['STATE']=='Orleans Territory'].plot()
orleans_places.plot(ax=base, color="red")

## Any Questions?

---

#**8.- Coordinate Reference Systems (CRS)**



Did you notice anything funny about the **shape** of the USA as we mapped it above?  How does it differ from the shape of the US in the map below?

\
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Map_of_USA_with_state_names.svg/640px-Map_of_USA_with_state_names.svg.png" width="800px"></img>


#### Why does the shape differ? 

Here's why:

<img src="http://tse3.mm.bing.net/th?id=OIP.lyDmHXX9VdoEOWDQlqppSAHaEy" width="500px"></img>

When we map the shperoidal earth on a 2D plane like a computer screen we get distortion!

## Map Projections and CRS Transformations

In order to reduce distortion in maps we apply a map projection (math) to transform 3D geographic coordinates to 2D projected map coordinates.
<img src="https://www.e-education.psu.edu/natureofgeoinfo/sites/www.e-education.psu.edu.natureofgeoinfo/files/image/projection.gif"></img>

\
- Seen differently 

\

<img src="https://ucb339e0d48d488fb3cbc37aa1df.dl.dropboxusercontent.com/cd/0/inline/AsMSu3gL6IIoMExcLjCruH9RV7-mpgshMfqQ9qK60XM2DYYz2x_Lew7JQotgsdcSguc5DjmyazWEGTnVNV_dHtDY5KdcSywSwgnvv6OBorlTp_150Gym6Cg_K5FtljLUmoc/file#"></img>

\

CRS transformations are often necessary for GeoPandas spatial operations like area and distance calculations which assume a 2D plane.




## Transforming a CRS
The process for transforming a CRS is:

1. Make sure a **crs** is defined for the geopandas dataframe by checking the **crs** property. 
2. If it is not set, you can **define** it.
3. Transform the coordinate geometry to a new CRS using the **to_crs** method.
- This returns a new geodataframe with the new coordinate values and CRS.
- You need to know what CRS to use!!


Let's check the CRS of our GeoDataFrames

In [0]:
# Check the CRS of our gdfs
print("The CRS of the conus geodata frame is: " + str(conus.crs))
print("The CRS of the usa1810 geodata frame is: " + str(usa1810.crs))
print("The CRS of the lsc_locs geodata frame is: " + str(lsc_locs.crs))
print("The CRS of the orleans_places geodata frame is: " + str(orleans_places.crs))


### The good news and the bad news...

The CRS of half of these GeoDataFrames is set!

This is not surprising. 

* The two GeoDataFrames that were created from shapefiles have a CRS because the shapefile format included that information.

* The two GeoDataFrames created from CSV files do not have CRSs because that info was not in the CSV file.


### The confusing news

The CRS of the first two GeoDataFrames is set to **epsg:4326** - what's that?

### EPSG 4326

**4326** is the [EPSG](http://www.epsg.org/) code for `WGS84`. This is the most common CRS for latitude and longitude data. It is the default CRS for most mapping software when the data does not have a defined CRS.

* **EPSG** stands for European Petroleum Survey Group, the organization that created these codes.




## Setting a CRS

We can set, or define, the CRS of a GeoDataFrame if we know what it is. 

**Question:** What do you think is the CRS for the `lsc_locs` and `orleans_places` GeoDataFrames?




Let's set the CRS for these gdfs to WGS84 since coordinates are longitude and latitude values.

In [0]:
lsc_locs.crs = usa1810.crs

In [0]:
# Set it
lsc_locs.crs = {'init' :'epsg:4326'}
orleans_places.crs = {'init' :'epsg:4326'}

# Check it
print("The CRS of the lsc_locs geodata frame is: " + str(lsc_locs.crs))
print("The CRS of the orleans_places geodata frame is: " + str(orleans_places.crs))

**Note**: Setting the CRS does not change any of the geometry data. It simply sets the value of this so that the software compare it against an internal database and  properly interpret it.

## Reprojecting a GeoDataFrame

There are a number of reasons why you might one to transform your data to another CRS, including:

- To make prettier maps
- To make more accurate spatial measurements
- To get all data in the same CRS for spatial analysis.

This process is called **reprojecting** the data because the operation is mathematical transformation of the geometry based on a specific [map projection](https://en.wikipedia.org/wiki/Map_projection).

\

<img src="https://ucd9cd5cb85cb2c0e6049e8ba1e7.dl.dropboxusercontent.com/cd/0/inline/AsMNnCV0L5bC869ivYiJC9V0z_6Fy2Z3hx9Po0rM2jI0RJ8dtO7F9vElWCiqoXk5pN1MGyg4C3_YyjYdjQzBRDIjqgPhnIVDgqIAidHRjzA1rwhZWjUBYlI3obtI7bx0Usw/file#" width="1000px"></img>

\


### Improving our Maps

For example, we can make our maps look better by transforming the data from geographic CRS (longitude and latitude) to a 2D projected `CRS`. 


Common map projections for data that spans the entire continguous USA and their EPSG codes include:

- **Web Mercator** (epsg:3857)
- **USA Contiguous Albers Equal Area** (epsg:5070)

Let's plot the conus GeoDataFrame using the Web Mercator projection.

In [0]:
conus.to_crs(3857).plot()

The above code did not change the geodataframe. It dynamically transformed the geometry and then plotted it.

Since these transformations can be computationally intensive and we often want to reuse the result, let's save the output to a new object.

In [0]:
# Transform geographic crs to web mercator - 3857
conus_3857 = conus.to_crs(epsg=3857)
conus_3857.plot()

## Challenge

Now you try it! Transform the **conus** geodataframe to **USA Contiguous Albers** (5070) CRS and save the output GeoDataFrame as **conus_5070**.

Then, map the output GeoDataFrame.

In [0]:
# Your code here


## Challenge Solution

In [0]:
#@title
# Transform the conus geodataframe to USA Albers (5070)
conus_5070 = conus.to_crs(epsg=5070)
conus_5070.plot()

## Multiplots

Let's plot all the data in all 3 CRS together.

We get very different maps of the USA depending on the CRS. 

- **WGS84** is the most common CRS for longitude and latitude data. But it shouldn't be used for maps because of the distortion to shape. More commonly, these data are transformed before mapping or spatial analysis.

- **Web Mercator** is often used for making maps of areas because it preserves shape. This is the CRS used by most online maps like Google Maps. BUT BEWARE - area distortion increases as you move away from equator and towards the poles.  Don't use this CRS for spatial analysis.

- **USA Contiguous Albers** is used for the maps and area based analysis for the contiguous USA.  For smaller areas within the USA you should use a CRS that is more customized to a specific state or region.


In [0]:
# RUN CODE - DO Not Change
fig, ax = plt.subplots(ncols=3, figsize=(18,4), subplot_kw=dict(aspect='equal'))
# Don't show the coordinate axis
ax[0].axis('off')
ax[1].axis('off')
ax[2].axis('off')
# Show a title
ax[0].set_title('WGS 84 (4326)')
ax[1].set_title('Web Mercator (3857)')
ax[2].set_title('Albers EA (5070)')
# display
conus.plot(ax=ax[0])
conus_3857.plot(ax=ax[1])
conus_5070.plot(ax=ax[2])
plt.show()

## Challenge

Update the following code to display the conus_3857 and conus_5070 GeoDataFrames in the same map. 

* Do they overlay?
* Should you display data with different CRSs in the same map?

In [0]:
# Update the code below to map the conus_3857 and conus_5070 geodataframes overlayed.
base = conus_3857.plot(color="blue", edgecolor="white", alpha=0.50, figsize=(14,10)) # UPDATE this with Web Mercator geodataframe
conus_5070.plot(ax=base, color="yellow", edgecolor='black', alpha=0.50)       # Update this code with the Albers geodataframe


**Takeaway**

GeoPandas.plot() does not dyamically transfrom data with different CRSs so that they overlay on a map. You need to do that transformation explicitly.


### CRSs - The Fine Print

1. GeoPandas data need to be in the same CRS in order to be mapped or analyzed together.
2. The units of a CRS are part of the CRS definition. These are typically decimal degrees for geographic (lat/lon) data and meters or feet for projected data.

    * You can use <https://spatialreference.org> to look up the units by EPSG code.

3. It's not obvious what the best projected CRS is for your map or analysis. You need to review the recent literature (as these things change), try different CRSs and check your results.  Here is a good starting place, [epsg.io](http://epsg.io/).


> A detailed discussion CRS and map projections is beyond the scope of this notebook. Understanding these, however, is **necessary** for working sucessfully with geospatial data! There are a number of online resources that can be found with a web search to help you get started.  Gaining this understanding takes time so be kind to yourself and ask for help if you need it.

## More ad-hoc projections


https://projectionwizard.org/#

This can be useful to extract closer projections to the area of interest

<img src="https://uc7d0845f9e0a2cc252f15af153e.dl.dropboxusercontent.com/cd/0/inline/AsMsFOWFJh7GEENp1spMizsV8unCfZ8eEUUnk2j5DZuQXuII140dLqYmnzCp09tgUJ80raYaPWDdwIlwhvE6DBs6_RGzJA2RJ6w-V91KLNXbZHMTpk0kyo2Vf2NQ8KqYJ-8/file#" width="1000px"></img>

In [0]:
#The use of this would be:
#gpd_demo = usa1810.to_crs("+proj=stere +lat_0=35.460669951495305 +lon_0=-115.31249999999999")
#gpd_demo.plot()

## Any Questions?


---



# **9.- Spatial Measurements**



Geopandas uses the  [Shapely library](https://shapely.readthedocs.io/en/stable/manual.html) to compute spatial measurements like area and length for individual geometries or all the geometries in a geoseries.  The available measurements depend on the geometry type. For example, we can compute area and perimeter for polygons, length for lines, and distances between points.  Read the GeoPandas and Shapely documentation to get a sense of all the meausurements you can compute.


## Calculating Area

Let's compute the area of a single state geometry in the `conus` GeoDataFrame.

In [0]:
conus[conus['STATE']=='Utah'].area

Above, **area** is returned as a pandas series containing one item.  

The item contains an index value, data value and a data type.

You can retrieve the data value by referencing the index value as follows.

In [0]:
conus[conus['STATE']=='Utah'].area[26]

You can also use the *squeeze* method to return just the data value when the GeoSeries only has one element.


In [0]:
conus[conus['STATE']=='Utah'].area.squeeze()

In [0]:
conus.crs

### Question  - What are the units for the above area value?

## Spatial Measurements and CRSs

It doesn't make sense to compute spatial measurements using geographic coordinates (latitude and longitude) because the units are decimal degrees. 

Let's redo the above area measurement using the Albers GeoDataFrame `conus_5070`.  The units for this CRS are square meters.  

We will convert the result to square kilometers by dividing by 1000 x 1000.

* You can find the units for a CRS by looking it up by EPSG code on the website <https://spatialreference.org>.

In [0]:
# Area Utah in sq kilometers
conus_5070[conus_5070['STATE']=='Utah'].area.squeeze() / (1000 *1000)

How close is that area measurement to what is reported in [Wikipedia](https://en.wikipedia.org/wiki/Utah) for the total area in square kilometers of Utah?

### Challenge
Calculate the area of Utah using the web mercator GeoDataFrame. Does it give a similar result to the Albers dataframe. Note, the units are also meters for this CRS.

In [0]:
# Your code here
# Calculate the area of Utah in sq KM using the Web Mercator geodataframe


### Challenge - Solution

In [0]:
# Calculate the area of Utah in sq KM using the Web Mercator geodataframe
conus_3857[conus_3857['STATE']=='Utah'].area.squeeze() / (1000 *1000)

### Spatial Measurements and GeoDataFrames

We can compute the area of all geometries in the geodataframe.

In [0]:
#conus.geometry.area
conus_5070.area

Above, we dynamically calculated area. But we can also adding it to a new column the GeoDataFrame.

In [0]:
# Update the GeoDataFrame
conus_5070['areaKM'] = conus_5070.area / (1000*1000)
conus_5070.head(15)

## Calculating Length or Perimeter

Similarly we can calculate the perimeter of one or all state polygons.

In [0]:
# Perimeter of all states in kilometers
conus_5070['perimeterKM'] = conus_5070.length / 1000


In [0]:
conus_5070.head()

## Calculating Distance
We can compute the shortest distance between geometries using the GeoSeries **distance** method.  This method calculates the shortest distance between two geometries or between a GeoSeries and a geometry.

### Computing the Distance between two points

Let's compute the distance between two `orleans_places`: **Baton Rouge** and **New Orleans**

In [0]:
# Compute the distance between Baton Rouge and NOLA
baton_rouge = orleans_places[orleans_places.place == 'Baton Rouge'].geometry
new_orleans = orleans_places[orleans_places.place == 'New Orleans'].geometry

baton_rouge.distance(new_orleans.squeeze())

As with area calculations, distance calculations require a GeoDataFrame with an appropriate CRS.  Let's dynamically convert to EPSG 5070 and check the result.


In [0]:
baton_rouge.to_crs(5070).distance(new_orleans.to_crs(5070).squeeze()) / 1000

You can check that on Google Maps to see if it is more or less correct.

We can extend this and calculate the distance between all places and New Orleans, the capital of the Orleans Territory.

In [0]:
orleans_places['dist2nola_km'] = orleans_places.to_crs(5070).distance(new_orleans.to_crs(5070).geometry.squeeze()) / 1000
orleans_places

In [0]:
orleans_places.to_crs(5070).distance(new_orleans.to_crs(5070).geometry.squeeze()) / 1000

We could also apply the distance function to the orleans_places gdf with `map`.

In [0]:
orleans_places.to_crs(5070).geometry.map(lambda g: g.distance(new_orleans.to_crs(5070).geometry.squeeze())/1000)

### Computing the Distance between two polygons

Distance calculations aren't limited to points or to points and polygons. 

We can also compute the shortest distance in kilometers between CA and Washington state.

First, let's get the geometry for both states.

In [0]:
#Extract the geometry for WA
wa_geom = conus_5070[conus_5070['STATE']=='Washington'].geometry
print(type(wa_geom))
print(wa_geom)
print()
#Extract the geometry for CA 
ca_geom = conus_5070[conus_5070['STATE']=='California'].geometry
print(type(ca_geom))
print(ca_geom)



Now, compute the distance using the `distance` method.

* Note the different implementations below.


In [0]:
# Compute the distance between the a GeoSeries and a geometry
wa_geom.distance(ca_geom[3]) / 1000

In [0]:
# Compute the distance between two geometries
wa_geom[47].distance(ca_geom[3]) / 1000

In [0]:
# Doing it all on one line
conus_5070[conus_5070['STATE']=='Washington'].geometry[47].distance(conus_5070[conus_5070['STATE']=='California'].geometry[3]) / 1000

*How does the command in the next cell differ from the previous command?*

In [0]:
conus_5070[conus_5070['STATE']=='Washington'].geometry[47].centroid.distance(conus_5070[conus_5070['STATE']=='California'].geometry[3].centroid) / 1000

Alternatively we can reset the indices so we know that the result will be in the first (zero indexed row)

In [0]:
wa_geom = conus_5070[conus_5070['STATE']=='Washington'].reset_index().geometry
ca_geom = conus_5070[conus_5070['STATE']=='California'].reset_index().geometry
wa_geom.distance(ca_geom) / 1000

### Apply Distance Calculation to all rows in a GeoDataFrame

What state is the farthest from CA?

In [0]:
# Calculate the distance between each state's geometry and CA geometry
conus_5070['dist2cal'] = conus_5070.distance(ca_geom[0]) / 1000


In [0]:
# Display the 5 states farthest from CA
conus_5070.sort_values(by='dist2cal', ascending=False).head()

### Challenge

Use the results from the previous distance calculations to view the states that border CA.

In [0]:
# Your code here

### Challenge - Solution

In [0]:
# Display the 5 states NEAREST TO CA
conus_5070.sort_values(by='dist2cal', ascending=True).head()

## CRSs and Distance Calculations

Compute the minimum distance in KM between Washington & California using the Web Mercator GeoDataFrame `conus_3857`.

* *Do you get the same result?*

In [0]:
# Your code here
wm_dist_m = conus_3857[conus_3857['STATE']=='Washington'].squeeze().geometry.distance(conus_3857[conus_3857['STATE']=='California'].squeeze().geometry)
wm_dist_km = wm_dist_m / 1000
print("web mercator dist KM:", wm_dist_km)
#
al_dist_m = conus_5070[conus_5070['STATE']=='Washington'].squeeze().geometry.distance(conus_5070[conus_5070['STATE']=='California'].squeeze().geometry)
al_dist_km = al_dist_m / 1000
print("Albers dist KM:", al_dist_km)


 ### Question
 
 Which of the above CRSs returned the best result?  Let's check it in [Google Maps](http://maps.google.com) to find out.

## Spatial Measurements and CRSs - Recap

The output of spatial measurements depend on the CRS and is expressed in the units of the CRS. The Shapely library assumes a two dimensional planar coordinate system and makes no transformation on the data - that is left for the analyst.

Key Takeaways: 

1. **Don't use geographic coordinates for spatial measurement queries**. The results in decimal degrees are meaningless!

2. Use the CRS that is best for the type the spatial operation and geographic region. 

3. Always check your work.


## Any Questions?


---



# **10.- Spatial Relationship Queries**



[Spatial relationship queries](https://en.wikipedia.org/wiki/Spatial_relation) consider how two geometries or sets of geometries relate to one another in space. 

<img src="https://upload.wikimedia.org/wikipedia/commons/5/55/TopologicSpatialRelarions2.png" height="400px"></img>


Here is a list of the most commonly used GeoPandas methods to test spatial relationships.

- [within](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.within)
- [contains](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.contains) (the inverse of `within`)
- [intersects](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.intersects)

<br>
There several other GeoPandas spatial relationship predicates but they are more complex to properly employ. For example the following two operations only work with geometry that are completely aligned.

- [touches](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.touches)
- [equals](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.equals)


All of these methods takes the form:

    Geoseries.contains(geometry)

Let's consider some spatial relationship queries between GeoPandas geometries.

To start, let's create some GeoPandas polygon objects that represent Louisiana geographies.

In [0]:
# Louisiana today
la_poly = conus[conus['STATE']=='Louisiana'].reset_index() 

# Louisiana in 1810 as the Orleans Territory
orleans_poly = usa1810[usa1810['STATE']=='Orleans Territory'].reset_index() 

# The Parish (or county) of Pointe Coupee, Louisiana in 1810
ptcoupee_poly = usa1810[usa1810['COUNTY']=='Pointe Coupee'].reset_index()


### Questions

What types of GeoPandas objects are these?

Let's plot the three on the same map...

In [0]:
base = orleans_poly.plot(color="lightpink", edgecolor="floralwhite", figsize=(10,10))
la_poly.plot(ax=base, color='none', edgecolor="black", linewidth=2, alpha=0.5)
ptcoupee_poly.plot(ax=base, color="grey")
plt.title('Louisiana in 1810 and Today, showing Pointe Coupee Parish')
plt.show()

All but one of these three GeoDataFrames have a GeoSeries with just one geometry. 

* **Question** - Which one has more than one?

In [0]:
print("la_poly has this many geometries: ", len(la_poly.geometry))
print("ptcoupee_poly has this many geometries: ", len(ptcoupee_poly.geometry))
print("orleans_poly has this many geometries: ", len(orleans_poly.geometry))



Let's consider a few simple spatial relationship queries.

<br>

Is Pointe Coupee Parish (ptcoupee_poly) within Louisiana (la_poly)?


In [0]:
ptcoupee_poly.within(la_poly)

### Challenge

Restate the above query using **contains**?

In [0]:
# Your code here

### Challenge - Solution

In [0]:
la_poly.contains(ptcoupee_poly)

### Check your work.

These queries seem simple but can be tricky. Sometimes it is good to ask questions you know are not true just to test that your syntax is correct.

* Does within Pointe Coupee Parish contain Louisiana?

In [0]:
ptcoupee_poly.contains(la_poly)

### Spatial Relationship queries with more complex GeoSeries


In the above queries we compared geometries 1 to 1 - where each spatial object only contained one geometry.

Now, let's ask more complex queries, comparing GeoSeries with more than 1 geometry

Keep in mind:

* la_poly has just one geometry - for the state of Louisiana.
* ptcoupee_poly has one geometry - for Pointe Coupee Parish.
* orleans_poly has 20 geometries, one for each parish in Orleans Territory
* conus has 49 geometries, one for each US State in 2017.
* usa1810 has many geometries, one for each county in all states and territories in 1810.




Even though we already know the answer, let's see how we ask the question:

> *In what US state (conus) is Pointe Coupee Parish located?*


First, let's check that `conus` contains Point Coupee Parish.


In [0]:
conus.contains(ptcoupee_poly.geometry[0])

### Question:

*How does the above **contains** syntax and result differ from what we used earlier?*

### Important

When comparing one geometry (ptcoupee_poly) to a GeoSeries with more than one geometry (conus) you need to explicitly reference that one geometry.

* GeoSeries geometry in `la_poly` is **implicitly** compared to the one geometry in the `ptcoupee_poly` GeoSeries:

  `la_poly.contains(ptcoupee_poly)`

<br>
<p>
  Compared to:
</p>
  

* GeoSeries geometries in `conus` are **explicitly** compared to the one geometry in the **ptcoupee_poly** GeoSeries: 

  `conus.contains(ptcoupee_poly.geometry[0])`
 


### Answering questions with spatial relationship queries

Use the results from the `contains` query to answer the question *what state contains Pointe Coupee Parish?*

In [0]:
type(ptcoupee_poly.geometry[0])

In [0]:
conus[conus.contains(ptcoupee_poly.geometry[0])== True]


### Challenge: 

1. Were all Orleans Territory Parishes within what is now Louisiana?

2. If not, what parishes are not now in Louisiana?

    * Hint: use the not operator (~)

3. Make a map of those Parishes on top of Louisiana (la_poly).

In [0]:
# Your code - were all Orleans Territory parishes in Louisiana?
orleans_poly.within(la_poly.geometry[0])

In [0]:
# Your code - what parishes were not?

In [0]:
# Your code - map of those parishes?

### Challenge - Solution

In [0]:
# were all Orleans Territory parishes in Louisiana?
orleans_poly.within(la_poly.geometry[0])

In [0]:
#what parishes were not?
parishes_not_in_la= orleans_poly[~orleans_poly.within(la_poly.geometry[0])]
parishes_not_in_la

In [0]:
#map of those parishes?
base = parishes_not_in_la.plot(color="pink", edgecolor="white",figsize=(10,10))
la_poly.plot(ax=base, color="none", edgecolor="black", linewidth=2)
plt.title('Louisiana in 1810 and Today, showing Pointe Coupee Parish and Settelment')
plt.show()

### Question

Why does the [within](http://geopandas.org/reference.html?highlight=distance#geopandas.GeoSeries.within) operator indicate that there are several Orleans Territory parishes that are not within Louisiana?

*Discussion*

##  Intersects - the most general and therefore most useful spatial relationship query

The most useful, fastest and most general purpose spatial relationship query is **intersects**. You don't need to worry about selecting the correct spatial relationship predicate for your query or differences due to the resolution and alignment of your geometries.

</br>

Below, we use `intersects` to ask again if all Orleans Territory parishes are in Louisiana?




In [0]:
orleans_poly.intersects(la_poly.geometry[0])


Here's another example that demonstrates the flexibility of `intersects`. 

Let's use `intersects` to see what states border Louisiana.

In [0]:
conus[conus.intersects(la_poly.geometry[0])]

Intersects is not a directional operator like contains or withins. You can compare two geometries in any order and get the same result.

In [0]:
print(ptcoupee_poly.intersects(la_poly))

print(la_poly.intersects(ptcoupee_poly))

## More Complex Spatial Relationship Queries

Spatial relationship queries can get complex very quickly.  Consider this question:

What Orleans Parish contains each of the `orleans_places`?

* This is a tough one! We are comparing two GeoSeries each with many geometries.
* This requires more or more for loops or apply functions

In [0]:
for index, row in orleans_places.iterrows():
  p_geom = row['geometry']
  p_name = row['place']
  
  for index2,row2 in orleans_poly.iterrows():
    o_geom = row2['geometry']
    o_name = row2['COUNTY']
    
    if o_geom.contains(p_geom):
      print(p_name, " was in", o_name, "Parish")


When your queries start to get that hairy it's a good time to ask *is there another way*?  There often is!

We will turn to this in our next section on Spatial Joins.

## Any Questions?



---



# **11.- Combining Data: via Attributes & Spatial Joins**

Joins are used to combine data in different tables.

* **Attribute joins** combine data based on common values.

* **Spatial joins** combine data based on location.



# Attribute Joins

Attribute joins combine data from different tables based on a column with shared values.  Although these are not spatial they are widely used in geospatial analysis and in all data analysis.  We use the **merge** command for geopandas attribute joins.

<br>
Let's use an attribute join to join some census data for Orleans Territory to a subset of the usa1810 data.

First, read in the CSV file to a Pandas DataFrame named **orleans_census1810**



In [0]:
orleans_census1810 = pd.read_csv('./orleans_census_data1810.csv')
orleans_census1810.head()

Then, subset `usa1810` to a new GeoDataFrame keeping only the data where the STATE is Orleans Territory - name this gdf  **orleans**.


In [0]:
orleans = usa1810[usa1810['STATE'] == 'Orleans Territory']

orleans.plot()


First, let's take a look at the values in the **orleans** GeoDataFrame.  Compare it to the **orleans_census1810** data frame.

- What column should we use for the join?

In [0]:
orleans.head()

Join the attribute data in **orleans_census1810** to the **orleans** GeoDataFrame using the **merge** command.

In [0]:
orleans_popdata = orleans.merge(orleans_census1810, on='GISJOIN')
orleans_popdata.head()

You can see that we now have a number of population related attributes in the geodataframe.

What happened to the columns that were in both dataframes?

## Questions?



---



# Spatial Joins

We can use a spatial join to combine attributes from different GeoDataFrames for objects that are colocated in space.

In `geopandas` this is done with the **sjoin** operator.  

First take a quick look at the documentation for **sjoin**.

In [0]:
#gpd.sjoin?

Let's explore spatial joins revisiting the question we asked in the last section.

* What Orleans Parish contains each of the orleans_places?

Here the goal is to add the parish name to the GeoDataFrame orleans_places.

### SJOIN

We are now ready to use **sjoin** to add the parish (county) for each conspiracy.

In [0]:
# sjoin syntax - for reference
## gpd.sjoin(left_df, right_df, how='inner', op='intersects', lsuffix='left', rsuffix='right') 

# sjoin in action
orleans_places2 = gpd.sjoin(orleans_places, orleans_poly)

orleans_places2.head()

The result of this `sjoin` is a new `GeoDataFrame` that has one row for each orleans_place and additional columns of attribute data from the orleans_poly GeoDataFrame for the geometries spatially *interesected*.

* Check the number of rows in `orleans_places` and `orleans_places2` - to they match?

In [0]:
print(len(orleans_places))
print(len(orleans_places2))

By default, `sjoin` only returns rows where an intersection was found. This is determined by the **how=** function parameter which defaults to `inner`.

If we set this to **how='left'** we will keep all the rows for the GeoDataFrame named on the left, here orleans_places.

This way we can see what places are not within Orleans Territory.

In [0]:
orleans_places2 = gpd.sjoin(orleans_places, orleans_poly, how="left")

orleans_places2.head()

In [0]:
len(orleans_places2)

We can use some Pandas to massage our results...

In [0]:
# How many orleans_places do not have COUNTY data?
orleans_places2[orleans_places2.COUNTY.isnull()]

We can make a quick plot to check the results visually.

In [0]:
base=orleans_poly.plot(figsize=(14,14))
orleans_places.plot(ax=base, color="black")
orleans_places2[orleans_places2.COUNTY.isnull()].plot(ax=base, color="red")

Finally, we can subset out the columns we want to keep.

In [0]:
orleans_places2[['place','COUNTY']]

#### NEW Challenge

What do you get when you reverse this join?
So that instead of joining the county data to the places points
you join the places points to the county data?

In [0]:
#joining places to orleans counties (in the orleans_poly gdf)
counties_with_places_data = gpd.sjoin(orleans_poly, orleans_places, how="left")
counties_with_places_data.head()

In [0]:
type(counties_with_places_data.geometry[0])

In [0]:
# Are the input and output county gdfs the same length?
print(len(counties_with_places_data))
print(len(orleans_poly))
counties_with_places_data[['COUNTY','place']]

### Challenge

Use a spatial join to identify the parishes in which the three LSC slave conspiracies (lsc_locs) took place.

In [0]:
# Your code here
lsc_locs
gpd.sjoin(lsc_locs,orleans_poly)

### Challenge - Solution

In [0]:
lsc_with_parish = gpd.sjoin(lsc_locs, usa1810, how="left")
lsc_with_parish[['name','COUNTY']]

### Any Questions?

---

# **12.- Data Driven Mapping**



Data driven mapping refers to the process of creating thematic maps by using data values to determine the symbology of mapped features - including their color, shape, size.  This is in contrast to setting the same symbology for all features as we have done above.

### Mapping categorical data
We can symbolize the color of our features by a categorical data value.

In [0]:
conus.plot(color="pink", edgecolor="blue")

In [0]:
conus.plot(column="STATE", edgecolor="white")


### Mapping quantitative data

We can also color areas by quantitave data values. 

</br>

Let's map the parishes in Orleans Territory by the number of non-white slaves. These values are in the column `nwslave_pop`.


In [0]:
orleans_popdata.plot(column='nwslave_pop', cmap="Reds",edgecolor='grey', legend=True, figsize=(8,6))
plt.show()

### cmaps - colormaps

Note the use of the **cmap** option to set the matplotlib color palette for mapping the data.  Take a look at the [documumentation](https://matplotlib.org/users/colormaps.html) for these colormaps and rerun the previous code with a different value for **cmap**.   I strongly recommend that you read this documentation to improve your use of colormaps to effectively map data values. 

### Discussion

Above, the plot option **column=** tells the plot command to use the values in the **nwslave_pop** column to determine the geometry colors based on the colormap specified by the **cmap** option. You can see the list of available [color maps here](https://matplotlib.org/users/colormaps.html). The full range of values in the `nwslave_pop` column is being scaled to the color palette called **Reds**.  This is called an `unclassified` or `classless` map. This map is a good first effort as it imposses no grouping on the data, thus making it easier to spot trends and outliers. But it is harder to interpret the data values within an area. 


### Graduated Color Maps

A more common practice is to use a **classification scheme** to bin data values into 4-7 classes and map those classes to a color palette.  This type of map is called a **graduated color map** or a **choropleth map**.

</br>

Let's try that below with **quantile** classification which is the most commonly used scheme when mapping data.

In [0]:

orleans_popdata.plot(column='nwslave_pop', cmap='Reds', edgecolor='black', 
                     legend=True, figsize=(8,6), scheme='equal_interval')
plt.show()

Wow that gives a very different looking map!

In [0]:
orleans_popdata.plot?

## Challenge

In the code cell below recreate the above map with the classification schemes **equal_interval** and **fisher_jenks** to see how the look of the map changes.


In [0]:
# Your code here

## Choropleth Maps

The maps we just made are called `choropleth maps`. A [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) is a data map that colors areas by data values.  This are the most common type of data map. It is also sometimes called a **heatmap**.

<br>

**Important**, when the areas being mapped vary in size it is not considered good cartographic practice to map **counts**.  Why do you think this is so?

Instead, choropleth maps typically symbolize ares by area weighted densites, ratios or rates that can compared across the different sized areas.

<br>

Let's map the ratio of non-white slaves (nwslave_pop) to free whites (white_pop).

In [0]:
# Create a new column that is the ratio of non-white slaves (nwslave_pop) to free whites (white_pop)
orleans_popdata['slave2white_ratio'] = orleans_popdata['nwslave_pop'] / orleans_popdata['white_pop']

# Map the ratio
orleans_popdata.plot(column='slave2white_ratio', cmap='Reds', edgecolor='black', 
                     legend=True, figsize=(8,6), scheme='quantiles')

plt.show()

Let's redo the above map by adding labels and a few more niceties.

We will also use **fisher_jenks** classification to minimize within bin variance and maximize between bin variance. This creates groupings that better reflect the data.

In [0]:
fig, ax = plt.subplots(1, figsize=(12,12))

orleans_popdata.plot(ax=ax, column='slave2white_ratio', cmap='OrRd', edgecolor='black', legend=True, scheme='fisher_jenks')

for polygon, name in zip(orleans_popdata.geometry, orleans_popdata.COUNTY_x):
    ax.annotate(xy=(polygon.centroid.x, polygon.centroid.y), s=name)

_ = ax.axis('off')

ax.set_title("Ratio of Non-White Slaves to Free Whites, Orleans Territory, 1810")
plt.show()

Needless to say, labels are a bit tricky, regardless of the software you use to make a map!


## Question

In 1791 and 1795 two slave revolts were planned in the same parish in Orleans Territory. Although these plots involved different people and had different orgins, both were discovered and thwarted, leading to the trial and execution or emprisonment of many enslaved persons. Soon thereafter, the [German Coast Uprising of 1811](https://en.wikipedia.org/wiki/1811_German_Coast_uprising), which was the largest slave revolt in US history, occured in a different Orleans parish. 

- *Does the map symbology indicate the two parishes in which these three events occured?*

As a check, we can add the **lsc_locs** points to the map above.

In [0]:
fig, ax = plt.subplots(1, figsize=(12,12))

orleans_popdata.plot(ax=ax, column='slave2white_ratio', cmap='OrRd', edgecolor='black', legend=True, scheme='fisher_jenks')

lsc_locs.plot(ax=ax, color='white', edgecolor="black", linewidth=3, markersize=100)

for polygon, name in zip(orleans_popdata.geometry, orleans_popdata.COUNTY_x):
    ax.annotate(xy=(polygon.centroid.x, polygon.centroid.y), s=name)

_ = ax.axis('off')

ax.set_title("Ratio of Non-White Slaves to Free Whites, Orleans Territory, 1810")
plt.show()

###  Any Questions?



---



# Interactive Mapping with Folium

[Folium](https://python-visualization.github.io/folium/) is the most commonly used Python library for creating interactive maps.  See the online documentation for details. 

Below are a few examples for you to consider. Just a taste!

First load the library. Install it if necessary.

In [0]:
#!pip install folium
import folium
from folium import Choropleth, Circle, Marker


In [0]:
# Create a simple point from the ptcoupee_poly
ptcoupee_pt = ptcoupee_poly.centroid

# note how we extract the coordinate value, which is what folium needs
ptcoupee_pt.y.squeeze()


In [0]:
# Create a map centered on Point Coupee
map1 = folium.Map(location=[ptcoupee_pt.y.squeeze(), ptcoupee_pt.x.squeeze()], tiles='Stamen Toner',
    zoom_start=10)

#display the map
map1

### Point markers with Popups

First, need to make sure the coordinates are numeric!

In [0]:
lsc_locs.dtypes

In [0]:
lsc_locs['latitude'] = pd.to_numeric(lsc_locs["latitude"])
lsc_locs['longitude'] = pd.to_numeric(lsc_locs["longitude"])

In [0]:
map2 = folium.Map(location=[ptcoupee_pt.y.squeeze(), ptcoupee_pt.x.squeeze()], tiles='Stamen Toner',
    zoom_start=9)

for i in lsc_locs.index:
  folium.CircleMarker(
    location=[lsc_locs.latitude[i], lsc_locs.longitude[i]],
    radius= 10,
    popup= lsc_locs.name[i],
    color='red',
    fill=True,
    fill_color='red'
).add_to(map2)


map2

### Challenge

Create a Folium map of all of the places in the orleand_places geodataframe.

* Add a popup with the place name or description

In [0]:
# Your code here

In [0]:
%who

In [0]:
#orleans_popdata.head()
popdata2map = orleans_popdata[['GISJOIN','geometry']].set_index('GISJOIN')
#popdata2map.head()
popdata2color=orleans_popdata[['GISJOIN','slave2white_ratio']].set_index('GISJOIN')
#popdata2color.head()

In [0]:
#folium.Map?
#folium.Choropleth?
#folium.Choropleth(geo_data, data=None, columns=None, key_on=None, bins=6, fill_color='blue', 
#                  nan_fill_color='black', fill_opacity=0.6, nan_fill_opacity=None, line_color='black', 
#                  line_weight=1, line_opacity=1, name=None, legend_name='', 
#                  overlay=True, control=True, show=True, topojson=None, smooth_factor=None, highlight=None, **kwargs)

In [0]:


map3 = folium.Map(location=[ptcoupee_pt.y.squeeze(), ptcoupee_pt.x.squeeze()], 
                  tiles='Stamen Toner',
                  width=800,height=600,zoom_start=7)

folium.Choropleth(geo_data=popdata2map.__geo_interface__,
           data=popdata2color.slave2white_ratio,
           fill_color="Reds",
           fill_opacity=0.8,
           line_color="grey",
           key_on="feature.id",
           legend=True,
           legend_name="stuff"
          ).add_to(map3)

for i in lsc_locs.index:
  folium.CircleMarker(
    location=[lsc_locs.latitude[i], lsc_locs.longitude[i]],
    radius= 8,
    popup= lsc_locs.name[i],
    color='red',
    fill=True,
    fill_color='blac',
    fill_opacity=1
).add_to(map3)

map3

## Any Questions?

---

## Lastly, to download data from Google Colab into your Google Drive, you should adapt the following code:

In [0]:
# Code to read files into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from google.colab import files
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# To have a working memory to date, let's save this as a shapefile for now
# The files can be observed on the menu on the left hand side 
# under the "Files" menu (next to "Table of contents", and "Code snippets")

geopandasFile.to_file("geopandasFileAsShapefile.shp")


In [0]:
# In theory you could right click and download the files in .shp,.shx,.prj,.dbf,.cpg
# But if that doesn't work then run the code below and these will be saved in 
# your google drive folder

# Files with format: .shp
uploaded = drive.CreateFile({'name_of_gdf': 'name_of_gdf.shp'})
uploaded.SetContentFile('name_of_gdf.shp')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))

# Repeat the above process (.shp) to files with format:
# .shx
# .prj
# .dbf
# .cpg


# Spatial Data Processing



Spatial relationship queries return `True` or `False` when comparing geometries based on a spatial relationship predicate. 

Geometric processing operations, on the other hand, construct new geometries from one or more input geometries. 

These transformations, which are also called **geoprocessing**, make up the bulk of spatial preprocessing operations - the work you do to prepare your data for analysis!

## Common Types of Geoprocessing operations

Below is a list of some common types of geoprocessing operations.

- Coordinate system transformations
- Dimensionality transformations (points to polygons, polygons to points or lines, lines to polygons or points)
- Geometric Aggregations (simplify, dissolve / groupby operations on geometries)
- Spatial overlay operations that perform set operations on input geometries and return new geometries that are the set intersection, union, difference.

An in-depth review of all of the types of and methods for geoprocessing is beyond the scope of this workshop.   A good way to get an overview is to work through the different sections of the Geopandas documentation.

Instead, we will work through a few of these as we explore our historical Louisiana data.

### Question

What types of geoprocesing operations we have explored so far?


## End Note

We encourage you to explore the resources listed below to learn more about geoprocessing and other operations in GeoPandas!

### Any Questions?
---


# Next Steps



### Start with the Package Documentation

- [GeoPandas Documentation](http://geopandas.org/)
- [Shapely Documentation(https://shapely.readthedocs.io/en/stable/)]

### Check out the excellent Kaggle Tutorial on Geospatial Analysis  
- <https://www.kaggle.com/learn/geospatial-analysis>
- They also have a great Pandas tutorial.

### For a deep dive check out the SciPy 2018 workshop on geopandas.

- The notebooks are [here](https://github.com/geopandas/scipy2018-geospatial-data).
- And be sure to watch the [youtube video](https://www.youtube.com/watch?v=kJXUUO5M4ok). 

### More Geopandas Practice

- Try this [Geopandas tutorial](https://www.datacamp.com/community/tutorials/geospatial-data-python) on plotting the path of Hurricane Florence.



### Interactive mapping 

- The [mplleaflet](https://github.com/jwass/mplleaflet) and [folium](https://github.com/python-visualization/folium) packages are very popular for creating interactive web maps in python notebooks. Check out the online documentation and do a web search for an online tutorial to get started.


# Thank you!


---

Last updated 11/12/2019 by Sergio Castellanos (sergioc [at] berkeley [dot] edu)