# Introduction to GeoPandas

In this lesson, we'll learn about a package that is core to using geospatial data in Python: GeoPandas. We'll explore the structure of geospatial data (which, turns out, is not too different from DataFrames!), including geometries, shapefiles, and how to save your hard work.


<!--
Instructor Notes

Expected time to complete:
    Lecture + Questions: 30 minutes
    Exercises: 5 minutes
-->

## What is GeoPandas?

[GeoPandas](http://geopandas.org/) is a relatively new package that makes it easier to work with geospatial data in Python. In the last few years it has grown more powerful and stable, allow Python to practioners to more easily and flexibly work with geospatial data in Python, which had been difficult in the past. GeoPandas is now the go-to package for working with `vector` geospatial data in Python. 

> **Protip**: If you work with `raster` data, check out the [rasterio](https://rasterio.readthedocs.io/en/latest/) package. We will not cover raster data in this tutorial.

GeoPandas gives you access to all of the functionality of [pandas](https://pandas.pydata.org/), which is the primary data analysis tool for working with tabular data in Python. GeoPandas extends pandas with attributes and methods for working with geospatial data. So, if you're familiar with pandas, working with geospatial data is a natural next step.

### Import Libraries

Let's start by importing the libraries that we will use. If you haven't already, you can install GeoPandas within this notebook:

In [None]:
# Install GeoPandas if you don't have it yet
%pip install geopandas

In [None]:
import pandas as pd
import geopandas as gpd

import matplotlib # Base python plotting library
import matplotlib.pyplot as plt # Submodule of matplotlib

# To display plots, maps, charts etc in the notebook
%matplotlib inline  

### Read in a Shapefile

As we discussed in the initial geospatial overview, a *shapefile* is one type of geospatial data that holds vector data. 

> To learn more about ESRI Shapefiles, this is a good place to start: [ESRI Shapefile Wiki Page](https://en.wikipedia.org/wiki/Shapefile) 

The tricky thing to remember about shapefiles is that they're actually a collection of 3 to 9+ files together. Here's a list of all the files that can make up a shapefile:
 
* `shp`: The main file that stores the feature geometry
* `shx`: The index file that stores the index of the feature geometry  
* `dbf`: The dBASE table that stores the attribute information of features 
* `prj`: The file that stores the coordinate system information. (should be required!)
* `xml`: Metadata: Stores information about the shapefile.
* `cpg`: Specifies the code page for identifying the character set to be used.

But it remains the most commonly used file format for vector spatial data, and it's really easy to visualize in one go!

Let's try it out with California counties, and use GeoPandas for the first time. We can use a flexible function called `gpd.read_file` to read in many different types of geospatial data. When using it, we'll specify the `shp` file:

In [None]:
# Read in the counties shapefile
counties = gpd.read_file('../data/california_counties/CaliforniaCounties.shp')

In [None]:
# Plot out California counties
counties.plot()

Bam! Amazing! We're off to a running start.

## Exploring the GeoPandas GeoDataFrame

Before we get in too deep, let's discuss what a *GeoDataFrame* is and how it's different from a pandas *DataFrame*.

A [GeoPandas GeoDataFrame](https://geopandas.org/data_structures.html#geodataframe), or `gdf` for short, is just like a pandas DataFrame (`df`) but with an extra geometry column as well as accompanying methods and attributes that work on that column. Let's emphasize this point, because it's important:

> A [GeoPandas GeoDataFrame](https://geopandas.org/data_structures.html#geodataframe), or `gdf` for short, is just like a pandas DataFrame (`df`) but with an extra geometry column as well as accompanying methods and attributes that work on that column.

This means all the methods and attributes of a pandas DataFrame also work on a GeoPandas GeoDataFrame!

With that in mind, let's start exploring our dataframe just like we would do in pandas.

In [None]:
# Find the number of rows and columns in counties
counties.shape

In [None]:
# Look at the first couple of rows in our geodataframe
counties.head()

In [None]:
# Look at all the variables included in our data
counties.columns

It looks like we have a good amount of information about the total population for different years and the densities, as well as race, age, and occupancy info. Notice at the end - just like we promised - a geometry column containing many numbers. Let's explore what this means, next.

## Plot the GeoDataFrame

We're able to plot our GeoDataFrame because of the extra `geometry` column. What exactly does this column provide?

### GeoPandas Geometries

There are three main types of geometries that can be associated with your GeoDataFrame: points, lines and polygons.

<img src ="https://datacarpentry.org/organization-geospatial/fig/dc-spatial-vector/pnt_line_poly.png" width="450"></img>

In the GeoDataFrame, these geometries are encoded in a format known as [Well-Known Text (WKT)](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry). Consider the following examples:

- POINT (30 10)
- LINESTRING (30 10, 10 30, 40 40)
- POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))

In each case, coordinates are separated by a spaces, and coordinate pairs are separated by commas.

Your geodataframe may also include the variants **multipoints, multilines, and multipolgyons** if the row-level feature of interest is comprised of multiple parts. For example, a GeoDataFrame of states, where one row represents one state, would have a POLYGON geometry for Utah but MULTIPOLYGON for Hawaii, which includes many islands.

Note that, it's OK to mix and match geometries of the same family, e.g., POLYGON and MULTIPOLYGON, in the same GeoDataFrame.

---

### Challenge 1

What kind of geometry would a GeoDataFrame containing roads have? What about one that includes landmarks in the San Francisco Bay Area?

---

You can check the types of geometries in a GeoDataFrame or a subset of the GeoDataFrame by combining the `type` and `unique` methods:

In [None]:
# Let's check what geometries we have in our counties GeoDataFrame
counties['geometry'].head()

In [None]:
# Let's check to make sure that we only have polygons and multipolygons 
counties['geometry'].type.unique()

In [None]:
counties.plot()

Just like with other plots you can make in Python, we can start customizing our map with colors, size, etc.

In [None]:
# We can run the following line of code to get more info about the parameters we can specify:
?counties.plot

In [None]:
# Make the figure size bigger
counties.plot(figsize=(6, 9))

In [None]:
# Customize our plot further
counties.plot(figsize=(6, 9), 
              edgecolor='grey', # Grey colored border lines
              facecolor='pink', # Fill in our counties as pink
              linewidth=2)      # Make the linewidth larger

## Subset the GeoDataframe

Since we'll be focusing on Berkeley later in the workshop, let's subset Alameda County from our GeoDataFrame:

In [None]:
# See all county names included in our dataset
counties['NAME'].values

It looks like Alameda county is specified as "Alameda" in this dataset.

So, let's create a new GeoDataFrame called `alameda_county` that is a subset of our counties GeoDataFrame:

In [None]:
alameda_county = counties.loc[counties['NAME'] == 'Alameda'].copy().reset_index(drop=True)

In [None]:
alameda_county

In [None]:
# Plot our newly subsetted GeoDataFrame
alameda_county.plot()

Nice! Looks like we have what we were looking for.

You can also make dynamic plots of one or more county without saving to a new GeoDataFrame:

In [None]:
bay_area_counties = ['Alameda',
                     'Contra Costa',
                     'Marin',
                     'Napa',
                     'San Francisco', 
                     'San Mateo',
                     'Santa Clara',
                     'Santa Cruz',
                     'Solano',
                     'Sonoma']
counties.loc[counties['NAME'].isin(bay_area_counties)].plot()

## Save Your Data

Let's not forget to save out our Alameda County geodataframe `alameda_county`. This way we won't need to repeat the processing steps and attribute join we did above.

We can save it as a shapefile:

In [None]:
alameda_county.to_file("../data/outdata/alameda_county.shp")

One of the problems of saving to a shapefile is that our column names get truncated to 10 characters (this is a shapefile limitation). 

Instead of renaming all columns with obscure names that are less than 10 characters, we can save our GeoDataFrame to spatial data file formats that do not have this limation, such as [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) or [GPKG](https://en.wikipedia.org/wiki/GeoPackage) (geopackage) files.

These formats have the added benefit of outputting only one file in contrast to the multi-file shapefile format.

In [None]:
alameda_county.to_file("../data/outdata/alameda_county.json", driver="GeoJSON")

In [None]:
alameda_county.to_file("../data/outdata/alameda_county.gpkg", driver="GPKG")

You can read these in, just as you would a shapefile with `gpd.read_file`:

In [None]:
alameda_county_test2 = gpd.read_file("../data/outdata/alameda_county.json")
alameda_county_test2.plot()

In [None]:
alameda_county_test = gpd.read_file("../data/outdata/alameda_county.gpkg")
alameda_county_test.plot()

There are also many other formats we could use for data output.

**NOTE**: If you're working with point data (i.e. a single latitude and longitude value per feature),
then CSV might be a good option!

## Overview

In this lesson, we learned about:

- The `geopandas` package
- Reading in shapefiles
    - `gpd.read_file`
- GeoDataFrame structures
    - `shape`, `head`, `columns`
- Plotting GeoDataFrames
    - `plot`
- Subsetting GeoDatFrames
    - `loc`
- Saving out GeoDataFrames
    - `to_file`

---

### Challenge 2: IO, Manipulation, and Mapping

Now, you'll get a chance to practice the operations we learned above.

In the following cell, compose code to:

1. Read in the California places data (`../data/census/Places/cb_2018_06_place_500k.zip`).
2. Subset "Berkeley" from the data.
3. Plot, and customize as desired.
4. Save out as a shapefile (`berkeley_places.shp`).

*Note: pulling in a zipped shapefile has the same syntax as just pulling in a shapefile. The only difference is that insead of just putting in the filepath, you'll want to write `zip://../data/census/Places/cb_2018_06_place_500k.zip`*

---

In [None]:
# YOUR CODE HERE
