<a href="https://colab.research.google.com/github/cul-data-club/meetings/blob/main/2022/march-24-geopandas/Hello%2C%20GeoPandas!.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hello, GeoPandas (2022 Edition)!

[GeoPandas](https://geopandas.org/en/stable/) is a library that aims to make working with spatial data in Python “easier,” largely be incorporating the [Pandas](https://pandas.pydata.org/) syntax we know and love. 

Google Colab does not ship with GeoPandas, so step one is to install it and a few other libraries, which we'll do in the cell below. If you're trying to install it on your home machine with Anaconda, you'll have to use the `-c conda-forge` flag, like this:

```
conda install -c conda-forge geopandas
```

You can also install from Anaconda Navigator.

In [None]:
# Install geopandas and other spatial libraries
# You only need to run this cell once per session

import sys
!{sys.executable} -m pip install rtree
!{sys.executable} -m pip install geopandas
# !{sys.executable} -m pip install fiona
# !{sys.executable} -m pip install geoplot
# !{sys.executable} -m pip install shapely
# !{sys.executable} -m pip install pyproj

Geographic data can take several primitive forms. The [GeoData@Columbia](https://geodata.library.columbia.edu/) library offers ten different primitive formats the data can take, but they boil down to four, more or less:

1. **Points** With point data, every observation/row/member is at least two coordinates. Each point is independent of the others.
2. **Lines** Instead of one point, every observation/row/member is at least two points connected with a line, where order matters.
3. **Polygons** Like lines, except the lines close to make shapes with calculable areas.
4. **Rasters** “Pictures” of the area under study, where each pixel represents a certain amount of space, like with satellite photography or other remote sensing data sources.

The first three types, as a whole, are called “vector data.”

For vector data, every observation/row/member will typically have other properties that can take familiar data types: numeric variables, continuous variables, and categoric variables.

GeoPandas, then, merges the “geometry” of an observation/row/member with its other properties to create a dataframe with geometries.

Even though geospatial data typically only has the four primitives mentioned above (often in some mixture), the data can be *formatted* in many, many ways. For GeoPandas, we will look at two file formats:

1. **Shapefile** Created by Esri, the company behind ArcGIS, [shapefiles](https://en.wikipedia.org/wiki/Shapefile) are an established vector format. Every shapefile is actually a combination of files, including one that ends in `*.shp`, which are often bundled together as a `.zip`. GeoPandas can read them even as `.zip` files without unbundling.
2. **GeoJSON** A comparative newcomer to geospatial data encoding, [GeoJSON](http://geojson.org/) encodes all of the data into a giant, plain text file formatted as JSON, or JavaScript Object Notation. As such, every GeoJSON data file is also a valid JavaScript object. With only one file, GeoJSON is somewhat more portable than shapefiles, and the file format is especially web-friendly.

You can create your own toy GeoJSON data at [http://geojson.io/](http://geojson.io/)

In fact, go ahead and so so, and save your file as `test.json` or something similar. Then you can upload the file to your Colab.

Now let’s import GeoPandas and fire up inline Matplotlib.

In [None]:
import geopandas
%matplotlib inline

GeoPandas has [three datasets built in](https://geopandas.org/en/stable/docs/reference/api/geopandas.datasets.available.html): two from [Natural Earth](http://naturalearth.org), and one of NYC. Just like with regular Pandas, we can use a [`.read_file()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) class method to create a geodataframe from a file. Here, we can read in the built-in NYC data.

GeoDataFrames have a built-in [`.plot()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.plot.html) method.

In [None]:
nyc = geopandas.read_file(geopandas.datasets.get_path('nybb'))
nyc.plot()

They also have an [`.explore()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html) method that creates an interactive map with some more context. This relies on a version of [folium](https://python-visualization.github.io/folium/) that's newer than what is installed by default in Colab as of this writing.

In [None]:
import folium
[major, minor, patch] = (int(number) for number in folium.__version__.split("."))
if(major > 0 or minor >= 10):
  nyc.explore("Shape_Area")
else:
  print("Version of folium is too low to use the explore method in GeoPandas.")

The GeoPandas documentation features a clarifying graphic that describes how a geoDataFrame differs from a regular Pandas dataFrame:

![GeoDataFrame schematic](https://geopandas.org/en/stable/_images/dataframe.svg)

The index section and the data section work more or less exactly like they do in Pandas, but GeoPandas adds another column for geometry, which holds the spatial information. This is similar to how tables look in [PostGIS](https://postgis.net/). In shapefile language, the geoDataFrame is a combination of the `.shp` file and the `.xbf` files.

Back to GeoPandas. Geodataframes have a [`.crs`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_crs.html) property that gives us the coordinate reference system, which yields an EPSG code. These we can subsequently look up like so: [http://spatialreference.org/ref/epsg/2263/](http://spatialreference.org/ref/epsg/2263/)

In [None]:
nyc.crs

We can read in our own GeoJSON file now, but note that the CRS is different from the NYC data’s.

In [None]:
gdf = geopandas.read_file("https://raw.githubusercontent.com/cul-data-club/meetings/main/2019/geopandas/test.json")
gdf.crs

Luckily, unifying the CRSes is rather trivial. Just set one’s to the other’s. **Note:** switching CRS is not the same as reprojecting.

In [None]:
gdf = df.to_crs(nyc.crs)
gdf.crs

As mentioned above, geoDataFrames behave much like regular dataFrames.

In [None]:
nyc.head()

In [None]:
gdf.head()

In [None]:
gdf[gdf.sentiment.str.contains("happy")]

We can plot data together by using one plot as the `ax` for the other.

In [None]:
base = nyc.plot("Shape_Area", legend=True, figsize=(10, 10))
gdf.plot(ax=base, color="red", edgecolor="white")

## NYC MTA data

Now let’s grab the [subway station location data](https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49) from the City of New York. Export it as a shapefile and upload the `.zip` to Colab.

In [None]:
stations = geopandas.read_file("./Subway Stations.zip")
stations.head()

In [None]:
stations.plot()

## Conclusion

And that's about it. The next cell has some commented out methods you can use on a geoDataFrame to do further analysis. I would have included more, but things start breaking.

In [None]:
# stations["buffered"] = stations.buffer(1000)


Finally, GeoPandas has a [gallery page](https://geopandas.org/en/stable/gallery/index.html) where you can see how others are using it. 

As I've mentioned in emails, GeoPandas is being used in spatial data science courses, as well. Yoh Kawano at UCLA uses it in his [intro to GIS and spatial data science course](https://yohman.github.io/22W-UP206A/).