<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<center><h1><font color="red" size="+3">Introduction to GeoPandas</font></h1></center>

## Reference Documents

* [GeoPandas User Guide](https://geopandas.org/en/stable/docs/user_guide.html)
* [GeoPandas Tutorial](http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/geopandas.html)

_______

# <font color="red"> What is GeosPandas? </font>

- A Python library that allows you to process shapefiles representing tabular data (like Pandas), where every row is associated with a geometry. 
- Provides access to many spatial functions for applying geometries, plotting maps, and geocoding. 
- Extends the capabilities of Pandas to enable spatial operations. 
- Includes new data types such as `GeoDataFrame` and `GeoSeries` which are subclasses of Pandas DataFrame and Series and enables efficient vector data processing in Python. 
- Is built on top of the following libraries that allow it to be spatially aware:
  - `Shapely` for geometric operations (i.e. buffer, intersections etc.)
  - `PyProj` for working with projections
  - `Fiona` for file input and output.

# <font color="red"> GeoPandas Data Structure </font>

GeoPandas implements two main data structures:
- GeoSeries
- GeoDataFrame. 

These are subclasses of pandas.Series and pandas.DataFrame, respectively.

### GeoSeries
- A vector where each entry in the vector is a set of shapes corresponding to one observation. 
- An entry may consist of only one shape (like a single polygon) or multiple shapes that are meant to be thought of as one observation (like the many polygons that make up the State of Hawaii or a country like Indonesia).

#### Attributes and Methods for GeoSeries
The GeoSeries class implements nearly all of the attributes and methods of Shapely objects. When applied to a GeoSeries, they will apply elementwise to all geometries in the series.

Some inportant attributes are:
- `area`: shape area (units of projection)
- `bounds`: tuple of max and min coordinates on each axis for each shape
`total_bounds`: tuple of max and min coordinates on each axis for entire GeoSeries
`geom_type`: type of geometry.
`is_valid`: tests if coordinates make a shape that is reasonable geometric shape.

Some basic methods are:
- `distance()`: returns Series with minimum distance from each entry to other
- `centroid`: returns GeoSeries of centroids
- `representative_point()`: returns GeoSeries of points that are guaranteed to be within each geometry. It does NOT return centroids.
- `to_crs()`: change coordinate reference system.
- `plot()`: plot GeoSeries.

### GeoDataFrame
- A tabular data structure that contains a GeoSeries.
- It always has one GeoSeries column that holds a special status. 
- This GeoSeries is referred to as the GeoDataFrame’s “geometry”. 
- When a spatial method is applied to a GeoDataFrame (or a spatial attribute like `area` is called), this commands will always act on the “geometry” column.
- The geometry column defines a point, line, or polygon associated with the rest of the columns. This column is a collection of shapely objects. Whatever you can do with shapely objects, you can also do with the geometry object.
- The Coordinate Reference System (CRS) is the coordinate reference system of the geometry column that tells us where a point, line, or polygon lies on the Earth's surface. Geopandas maps a geometry onto the Earth's surface.
- The “geometry” column – no matter its name – can be accessed through the geometry attribute (`gdf.geometry`), and the name of the `geometry` column can be found by typing `gdf.geometry.name`.

A GeoDataFrame may also contain other columns with geometrical (shapely) objects, but only one column can be the active geometry at a time.

---

## Required Packages

```
   Matplotlib
   Pandas
   GeosPandas
   mapclassify
```

----

### <font color="red">Uncomment and run the cell below only if in Google Colab</font>

In [None]:
#!sudo apt-get update && apt-get install -y libspatialindex-dev
#!pip install rtree
#!pip install geopandas
#!pip install mapclassify

----

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.ticker as mticker

In [None]:
import pandas as pd  
# combines the capabilities of pandas and shapely for geospatial operations
import geopandas as gpd  
# for manipulating text data into geospatial shapes
from shapely.geometry import Point, Polygon, MultiPolygon 
# stands for "well known text," allows for interchange across GIS programs
from shapely import wkt 
# supports geospatial join
import rtree  
# visualize all columns in dataframe
pd.set_option('display.max_columns', None)  

In [None]:
print(f"Version of Pandas:    {pd.__version__}")
print(f"Version of GeoPandas: {gpd.__version__}")

# <font color="red"> Creating GeoDataFrame </font>

- We start with a Pandas DataFrame that has latitude and longitude coordinates as columns representing locations of cities.
- We perform transformations to create a GeoPandas GeoDataFrame that includes the "geometry" column (representing points).

[Mapping in Python](https://cybergisxhub.cigi.illinois.edu/notebook/spatial-data-exploration-and-visualization-on-google-colab/)

In [None]:
cities = ['Paris', 'New York', 'Mumbai', 'Tokyo', 
          'Moscow', 'Mexico City', 'Sao Paulo', 'Yaounde', 
          'Vancouver', 'Sydney', 'Harare']
countries = ['France', 'USA', 'India', 'Japan', 
             'Russia', 'Mexico', 'Brazil', 'Cameroon', 
             'Canada', 'Australia', 'Zimbabwe']
longitudes = [2.25, -73.92, 72.83, 139.69, 37.36, -99.13, 
              -46.63, 11.50, -123.08, 151.20, 31.0]
latitudes = [48.85, 40.69, 28.35, 35.68, 55.45, 19.43,
             -23.55, 3.84, 49.32, -33.87, -18.0]

df = pd.DataFrame({
    'City': cities,
    'Country': countries,
    'Longitude': longitudes,
    'Latitude': latitudes
})
df


We generate them by zipping the latitude and longitude together to store them in a new column named `Coordinates`.



In [None]:
df["Coordinates"] = list(zip(df.Longitude, df.Latitude))
df

- Our next step is to turn the tuple into a `Shapely` `Point` object.
- We do this by applying Shapely’s `Point` method to the `Coordinates` column.

In [None]:
df["Coordinates"] = df["Coordinates"].apply(Point)
df

- Finally, we will convert our DataFrame into a GeoDataFrame by calling the `geopandas.DataFrame` method.
- GeoDataFrame is a data structure with the convenience of a normal DataFrame but also an understanding of how to plot maps.

>The most important property of a GeoDataFrame is that it always has one GeoSeries column that holds a special status. This GeoSeries is referred to as the GeoDataFrame’s “geometry”. When a spatial method is applied to a GeoDataFrame (or a spatial attribute like area is called), this commands will always act on the “geometry” column.

In [None]:
gdf = gpd.GeoDataFrame(df, geometry="Coordinates")
gdf.head()

Does not look different than a vanilla Pandas DataFrame:

In [None]:
print('gdf is of type:', type(gdf))

How can we tell which column is the geometry column>

In [None]:
print('\nThe geometry column is:', gdf.geometry.name)

Plot the city locations:

In [None]:
gdf.plot()

# <font color="red"> Manipulating the World Map</font>

From [Spatial Analysis with Colab](https://cybergisxhub.cigi.illinois.edu/notebook/spatial-data-exploration-and-visualization-on-google-colab/)

Obtain dataset from the Natural Earth database.

In [None]:
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
world.head()

We can set the index to be the country abbreviations.

In [None]:
world = world.set_index("iso_a3")
world.head()

world is a GeoDataFrame with the following columns:

- `pop_est`: Contains a population estimate for the country
- `continent`: The country’s continent
- `name`: The country’s name
- `iso_a3`: The country’s 3 letter abbreviation
- `gdp_md_est`: An estimate of country’s GDP
- `geometry`: A POLYGON for each country

In [None]:
world.geometry.name

What is the CRS?

In [None]:
world.crs

A CRS has the following components:

- **Datum** - The reference system, which in our case defines the starting point of measurement (Prime Meridian) and the model of the shape of the Earth (Ellipsoid). The most common Datum is WGS84.

- **Area of use** - In our case, the area of use is the whole world, but there are many CRS that are optimized for a particular area of interest.

- **Axes and Units** - Usually, longitude and latitude are measured in degrees. Units for x, y coordinates are often measured in meters.

Show the world map:

In [None]:
world.plot();

Show the geometry of the USA:

In [None]:
world.loc["USA", 'geometry']

Filter the data to exclude Antarctica:

In [None]:
world_gdp = world[(world.pop_est>0) & (world.name!="Antarctica")]

world_gdp['gdp_per_cap'] = world_gdp.gdp_md_est / world_gdp.pop_est

world_gdp.plot(column='gdp_per_cap');

Add legend:

In [None]:
fig, axes = plt.subplots(1, 1)

world.plot(column='pop_est', ax=axes, legend=True);

Resize the colorbar:

In [None]:
from mpl_toolkits.axes_grid1 import make_axes_locatable

fig, axes = plt.subplots(1, 1)

divider = make_axes_locatable(axes)

cax = divider.append_axes("right", size="5%", pad=0.1)

world.plot(column='pop_est', ax=axes, legend=True, cax=cax);

We can use the `legend_kwds` argument to add more features:

In [None]:

fig, axes = plt.subplots(1, 1)

world.plot(column='pop_est',
           ax=axes,
           legend=True,
           legend_kwds={'label': "Population by Country",
                        'orientation': "horizontal"});

We can use the `cmap` argument to select the colormap:

In [None]:
world_gdp.plot(column='gdp_per_cap', cmap='OrRd');

To make the color transparent for when you just want to show the boundary, you have two options:
- Use `world.plot(facecolor="none", edgecolor="black")`. 
- Use `world.boundary.plot()`. 

In [None]:
world_gdp.boundary.plot();

We can scale the colormaps by using the `scheme` option:

In [None]:
world_gdp.plot(column='gdp_per_cap', 
               cmap='OrRd', 
               scheme='quantiles');

Plot cities on top of the map:

In [None]:
base = world_gdp.boundary.plot();
gdf.plot(ax=base, marker='o', color='red', markersize=5)

In [None]:
fig, axes = plt.subplots(figsize=(15, 10))
world_gdp.plot(ax=axes)
gdf.plot(ax=axes, marker='o', color='red', markersize=9)

# <font color="red"> US State Census Data</font>

In [None]:
state_df = gpd.read_file("http://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_state_5m.zip")
state_df.head()

We can do a quick plot of the USA with state boundaries:

In [None]:
fig, axes = plt.subplots(figsize=(15, 10))
state_df.plot(ax=axes);

How could we only map the area covering the USA?

We first need to grab the spatial extent of the `state_df` object:

In [None]:
df_bounds = state_df.geometry.total_bounds
df_bounds

It is a tuple of 4 values: `(xmin, ymin, xmax, ymax)`.

In [None]:
fig, axes = plt.subplots(figsize=(15, 10))
xlim =([-176.0, -64.0])
ylim =([13.0, df_bounds[-1]])
axes.set_xlim(xlim)
axes.set_ylim(ylim)
state_df.plot(ax=axes);

In [None]:
norm = matplotlib.colors.LogNorm(vmin=state_df.ALAND.min(), vmax=state_df.ALAND.max())
fig, axes = plt.subplots(figsize=(15, 10))
xlim =([-176.0, -64.0])
ylim =([13.0, df_bounds[-1]])
axes.set_xlim(xlim)
axes.set_ylim(ylim)
state_df.to_crs('epsg:4326').plot("ALAND", 
                                  ax=axes, 
                                  legend=True,  
                                  norm=norm);


### <font color="blue">Zoom in on the State of Wisconsin</font>

Draw the map of the state:

In [None]:
fig, axes = plt.subplots(figsize=(10, 10))
state_df.query("NAME == 'Wisconsin'").plot(ax=axes, 
                                           edgecolor="black", 
                                           color="white")
plt.show()

#### Get the US County Census Data

In [None]:
county_df = gpd.read_file("http://www2.census.gov/geo/tiger/GENZ2020/shp/cb_2020_us_county_5m.zip")
county_df.head()

In [None]:
county_df.info()

#### Get the Data for the State of Wisconsin

In [None]:
wis_county_df = county_df.query("STATEFP == '55'")
wis_county_df

#### Plot the map of the different counties in Wisconsin

In [None]:
fig, axes = plt.subplots(figsize=(10, 10))

state_df.query("NAME == 'Wisconsin'").plot(ax=axes, edgecolor="black", 
                                           color="white")

wis_county_df.plot(ax=axes, edgecolor="red", color="white")

plt.show()

#### Use 2016 Presidential Election Results

In [None]:
url = "https://datascience.quantecon.org/assets/data/ruhl_cleaned_results.csv"

pres_election_2016 = pd.read_csv(url, thousands=",")
pres_election_2016.head()

In [None]:
pres_election_2016.info()

In [None]:
pres_election_2016["county"]

In [None]:
pres_election_2016["county"] = pres_election_2016["county"].str.title()
pres_election_2016["county"] = pres_election_2016["county"].str.strip()

In [None]:
wis_county_df["NAME"] = wis_county_df["NAME"].str.title()
wis_county_df["NAME"] = wis_county_df["NAME"].str.strip()

In [None]:
res_states = wis_county_df.merge(
    pres_election_2016, 
    left_on="NAME", 
    right_on="county", 
    how="inner"
    )

In [None]:
res_states.head()

In [None]:
%%time
res_states["trump_share"] = res_states["trump"] / (res_states["total"])
res_states["rel_trump_share"] = res_states["trump"] / (res_states["trump"]+res_states["clinton"])

In [None]:
res_states.head()

Show the vote map:

In [None]:
fig, axes = plt.subplots(figsize = (10,8))

# Plot the state
state_df[state_df['NAME'] == 'Wisconsin'].plot(ax=axes, edgecolor='black',color='white')
# Plot the counties and pass 'rel_trump_share' as the data to color
res_states.plot(
    ax=axes, edgecolor='black', column='rel_trump_share', 
    legend=True, cmap='RdBu_r',
    vmin=0.01, vmax=0.95
)

# Add text to let people know what we are plotting
axes.annotate('Republican vote share',
              xy=(0.76, 0.06),  xycoords='figure fraction')

# No axis with long and lat
plt.axis('off')

plt.show()

Number of counties won by each candidate:

In [None]:
res_states.eval("trump > clinton").sum()

In [None]:
res_states.eval("trump < clinton").sum()

Total number of votes obtained by each candidate:

In [None]:
res_states["trump"].sum()

In [None]:
res_states["clinton"].sum()

# <font color="red"> [Smithsonian Global Volcanism Database](https://volcano.si.edu/) </font>

In [None]:
server = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?'
query = 'service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=json'
gf = gpd.read_file(server+query)
gf.head()

In [None]:
gf.info()

In [None]:
gf.iloc[2]

#### Subsetting
- Often we only want points in a certain bounding box.
- Subsetting is very easy in Geopandas. 

In [None]:
ymin, ymax, xmin, xmax = [45, 49, -120, -124]
subset = gf.cx[xmin:xmax, ymin:ymax]
subset

#### Plot the locations of volcanoes on the map of the world

In [None]:
fig, axes = plt.subplots(figsize=(15, 10))
world.plot(ax=axes, edgecolor="black", color="white")
gf.plot(ax=axes, marker='o', color='red', markersize=5);

### <font color="blue">Focus on Colombia</font>

#### Get volcanoes that occured in Colombia

In [None]:
colombia = world.query('name == "Colombia"')
colombia

In [None]:
colombian_volcanoes = gpd.sjoin(gf, colombia, how="inner", op='within')
colombian_volcanoes

#### Plot the location of volcanoes on the map of Colombia

In [None]:
fig, axes = plt.subplots(figsize=(10, 10))
colombia.plot(ax=axes, edgecolor="black", color="white")
colombian_volcanoes.plot(ax=axes, marker='o', color='red', markersize=5);