# Working with geographical data (Part 1)

## Using `GeoPandas` _GeoDataFrames_ 

<img style="float: left; width:300px" src="img/panda_globe.png"/>

<img src="img/geoloc.png"/>

<br style="clear: left"/>

## COMM3180 Spring 2026

### Instructor: Matt O'Donnell (mbod@asc.upenn.edu)

-----

## Overview

* This notebook works through the basic steps of working with datasets that have a geographical dimension. This could mean that specific locations (often _longitude_ and _latitude_ coordinates) are given to locate a place or event, or that geometric shapes are defined for the boundaries of an area (e.g. a street, campus, political ward, etc.).


* There is an extension to the __Pandas__ Python library called __GeoPandas__ that adds functionality on top of the core __Pandas__ functionality to make plotting, analyzing and transforming geographic features in a dataset really nice and easy!


* A lot of data available through initiatives like data.gov contains some kind of spatial location data which situates counts, statistics and events in a geographic space. 


## Working with geographic and geolocated data 

### Some terminology

* __GIS__ (Geographic Information System) - A system to encode, manipulate, analyze and present geographic spatial data


* __CRS__ (Coordinate Reference System) - Various standards that link recognized geographic location measurements with 
the form of location data in the dataset to designate actual geographic locations. 
These measurements differ between CRS, e.g. whether coordinates are defined in meters or decimal degrees.


* __Projection__ - The process of transforming geographic data from one CRS to another 
https://map-projections.net/compare.php?p1=mercator-84&p2=robinson&sps=1


* __Shape definition__ - A vector used to define points, lines, shapes and nested sets of shapes
for example to mark:
  - building or event locations (points)
  - streets or routes (lines)
  - perimeter of a location, e.g. campus, a school district, a city (shapes)


* __Longitude, Latitude & Elevation__ - a system for locating and presenting geographical points in two dimensional space.
    - __latitude__ - Is the angle between equator and horizontal line crossing through location.
    - __longitude__ - angle between a meridan, e.g. the pole line (vertical line from north to south pole) and the location.
    - Also need elevation to locate point on the curved earth surface but this is often not needed for geographic data analysis at local level.



* __geocoding__ - from place name to geographical coordinates (long/lat)


* __geolookup__ (or reverse geocoding) - from geographical coordinates to place/street name 


### File formats

* There are a range of file formats that you will come across on `data.gov` and other sites that encode geographic data. The main ones are:

  - shapefile (.shp extension)
  - Geojson (.geojson extension)
  - Geopackage (.gpkg extension)


------

## Setup

* Import the required modules


* `geopandas` is an extension of Pandas to add geographical data structures and functions 
for processing and plotting geographical data
  - A `GeoDataFrame` has a `geometry` column that contains the geographical vector definitions
  for the spatial item in each row


* `shapely` is a module that provides data structures for geographic shapes
    - `Point`
    - `Line`
    - `Polygon`
  
  that are used in geographic data formats, e.g. the `geometry` column in `GeoDataFrame`

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
import shapely
import json

## Example: _Rubbish/Recycling Collection Day Boundary_

* https://www.opendataphilly.org/dataset/rubbish-recycling-collection-boundary

> The data is used to determine the day of collection for a given location and set of households in the City of Philadelphia. The file is also used to aggregate data such as households, tonnage, and mileage.

### Shape files (`.shp`, `.shx`)

* https://gisgeography.com/arcgis-shapefile-files-types-extensions/

> ArcGIS shapefiles have mandatory and optional files. The mandatory file extensions needed for a shapefile are .shp, .shx and .dbf. But the optional files are: .prj, .xml, .sbn and .sbx

### Loading a shape file

* A shape file is actually an zipped folder with a series of files

In [None]:
ls data/Rubbish_Recyc_Coll_Bnd_SHP/

* You load it into `GeoPandas` by pointing to the file with the `.shp` extension

In [None]:
rub_df=gpd.read_file('data/Rubbish_Recyc_Coll_Bnd_SHP/Rubbish_Recyc_Coll_Bnd.shp')

In [None]:
rub_df.shape

In [None]:
rub_df.head()

In [None]:
rub_df.plot(figsize=(16,8), color='#A0A0A0', edgecolor='blue')

In [None]:
rub_df['SANDIS'].unique()

In [None]:
rub_df['SANDIS'].value_counts(dropna=False)

In [None]:
rub_df['COLLDAY'].value_counts(dropna=False)

In [None]:
fig,ax=plt.subplots(figsize=(16,8))
rub_df.plot(ax=ax, column='COLLDAY', categorical=True)

* So we first need to subset the data frame to remove the rows with `None` or `NA` values.

In [None]:
rub_df2=rub_df[-rub_df['COLLDAY'].isnull()]

In [None]:
rub_df.shape

In [None]:
rub_df2.shape

In [None]:
rub_df2.plot(column='COLLDAY', categorical=True, legend=True)

* Looking at the `geojson` version

In [None]:
rub_gj_df=gpd.read_file('data/Rubbish_Recyc_Coll_Bnd.geojson')

In [None]:
rub_gj_df.head()

* `GeoJson` is a format that uses `JSON` to encode geographical shape and location data.


* Here is a peek at what it looks like:

In [None]:
jdata = json.load(open('data/Rubbish_Recyc_Coll_Bnd.geojson'))

In [None]:
jstr=json.dumps(jdata, indent=4)
print(jstr[:5000])

In [None]:
sc_df=pd.read_csv('data/philly_sanitation_centers.csv')
sc_df

In [None]:
gpd.points_from_xy(sc_df['long'],sc_df['lat'])

In [None]:
sc_gdf=gpd.GeoDataFrame(sc_df,geometry=gpd.points_from_xy(sc_df['long'],sc_df['lat']))

In [None]:
sc_gdf

In [None]:
sc_gdf.plot()

In [None]:
base=rub_df2.plot(color='#f0fff0', edgecolor='#202020', figsize=(10,10))
sc_gdf.plot(ax=base)

In [None]:
lead_df = gpd.read_file('data/child_blood_lead_levels_by_zip.geojson')

In [None]:
lead_df.head()

In [None]:
lead_df.plot(column='num_bll_5plus', legend=True, cmap='bwr')

In [None]:
pdiv_df = gpd.read_file('data/Political_Divisions.geojson')

In [None]:
pdiv_df.shape

In [None]:
pdiv_df.head()

In [None]:
pdiv_df.plot(color='white', edgecolor='black', figsize=(16,10))

In [None]:
vote_df = gpd.read_file('data/qualified_voter_listing_2018_primary_by_ward.geojson')

In [None]:
vote_df.shape

In [None]:
vote_df.sample(10)

In [None]:
pdiv_df['ward']=pdiv_df['DIVISION_NUM'].str[:2]

In [None]:
pdiv_df.head()

In [None]:
wdiv_df=pdiv_df.dissolve(by='ward')

In [None]:
wdiv_df.plot()

In [None]:
wdiv_df.shape

In [None]:
wdiv_df.plot()

In [None]:
vote_df['ward_num']=vote_df['ward'].str[2:]

In [None]:
vote_df.head()

In [None]:
vdf = vote_df.merge(wdiv_df, left_on='ward_num', right_on='ward')

In [None]:
vdf

In [None]:
vote_gdf=gpd.GeoDataFrame(vdf, geometry='geometry_y')

In [None]:
vote_gdf.shape

In [None]:
vote_gdf['rep_per']=vote_gdf['rep']  / vote_gdf['total']

In [None]:
vote_gdf.plot(column='rep_per', figsize=(10,10), cmap='Reds', legend=True)