# GeoSpatial 1: time series data, interactive maps
---

**<font color='red'>FYI internet access required for Section D and E.</font>**

---

Required packages:
* pandas
* numpy
* matplotlib
* geopandas
* geodatasets
* folium
* mapclassify

---

**Which sample of Carbon Polluters are we examining?**    
US university facilities with large GHG emissions (> 25,000 metric tons of carbon dioxide equivalent (CO 2 e) per year) in any year between 2011 to 2022 (the latest reporting year). 

**What scientific approaches are we taking?**    
Statistical and geospatial approaches.

**What outputs will we develop?**    
Statistical graphs and interactive maps, with historical and/or regional dimensions.

**What will our outputs tell us?**    
Who and where are the significant US sources of carbon pollution within the Higher Education sector, at facility and state level, both recent and since 2012.  

**Beyond the well-known Eco impacts of Carbon Polluters, what makes this sample significant?**    
The fact that this sector has/individual universities are large fossil fuel burners may be a surprise seemingly out of sync with any green credentials or reputation they have garnered, especially relating to clean energy.
    
For further/corroborating findings, see recent Reuters article -> https://www.reuters.com/investigates/special-report/usa-pollution-universities/

---
**Data Source - University Emitters**
* Filename: `Py4EE_GeoSpatial1_Data.csv`
* Org: U.S. Environmental Protection Agency (EPA)
* Resource:  Facility Level Information on GreenHouse gases Tool (FLIGHT), which provides information about greenhouse gas (GHG) emissions from large facilities in the U.S., who are required to report annual data about their GHG emissions to EPA as part of the Greenhouse Gas Reporting Program (GHGRP) -> https://ghgdata.epa.gov/ghgp/main.do
* Related resources: All GHGRP data products -> https://www.epa.gov/ghgreporting/find-and-use-ghgrp-data

**Data Source - US state shapefile**
* Tutorial subfolder: `cb_2022_us_state_20m`
* Org: U.S. Census Bureau 
* Resource: Cartographic boundary files -> https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html
    * Search for "States > 1 : 20,000,000 (national) shapefile..."

## A. Set-up Jupyter Notebook, University Emitters dataset & US state shapefile

>**A0.** Import the required packages and submodule with their conventional aliases.
>```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

>**A1.** (OPTIONAL) For autocompletion, or if it's not working, try running this magic command.
>```
%config Completer.use_jedi = False
```

>**A2.** Set pandas display options to show all columns/not truncate their display.
>```
pd.options.display.max_columns = None
```

>**A3.** Read-in the university emitters dataset `"Py4EE_GeoSpatial1_Data.csv"`, and assign to `raw_data`.    
>
>**Code Detail:** Although we are still passing in just the path to the file as the only required parameter of the `read_csv()` method, note from the Docstring the range of options available for customising the call.
>```
raw_data = pd.read_csv("Py4EE_GeoSpatial1_Data.csv")
```

>**A4.** Copy `raw_data` to create the `DataFrame` we will be prepping called `df_prep`.
>```
df_prep = raw_data.copy()
```

>**A5.** Ensure the US state shapefile `cb_2022_us_state_20m` is a subfolder in your current working directory (or locally accessible).  
> * This shapefile is an Esri vector data storage format that stores the official location, shape, and attributes for each state at the 1 : 20,000,000 (national) ratio scale. 
> * `cb_2022_us_state_20m` contains a set of related files, such as the `.shp`, `.shx`, `.dbf`, and `.prj` files components of the shapefile.
> * Download here -> https://www2.census.gov/geo/tiger/GENZ2022/shp/cb_2022_us_state_20m.zip

---
## B. Inspect the University Emitters dataset

>**B0.** Have a look at the dataset of large carbon polluting US university facilities.
>
>**Code Detail:** Instead of `head()` returning a default number of rows (5), start specifying the `n` argument.
>```
df_prep.head(2)
```

<font color="green">***B0. Comments***      
*- There is missing data that requires data prep (also called cleaning/munging/wrangling).*

>**B1.** Find out the dimensions of `df_prep`.
>```
df_prep.shape
```

<font color="green">***B1. Comment***     
*- There are 123 US university facilities which were large carbon polluters in at least one of the years between 2011-2022.*

>**B2.** (OPTIONAL) Find out which US university we saw in `GeoSpatial0` had the worst 5-year total pollution, i.e. retrieve the row where the `"GHGRP ID"` is 1001250.
>```
df_prep[df_prep["GHGRP ID"] == 1001250]
```

---
## C. Prepare the University Emitters dataset

>**C0.** Some of the columns are definitely not needed, so let's prepare to drop them, and with some efficiency. Instead of manually counting, let's programmatically find out the index positions of each `df_prep` column.
>```
list(enumerate(df_prep.columns))
```

>**C1.** Perform a drop operation that removes 4 particular columns from `df_prep` inplace (`SUBPARTS`, `CHANGE IN EMISSIONS (2021 TO 2022)`, `CHANGE IN EMISSIONS (2011 TO 2022)`, and `SECTORS`). Review the modification.    
>
>**Code Detail:** Pass the relevant subset of the `df_prep.columns` attribute as the named `columns` argument of the `drop()` method.    
>
>**Tech Note:** The `inplace` parameter is accepted by several pandas methods is used in this Tutorial where possible for brevity, but it's existence is controversial and currently in flux.
>```
df_prep.drop(columns = df_prep.columns[[10,23,24,25]], inplace=True)
df_prep.head(2)
```

>**C2.** See what `dtype` pandas inferred was in each column when it originally read-in the `Py4EE_GeoSpatial1_DATA.csv`.
>```
df_prep.dtypes
```

<font color="green">***C2. Comment***     
*- pandas' inferences are only partial accurate. The columns that are essential to correct are the 12 years of reported emissions data. These are all currently `object` `dtype`, so basically string data, not numeric as required.*

>**C3.** Before we start modifying the 12 reported emissions data columns, let's shorten their long, cumbersome labels to just the reporting year reference. Review the modification.    
>
>**Code Detail:** Perform a renaming operation inplace where a `replace()` string method that returns a copy of the original string with the substring `"TOTAL REPORTED EMISSIONS, "` replaced by nothing is applied to each column label of `df_prep`.
>```
df_prep.rename(columns = lambda x: x.replace("TOTAL REPORTED EMISSIONS, ", ""), inplace=True)
df_prep.head(2)
```

>**C4.** Now start the process towards converting these 12 columns to a numeric `dtype`. Remove the comma characters from the string data in these 12 columns, but not anywhere else in `df_prep` (e.g. `"REPORTED ADDRESS"`). Review the modification.
> 
>**Code Detail:** Use pandas `loc` indexer to select these 12 `df_prep` columns by inputting a slice object with labels after the comma. `map()` applies a function to a Dataframe elementwise, as opposed to row/column-wise.    
>```
df_prep.loc[:, "2011":] = df_prep.loc[:, "2011":].map(lambda x: x.replace(",", ""))
df_prep.head(2)
```

<font color="green">***C4. Comment***        
*- As a specific example of the general point that there are typically **multiple ways to do the same thing in scientific Python**, other ways we can select these 12 priority columns include:*
```
df_prep.iloc[:, -12:]
df_prep.filter(regex="^20")
```

>**C5.** The last string clean-up step is to deal with the `"---"` instances in the 12 columns that we assume is EPA notation for `N/A`. We will replace these string values inplace with `NaN`. Review the modification.
>```
df_prep.replace("---", np.nan, inplace=True)
df_prep.head(2)
```

>**C6.** Now convert the 12 columns to a numeric `dtype`. Review the modification by eye.  
>
>```
df_prep[df_prep.columns[-12:]] = df_prep[df_prep.columns[-12:]].apply(pd.to_numeric)
```

>**C7.** Now review the modification more formally by accessing the `dtypes` attribute again.
>```
df_prep.dtypes
```

>**C8.** Now that the columns with the 12 years of reported emissions data are a numeric `dtype` (as well as some other columns), let's compute some quick summary statistics.    
>
>**Tech Note:** To show less/no decimal places use `pd.set_option ("display.precision", 0)`
>```
df_prep.describe()
```

>**C9.** The subset of 12 emissions columns is time series data. Generate a quick matplotlib line plot.
>
>**Code Detail:** Access the `T` attribute of the `df_prep` subset to return the transpose, then call the `plot()` method, using the `legend` keyword to not place a legend on the plot.  
>
>**Tech Note:** The `plot()` method for `DataFrame` or `Series` data structures uses the backend specified by the option `plotting.backend`, which has the default value of matplotlib.
>```
df_prep[ df_prep.columns[-12:]].T.plot(legend=False)
```

<font color="green">***C9. Comment***    
*- By eye it looks like all data points in the time series are above the 25,000 metric ton CO 2 e threshold as specified in the original EPA FLIGHT data request, but a programmatic check can optionally be performed in **C10.***

>**C10.** (OPTIONAL) Check that all the data points in the time series are either greater than 25,000 or `NaN` by generating a Boolean array for the conditions then determining whether all the values are `True` or not.
>```
( (df_prep.iloc[:, -12:] > 25000) | (df_prep.iloc[:, -12:].isna()) ).values.all()
```

>**C11.** Final task for **Section C.**, create a new `"Cumulative"` column with the sum of each facility's reported emissions data over the 12 years. Review the modification.
>
>**Code Detail:** Extend `df_prep` by assigning a new index value, `"Cumulative"` using the indexing operator. To perform the `sum()` operation on the columns axis a named `axis` argument of either `columns` or `1` must be passed.
>```
df_prep["Cumulative"] = df_prep.loc[:, "2011":].sum(axis=1)
df_prep.head(2)
```

>**C12.** (OPTIONAL) Calculate the total GHG emissions/volume of carbon pollution that the large facilities in the US university sector have been responsible for creating over the reported 12-year period.
>```
df_prep["Cumulative"].sum()
```

<font color="green">***C12. Comment***     
*- These 123 US university facilities are responsible for ~105 million tons of carbon pollution from 2011 to 2022.*

---
## D. Map dataset for regional trends - Basic

<font color="green">***D. Intro***       
*- For this mapping section D. and E. we use the geographic pandas extension, geopandas, to create a `GeoDataFrame` object.*       
*- A `GeoDataFrame` is a pandas `DataFrame` that has a column with geometry, and extends pandas functionality in order to make basic maps.*

>**D0.** Import geopandas with it's conventional alias and geodatasets.
>```
import geopandas as gpd
import geodatasets
```

>**D1.** Create a `GeoDataFrame` called `geo_df` using `df_prep`. Review the new object.
> 
>**Code Detail:** Call geopandas `GeoDataFrame()` function, inputting `df_prep` as well as a `geometry` keyword argument which is another geopandas function `points_from_xy()` called with `df_prep`'s `"LONGITUDE"` and `"LATITUDE"` columns as the required positional `x` and `y` arguments respectively.    
>```
geo_df = gpd.GeoDataFrame(df_prep, geometry=gpd.points_from_xy(df_prep["LONGITUDE"], df_prep["LATITUDE"]))
geo_df.head()
```

>**D2.** Generate a default plot of this new `GeoDataFrame` `geo_df`. Then try a customised plot where the colour of the points is based on their `"Cumulative"` column value. Finally try customising the colormap, `cmap`, used that reflect the `"Cumulative"` values.
> 
>**Tech Note:** geopandas uses matplotlib for this `plot()` method. matplotlib has a range of built-in colormaps -> https://matplotlib.org/stable/tutorials/colors/colormaps.html
>```
geo_df.plot()
geo_df.plot(column="Cumulative")
geo_df.plot(column="Cumulative", cmap="cool")
```

>**D3.** Evidently, `geo_df` plot needs a base map to contextualise the points. Download one of `geodatasets` available datasets, `"naturalearth.land"`, assign to `base_map`, then plot. 
>
>**Tech Notes:** `base_map` is also a `GeoDataFrame` object. Run `geodatasets.data` for full list of datasets.
>```
base_map = gpd.read_file(geodatasets.get_path("naturalearth.land"))
base_map.plot()
```

>**D6.** Use matplotlib to construct a basic map that plots `geo_df` and `base_map` together, visualising the location of the 123 large university emitters and the relative size of their cumulative emissions over 2011-2022. 
>
>**Tech Note:** Many matplotlib plotting routines start with `fig, ax = plt.subplots()`, even when the output is a single plot, as a `Figure` and an `Axes` object is created in one step. `Axes` is not an `axis` reference, rather an instance of the class `plt.Axes` that is a bounding box object with ticks and labels. Conventionally, `ax` is used to refer to an individual axes instance, or a group of axes instances...!!
>```
fig, ax = plt.subplots(figsize=(12,8))    # Create an empty matplotlib Figure and Axes  
base_map.plot(ax=ax)
geo_df.plot(ax=ax, column="Cumulative", cmap="autumn_r")    # Other input options include legend=True, legend_kwds={"orientation": "horizontal"}
#plt.tight_layout()
#plt.savefig("<SomeFilename>.png", dpi=600)    # Save the Figure/Axes using matplotlib - use the optional dpi (dots-per-inch) argument to control the resolution of the png
plt.show()    # Display plot
```

---
## E. Map dataset for regional trends - Interactive

<font color="green">***E. Intro***    
*- In this section we extend our geopandas extension of pandas with folium, an optional geopandas dependency which provides interactive mapping from the Open-Source leaflet.js library via a Python interface.*         
*- See geopandas docs -> https://geopandas.org/en/stable/gallery/plotting_with_folium.html*

>**E0.** Import folium.
>```
import folium
```

>**E1.** Test out the extended geopandas functionality that folium now enables. Call `explore()` on `base_map`.
>
>**Tech Node:** The leaflet/folium maps takes a moment to render.
>```
base_map.explore()
```

>**E2.** Rather than using a world map as a base layer to plot US-only data, let's prepare a regional map. Read-in the US Census states shapefile we double-checked in **Section A.**, creating another `GeoDataFrame`, and assign to `US_state_boundaries`.    
>
>**Tech Note:** Leave the `"cb_2022_us_state_20m"` directory and files as-is, they contain dependencies.
>```
US_state_boundaries = gpd.read_file("cb_2022_us_state_20m/cb_2022_us_state_20m.shp")
```

>**E3.** (OPTIONAL) `US_state_boundaries` is another `GeoDataFrame` like `base_map` with multi/polygons, although for US states rather than world countries. What folium now allows us to do is visualize `GeoDataFrame` data on a leaflet map. Try calling `explore()` again.    
>```
US_state_boundaries.explore()
```

>**E4.** Through a plotting routine harnessing folium-extended capabilities of geopandas with `geo_df` and `US_state_boundaries`, construct an improved version of our **D6.** basic map, namely an interactive map that visualises the location of the 123 large university emitters and the relative size of their cumulative emissions over 2011-2022.
>
>**Code Details:** 
>* Create a `base_layer` which is an `US_state_boundaries.explore()` object, but setting several optional arguments that adjust formatting and interactivity.    
>* Then call `explore()` on `geo_df`, specifying this plot to be drawn on the existing map instance, `base_layer`, using the `m` keyword and `"Cumulative"` as the column to plot. Use available keyword arguments to fine-tune formatting and interactivity.
>* Finally, display `base_layer`.
>
>**Tech Notes:**
> * The `radius` of 4,828 meters set in `marker_kwds` is significant because this is the value used in EPA material on GHGRP facilities and their impact on surrounding communities, namely 3 miles -> https://edap.epa.gov/public/extensions/GHGRP-Demographic-Data-Highlights/GHGRP-Demographic-Data-Highlights.html
> * min_zoom refers to how far out (smaller int), max zoom refers to how far in (larger int)
>```
base_layer = US_state_boundaries.explore(location=[39, -97], width="60%", height="60%", zoom_start=4, min_zoom=3, tooltip=False, style_kwds=dict(fillOpacity=1, weight=1, fillColor="gainsboro"), highlight_kwds=dict(fillOpacity=0, weight=3))
>
>geo_df.explore(m=base_layer, column="Cumulative", marker_type="circle", marker_kwds=dict(radius=4828, fill=False), cmap="autumn_r", popup=["FACILITY", "Cumulative"], tooltip=False)
>
>#base_layer.save("<SomeFilename>.html")    # Save map, e.g. as html
base_layer
```