|  Sunrise logo | ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|---|

#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: https://github.com/edgi-govdata-archiving/ECHO-COVID19
#### The notebook was collaboratively authored by the Environmental Data & Governance Initiative (EDGI) following our authorship protocol: https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/
#### For more information about this project, visit https://www.environmentalenforcementwatch.org/

## How to Run
* A "cell" in a Jupyter notebook is a block of code performing a set of actions that make available or use specific data. The notebook works by running one cell after another as you, the notebook user, selects offered options.
* If you click on a gray **code** cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“**running** a cell”). The button will animate. When the animation stops, the cell has finished running.
![Where to click to run the cell](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/pressplay.JPG?raw=true)
* You may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. 
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/warning-message.JPG?raw=true)
* You may also get a warning that the "runtime" has restarted, after you run the second cell. That's to be expected, carry on!
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/restart.png?raw=true)
* **It is important to run cells in order because they depend on each other.**
* Run all of the cells in a Notebook to make a complete report. Please feel free to look at and **learn about each result as you create it**!

# **Let's begin!** 
These first few cells give us access to external Python code we will need. Hover over the "[ ]" on the top left corner of the cell below and you should see a "play" button appear. Click on it to run the cell then move to the next one.
### 1.  Bring in extra code

In [None]:
# Code stored in Github projects
!git clone -b program-specific-info --single-branch https://github.com/ericnost/ECHO_modules.git &>/dev/null;
!git clone -b add_geos https://github.com/edgi-govdata-archiving/ECHO-Geo.git &>/dev/null;
!git clone -b split https://github.com/edgi-govdata-archiving/ECHO-Sunrise.git &>/dev/null; # This has the utilities file for mapping and make_data_sets.py

After you run the following cell, you may see an error message. That's to be expected! You can dismiss it and proceed to the third cell here ("_# Import main code libraries_").

In [None]:
# Import geospatial code libraries
import os
def restart_runtime():
  os.kill(os.getpid(), 9) # https://stackoverflow.com/questions/52678841/google-colab-how-to-restart-runtime-using-code
!apt update  &>/dev/null;
!apt install gdal-bin python-gdal python3-gdal  &>/dev/null;
!apt install python3-rtree  &>/dev/null;
restart_runtime() # Necessary to install the above ^^^ https://stackoverflow.com/questions/57831187/need-to-restart-runtime-before-import-an-installed-package-in-colab

In [None]:
# Import main code libraries
%run ECHO_modules/DataSet.py
%run ECHO-Sunrise/utilities.py
import pandas as pd
!pip install geopandas &>/dev/null;
import geopandas
import rtree
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import requests
import csv
import datetime
import ipywidgets as widgets

### 2. Which facilities does EPA track in Mass?
This may take just a little bit of time to load - there are thousands! The next cell will load in the data and give you a preview of it.

In [None]:
echo_data_sql = "select * from ECHO_EXPORTER where FAC_STATE = 'MA' and FAC_ACTIVE_FLAG='Y'"
try:
    echo_data = get_data( echo_data_sql, 'REGISTRY_ID' )
    num_facilities = echo_data.shape[0]
    print("\nThere are %s facilities in Massachussets currently tracked in the ECHO database." %(num_facilities))
    print(echo_data)
except pd.errors.EmptyDataError:
    print("\nThere are no facilities in this region.\n")

### 3.  Run this next cell to choose how you want to *zoom in* on the data.
What specific programs do you want to look at and do you want to view this information by county, congressional district, state house district, watershed, or zip code?

In [None]:
# Only list the data set if it has the correct flag set.
%run ECHO-Sunrise/make_data_sets.py
data_sets=make_data_sets()

data_set_choices = []
for k, v in data_sets.items():
    if ( v.has_echo_flag( echo_data ) ):
        data_set_choices.append( k )

data_set_widget=widgets.Dropdown(
    options=list(data_set_choices),
    description='Data sets:',
    disabled=False,
    value='Greenhouse Gases'
) 
display(data_set_widget)

# The different possible geographies for analysis
region_field = { 
    'Congressional District': { "field": "congressional_districts" },
    'County': { "field": "county" },
    'State Districts': { "field": "state_house_districts" }, 
    'Town': {"field": "town"},
    'Watershed': {"field": "watersheds"},
    'Zip Code': { "field": "zip_code" },
}

style = {'description_width': 'initial'}
select_region_widget = widgets.Dropdown(
    options=region_field.keys(),
    style=style,
    value='Congressional District',
    description='Region of interest:',
    disabled=False
)
display( select_region_widget )

### 4. Here are all the facilities in this program
This may take some time because we're looking at all records under this program for all facilities across the state!

We'll get all the data from the database and map where these facilities are:

In [None]:
program = data_sets[ data_set_widget.value ]
program_data = None

my_prog_data, bars = get_program_data(echo_data, program, program_data)
map_of_facilities = mapper_marker(my_prog_data)
map_of_facilities

Next, let's create a barchart to show trends over time (2010 - 2018):

In [None]:
ax = bars.plot(kind='bar', title = program.name, figsize=(20, 10), fontsize=16)
ax.set_xlabel( 'Reporting Year' )
ax.set_ylabel( program.name )
ax    

### 5. Now we bring the geographic data and the facility data together. First, let's rank each geography.

In [None]:
# Read in and map geojson for the selected geography
geo = region_field[select_region_widget.value]["field"].lower()
geo_json_data = geopandas.read_file("ECHO-Geo/ma_"+geo+".geojson")

# Get rid of any null geographies
for geom in geo_json_data.iterrows():
    if(geom[1]["geometry"]==None):
        geo_json_data=geo_json_data.drop(geom[0])
        
# Make a geodataframe out of the facilities data   
gdf = geopandas.GeoDataFrame(
    my_prog_data, crs= "EPSG:4326", geometry=geopandas.points_from_xy(my_prog_data["FAC_LONG"], my_prog_data["FAC_LAT"]))

# Join the facilities and the counties, towns, etc. - whatever the chosen geography is
join = geopandas.sjoin(gdf, geo_json_data, how="inner", op='intersects')

# get geo and attribute data column names
geo_column = {"county": "COUNTY", "state_house_districts": "REP_DIST", "town": "town","zip_code": "geoid10","watersheds": "huc12","congressional_districts": "ids"}
g = geo_column[geo]
a = program.agg_col

join.to_csv("full_program_data-"+program.name+"-"+g+".csv")

data = join.groupby(join[g])[[a]].agg("sum")
data = data.sort_values(by=a, ascending=False)
data.to_csv("geos_ranked-"+program.name+"-"+g+".csv")

sns.set(style='whitegrid')
plt.figure(figsize=(10,6))
unit = data[0:19].index # First 20 rows 
values = data[0:19][a] # First 20 rows
sns.barplot(values, unit, order=list(unit), orient="h") 

plt.title('Top 20 %s in Massachusetts from 2010-2018' %(geo))
plt.xlabel(program.name)

plt.show()

### 6. Now, let's map it!
Ares shaded grey are those where there was no data - either becuase there were no emissions, violations, etc. or because there was nothing we could pull from ECHO at this time.

In [None]:
data.reset_index(inplace=True)
att_data = data.rename(columns={g: "geo", a: "value"}) 

ranked = my_prog_data.set_index("Index")
ranked.sort_values(by=a, ascending=False)
ranked.to_csv("facilities_ranked-"+program.name+".csv")
ranked['quantile'] = pd.qcut(ranked[a], 4, labels=False, duplicates="drop")

mp = mapper_area(ranked, geo_json_data, att_data, g, a)
mp