| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|

This notebook is licensed under GPL 3.0. Please visit our [Github repo](https://github.com/edgi-govdata-archiving/ECHO-Cross-Program) for more information.

The notebook was collaboratively authored by EDGI following our [authorship protocol](https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/).

For more information about this project, visit https://www.environmentalenforcementwatch.org/

Note:  This notebook pulls data from a copy of EPA's ECHO database hosted by Stony Brook University. The data sets are updated on a weekly basis, meaning that some of the results from your run may not exactly match those in [EEW's Congressional Report Cards](https://www.environmentalenforcementwatch.org/reports). For instance, the Report Cards show ten facilities that have spent at least three of the past 12 quarters in non-compliance with different environmental protection laws. These results will therefore change as we enter new parts of the year. In addition, the Report Cards estimate the number of facilities that were active in 2019, since EPA does not provide such figures. Our estimate is based on the number of facilities EPA records as active at the *current* moment in time. In short, we use active right now as a proxy for active in 2019. This number informs several metrics in the Report Cards - including violations and inspections per 1000 facilities - and these will change as the number of facilities reported as active right now by the EPA changes. Please see the [CD-Report repo](https://github.com/edgi-govdata-archiving/CD-report) for facility counts and non-compliance rates as we recorded them in mid-September 2020 in order to produce the Report Cards.

# Examining Data from the EPA's Risk Screening Environmental Indicators (RSEI) 

This notebook examines data from the Risk Screening Environmental Indicators (RSEI) database (https://epa.gov/rsei). 

As data is retrieved from each RSEI data set, a subset of the available fields are selected. Those are in the ***column*** variable in the code blocks. 

Additional columns can be added by modifying the list in the ***column*** variable.

The fields available and their meaning can be found in the data dictionary at this link: (https://www.epa.gov/rsei/rsei-data-dictionary-site-data).

In [None]:
# Install our codebase 
# !pip install ECHO_modules >&/dev/null;
%pip install git+https://github.com/edgi-govdata-archiving/ECHO_modules@neighborhoods >&/dev/null;
%pip install geopandas >&/dev/null;

### Select the type of region and then the state
A state selection is not necessary for Zip Code and Neighborhood region types.

An additional choice lets you supply a list of FRSIDs and use those facilities in the analysis.

In [1]:
from ECHO_modules.get_data import get_echo_data
from ECHO_modules.utilities import show_region_type_widget, \
    show_state_widget, show_year_range_widget
from ECHO_modules.rsei_utilities import show_rsei_pick_region_widget

region_type_widget = show_region_type_widget(region_types=('City', 'County', 'State', 'Zip Code', 'Neighborhood',
                                                           'FRSID List'), 
                                             default_value='City')
state_widget = None
# display( region_type_widget )
print('(The State will be ignored for Zip Code and Neighborhood regions.)')
state_widget = show_state_widget()

Dropdown(description='Region of interest:', options=('City', 'County', 'State', 'Zip Code', 'Neighborhood', 'F…

(The State will be ignored for Zip Code and Neighborhood regions.)


Dropdown(description='State:', options=('AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI'…

## Select the regions to look for
For Neighborhoods, only rectangles are currently supported.

City, county and state names will be automatically converted to upper case. Don't worry about the case as you type in your selections.

Multiple selections can be made with a comma-separated list.


In [2]:
from ECHO_modules.utilities import polygon_map

description = None
region_widget = None
region_type = region_type_widget.value
if region_type == 'Neighborhood':
    (map,shapes) = polygon_map()
    display(map)
elif region_type != 'State':
    if region_type == 'FRSID List':
        description = 'Select file with FRSID column'
    region_widget = show_rsei_pick_region_widget( type=region_type,
                                            state_widget=state_widget, 
                                            description=description )

Text(value='', description='Select file with FRSID column')

In [14]:
if 'Standard' in regions_selected.columns:
    fac_selected = regions_selected[['FRSID', 'Standard']]

## Get the facilities that are in the chosen regions
These are the producers of toxic waste in the chosen region, as reported to the EPA's Toxic Release Inventory (TRI).

In [25]:
import pandas as pd 
from ECHO_modules.rsei_utilities import get_rsei_facilities, get_this_by_that
from ECHO_modules.utilities import get_frsid_list

state = state_widget.value if state_widget is not None else None
regions_selected = None
if region_type == 'Zip Code':
    regions_selected = str(region_widget.value)
elif region_type == 'Neighborhood':
    regions_selected = shapes.pop()
elif region_type == 'FRSID List':
    regions_selected = get_frsid_list(region_widget.value)
elif region_type != 'State':
    regions_selected = region_widget.value
    
columns = '"FacilityName", "FacilityID", "FacilityNumber", "FRSID", "Latitude", "Longitude", "Street",'
columns += '"City", "County", "State", "ZIPCode", "StandardizedParentCompany"'

if region_type == 'FRSID List':
    fac_df = get_this_by_that(this_name='facility', that_series=regions_selected['FRSID'], this_key='FRSID',
                              this_columns=columns)
    if 'Standard' in regions_selected.columns:
        fac_selected = regions_selected[['FRSID', 'Standard']]
        fac_selected['FRSID'] = fac_selected['FRSID'].astype("int")
        fac_df = pd.merge(fac_df, fac_selected, on='FRSID')
else:
    fac_df = get_rsei_facilities(state=state, region_type=region_type, regions_selected=regions_selected, 
                                 rsei_type='facility', columns=columns)
# If the columns aren't specified, all columns are returned ("select * from ...")
# fac_df = get_rsei_facilities(state=state, region_type=region_type, regions_selected=regions_selected, 
#                              rsei_type='facility')
fac_df

100) reading facility_data_rsei_v2312
200) reading facility_data_rsei_v2312
300) reading facility_data_rsei_v2312
400) reading facility_data_rsei_v2312


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fac_selected['FRSID'] = fac_selected['FRSID'].astype("int")


Unnamed: 0,FacilityName,FacilityID,FacilityNumber,FRSID,Latitude,Longitude,Street,City,County,State,ZIPCode,StandardizedParentCompany,Standard
0,ELITE SPICE INC.,20794LTSPC7151M,226,110000339269,39.174650,-76.776740,7151 MONTEVIDEO RD,JESSUP,HOWARD,MD,20794,ELITE SPICE INC,Sterilizer Rule
1,WINYAH GENERATING STATION,29440WNYHG661ST,2876,110000353457,33.330842,-79.357839,661 STEAM PLANT DR,GEORGETOWN,GEORGETOWN,SC,29440,SOUTH CAROLINA PUBLIC SERVICE AUTHORITY,MATS
2,ARVESTA CORP,44081CMRCS3647S,6406,110000386001,41.748776,-81.155151,3647 SHEPARD RD,PERRY,LAKE,OH,44081,,HON Rule
3,CHEVRON PRODUCTS CO PASCAGOULA REFINERY,39567CHVRNPOBOX,6456,110000377477,30.343733,-88.493800,250 INDUSTRIAL RD,PASCAGOULA,JACKSON,MS,39581,CHEVRON CORP,HON Rule
4,CLIFTY CREEK STATION,47250CLFTY1335C,10982,110000402314,38.738349,-85.419145,1335 CLIFTY HOLLOW RD,MADISON,JEFFERSON,IN,47250,OHIO VALLEY ELECTRIC CORP,MATS
...,...,...,...,...,...,...,...,...,...,...,...,...,...
345,MIDWEST STERILIZATION CORP,78045MDWST121GE,34956,110024942678,27.620278,-99.503056,12010 GENERAL MILTON DR,LAREDO,WEBB,TX,78045,MIDWEST STERILIZATION CORP,Sterilizer Rule
346,STERIS ISOMEDIX SERVICES,75050CSMDF1175I,38272,110002131014,32.799940,-97.047086,1175 ISUZU PKWY,GRAND PRAIRIE,TARRANT,TX,75050,STERIS CORP,Sterilizer Rule
347,BAKELITE CHEMICALS LLC,75901GRGPC1429E,49676,110017769789,31.336660,-94.712030,1429 E LUFKIN AVE,LUFKIN,ANGELINA,TX,75901,BAKELITE LLC,HON Rule
348,DAK AMERICAS LLC-COLUMBIA SITE,2905WDKMRC57KAV,57924,110043305688,33.866387,-81.012712,570 K AVE,GASTON,CALHOUN,SC,29053,DAK AMERICAS LLC,HON Rule


#### See a map of these producing facilities in the regions selected

You can use the rectangle selection in this next map to home in on a subset of facilities on this map.
If you don't select a rectangle all facilities will be included.

Skip this if you want to use all of the facilities already selected.

In [None]:
from ECHO_modules.utilities import ipymapper

(map, shapes) = ipymapper(fac_df, no_text=False, lat_field='Latitude', long_field='Longitude',
                name_field='FacilityName', info_field='LatLongSource')
display(map)

#### Choose the years for the submissions you want to see

In [26]:
from ECHO_modules.utilities import show_year_range_widget
year_range = show_year_range_widget()

SelectionRangeSlider(description='Dates', index=(0, 54), layout=Layout(width='500px'), options=(1970, 1971, 19…

### If there are too many of these producing facilities, the number of submissions and releases can get very large. 
It may take a very long time to identify them all, and it may not work at all. 

If you try it and get stuck, we suggest you run it again and use either a rectangular region selection in the previous map or this next widget to
reduce the number of facilities. Being more specific with the date range will also work to reduce the number of submissions retrieved.

You can add to your selection with Ctrl+click, and extend the selection with Shift+click.

Skip these next two cells if you want to use all the facilities that have been chosen.

In [None]:
from ECHO_modules.rsei_utilities import get_facs_in_rect
from ECHO_modules.utilities import show_fac_widget

try:
    if len(shapes) > 0:
        fac_df = get_facs_in_rect(fac_df, 'Latitude', 'Longitude', shapes)
    fac_widget = show_fac_widget(fac_df['FacilityName'])
except NameError:
    print('Will use all facilities.')

We'll work in several steps to follow the chain from facilities to their submissions (with the associated chemical),
then from the submission to releases (using SubmissionNumber),
then from releases to elements (using ReleaseNumber),
and from releases to offsite facilities (using releases.OffsiteNumber with offsite.FacilityNumber).
We can then try to connect the offsite facility (offsite.TRIFID) with facility (FacilityID)

### Get the submissions and chemicals made by these facilities

In [28]:
from ECHO_modules.rsei_utilities import get_this_by_that, add_chemical_to_submissions

# Get the submissions 
columns = '"SubmissionNumber", "FacilityNumber", "ChemicalNumber", "SubmissionYear", "OneTimeReleaseQty", "TradeSecretInd"'

sub_df = get_this_by_that(this_name='submissions', that_series=fac_df['FacilityNumber'], this_key='FacilityNumber',
                          this_columns=columns, years=year_range.value, year_field='SubmissionYear')

# Get the chemical names and add them to the submissions
columns = '"ChemicalNumber", "Chemical", "RfCInhale"'
# columns = '"Chemical", "RfCInhale", "RfDOral"'
# columns = '*'
sub_df = add_chemical_to_submissions(submissions=sub_df, chemical_columns=columns)
# columns = ["SubmissionNumber", "ChemicalNumber", "Chemical", "RfCInhale"]
# sub_df[columns]
print(f'{len(sub_df)} submissions found for these facilities')

chem_widget = None
chem_numbers_widget = None


100) reading submissions_data_rsei_v2312
200) reading submissions_data_rsei_v2312
300) reading submissions_data_rsei_v2312
select "ChemicalNumber", "Chemical", "RfCInhale" from "chemical_data_rsei_v2312" where "ChemicalNumber" in (293, 493, 292, 331, 283, 549, 157, 347, 356, 632, 29, 153, 172, 502, 406, 595, 609, 521, 614, 360, 333, 45, 402, 610, 318, 273, 5, 124, 171, 225, 364, 379, 381, 599, 409, 82, 572, 592, 575, 193, 80, 261, 51, 60, 114, 116, 155, 163, 169, 238, 290, 291, 332, 627, 619, 591, 393, 589, 590, 454, 491, 531, 36, 191, 453, 519, 346, 359, 40, 154, 582, 130, 138, 126, 127, 208, 212, 565, 567, 222, 38, 58, 505, 536, 343, 315, 162, 376, 84, 629, 160, 161, 32, 275, 399, 370, 12, 334, 398, 115, 140, 252, 253, 316, 397, 514, 19, 11, 548, 224, 158, 410, 165, 75, 550, 214, 584, 460, 3, 6, 166, 456, 463, 467, 474, 613, 624, 608, 371, 201, 16, 17, 41, 55, 495, 85, 594, 83, 236, 135, 286, 891, 892, 269, 418, 195, 375, 441, 329, 352, 530, 223, 216, 215, 63, 447, 267, 419, 118, 489

#### Select the chemicals of interest
If you want to see the results for all the chemicals you can skip this cell.

In [None]:
from ECHO_modules.rsei_utilities import show_select_multiple_widget

chem_widget = show_select_multiple_widget(sub_df['Chemical'], 'Select Chemicals')
chem_widget.layout.height = '200px'
display(chem_widget)

...or enter a list of the chemical numbers if you know them.

In [None]:
from ECHO_modules.utilities import make_widget

widget_parms = {'type' : 'text', 'default' : '', 'description' : 'Enter list of chemical numbers'}
chem_numbers_widget = make_widget(widget_parms)
display(chem_numbers_widget)

#### This cell can be skipped if you didn't select any chemicals from the dropdown list or enter chemical numbers.
If skipped, we will proceed with the full list of chemicals.

In [None]:
if chem_widget is not None and len(chem_widget.value) > 0:
    sub_df = sub_df[sub_df['Chemical'].isin(chem_widget.value)]
    print(f'Chemicals selected: {chem_widget.value}')
elif chem_numbers_widget is not None and len(chem_numbers_widget.value) > 0:
    chem_numbers = [int(item) for item in chem_numbers_widget.value.split(",")]
    sub_df = sub_df[sub_df['ChemicalNumber'].isin(chem_numbers)]
    print(f'ChemicalNumbers selected: {chem_numbers_widget.value}')

    
print(sub_df)

### We may filter the releases by which "media" we are interested in.
The available media and their descriptions are listed here.

Multiple media may be chosen.

In [29]:
from ECHO_modules.rsei_utilities import get_media, show_select_multiple_widget

media_df = get_media()
media_df['Display'] = media_df['Media'].astype(str) + '\t' + media_df['MediaText']
media_widget = show_select_multiple_widget(media_df['Display'], 'Select Media')
media_widget.layout.height = '200px'
display(media_widget)

SelectMultiple(description='Select Media', layout=Layout(height='200px', width='70%'), options=('1\t1 Fugitive…

In [30]:
chosen_media = None
if len(media_widget.value ) > 0:
    chosen_media = media_df[media_df['Display'].isin(media_widget.value)]

Start a linking dataframe with minimal fields to trace from the facility to the offsite locations, via submissions and releases.
Join on the FacilityNumber fields of fac_df and sub_df (submissions).

### Get the releases for the submissions and the elements for the releases
Note that many of the releases will not show an OffsiteNumber.
Those releases are not sent offsite.

In [31]:
# Link the facilities to the submissions
link_df = fac_df.set_index('FacilityNumber').join(sub_df.set_index('FacilityNumber'), lsuffix='_left', rsuffix='_right')

# Get the releases for the submissions
columns = '"ReleaseNumber", "SubmissionNumber", "Media", "PoundsReleased", "OffsiteNumber", "TEF"'
filter = None
if chosen_media is not None:
    filter = {'filter_field' : 'Media', 'filter_list' : chosen_media['Media'].to_list(), 'int_flag' : True}
rel_df = get_this_by_that(this_name='releases', that_series=sub_df['SubmissionNumber'], this_key='SubmissionNumber',
                          this_columns=columns, filter = filter)
if chosen_media is not None:
    rel_df = rel_df.dropna(subset=['Media'])
print(f'{len(rel_df)} releases found for these submissions')

# Get the elements for the releases
columns = '"ReleaseNumber", "ElementNumber", "PoundsPT", "ScoreCategory", "Score", "Population", "ScoreA", "PopA", "ScoreB", "PopB"'
element_df = get_this_by_that(this_name='elements', that_series=rel_df['ReleaseNumber'], this_key='ReleaseNumber', 
                              this_columns=columns)
print(f'{len(element_df)} elements found for these releases')
# Add the elements to the releases dataframe
rel_df = rel_df.set_index('ReleaseNumber').join(element_df.set_index('ReleaseNumber'), how='left')

100) reading releases_data_rsei_v2312
200) reading releases_data_rsei_v2312
300) reading releases_data_rsei_v2312
400) reading releases_data_rsei_v2312
500) reading releases_data_rsei_v2312
600) reading releases_data_rsei_v2312
700) reading releases_data_rsei_v2312
800) reading releases_data_rsei_v2312
900) reading releases_data_rsei_v2312
1000) reading releases_data_rsei_v2312
1100) reading releases_data_rsei_v2312
1200) reading releases_data_rsei_v2312
1300) reading releases_data_rsei_v2312
1400) reading releases_data_rsei_v2312
1500) reading releases_data_rsei_v2312
1600) reading releases_data_rsei_v2312
1700) reading releases_data_rsei_v2312
1800) reading releases_data_rsei_v2312
1900) reading releases_data_rsei_v2312
2000) reading releases_data_rsei_v2312
2100) reading releases_data_rsei_v2312
2200) reading releases_data_rsei_v2312
2300) reading releases_data_rsei_v2312
2400) reading releases_data_rsei_v2312
2500) reading releases_data_rsei_v2312
2600) reading releases_data_rsei_v

In [32]:
# Continue the linking process for facilities by joining the previous link with the releases.
# link_df1 will have all of the releases
# link_df2 will have only releases sent to offsite facilities
link_df1 = link_df.set_index('SubmissionNumber').join(rel_df.set_index('SubmissionNumber'))
link_df2 = link_df.set_index('SubmissionNumber').join(rel_df.set_index('SubmissionNumber')).dropna(subset=('OffsiteNumber'))

## Write out the chemical and element data for these facilities
This first file will contain all of the releases, the chemicals, the score, pounds released pre and post-tratment, population effected and the year.

In [36]:
tmp_df = link_df1[link_df1.index.notnull()]
fac_chemicals = tmp_df[['FacilityID', 'FacilityName', 'Standard', 'Media', 'Chemical', 'Score', 'ScoreCategory', 
                        'PoundsReleased', 'PoundsPT', 'Population', 'SubmissionYear']]

from ipywidgets import widgets
filename_widget = widgets.Text(
    value='facility_chemicals.csv',
    description="Filename",
    disabled=False
)
filename_widget

Text(value='facility_chemicals.csv', description='Filename')

#### Write the file.

In [37]:
fac_chemicals.to_csv(filename_widget.value)

### Get the total pounds released, grouped by the facility and chemical
The next file will sum the pounds of the chemical released by the facility for the entire period.

In [None]:
fac_chem_pounds = link_df1.groupby(['FacilityName', 'Chemical'], as_index=False).agg({'Score' : 'mean', 'PoundsReleased' : 'sum', 'PoundsPT' : 'sum'})

fac_chem_pounds

### Choose a file to write to

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='pounds_released.csv',
    description="Filename",
    disabled=False
)
filename_widget

#### Write the file

In [None]:
fac_chem_pounds.to_csv(filename_widget.value)

### Get the offsite facilities from the releases and sum the pounds released

In [None]:
if len(link_df2) > 0:
    columns = '"FacilityNumber", "TRIFID", "FRSID", "Name", "Street", "City", "State", "ZIPCode", "Latitude", "Longitude"'
    off_df = get_this_by_that(this_name='offsite', that_series=link_df2['OffsiteNumber'].dropna(), this_key='OffsiteID', this_columns=columns)
    print(f'{len(rel_df)} releases found for these submissions')

    # Continue the linking process started earlier. 
    # This time link the OffsiteNumber from releases with the FacilityNumber in offsite.
    link_df3 = link_df2.set_index('OffsiteNumber').join(off_df.set_index('FacilityNumber'), lsuffix='_left', rsuffix='_right')
    
    # Sum the pounds of chemical released by producing and offsite facilities
    off_chem_pounds = link_df3.groupby(['FacilityName', 'Name', 'Chemical'])['PoundsReleased'].sum()
else:
    print('There are no offsite releases for these facilities and media.')

#### If there are no offsite releases, the cells that follow having to do with the offsite facilities won't have meaning.
This can happen if we have chosen only air releases, for example.

### Choose a file to write the chemicals and pounds transferred to each offsite facility 

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='offsite_pounds_released.csv',
    description="Filename",
    disabled=False
)
filename_widget

#### Write the file

In [None]:
off_chem_pounds.to_csv(filename_widget.value)

### Link the producing facilities with their offsite facilities.

Map the facilities releasing and the offsite facilities they send to.
    df_dicts : tuple
        Tuple of dictionaries containing the facilities to map.  They must have a latitude and 
        longitude field. The dictionaries should have these fields:

             the DataFrame - 'DataFrame'

             circle border color - 'marker_color'

             circle interior color - 'marker_fill_color'

             facility name - 'name_field' in the dataframe 

             latitude field - 'lat_field'

             longitude field - 'long_field'

             info field - 'info_field'

The facilities producing waste will be shown with green circles.
The offsite facilities receiving the waste from the green facilities are shown with blue circles.

In [None]:
from ECHO_modules.rsei_utilities import mapper2
'''
Pare the linking information down to just the latitude/longitude for the originating facility (_left)
and the coordinates for the offsite facility (_right).
There may be multiple transfers between the same two facilities, so we drop duplicates.
(The multiple transfers may be of interest. They will exist in link_df3.)
'''
link_df3 = link_df3.dropna(subset=['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right'])
link_df4 = link_df3.drop_duplicates(subset=['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right'])
link_df4 = link_df4[['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right']]

fac_df = fac_df.dropna(subset=['Latitude', 'Longitude'])
off_df = off_df.dropna(subset=['Latitude', 'Longitude'])
fac_dict = {
    'DataFrame' : fac_df,
    'marker_color' : 'black',
    'marker_fill_color' : 'green',
    'name_field' : 'FacilityName',
    'lat_field' : 'Latitude',
    'long_field' : 'Longitude',
    'info_field' : None
}
off_dict = {
    'DataFrame' : off_df,
    'marker_color' : 'yellow',
    'marker_fill_color' : 'blue',
    'name_field' : 'Name',
    'lat_field' : 'Latitude',
    'long_field' : 'Longitude',
    'info_field' : None
}
map_facs_and_offs = mapper2(df_dicts=(fac_dict, off_dict), link_df=link_df4 )
display(map_facs_and_offs)

#### Save the map as an HTML file

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='map_facs_to_offs.html',
    description="Filename",
    disabled=False
)
filename_widget

In [None]:
map_facs_and_offs.save(filename_widget.value)

#### Select one or more producing facilities to show their offsites
To zero in on one or a few producing facilities, you can select them from the list.

In [None]:
from ECHO_modules.utilities import show_fac_widget

fac_widget = show_fac_widget(fac_df['FacilityName'])
fac_widget.layout.height = '200px'

In [None]:
import pandas as pd

fac_off_df = link_df3[link_df3['FacilityName'].isin(fac_widget.value)]
off_count = len(fac_off_df['Name'].unique())
print(f'{len(fac_off_df)} transfers from these facilities to {off_count} offsite locations')
# df = pd.DataFrame(fac_widget.value)
fac_off_df = fac_off_df.drop_duplicates(subset=['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right'])
fac_off_link_df = fac_off_df.drop_duplicates(subset=['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right'])
fac_off_link_df = fac_off_link_df[['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right']]
fac_off_link_df

### Show the map again for just these selected producing facilities

In [None]:
fac_dict = {
    'DataFrame' : fac_off_df,
    'marker_color' : 'black',
    'marker_fill_color' : 'green',
    'name_field' : 'FacilityName',
    'lat_field' : 'Latitude_left',
    'long_field' : 'Longitude_left',
    'info_field' : None
}
off_dict = {
    'DataFrame' : fac_off_df,
    'marker_color' : 'yellow',
    'marker_fill_color' : 'blue',
    'name_field' : 'Name',
    'lat_field' : 'Latitude_right',
    'long_field' : 'Longitude_right',
    'info_field' : None
}
map_facs_and_offs = mapper2(df_dicts=(fac_dict, off_dict), link_df=fac_off_link_df )
display(map_facs_and_offs)

#### Save the map as HTML

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='map_facs_to_offs.html',
    description="Filename",
    disabled=False
)
filename_widget

In [None]:
map_facs_and_offs.save(filename_widget.value)

## See offsite facilities in the chosen region
These offsite facilities may be receiving from other facilities outside of this region. They aren't necessarily linked to the producing facilities in fac_df.

The offsite database table does not have a County field, so if our type of region is County, we select all offsite facilities in the state.

In [None]:
from ECHO_modules.rsei_utilities import get_rsei_facilities

columns = '"Name", "OffsiteID", "FacilityNumber", "TRIFID", "FRSID", "Latitude", "Longitude", "Street",'
columns += '"City", "State", "ZIPCode"'
# columns = '*'

this_region_type = region_type
if region_type == 'County':
    this_region_type = 'State'

off_df2 = get_rsei_facilities(state=state, region_type=this_region_type, regions_selected=regions_selected, 
                             rsei_type='offsite', columns=columns)
off_df2

#### In this map you can use a rectangle to select a subset of the offsite facilities
If your region is a county, the offsite facilities selected will initially be all in the state. You can make a selection on this map to filter to the area of your county. (You can only choose a rectangle.)

In [None]:

to_map = off_df2.dropna(subset=['Latitude', 'Longitude'])
(map, shapes) = ipymapper(off_df2, no_text=False, lat_field='Latitude', long_field='Longitude',
                name_field='Name', info_field='POTW_Incin')
display(map)

#### Use either the selected rectangle or choose from the list to use a subset of facilities

In [None]:

if len(shapes) > 0:
    off_df2 = get_facs_in_rect(off_df2, 'Latitude', 'Longitude', shapes)
fac_widget = show_fac_widget(off_df2['Name'])

## What chemicals are imported into the region at these sites, and where do they come from?
We had been looking at what chemicals were produced within our region or selected facilities.

Now we will look at what facilities within our region are receiving from facilities, either within or outside of the region.

The submissions will be filtered to the year range selected earlier.

In [None]:
# Get releases for these offsite facilities.

columns = '"ReleaseNumber", "SubmissionNumber", "Media", "PoundsReleased", "OffsiteNumber", "TEF"'
off_rel_df = get_this_by_that(this_name='releases', that_series=off_df2['OffsiteID'], this_key='OffsiteNumber',
                          this_columns=columns)

# Link releases to offsite facilities
off_link_df = off_df2.set_index('OffsiteID').join(off_rel_df.set_index('OffsiteNumber'), how='left').dropna(subset=['ReleaseNumber', 
                                                                                                                    'SubmissionNumber'])

# Get elements for the releases.
columns = '"ReleaseNumber", "ElementNumber", "PoundsPT", "ScoreCategory", "Score", "Population", "ScoreA", "PopA", "ScoreB", "PopB"'
off_element_df = get_this_by_that(this_name='elements', that_series=off_rel_df['ReleaseNumber'], this_key='ReleaseNumber', 
                              this_columns=columns)
off_rel_df = off_rel_df.set_index('ReleaseNumber').join(off_element_df.set_index('ReleaseNumber'), how='left')

# Get submissions for the releases.
columns = '"SubmissionNumber", "FacilityNumber", "ChemicalNumber", "SubmissionYear", "OneTimeReleaseQty", "TradeSecretInd"'
off_sub_df = get_this_by_that(this_name='submissions', that_series=off_rel_df['SubmissionNumber'], this_key='SubmissionNumber',
                              this_columns=columns, years=year_range.value, year_field='SubmissionYear')

# Get chemicals for the submissions.
columns = '"ChemicalNumber", "Chemical", "RfCInhale"'
off_sub_df = add_chemical_to_submissions(submissions=off_sub_df, chemical_columns=columns)

# Link submissions to offsite facilities.
off_link_df2 = off_link_df.set_index('SubmissionNumber').join(off_sub_df.set_index('SubmissionNumber'), how='left', lsuffix='_left', rsuffix='_right')

# Get producing facilities from the submissions.
columns = '"FacilityName", "FacilityID", "FacilityNumber", "FRSID", "Latitude", "Longitude", "Street",'
columns += '"City", "County", "State", "ZIPCode", "StandardizedParentCompany"'
off_fac_df = get_this_by_that(this_name='facility', that_series=off_sub_df['FacilityNumber'], this_key='FacilityNumber',
                              this_columns=columns)

# Link offsite facilities to producing facilities
off_link_df3 = off_link_df2.set_index('FacilityNumber_right').join(off_fac_df.set_index('FacilityNumber'), how='left', lsuffix='_left', rsuffix='_right')

# Sum pounds released by offsite, producing and chemical
off_chem_pounds = off_link_df3.groupby(['Name', 'FacilityName', 'Chemical'])['PoundsReleased'].sum()

#### Write the pounds of chemicals received by offsite facilities

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='offsite_pounds_received.csv',
    description="Filename",
    disabled=False
)
filename_widget

In [None]:
off_chem_pounds.to_csv(filename_widget.value)

In [None]:
'''
Pare the linking information down to just the latitude/longitude for the offsite receiving facility (_left)
and the coordinates for the producing facility (_right).
There may be multiple transfers between the same two facilities, so we drop duplicates.
(The multiple transfers may be of interest. They will exist in link_df3.)
'''
off_link_df3 = off_link_df3.dropna(subset=['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right'])
off_link_df4 = off_link_df3.drop_duplicates(subset=['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right'])
off_link_df4 = off_link_df4[['Latitude_left', 'Longitude_left', 'Latitude_right', 'Longitude_right']]

fac_dict = {
    'DataFrame' : off_link_df3,
    'marker_color' : 'black',
    'marker_fill_color' : 'green',
    'name_field' : 'FacilityName',
    'lat_field' : 'Latitude_right',
    'long_field' : 'Longitude_right',
    'info_field' : None
}
off_dict = {
    'DataFrame' : off_link_df3,
    'marker_color' : 'yellow',
    'marker_fill_color' : 'blue',
    'name_field' : 'Name',
    'lat_field' : 'Latitude_left',
    'long_field' : 'Longitude_left',
    'info_field' : None
}
map_facs_and_offs = mapper2(df_dicts=(fac_dict, off_dict), link_df=off_link_df4 )
display(map_facs_and_offs)

#### Save the map to HTML

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='map_offs_from_facs.html',
    description="Filename",
    disabled=False
)
filename_widget

In [None]:
map_facs_and_offs.save(filename_widget.value)

## From here down it is just ideas and scratch code

### Add the chemicals and pounds released to the facilities dataframe
We'll be able to see this information in the popups on the upcoming maps.

In [None]:
# Use fac_chem_pounds to build an info field on fac_df
# Truncate the chemical name to 20 characters - truncate(after=20)

In [None]:
df = pd.read_csv(filename_widget.value)

In [None]:
fac_with_chem = fac_df.set_index('FacilityName').join(df.set_index('FacilityName'))
fac_with_chem

In [None]:

# All the releases where media = 1 (I think that's direct air releases) 
rsql = 'select * from "releases_data_rsei_v2312" where "Media" <= 2;' 
get_echo_data(rsql)
# All the releases above a certain weight 
rsql = 'select * from "releases_data_rsei_v2312" where "PoundsReleased" > 100000;' 
releases = get_echo_data(rsql)

In [None]:
len(releases)

In [None]:
# All the releases where media = 1 (I think that's direct air releases) 
media_sql = 'select "Media", "MediaText" from "media_data_rsei_v2312";' 
media_types = get_echo_data(media_sql)
media_types

In [None]:
# Get Exxon facilities 
rsql = 'select * from "facility_data_rsei_v2312" where "StandardizedParentCompany" like \'%EXXON%\';' 
facs = get_echo_data(rsql) 
# Get their submissions 
these_fac_numbers = list(facs["FacilityNumber"].unique()) 
rsql = 'select * from "submissions_data_rsei_v2312" where "FacilityNumber" in ({});'.format(','.join([str(fac) for fac in these_fac_numbers])) 
# You shouldn't do SQL like this but I'm being quick 
subs = get_echo_data(rsql) 

# Use these submission numbers to get releases 
# Ok, actually there are too many submissions (>20,000) to easily get all the Exxon releases from the database. 
# An enterprising SQL writer could do this with some joins, I bet! No time right now for me though 
# But this is the general idea.... 
these_submission_numbers = list(subs["SubmissionNumber"].unique())[0:50] 
# Just do the first 50 as a test 
rsql = 'select * from "releases_data_rsei_v2312" where "SubmissionNumber" in ({});'.format(','.join([str(fac) for fac in these_submission_numbers])) 
res = get_echo_data(rsql) 
res

### Filter the submissions to specific chemicals
There may be many different chemicals in the submission list. We provide this way of making it a little easier to choose which chemicals we want to see.

We can print the chemicals to a file, work with it externally to select the set of chemicals we are interested in. Then we can read in our selections rather than have to pick them out of a long list of chemicals a few cells later.

If there aren't too many chemicals you can skip this file writing and reading and select the chemicals from the full list.

In [None]:
from ipywidgets import widgets
filename_widget = widgets.Text(
    value='chemicals.csv',
    description="Filename",
    disabled=False
)
filename_widget
 

In [None]:
from ECHO_modules.get_data import get_echo_data

chemicals = "'1,2,4-Trichlorobenzene', '1,2,4-Trimethylbenzene', 'Benzene', 'Hexachlorobenzene', "
chemicals += "'Mercury', 'Mercury compounds', 'Polychlorinated biphenyls ', 'Polycyclic aromatic compounds'"

sql = f'select "ChemicalNumber", "Chemical" from "chemical_data_rsei_v2312" where "Chemical" in ({chemicals})'
df = get_echo_data(sql)
df

chemical_numbers = "51, 321, 359, 360, 474, 564, 575, 609"