# USB Carbon Dioxide - EDA of facilities data

## Ashley Moss, Bona Lee, Dylan Mair, Emma Tebbe

Getting this to work:

1. Go to the [EPA site](https://www.epa.gov/ghgreporting/ghg-reporting-program-data-sets), or directly,
2. Download two zip files to your laptop:
  * Pull down the [2019 facility data](https://www.epa.gov/sites/default/files/2020-11/2019_data_summary_spreadsheets.zip).
  * Pull down the [emissions data](https://www.epa.gov/sites/default/files/2020-11/emissions_by_unit_and_fuel_type_c_d_aa_10_2020.zip) for these facilities.

3. Extract the files from the zip files.
4. Open up your Google Drive https://drive.google.com/drive/my-drive 
5. Create a subdirectory at the top level called `co2` (if you name it differently you need only change `DATADIR` below).
4. Go into your `co2` subdirectory and upload two files:
  * `Emissions by Unit and Fuel Type.xlsx`
  * `ghgp_data_2019.xlsx`
  * May as well upload this colab notebook while you're at it.
5. In your Google Drive, `co2` subdirectory, double-click on the colab notebook.

The rest of this stuff should run from here! For some things it will be easier to EDA the Excel spreadsheets directly on your laptop. The experiments below focus on joining the Lat/Long data of facilities (and some other attributes) with the emissions data.

Hopefully I'll shortly have this info aggregated according to Facility ID and plotted on a very basic map of California. Then it is on to wells and the FracTracker social setting attributes.

In [2]:
# Code goes here

#from google.colab import drive

# sign in to mount your google drive
#drive.mount('/content/drive', force_remount=True)

In [None]:
import os
import time
import numpy as np
import pandas as pd

#DATADIR = '/content/drive/MyDrive/co2/data/'

#os.chdir(DATADIR)
#print("Contents of the Top Directory:")
#print(os.listdir())

# No need to use colab, can be handled through Git

In [5]:
emissions = pd.read_excel(r'./data/Emissions by Unit and Fuel Type.xlsx',
                          sheet_name='UNIT_DATA', skiprows=4)

In [6]:
print(list(emissions.columns))
print(emissions.shape)
emissions = emissions[emissions.State == 'CA']
print(emissions.shape)

['Facility Id', 'FRS Id', 'Facility Name', 'City', 'State', 'Primary NAICS Code', 'Reporting Year', 'Industry Type (subparts)', 'Industry Type (sectors)', 'Unit Name', 'Unit Type', 'Unit Reporting Method', 'Unit Maximum Rated Heat Input Capacity (mmBTU/hr)', 'Unit CO2 emissions (non-biogenic) ', 'Unit Methane (CH4) emissions ', 'Unit Nitrous Oxide (N2O) emissions ', 'Unit Biogenic CO2 emissions (metric tons)']
(190709, 17)
(10279, 17)


In [7]:
emsCA2019 = emissions[emissions['Reporting Year'] == 2019]
emsCA2019.shape

(927, 17)

In [8]:
# Unique values of Facility ID for 2019 emissions data from facilities in California 
len(set(emsCA2019['Facility Id']))

298

In [None]:
# Unique values of Facility Name for 2019 emissions data from facilities in California
len(set(emsCA2019['Facility Name']))

295

In [9]:
# Unique values of FRS ID for 2019 emissions data from facilities in California
len(set(emsCA2019['FRS Id']))

368

In [11]:
facilities = pd.read_excel(r'./data/ghgp_data_2019.xlsx',
                          sheet_name='Direct Emitters', skiprows=3)

print(list(facilities.columns))
print(facilities.shape)
facilities = facilities[facilities.State == 'CA']
print(facilities.shape)

['Facility Id', 'FRS Id', 'Facility Name', 'City', 'State', 'Zip Code', 'Address', 'County', 'Latitude', 'Longitude', 'Primary NAICS Code', 'Industry Type (subparts)', 'Industry Type (sectors)', 'Total reported direct emissions', 'CO2 emissions (non-biogenic) ', 'Methane (CH4) emissions ', 'Nitrous Oxide (N2O) emissions ', 'HFC emissions', 'PFC emissions', 'SF6 emissions ', 'NF3 emissions', 'Other Fully Fluorinated GHG emissions', 'HFE emissions', 'Very Short-lived Compounds emissions', 'Other GHGs (metric tons CO2e)', 'Biogenic CO2 emissions (metric tons)', 'Stationary Combustion', 'Electricity Generation', 'Adipic Acid Production', 'Aluminum Production', 'Ammonia Manufacturing', 'Cement Production', 'Electronics Manufacture', 'Ferroalloy Production', 'Fluorinated GHG Production', 'Glass Production', 'HCFC–22 Production from HFC–23 Destruction', 'Hydrogen Production', 'Iron and Steel Production', 'Lead Production', 'Lime Production', 'Magnesium Production', 'Miscellaneous Use of Carbo

In [12]:
print(len(set(facilities['Facility Id'])))
print(len(set(facilities['Facility Name'])))
print(len(set(facilities['FRS Id'])))

383
380
372


In [13]:
ems = set(emsCA2019['Facility Id'])
facs = set(facilities['Facility Id'])

for fac in facs:
    if fac not in ems:
        print('facility id', fac, 'not in emissions data.')

facility id 1011716 not in emissions data.
facility id 1007736 not in emissions data.
facility id 1003679 not in emissions data.
facility id 1003680 not in emissions data.
facility id 1007817 not in emissions data.
facility id 1005782 not in emissions data.
facility id 1007836 not in emissions data.
facility id 1007839 not in emissions data.
facility id 1003748 not in emissions data.
facility id 1009930 not in emissions data.
facility id 1003800 not in emissions data.
facility id 1003813 not in emissions data.
facility id 1003814 not in emissions data.
facility id 1007951 not in emissions data.
facility id 1001819 not in emissions data.
facility id 1005955 not in emissions data.
facility id 1003918 not in emissions data.
facility id 1008030 not in emissions data.
facility id 1004010 not in emissions data.
facility id 1004011 not in emissions data.
facility id 1006089 not in emissions data.
facility id 1004052 not in emissions data.
facility id 1006107 not in emissions data.
facility id

In [14]:
for em in ems:
    if em not in facs:
        print('facility id', em, 'not in facilities data.')

We've identified that all emissions from 2019 have a facility defined. That means we can merge Lat Long data with our emissions data. We might like to first aggregate emissions by site, but actually, no, we might need a breakdown later if we get smart about different types of sources.

In [15]:
facilitiesB = facilities.rename(inplace=False,
                                columns={'Facility Name':'Facility Name2',
                                         'City':'City2',
                                         'Primary NAICS Code':'Primary NAICS Code2',
                                         'Industry Type (subparts)':'Industry Type (subparts)2',
                                         'Industry Type (sectors)':'Industry Type (sectors)2'})

emsCA2019loc = pd.merge(emsCA2019, facilitiesB[['Facility Id',
#                                               'Facility Name2',
                                               'City2',
                                               'Zip Code',
                                               'Address',
                                               'County',
                                               'Latitude',
                                               'Longitude',
#                                               'Primary NAICS Code2',
                                               'Industry Type (subparts)2',
                                               'Industry Type (sectors)2']],
                        how='left', on='Facility Id')
emsCA2019loc.shape
# tran2_df = pd.merge(tran_df, dupeval_df, how='left', on='value')

(927, 25)

In [16]:
diff_cities = emsCA2019loc[emsCA2019loc.City != emsCA2019loc.City2]
len(diff_cities)

2

In [17]:
diff_cities

Unnamed: 0,Facility Id,FRS Id,Facility Name,City,State,Primary NAICS Code,Reporting Year,Industry Type (subparts),Industry Type (sectors),Unit Name,...,Unit Nitrous Oxide (N2O) emissions,Unit Biogenic CO2 emissions (metric tons),City2,Zip Code,Address,County,Latitude,Longitude,Industry Type (subparts)2,Industry Type (sectors)2
116,1003599,110043800000.0,CALABASAS SANITARY LANDFILL,AGOURA &#40;UNINCORP. LA COUNTY&#41;,CA,562212,2019,"C,HH",Waste,GP-Turbines,...,108.472,30117.6,AGOURA (UNINCORP. LA COUNTY),91301,5300 LOST HILLS ROAD,LOS ANGELES COUNTY,34.141301,-118.711378,"C,HH",Waste
560,1010139,,Newport Fab LLC (dba TowerJazz),Newport Beach,CA,334413,2019,"C,I",Other,GP-Facility wide,...,7.45,0.0,Newport Beach,92660,4321 Jamboree Road,ORANGE COUNTY,33.66196,-117.85845,"C,I",Other


In [18]:
# Testing confirmed no mismatches for Facility Name.
# Testing confirmed no mismatches for Primary NAICS Code.

In [19]:
diff_sub = emsCA2019loc[emsCA2019loc['Industry Type (subparts)'] != emsCA2019loc['Industry Type (subparts)2']]
print(len(diff_sub))
diff_sub

133


Unnamed: 0,Facility Id,FRS Id,Facility Name,City,State,Primary NAICS Code,Reporting Year,Industry Type (subparts),Industry Type (sectors),Unit Name,...,Unit Nitrous Oxide (N2O) emissions,Unit Biogenic CO2 emissions (metric tons),City2,Zip Code,Address,County,Latitude,Longitude,Industry Type (subparts)2,Industry Type (sectors)2
110,1004613,1.100438e+11,BETA OCS PARCELS,Offshore,CA,211120,2019,"C,W",Petroleum and Natural Gas Systems,GP-ICEs,...,2.384,0.0,Offshore,0,,,33.57600,-118.12320,"C,W-OFFSH",Petroleum and Natural Gas Systems
111,1004613,1.100438e+11,BETA OCS PARCELS,Offshore,CA,211120,2019,"C,W",Petroleum and Natural Gas Systems,GP-Turbines,...,49.468,0.0,Offshore,0,,,33.57600,-118.12320,"C,W-OFFSH",Petroleum and Natural Gas Systems
127,1005164,1.100143e+11,"California Resources Elk Hills, LLC - Gas Proc...",TUPMAN,CA,211130,2019,"C,D,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",CTG-1,...,381.440,0.0,TUPMAN,93276,28590 Highway 119,KERN COUNTY,35.23893,-119.35951,"C,D,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
128,1005164,1.100143e+11,"California Resources Elk Hills, LLC - Gas Proc...",TUPMAN,CA,211130,2019,"C,D,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",CTG-2,...,378.460,0.0,TUPMAN,93276,28590 Highway 119,KERN COUNTY,35.23893,-119.35951,"C,D,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
129,1005164,1.100143e+11,"California Resources Elk Hills, LLC - Gas Proc...",TUPMAN,CA,211130,2019,"C,D,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",GP-01,...,14.006,0.0,TUPMAN,93276,28590 Highway 119,KERN COUNTY,35.23893,-119.35951,"C,D,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
903,1010774,,Valero Wilmington Asphalt Plant,Wilmington,CA,324110,2019,"C,MM,Y","Petroleum Product Suppliers, Refineries",CP-FG,...,4.470,0.0,Wilmington,90744,1651 Alameda Street,LOS ANGELES COUNTY,33.79699,-118.23971,"C,MM-REF,Y","Petroleum Product Suppliers,Refineries"
904,1010774,,Valero Wilmington Asphalt Plant,Wilmington,CA,324110,2019,"C,MM,Y","Petroleum Product Suppliers, Refineries",CP-NG,...,6.556,0.0,Wilmington,90744,1651 Alameda Street,LOS ANGELES COUNTY,33.79699,-118.23971,"C,MM-REF,Y","Petroleum Product Suppliers,Refineries"
917,1011526,,"Wild Goose Storage, LLC",Gridley,CA,486210,2019,"C,W",Petroleum and Natural Gas Systems,GP-Dehy,...,0.298,0.0,Gridley,95948,2780 West Liberty Rd.,BUTTE COUNTY,39.34764,-121.81628,"C,W-UNSTG",Petroleum and Natural Gas Systems
918,1011526,,"Wild Goose Storage, LLC",Gridley,CA,486210,2019,"C,W",Petroleum and Natural Gas Systems,GP-NGC Engines,...,18.178,0.0,Gridley,95948,2780 West Liberty Rd.,BUTTE COUNTY,39.34764,-121.81628,"C,W-UNSTG",Petroleum and Natural Gas Systems


In [20]:
diff_secs = emsCA2019loc[emsCA2019loc['Industry Type (sectors)'] != emsCA2019loc['Industry Type (sectors)2']]
print(len(diff_secs))
diff_secs

165


Unnamed: 0,Facility Id,FRS Id,Facility Name,City,State,Primary NAICS Code,Reporting Year,Industry Type (subparts),Industry Type (sectors),Unit Name,...,Unit Nitrous Oxide (N2O) emissions,Unit Biogenic CO2 emissions (metric tons),City2,Zip Code,Address,County,Latitude,Longitude,Industry Type (subparts)2,Industry Type (sectors)2
127,1005164,1.100143e+11,"California Resources Elk Hills, LLC - Gas Proc...",TUPMAN,CA,211130,2019,"C,D,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",CTG-1,...,381.440,0.0,TUPMAN,93276,28590 Highway 119,KERN COUNTY,35.238930,-119.359510,"C,D,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
128,1005164,1.100143e+11,"California Resources Elk Hills, LLC - Gas Proc...",TUPMAN,CA,211130,2019,"C,D,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",CTG-2,...,378.460,0.0,TUPMAN,93276,28590 Highway 119,KERN COUNTY,35.238930,-119.359510,"C,D,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
129,1005164,1.100143e+11,"California Resources Elk Hills, LLC - Gas Proc...",TUPMAN,CA,211130,2019,"C,D,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",GP-01,...,14.006,0.0,TUPMAN,93276,28590 Highway 119,KERN COUNTY,35.238930,-119.359510,"C,D,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
130,1005321,1.100438e+11,California Resources Production Corporation – ...,PIRU,CA,211130,2019,"C,NN,W","Natural Gas and Natural Gas Liquids Suppliers,...",Hot Oil Heater,...,0.596,0.0,PIRU,93040,3824 GUIBERSON ROAD-GAS PL,VENTURA COUNTY,34.390033,-118.795921,"C,NN-FRAC,W-PROC","Natural Gas and Natural Gas Liquids Suppliers,..."
165,1003610,1.100205e+11,CHEVRON PRODS.CO. RICHMOND REFY,RICHMOND,CA,324110,2019,"C,P,PP,Y","Chemicals, Refineries, Suppliers of CO2",CP-1 RFG V-475 minus H2 Plant,...,1498.344,0.0,RICHMOND,94801,841 CHEVRON WAY,CONTRA COSTA COUNTY,37.938779,-122.396453,"C,P,PP,Y","Chemicals,Refineries,Suppliers of CO2"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900,1008504,1.100331e+11,VALERO REFINING CO - CALI FORNIA BENICIA REFINERY,BENICIA,CA,324110,2019,"C,MM,P,PP,Y","Chemicals, Petroleum Product Suppliers, Refine...",Natural Gas to SRU-A Train,...,1.490,0.0,BENICIA,94510,3400 E SECOND ST,SOLANO COUNTY,38.071670,-122.139750,"C,MM-REF,P,PP,Y","Chemicals,Petroleum Product Suppliers,Refineri..."
901,1008504,1.100331e+11,VALERO REFINING CO - CALI FORNIA BENICIA REFINERY,BENICIA,CA,324110,2019,"C,MM,P,PP,Y","Chemicals, Petroleum Product Suppliers, Refine...",Natural Gas to SRU-B Train,...,0.596,0.0,BENICIA,94510,3400 E SECOND ST,SOLANO COUNTY,38.071670,-122.139750,"C,MM-REF,P,PP,Y","Chemicals,Petroleum Product Suppliers,Refineri..."
902,1008504,1.100331e+11,VALERO REFINING CO - CALI FORNIA BENICIA REFINERY,BENICIA,CA,324110,2019,"C,MM,P,PP,Y","Chemicals, Petroleum Product Suppliers, Refine...",Propane Combustion,...,0.894,0.0,BENICIA,94510,3400 E SECOND ST,SOLANO COUNTY,38.071670,-122.139750,"C,MM-REF,P,PP,Y","Chemicals,Petroleum Product Suppliers,Refineri..."
903,1010774,,Valero Wilmington Asphalt Plant,Wilmington,CA,324110,2019,"C,MM,Y","Petroleum Product Suppliers, Refineries",CP-FG,...,4.470,0.0,Wilmington,90744,1651 Alameda Street,LOS ANGELES COUNTY,33.796990,-118.239710,"C,MM-REF,Y","Petroleum Product Suppliers,Refineries"


Okay, so 2 Cities are mismatches (typos). Subparts and Sectors (both from Industry Type) have a lot of differences, looks like different labels for the same things so maybe not a big deal.

In [None]:
!pip install redis

Collecting redis
  Downloading redis-3.5.3-py2.py3-none-any.whl (72 kB)
[?25l[K     |████▌                           | 10 kB 31.7 MB/s eta 0:00:01[K     |█████████                       | 20 kB 41.2 MB/s eta 0:00:01[K     |█████████████▋                  | 30 kB 48.4 MB/s eta 0:00:01[K     |██████████████████▏             | 40 kB 33.4 MB/s eta 0:00:01[K     |██████████████████████▊         | 51 kB 18.0 MB/s eta 0:00:01[K     |███████████████████████████▎    | 61 kB 12.1 MB/s eta 0:00:01[K     |███████████████████████████████▉| 71 kB 13.4 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 416 kB/s 
[?25hInstalling collected packages: redis
Successfully installed redis-3.5.3


In [None]:
import redis
import matplotlib.pyplot as plt
import gmaps
import gmaps.datasets
gmaps.configure(api_key="AIzaSyCA7ambehqAC2OdZYY0DwtFEOb2OFyHOUg")
%matplotlib inline

ModuleNotFoundError: ignored

In [None]:
emsCA2019locagg = emsCA2019loc.groupby('Facility Id').agg({'Latitude':'min',
                                                           'Longitude':'min',
                                                           'Unit CO2 emissions (non-biogenic) ':'sum'})
emsCA2019locagg.reset_index(inplace=True)
emsCA2019locagg.head()

In [None]:
# Looks like the gmaps widget does not work on Google Colab. Ugh.

fig = gmaps.figure(center=(35.8,-119.4), zoom_level=6)
# fig = gmaps.figure()

emission_layer = gmaps.heatmap_layer(emsCA2019locagg[['Latitude', 'Longitude']],
                                     weights=emsCA2019locagg['Unit CO2 emissions (non-biogenic) '])
facility_layer = gmaps.symbol_layer(emsCA2019locagg[['Latitude', 'Longitude']],
                                     fill_color='black', scale=10)
emission_layer.max_intensity = 50000
emission_layer.point_radius = 250
# fig.add_layer(emission_layer)
fig.add_layer(facility_layer)
fig

In [None]:
emsCA2019locagg['Longitude'].mean()
