# Methane leak detection
Methane is the dominant hydrocarbon in natural gas.  When methane leaks into the atmosphere without being burned, it generates significantly more climate damages on a pound for pound basis than CO2.  Although methane is valuable, a surprising amount of it leaks from pipelines, storage facilities, even the pipes in your home; most estimates are that more than 2% of methane leaks into the atmosphere before being burned or used for other purposes.  That means that 25% or more of the climate damages associated with natural gas use come from leaks, rather than from burning the gas and emitting CO2.  

Therefore, as you might imagine, preventing so-called "fugitive emissions" of methane is therefore an extremely important goal for the climate change mitigation community.  

The data presented in this notebook give you a chance to explore the state of the art in terms of oil and gas facility monitoring and modeling for methane.  The data originally come from the study documented in [this](https://eartharxiv.org/repository/view/2935/) paper and downloaded [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OX4QOA).  

The data set inspired a [follow-on study](https://www.climatechange.ai/papers/neurips2020/20) that won a best paper award at NeurIPS 2020.  The basic gist of the NeurIPS paper is to predict what facilities will have very large leaks and allocate inspection and repair resources on the basis of those predictions.  It's a really nice example of how prediction model output can be used in a nuanced way for good resource allocation decisions.

In [1]:
import numpy as np
import pandas as pd

In [2]:
xls = pd.ExcelFile('Wang_FEMPEA_SI_vf.xlsx')
df = pd.read_excel(xls, 'Component_Emissions')

In [3]:
df.head()

Unnamed: 0,Survey,Operator,Site,Site.Type,Treatment.Group,Vent/Leak,Emitting Component,Tank Relation,Emission (kg/d)
0,Survey 1,L,78,Oil MWPro,Annual,Leak,Valves,No,22.3
1,Survey 1,L,78,Oil MWPro,Annual,Vent,Open-Ended Line (Tank),Yes,69.95
2,Survey 1,L,78,Oil MWPro,Annual,Leak,Pneumatics,No,44.13
3,Survey 1,L,102,Gas SW,Annual,Vent,Open-Ended Line (Tank),Yes,69.95
4,Survey 1,L,102,Gas SW,Annual,Vent,Open-Ended Line (Non-Tank),No,12.1


Survey 1 is the first visit made to any facility in the data set -- let's focus on that first, since facilities may have responded to their prior inspections.

In [4]:
df_s1 = df.loc[df.loc[:,'Survey']=='Survey 1',:]

Let's see what type of emissions are uncovered in the first visits:

In [5]:
df_s1_sort = df_s1.sort_values(by='Emission (kg/d)',ascending=False)
df_s1_sort.head()

Unnamed: 0,Survey,Operator,Site,Site.Type,Treatment.Group,Vent/Leak,Emitting Component,Tank Relation,Emission (kg/d)
296,Survey 1,F,105,Oil SW,Annual,Vent,Open-Ended Line (Tank),Yes,2532.85
655,Survey 1,P,89,Oil MW,Bi-Annual,Leak,Flange/Connector,No,1701.17
31,Survey 1,L,141,Gas SW,Bi-Annual,Vent,Open-Ended Line (Tank),Yes,1529.54
166,Survey 1,L,98,Gas MW,Bi-Annual,Vent,Pneumatics,No,1526.89
439,Survey 1,D,163,Oil SW,Tri-Annual,Vent,Open-Ended Line (Tank),Yes,1247.52


With a global warming potential on the order of 30 (1 ton of methane does the same damage as 30 tons of carbon dioxide), the top emitter is emitting the equivalent of $2532\cdot 30 \approx 7600$ kg of CO2, or 34 metric tonnes per day.  That's more than most American households emit in a year.

The total emissions revealed in survey 1 is:

In [9]:
print(np.sum(df_s1_sort['Emission (kg/d)'])*30/2200,'metric tonnes of CO2 equivalent emissions')

690.4324090909092 metric tonnes of CO2 equivalent emissions


# Next steps
Example prediction exercises:
* Construct a prediction model that identifies facilities with the largest leaks, or even the size of the leak, based on site information that can be gathered without measuring leaks.  The operator type and site type are two obvious options here. 
* Construct a prediction model that classifies the type of leak (from the `Emitting Component` field) based on all other measurements, including emissions.  The application of the model in this case would be as follows: New technologies enable measuring the size of leaks with computer vision, but they may not find the source of the leak.  A machine learning model could help narrow down the set of options for where to look for the leak.

Note, if you're interested in working with these data, we've been promised by the study authors that an updated dataset will be available soon.  We can't promise it will be available for your project!

One more thing to consider as you contemplate working with these data for your project: One of your jobs will be to eventually merge data sets together to expand what you study, learn and model.  However in the case of this data set, the facilities are anonymized.  This limits the scope of what you might be able to merge with this data set.  You will likely need to use this one as a "warm up" data set, and then move on to other options.  The International Energy Agency has a [terrific set of resources](https://www.iea.org/articles/methane-tracker-data-explorer) for methane leak data that you may be able to use as a next step.  