In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

In [None]:
path = Path('/app/data/raw/2022.03.22OGW.xlsx')
assert path.exists()

In [None]:
# eip = dbcp.extract.eip_infrastructure.extract(path)
# hardcode the extract function so this notebook can be easily rerun in the future without maintenance
proj = pd.read_excel(path, sheet_name='Project')

In [None]:
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [None]:
proj.shape

# Cleaning
## Projects Cleaning
- [x] Accuracy
- [x] Atomicity
- [ ] Consistency
- [x] Completeness
- [x] Uniformity
- [x] Validity
    - [x] Range Validation
    - [x] Uniqueness Validation
    - [x] Set Membership Validation
    - [x] Type Validation
    - [x] Cross-Field Validation

### Accuracy
The biggest accuracy risks for this dataset are probably 1) bad reporting to the EPA (would take a huge research effort to double check that) and 2) transcription errors by EIP between the PDFs and the database. I'll do a little spot check to guard againt the second.

Results: I only checked 3 facilities but still managed to find conflicting information about CO2e numbers 😕

In [None]:
proj.sample(3, random_state=42)

Sinton Compresson Station: [permit](https://api.oilandgaswatch.org/d/98/f8/98f85e1d868f4e63966d01637fc5408c.1638199494.pdf) confirmed all emissions numbers EXCEPT GHG (not mentioned). I couldn't find a source for that 450,475 number. The [environmental impact statement](https://www.ferc.gov/sites/default/files/2020-05/corpuschristiFEIS.pdf) submitted to FERC claims only 155,000 tpy of CO2e.

Golden Pass LNG Terminal: GHG numbers confirmed on page 25 of [the permit doc](https://api.oilandgaswatch.org/d/18/54/18545bea701e4bed938050997b308fdf.1638219234.pdf)

NGPL Compressor Station: had to go digging for the docs, but found the [FERC Environmental Assessment](https://www.ferc.gov/sites/default/files/2020-04/CP19-99-EA.pdf) that confirms the 173.4 tpy CO2e numbers.

### Atomicity
By inspection I see that all the ID and associated name fields can contain multiple values. I'll only worry about Facility IDs and Air Construction Permit IDs.

On a related note, both the facilities table and the project table have a column linking the two. I'll have to combine them to get a complete association entity table.

### Completeness
For this purpose, I'll limit the scope of 'completeness' to only look at missing values within the data. For better or worse, it is EIP's job to ensure projects are in the dataset at all.

We are not interested in already-operating projects, so I'll remove those and assess completeness based on the remaining subset.

Notable missing values and lack of missing values:
* all projects are linked to a facility ID!
* all projects have an operating status
* 95/308 (30.8%) are missing Air Construction Permit IDs. Likely because many of these projects are too new to have gone through the permitting process.
* 86/308 (27.9%) are missing CO2e estimates. Same newness reason.

For evidence of the "too new to have a permit" hypothesis, compare completeness of criteria pollutants before/after subsetting by operational status:
* For all projects, only around 66/672 (9.8%) are missing criteria pollutants (NOx, VOC, CO, SO2, PM2.5)
* For not-operational projects, 60/308 (19.5%) are missing criteria pollutants. So nearly all of the missing values.

In [None]:
def calc(num, denom=308):
    percent = 1- num/denom
    n = denom - num
    return f"{n}/{denom} ({percent*100:.1f}%)"

In [None]:
calc(248)

In [None]:
len(proj)

In [None]:
proj['Operating Status'].value_counts()

In [None]:
proj.loc[proj['Operating Status'].ne('Operating'),:].count().T

### Consistency - defer
Defer until I've cleaned the related datasets
### Uniformity
Important columns to check consistent representation:
* ID fields (check consistent array delimiters)
* all the emissions - check metric vs short tons

Secondary importance:
* modified_on
* project cost (supposed to be in millions $)
* jobs promised has inconsistent formatting/delimiters

#### ID Fields
Want to check for consistent array delimiters.

In [None]:
# exclude ID cols with numeric types (no arrays present)
id_cols = [col for col in proj.columns if '(ID)' in col and pd.api.types.is_object_dtype(proj[col])]
id_cols

In [None]:
# mandatory opening pattern, optional delimiter, optional repeating pattern, optional closing pattern, mandatory end of line
array_pattern = r'(?:\d{3,5})(?:, ?)?(?:\d{3,5}, ?)*(?:\d{3,5})?$'

In [None]:
test_case = pd.Series([
    '1234',
    '1234,567',
    '1234, 567',
    '12345, 678, 9012',
    '1234\t5678', # tab is bad, no comma
    '12, 3456', # too short
    '1234    5678', # too many spaces, no comma
])
pd.concat([test_case, test_case.str.match(array_pattern)], axis=1)

In [None]:
# all pass the formatting test
for col in id_cols:
    assert proj[col].str.match(array_pattern).all()

#### Emissions
Check metric vs short tons

Edit: difference is only a factor of 0.907 so I won't be able to tell the difference. Within the noise.

In [None]:
proj['Greenhouse Gases (CO2e)'].replace(0, np.nan).transform(np.log10).plot.hist(bins=50)

#### Date Modified

In [None]:
# to_datetime works on all values present
timestamps = pd.to_datetime(proj['modified_on'])
timestamps.dtypes, timestamps.isna().sum()

#### Jobs
Check array delimiter, naming, and order.

* Array delimiter: can be `,` or `;` or none
* naming: `temporary`, `permanent`, `full-time`, `construction`, `operating` and none given
* order: not consistent. Needs a parser.

In [None]:
# only 74/672 (11%) have jobs numbers
proj['Number of Jobs Promised'].dropna().shape

In [None]:
jobs = proj['Number of Jobs Promised'].dropna()

**What special characters are present?**

In [None]:
from functools import reduce
reduce(set.union, [set(item) for item in jobs.str.replace('\d+|\w+|\s+', '', regex=True).to_list()])

In [None]:
# not a delimiter
jobs[jobs.str.contains('>')]

In [None]:
# not an array delimiter. It is a range delimiter
jobs[jobs.str.contains('-')]

In [None]:
# not a delimiter
jobs[jobs.str.contains('\(|\)')]

**What job types are present?**

In [None]:
jobs.str.extractall('([a-zA-Z]+)')[0].value_counts()

In [None]:
# repeat but without that long parenthetical
jobs.str.replace('\(.+\)', '', regex=True).str.extractall('([a-zA-Z]+)')[0].value_counts()

I think `permanent`, `full-time` and `operating` are equivalent. And `temporary` == `construction`. And `Unkown` is Null.

### Range Validation
Check IDs, Emissions, Cost, Jobs, expected completion year
#### Emissions
Kind of hard to interpret, but no outrageous smoking guns

In [None]:
emission_cols = [
    'Greenhouse Gases (CO2e)',
    'Particulate Matter (PM2.5)',
    'Nitrogen Oxides (NOx)',
    'Volatile Organic Compounds (VOC)',
    'Sulfur Dioxide (SO2)',
    'Carbon Monoxide (CO)',
    'Hazardous Air Pollutants (HAPs)',
]

In [None]:
# sulfur is missing due to type error
proj[emission_cols].describe()

In [None]:
# a single value causes the issue
proj.loc[proj['Sulfur Dioxide (SO2)'].str.contains(',').fillna(False), ['id', 'name', 'Sulfur Dioxide (SO2)']]

In [None]:
sulfur = pd.to_numeric(proj['Sulfur Dioxide (SO2)'].str.split(',').str[0], errors='raise')

In [None]:
sulfur.describe()

In [None]:
emission_cols.remove('Sulfur Dioxide (SO2)')

In [None]:
extremely_large_idx = [proj.loc[:, col].nlargest(5).index for col in emission_cols] + [sulfur.nlargest(5).index]

In [None]:
extremely_large = pd.Index([])
for index in extremely_large_idx:
    extremely_large = extremely_large.union(index)
extremely_large

In [None]:
proj.loc[extremely_large, ['id', 'name', 'Project Description', 'Sulfur Dioxide (SO2)'] + emission_cols].sort_values('Greenhouse Gases (CO2e)')

In [None]:
import matplotlib.pyplot as plt

In [None]:
# NOTE: this only includes positive values (most but not all of them)
n = len(emission_cols)+1
fig, axes = plt.subplots(nrows=n, figsize=(5, n*4))
for i, col in enumerate(emission_cols):
    proj.loc[:, col].replace(0,np.nan).transform(np.log10).hist(bins=40, ax=axes[i])
    axes[i].set_title(col)
sulfur.replace(0,np.nan).transform(np.log10).hist(bins=40, ax=axes[n-1])

In [None]:
# Negative values only
n = len(emission_cols)+1
fig, axes = plt.subplots(nrows=n, figsize=(5, n*4))
for i, col in enumerate(emission_cols):
    proj.loc[:, col].mul(-1).replace(0,np.nan).transform(np.log10).hist(bins=10, ax=axes[i])
    axes[i].set_title(col)
sulfur.mul(-1).replace(0,np.nan).transform(np.log10).hist(bins=10, ax=axes[n-1])

#### IDs
There are lots of ID columns, but I only care about Facility IDs and Air Construction IDs

In [None]:
# defined way up near the top
id_cols

In [None]:
fac_ids = proj['Facility (ID)'].str.split(',', expand=True)
for col in fac_ids.columns:
    fac_ids.loc[:, col] = pd.to_numeric(fac_ids.loc[:, col], errors='coerce')

fac_ids.head()

In [None]:
# they all look in the same range
fac_ids.describe()

In [None]:
air_const_ids = proj['Air Construction (ID)'].str.split(',', expand=True)
for col in air_const_ids.columns:
    air_const_ids.loc[:, col] = pd.to_numeric(air_const_ids.loc[:, col], errors='coerce')

air_const_ids.head()

In [None]:
# they all look in the same range
air_const_ids.describe()

#### Project Cost
Check uniformity at the same time: should be in millions of dollars. Check vs thousands or single dollars.

In [None]:
# wrong dtype
proj['Project Cost (million $)'].hist(bins=30)

By manual inspection (there are not that many values present), I see that there are a handful of values of the form "XX, XX". The first number is repeated in an array. So I want to use the same method that fixed the identical issue in the `State` column. But first I need to check that there are no commas present as thousands separators or for other reasons. Check that a split on commas produces two identical values:

In [None]:
proj['Project Cost (million $)'].str.split(',', expand=True).dropna()

In [None]:
# definitely no single dollar amounts.
# As for thousands, check that the 8-14 billion dollar projects are plausible
cost = pd.to_numeric(proj['Project Cost (million $)'].str.split(',').str[0], errors='raise')
cost.hist(bins=20)

In [None]:
# log transform
cost.transform(np.log10).hist(bins=20)

In [None]:
# Yes, costs are in millions. If they were in thousands, it would mean these megafacilities were being built with 6-14.5 million dollars. I'd buy one at that price!
pd.set_option('display.max_colwidth', 0)
proj.loc[cost.nlargest(8).index, ['name', 'Project Cost (million $)', 'Project Description']]

#### Jobs
Skipping for now becuase I need to make a parser first.

### Uniqueness Validation
Check the `id` field

In [None]:
proj['id'].duplicated().sum()

### Set Membership Validation
* classification
* industry sector
* project type
* operating status

#### Classification
Doesn't look like any erroneous categories to me.

In [None]:
proj['Classification'].value_counts()

#### Industry Sector
A single one:many array value. Simplify by picking one

In [None]:
proj['Industry Sector'].value_counts()

#### Project Type
This column has a fair number of multivalued array entries. But the categories themselves look consistent -- no mis-spellings, etc.

In [None]:
proj['Project Type'].value_counts()

In [None]:
# split and combine value counts
proj['Project Type'].str.split(',', expand=True).stack().str.strip().value_counts()

#### Operating Status
Just need to replace "Unknown" with Null

In [None]:
proj['Operating Status'].value_counts()

### Type Validation
Already did this while doing range validation, but `Sulfur Dioxide (SO2)`and `Project Cost` require parsing duplicative csv array values in what should be a numeric column. Also, `Number of Jobs Promised` needs parsing into two columns: permanent and temporary jobs.

#### Completion Year
I started converting this to numeric, but would have to model multi-valued items. I think the benefit (sorting, quantitative analysis) is small relative to 1) the cost it will take to communicate the modeling and 2) actually doing the modelling. I think we have other fields we would filter on first.

In [None]:
proj['Actual or Expected Completion Year'].str.len().hist(bins=40)

In [None]:
proj['Actual or Expected Completion Year'].str.len().nlargest(8)

In [None]:
proj.loc[proj['Actual or Expected Completion Year'].str.len().nlargest(8).index, 'Actual or Expected Completion Year']

### Cross-Field Validation
None really needed. I could check that the date columns are in a logical order (modified > created, for example) but I'm not planning to really use those columns. So I skipped it.