In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

In [4]:
path = Path('/app/data/raw/fossil_infrastructure.xlsx')
assert path.exists()

In [5]:
# eip = dbcp.extract.eip_infrastructure.extract(path)
# hardcode the extract function so this notebook can be easily rerun in the future without maintenance
air = pd.read_excel(path, sheet_name='Air Construction')

In [6]:
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [7]:
air.shape

(879, 16)

# Cleaning
## Projects Cleaning
Columns I care about:
* id
* name
* modified on
* project ID (1:m as arrays)
* statute (1:m as arrays)
* permit type (1:m as arrays)
* permitting action (1:m as arrays)
* permit status
* description
* research notes

Cleaning Checklist:
- [x] Accuracy
- [x] Atomicity
- [ ] Consistency
- [x] Completeness
- [x] Uniformity
- [x] Validity
    - [x] Range Validation
    - [x] Uniqueness Validation
    - [x] Set Membership Validation
    - [x] Type Validation
    - [x] Cross-Field Validation

### Accuracy
The most important item to spot check here is the permit status. "Final" permit statuses are of little interest and also presumably don't change over time, so I'll only check 1 of those.

Results: 4/4 match dates and status 👍🏼

In [8]:
filter_ = air['Permit Status'].isin({"Application Pending", "Draft Issued"})
air.loc[filter_,:].sample(3, random_state=42)

Unnamed: 0,id,name,created_on,modified_on,Date Last Checked,Project (ID),Project,Permit Status,Description or Purpose,Application Date,Draft Permit Issuance Date,Last Day to Comment,Final Permit Issuance Date,Deadline to Begin Construction,Detailed Permitting History,Document URL
755,4558,0520-00492-V2 and PSD-LA-803 (M-2),2021-11-24T08:48:13.060662,2022-08-10T19:21:53.139398,2022-08-10,2933,Lake Charles Methanol - Initial Construction[2...,Draft Issued,This permit application would renew the facili...,2020-08-21,2022-01-20,2022-02-28,,,,
106,1720,R6PSD-DWP-GM8,2021-05-20T19:18:48.607217,2022-09-02T17:16:52.422962,2022-08-19,2762,Bluewater SPM Deepwater Port - Initial Constru...,Draft Issued,EPA Region 6 permit authorizing construction o...,2019-05-30,2020-11-12,2021-01-11,,,,
847,5260,169207,2022-06-27T14:23:14.151646,2022-08-26T18:07:50.889054,2022-08-26,5258,Cedar Bayou Hydrogen Plant - Initial Construct...,Application Pending,This permit would authorize construction of th...,2022-05-27,,,,,,


Alaska LNG Liquifaction Plant: confirmed on [AK DEC website](https://dec.alaska.gov/Applications/Air/airtoolsweb/AirPermitsApprovalsAndPublicNotices). Dates match, status is a little more ambiguous but I think "draft" is right.

Gulf LNG: confirmed on [MS state website](https://opcgis.deq.state.ms.us/enonline/ai_info.aspx?ai=23844). Application date and status match.

Delfin LNG: confirmed on [LA DEQ website](https://deq.louisiana.gov/public-notices?keyword=delfin&startDate=&endDate=). Dates and status match.

In [9]:
filter_ = air['Permit Status'].eq("Final Issued")
air.loc[filter_,:].sample(1, random_state=42)

Unnamed: 0,id,name,created_on,modified_on,Date Last Checked,Project (ID),Project,Permit Status,Description or Purpose,Application Date,Draft Permit Issuance Date,Last Day to Comment,Final Permit Issuance Date,Deadline to Begin Construction,Detailed Permitting History,Document URL
638,3670,152723,2021-08-24T17:03:45.752043,2022-02-17T20:00:59.538202,,3664,Enterprise Mont Belvieu - Frac X[3664],Final Issued,Permit authorizing construction of a tenth fra...,2018-07-12,,,2018-07-31,,"Registration No. 152723 (issued 7/31/2018, rev...",


Heim Gas Plant Expansion: confirmed at [TX CEQ website](https://www15.tceq.texas.gov/crpub/index.cfm?fuseaction=iwr.pgmdetail&addn_id=120534092019308&re_id=578462662019220&program_code=AIRNSR&lgcy_sys_cd=NSR&program=AIR%20NEW%20SOURCE%20PERMITS&IdType=REG). Dates and status match.

### Atomicity
Most of the columns are 1:m values encoded as csv array strings, but most values are singletons. See Range Validation and Set Membership Validation for decisions on modeling as 1:1 vs 1:m.

### Completeness
For this purpose, I'll limit the scope of 'completeness' to only look at missing values within the data. For better or worse, it is EIP's job to ensure projects are in the dataset at all.

We are not interested in already-issued permits, so I'll remove those and assess completeness based on the remaining subset.

Notable missing values and lack of missing values:
* The only 3 records missing project ID + permit info are entirely NaN rows

In [10]:
def calc(num, denom=105):
    percent = 1- num/denom
    n = denom - num
    return f"{n}/{denom} ({percent*100:.1f}%)"

In [11]:
calc(102)

'3/105 (2.9%)'

In [12]:
len(air)

879

In [13]:
air['Permit Status'].value_counts()

Final Issued                                      754
Application Pending                                43
Draft Issued                                       27
Expired                                            20
Withdrawn (UARG v. EPA 134 S. Ct. 2427 (2014))     12
Withdrawn                                           9
Void                                                6
Denied                                              2
Name: Permit Status, dtype: int64

In [14]:
air.loc[air['Permit Status'].ne('Final Issued'),:].count().T

id                                125
name                              123
created_on                        125
modified_on                       125
Date Last Checked                  82
Project (ID)                      120
Project                           120
Permit Status                     119
Description or Purpose            110
Application Date                  112
Draft Permit Issuance Date         32
Last Day to Comment                28
Final Permit Issuance Date         40
Deadline to Begin Construction     27
Detailed Permitting History        42
Document URL                       11
dtype: int64

In [15]:
air.loc[air['Permit Status'].isna(),:]

Unnamed: 0,id,name,created_on,modified_on,Date Last Checked,Project (ID),Project,Permit Status,Description or Purpose,Application Date,Draft Permit Issuance Date,Last Day to Comment,Final Permit Issuance Date,Deadline to Begin Construction,Detailed Permitting History,Document URL
606,3169,,2021-07-06T12:27:00.231068,2021-07-06T12:27:00.231068,,,,,,,,,,,,
607,3170,,2021-07-06T12:27:32.120889,2021-07-06T12:27:32.120889,,,,,,,,,,,,
813,4874,"156320, PSDTX1558M1, GHGPSDTX193M1",2022-03-17T02:30:28.758823,2022-03-17T02:30:33.514155,,,,,,,,,,,,
821,4913,Jackson Generation,2022-04-01T19:24:37.632988,2022-04-01T19:24:47.500382,2022-04-01,,,,,,,,,,,
846,5259,169207,2022-06-27T14:23:14.151359,2022-06-27T14:23:25.286010,,,,,,,,,,,,
878,5503,159743,2022-09-14T14:34:27.100584,2022-09-14T14:34:27.299773,,5502.0,Junction Compressor - Initial Construction[5502],,,,,,,,,


### Consistency - defer
Defer until I've cleaned the related datasets
### Uniformity
Important columns to check consistent representation:
* all array fields -- check consistent delimiters
    * project ID (1:m as arrays)
    * statute (1:m as arrays)
    * permit type (1:m as arrays)
    * permitting action (1:m as arrays)
* modified on -- check consistent date format

#### Array Fields
Want to check for consistent array delimiters.

In [16]:
# exclude ID cols with numeric types (no arrays present)
id_cols = [
    'Project (ID)',    
]

In [17]:
# mandatory opening pattern, optional delimiter, optional repeating pattern, optional closing pattern, mandatory end of line
array_pattern = r'(?:\d{3,5})(?:, ?)?(?:\d{3,5}, ?)*(?:\d{3,5})?$'

In [18]:
test_case = pd.Series([
    '1234',
    '1234,567',
    '1234, 567',
    '12345, 678, 9012',
    '1234\t5678', # tab is bad, no comma
    '12, 3456', # too short
    '1234    5678', # too many spaces, no comma
])
pd.concat([test_case, test_case.str.match(array_pattern)], axis=1)

Unnamed: 0,0,1
0,1234,True
1,1234567,True
2,"1234, 567",True
3,"12345, 678, 9012",True
4,1234\t5678,False
5,"12, 3456",False
6,1234 5678,False


In [19]:
# all pass the formatting test
for col in id_cols:
    assert air[col].str.match(array_pattern).all()

In [20]:
array_cols = [
    'Permitting Action', 
    'Permit Type',
    'Statute',
]

In [21]:
special_chars = air.loc[:, array_cols].copy()
for col in array_cols:
    special_chars.loc[:, col] = special_chars.loc[:, col].str.replace('\w|\s|,', '', regex=True)

KeyError: "None of [Index(['Permitting Action', 'Permit Type', 'Statute'], dtype='object')] are in the [columns]"

In [None]:
# no other delimiters present
special_chars.loc[special_chars.fillna('').ne('').any(axis=1),:]

#### Date Modified

In [22]:
# to_datetime works on all values present
timestamps = pd.to_datetime(air['modified_on'], errors='raise')
timestamps.dtypes, timestamps.isna().sum()

(dtype('<M8[ns]'), 0)

### Range Validation
Check project ID and date modified

#### Project ID

In [24]:
proj_ids = air['Project (ID)'].str.split(',', expand=True)
for col in proj_ids.columns:
    proj_ids.loc[:, col] = pd.to_numeric(proj_ids.loc[:, col], errors='raise')

proj_ids.head()

Unnamed: 0,0,1
0,2723.0,
1,2723.0,
2,2728.0,
3,2727.0,
4,2875.0,


In [25]:
# they all look in the same range
proj_ids.describe()

Unnamed: 0,0,1
count,861.0,3.0
mean,3304.284553,4370.666667
std,688.358292,237.584371
min,2723.0,4232.0
25%,2864.0,4233.5
50%,3008.0,4235.0
75%,3397.0,4440.0
max,5502.0,4645.0


#### Date Modified
range looks fine

In [26]:
pd.to_datetime(air['modified_on']).describe()

  pd.to_datetime(air['modified_on']).describe()


count                            879
unique                           721
top       2021-05-21 15:13:50.395199
freq                             159
first     2021-05-21 15:13:50.395199
last      2022-09-14 14:34:27.299773
Name: modified_on, dtype: object

### Uniqueness Validation
Check the `id` field

In [27]:
air['id'].duplicated().sum()

0

### Set Membership Validation
* statute (1:m as arrays)
* permit type (1:m as arrays)
* permitting action (1:m as arrays)
* permit status

#### Statute

In [29]:
air.columns

Index(['id', 'name', 'created_on', 'modified_on', 'Date Last Checked',
       'Project (ID)', 'Project', 'Permit Status', 'Description or Purpose',
       'Application Date', 'Draft Permit Issuance Date', 'Last Day to Comment',
       'Final Permit Issuance Date', 'Deadline to Begin Construction',
       'Detailed Permitting History', 'Document URL'],
      dtype='object')

In [28]:
air['Statute'].value_counts()

KeyError: 'Statute'

In [None]:
# split and combine value counts
air['Statute'].str.split(',', expand=True).stack().str.strip().value_counts()

#### Permit Type

In [30]:
air['Permit Type'].value_counts()

KeyError: 'Permit Type'

In [None]:
# split and combine value counts
air['Permit Type'].str.split(',', expand=True).stack().str.strip().value_counts()

#### Permitting Action
A bunch of 1:m categories, but very few actual values

In [31]:
air['Permitting Action'].value_counts()

KeyError: 'Permitting Action'

In [None]:
# split and combine value counts
air['Permitting Action'].str.split(',', expand=True).stack().str.strip().value_counts()

#### Permit Status
Will combine at least the two `withdrawn` categories, maybe even all of `expired`, `withdrawn`, `void`, `denied` into a single "Nope" category.

In [32]:
air['Permit Status'].value_counts()

Final Issued                                      754
Application Pending                                43
Draft Issued                                       27
Expired                                            20
Withdrawn (UARG v. EPA 134 S. Ct. 2427 (2014))     12
Withdrawn                                           9
Void                                                6
Denied                                              2
Name: Permit Status, dtype: int64

### Type Validation
Only the project ID and date modified fields will change type after transformation

### Cross-Field Validation
None really needed. I could check that the date columns are in a logical order (application < draft issued < last comment date < final issued < construction deadline) but I'm not planning to really use those columns. So I skipped it.