In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

In [4]:
path = Path('/app/data/raw/2022.03.22OGW.xlsx')
assert path.exists()

In [5]:
# eip = dbcp.extract.eip_infrastructure.extract(path)
# hardcode the extract function so this notebook can be easily rerun in the future without maintenance
air = pd.read_excel(path, sheet_name='Air Construction')

In [6]:
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [7]:
air.shape

(836, 26)

# Cleaning
## Projects Cleaning
Columns I care about:
* id
* name
* modified on
* project ID (1:m as arrays)
* statute (1:m as arrays)
* permit type (1:m as arrays)
* permitting action (1:m as arrays)
* permit status
* description
* research notes

Cleaning Checklist:
- [x] Accuracy
- [x] Atomicity
- [ ] Consistency
- [x] Completeness
- [x] Uniformity
- [x] Validity
    - [x] Range Validation
    - [x] Uniqueness Validation
    - [x] Set Membership Validation
    - [x] Type Validation
    - [x] Cross-Field Validation

### Accuracy
The most important item to spot check here is the permit status. "Final" permit statuses are of little interest and also presumably don't change over time, so I'll only check 1 of those.

Results: 4/4 match dates and status 👍🏼

In [8]:
filter_ = air['Permit Status'].isin({"Application Pending", "Draft Issued"})
air.loc[filter_,:].sample(3, random_state=42)

Unnamed: 0,id,name,created_by,created_on,modified_by,modified_on,private,Date Last Checked,Project (ID),Project,Statute,Permit Type,Permitting Action,Permit Status,Description or Purpose,Application Date,Draft Permit Issuance Date,Last Day to Comment,Final Permit Issuance Date,Deadline to Begin Construction,Document(s),Detailed Permitting History,Research Notes,Document URL,Review Flag,FOIA Flag
5,1591,AQ1539CPT01,EIP Test Account,2021-05-20T19:18:48.607217,Griffin Bird,2022-02-09T16:26:23.548159,False,,2944,Alaska LNG - New Liquefaction Plant[2944],Clean Air Act,Major,Initial,Draft Issued,Permits authorizing initial construction of th...,2018-05-01,2020-09-11,2020-12-10,,,{u'url': u'https://api.oilandgaswatch.org/d/9a...,AQ1539CPT01 (draft permit issued 9/11/2020),Public comment period ends 12/10/2020. Emissio...,,,
294,1979,1280-00132,EIP Test Account,2021-05-20T19:18:48.607217,Alexandra Shaykevich,2022-03-09T17:46:37.380210,False,,2892,Gulf LNG Liquefaction Project[2892],Clean Air Act,Major,Initial,Application Pending,"This application, if approved, would authorize...",2015-06-17,,,,,{u'url': u'https://api.oilandgaswatch.org/d/ce...,"Application submitted 9/30/2015, revised 3/29/...",No final permit as of 2/10/2022. GB,,,
719,4278,0560-00990-V1,Alexandra Shaykevich,2021-10-20T14:53:16.529822,Griffin Bird,2022-02-23T19:34:06.789121,False,2022-02-23,2825,Delfin Onshore Facility - Initial Construction...,Clean Air Act,True Minor,"Renewal, Extension",Draft Issued,"Application to renew Permit No. 0560-00990-V0,...",2021-01-05,2022-02-22,2022-03-29,,,{u'url': u'https://api.oilandgaswatch.org/d/4f...,,,,,


Alaska LNG Liquifaction Plant: confirmed on [AK DEC website](https://dec.alaska.gov/Applications/Air/airtoolsweb/AirPermitsApprovalsAndPublicNotices). Dates match, status is a little more ambiguous but I think "draft" is right.

Gulf LNG: confirmed on [MS state website](https://opcgis.deq.state.ms.us/enonline/ai_info.aspx?ai=23844). Application date and status match.

Delfin LNG: confirmed on [LA DEQ website](https://deq.louisiana.gov/public-notices?keyword=delfin&startDate=&endDate=). Dates and status match.

In [9]:
filter_ = air['Permit Status'].eq("Final Issued")
air.loc[filter_,:].sample(1, random_state=42)

Unnamed: 0,id,name,created_by,created_on,modified_by,modified_on,private,Date Last Checked,Project (ID),Project,Statute,Permit Type,Permitting Action,Permit Status,Description or Purpose,Application Date,Draft Permit Issuance Date,Last Day to Comment,Final Permit Issuance Date,Deadline to Begin Construction,Document(s),Detailed Permitting History,Research Notes,Document URL,Review Flag,FOIA Flag
789,4688,159014 - Heim Plant Expansion,Alexandra Shaykevich,2022-01-03T19:58:39.079966,Alexandra Shaykevich,2022-01-15T23:10:06.193460,False,,4684,Heim Gas Plant - Initial Construction[4684],Clean Air Act,True Minor,Major Modification,Final Issued,"Permit by Rule authorized maintenance, startup...",2021-08-04,,,2021-08-30,,{u'url': u'https://api.oilandgaswatch.org/d/97...,,https://www15.tceq.texas.gov/crpub/index.cfm?f...,,,


Heim Gas Plant Expansion: confirmed at [TX CEQ website](https://www15.tceq.texas.gov/crpub/index.cfm?fuseaction=iwr.pgmdetail&addn_id=120534092019308&re_id=578462662019220&program_code=AIRNSR&lgcy_sys_cd=NSR&program=AIR%20NEW%20SOURCE%20PERMITS&IdType=REG). Dates and status match.

### Atomicity
Most of the columns are 1:m values encoded as csv array strings, but most values are singletons. See Range Validation and Set Membership Validation for decisions on modeling as 1:1 vs 1:m.

### Completeness
For this purpose, I'll limit the scope of 'completeness' to only look at missing values within the data. For better or worse, it is EIP's job to ensure projects are in the dataset at all.

We are not interested in already-issued permits, so I'll remove those and assess completeness based on the remaining subset.

Notable missing values and lack of missing values:
* The only 3 records missing project ID + permit info are entirely NaN rows

In [14]:
def calc(num, denom=105):
    percent = 1- num/denom
    n = denom - num
    return f"{n}/{denom} ({percent*100:.1f}%)"

In [15]:
calc(102)

'3/105 (2.9%)'

In [11]:
len(air)

836

In [12]:
air['Permit Status'].value_counts()

Final Issued                                      731
Application Pending                                29
Draft Issued                                       28
Expired                                            20
Withdrawn (UARG v. EPA 134 S. Ct. 2427 (2014))     12
Withdrawn                                           8
Void                                                4
Denied                                              1
Name: Permit Status, dtype: int64

In [13]:
air.loc[air['Permit Status'].ne('Final Issued'),:].count().T

id                                105
name                              103
created_by                        105
created_on                        105
modified_by                       105
modified_on                       105
private                           105
Date Last Checked                  38
Project (ID)                      102
Project                           102
Statute                           102
Permit Type                       102
Permitting Action                 101
Permit Status                     102
Description or Purpose             90
Application Date                   95
Draft Permit Issuance Date         32
Last Day to Comment                31
Final Permit Issuance Date         40
Deadline to Begin Construction     26
Document(s)                        89
Detailed Permitting History        43
Research Notes                     80
Document URL                        4
Review Flag                         1
FOIA Flag                           3
dtype: int64

In [17]:
air.loc[air['Permit Status'].isna(),:]

Unnamed: 0,id,name,created_by,created_on,modified_by,modified_on,private,Date Last Checked,Project (ID),Project,Statute,Permit Type,Permitting Action,Permit Status,Description or Purpose,Application Date,Draft Permit Issuance Date,Last Day to Comment,Final Permit Issuance Date,Deadline to Begin Construction,Document(s),Detailed Permitting History,Research Notes,Document URL,Review Flag,FOIA Flag
613,3169,,Courtney Bernhardt,2021-07-06T12:27:00.231068,Commons Cloud Bot,2021-07-06T12:27:00.231068,True,,,,,,,,,,,,,,,,,,,
614,3170,,Courtney Bernhardt,2021-07-06T12:27:32.120889,Commons Cloud Bot,2021-07-06T12:27:32.120889,True,,,,,,,,,,,,,,,,,,,
831,4874,"156320, PSDTX1558M1, GHGPSDTX193M1",Griffin Bird,2022-03-17T02:30:28.758823,Griffin Bird,2022-03-17T02:30:33.514155,True,,,,,,,,,,,,,,,,,,,


### Consistency - defer
Defer until I've cleaned the related datasets
### Uniformity
Important columns to check consistent representation:
* all array fields -- check consistent delimiters
    * project ID (1:m as arrays)
    * statute (1:m as arrays)
    * permit type (1:m as arrays)
    * permitting action (1:m as arrays)
* modified on -- check consistent date format

#### Array Fields
Want to check for consistent array delimiters.

In [26]:
# exclude ID cols with numeric types (no arrays present)
id_cols = [
    'Project (ID)',    
]

In [22]:
# mandatory opening pattern, optional delimiter, optional repeating pattern, optional closing pattern, mandatory end of line
array_pattern = r'(?:\d{3,5})(?:, ?)?(?:\d{3,5}, ?)*(?:\d{3,5})?$'

In [23]:
test_case = pd.Series([
    '1234',
    '1234,567',
    '1234, 567',
    '12345, 678, 9012',
    '1234\t5678', # tab is bad, no comma
    '12, 3456', # too short
    '1234    5678', # too many spaces, no comma
])
pd.concat([test_case, test_case.str.match(array_pattern)], axis=1)

Unnamed: 0,0,1
0,1234,True
1,1234567,True
2,"1234, 567",True
3,"12345, 678, 9012",True
4,1234\t5678,False
5,"12, 3456",False
6,1234 5678,False


In [27]:
# all pass the formatting test
for col in id_cols:
    assert air[col].str.match(array_pattern).all()

In [36]:
array_cols = [
    'Permitting Action', 
    'Permit Type',
    'Statute',
]

In [37]:
special_chars = air.loc[:, array_cols].copy()
for col in array_cols:
    special_chars.loc[:, col] = special_chars.loc[:, col].str.replace('\w|\s|,', '', regex=True)

In [38]:
# no other delimiters present
special_chars.loc[special_chars.fillna('').ne('').any(axis=1),:]

Unnamed: 0,Permitting Action,Permit Type,Statute


#### Date Modified

In [29]:
# to_datetime works on all values present
timestamps = pd.to_datetime(air['modified_on'], errors='raise')
timestamps.dtypes, timestamps.isna().sum()

(dtype('<M8[ns]'), 0)

### Range Validation
Check project ID and date modified

#### Project ID

In [43]:
proj_ids = air['Project (ID)'].str.split(',', expand=True)
for col in fac_ids.columns:
    proj_ids.loc[:, col] = pd.to_numeric(proj_ids.loc[:, col], errors='raise')

proj_ids.head()

Unnamed: 0,0,1
0,2723.0,
1,2723.0,
2,2728.0,
3,2727.0,
4,2875.0,


In [44]:
# they all look in the same range
proj_ids.describe()

Unnamed: 0,0,1
count,821.0,3.0
mean,3213.851401,3936.666667
std,555.05112,895.577095
min,2723.0,2930.0
25%,2861.0,3582.5
50%,2998.0,4235.0
75%,3148.0,4440.0
max,4880.0,4645.0


#### Date Modified
range looks fine

In [45]:
pd.to_datetime(air['modified_on']).describe()

  pd.to_datetime(air['modified_on']).describe()


count                            836
unique                           657
top       2021-05-21 15:13:50.395199
freq                             180
first     2021-05-21 15:13:50.395199
last      2022-03-22 14:29:27.480232
Name: modified_on, dtype: object

### Uniqueness Validation
Check the `id` field

In [46]:
air['id'].duplicated().sum()

0

### Set Membership Validation
* statute (1:m as arrays)
* permit type (1:m as arrays)
* permitting action (1:m as arrays)
* permit status

#### Statute

In [47]:
air['Statute'].value_counts()

Clean Air Act                        813
Clean Air Act, Natural Gas Act        18
Clean Air Act, Deepwater Port Act      1
Name: Statute, dtype: int64

In [51]:
# split and combine value counts
air['Statute'].str.split(',', expand=True).stack().str.strip().value_counts()

Clean Air Act         832
Natural Gas Act        18
Deepwater Port Act      1
dtype: int64

#### Permit Type

In [48]:
air['Permit Type'].value_counts()

Major                                                                      379
Air Construction Major                                                     216
True Minor                                                                 170
Synthetic Minor                                                             33
Air Construction Major, Certificate of Public Convenience and Necessity     20
Air Construction Minor                                                       9
Air Construction Major, Deepwater Port License                               1
Name: Permit Type, dtype: int64

In [50]:
# split and combine value counts
air['Permit Type'].str.split(',', expand=True).stack().str.strip().value_counts()

Major                                              379
Air Construction Major                             237
True Minor                                         170
Synthetic Minor                                     33
Certificate of Public Convenience and Necessity     20
Air Construction Minor                               9
Deepwater Port License                               1
dtype: int64

#### Permitting Action
A bunch of 1:m categories, but very few actual values

In [49]:
air['Permitting Action'].value_counts()

Initial                                   328
Major Modification                        181
Minor Modification                         52
Extension                                   7
Administrative Amendment                    4
Minor Modification, Renewal                 4
Major Modification, Initial                 2
Renewal, Extension                          2
Initial, Renewal                            1
Major Modification, Renewal                 1
Initial, Major Modification                 1
Initial, Minor Modification                 1
Extension, Renewal, Major Modification      1
Extension, Renewal                          1
Renewal                                     1
Extension, Initial                          1
Name: Permitting Action, dtype: int64

In [53]:
# split and combine value counts
air['Permitting Action'].str.split(',', expand=True).stack().str.strip().value_counts()

Initial                     334
Major Modification          186
Minor Modification           57
Extension                    12
Renewal                      11
Administrative Amendment      4
dtype: int64

#### Permit Status
Will combine at least the two `withdrawn` categories, maybe even all of `expired`, `withdrawn`, `void`, `denied` into a single "Nope" category.

In [52]:
air['Permit Status'].value_counts()

Final Issued                                      731
Application Pending                                29
Draft Issued                                       28
Expired                                            20
Withdrawn (UARG v. EPA 134 S. Ct. 2427 (2014))     12
Withdrawn                                           8
Void                                                4
Denied                                              1
Name: Permit Status, dtype: int64

### Type Validation
Only the project ID and date modified fields will change type after transformation

### Cross-Field Validation
None really needed. I could check that the date columns are in a logical order (application < draft issued < last comment date < final issued < construction deadline) but I'm not planning to really use those columns. So I skipped it.