In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
path = Path('/app/data/raw/2022.03.22OGW.xlsx')
assert path.exists()

In [3]:
# eip = dbcp.extract.eip_infrastructure.extract(path)
# hardcode the extract function so this notebook can be easily rerun in the future without maintenance
sheets_to_read = [
    'Facility',
    'Company',
    'Project',
    'Air Construction',  # permit status is key to identifying actionable projects
    'Pipelines',
]
eip = pd.read_excel(path, sheet_name=sheets_to_read)
rename_dict = {
    'Facility': 'eip_facilities',
    'Company': 'eip_companies',
    'Project': 'eip_projects',
    'Air Construction': 'eip_air_constr_permits',
    'Pipelines': 'eip_pipelines',
}
eip = {rename_dict[key]: df for key, df in eip.items()}

In [4]:
eip.keys()

dict_keys(['eip_facilities', 'eip_companies', 'eip_projects', 'eip_air_constr_permits', 'eip_pipelines'])

In [5]:
{k: df.shape for k, df in eip.items()}

{'eip_facilities': (563, 59),
 'eip_companies': (439, 16),
 'eip_projects': (672, 47),
 'eip_air_constr_permits': (836, 26),
 'eip_pipelines': (176, 64)}

In [6]:
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [7]:
fac = eip['eip_facilities']
cos = eip['eip_companies']
proj = eip['eip_projects']
air = eip['eip_air_constr_permits']
pipe = eip['eip_pipelines']

Outline of work
Two parts: data cleaning and data normalization/structuring
# Structuring and Normalizaing
**5 entities and 5 many:many relationships means 10 tables...** But utting both pipelines and companies drops the total tables in half to 5.

The *only* purpose of bringing in the companies table is to add one column with ownership info. But the marginal cost is 3 tables (2 if cutting pipelines), or 30% of tables. I'll confirm with DBCP that this is OK.

Pipelines are approved at the federal level so I'm not sure they are actionable for Down Ballot people. They also have only very coarse location information (state). We punted on them last time so I would like to do so again. Marginal cost is also 3 tables, or 2 additional if cutting companies.

## Entity Relationships
### Entities
* facilities
* companies
* projects
* permits (air construction permits. there are many other permit types that I didn't integrate)
* pipelines

### Relationships
many : many
* facilities : companies
* facilities : projects
* facilities : pipelines
* companies : pipelines
* projects : permits

one : many
* none

one : one
* none

no direct relationship
* facilities : permits (air construction permits are mediated through projects. Other permits not considered here do have direct relationships)
* companies : projects (mediated through facilities)
* companies : permits (mediated through projects then through facilities)
* projects : pipelines (mediated through facilities)
* permits : pipelines (mediated through projects then through facilities)

# Cleaning
Need to clean facilities, projects, and permits via the usual checklist. But I can ignore many unecessary columns and prefix them 'raw_' to discourage use.
## Facilities Cleaning
- [x] Accuracy
- [x] Atomicity
- [ ] Consistency
- [x] Completeness
- [ ] Uniformity
- [ ] Validity
    - [ ] Range Validation
    - [ ] Uniqueness Validation
    - [ ] Set Membership Validation
    - [ ] Type Validation
    - [ ] Cross-Field Validation

### Accuracy
I'm mostly using this table for location information, so I'll focus on the "street address" and "coordinates" columns. I don't have "golden data" to compare against, but I can at least spot check some items by googling them. \[Update: 3/3 spot checks of location are good. Obviously this is far from comprehensive but gives a small measure of confidence.]

In [8]:
fac.sample(3, random_state=42)

Unnamed: 0,id,name,created_by,created_on,modified_by,modified_on,private,CCS/CCUS,Company (ID),Company,Project (ID),Project,State,Completeness Review Notes,Final Review,Facility Alias,Facility Description,Initial Review,State Facility ID Number(s),Previous Facility Name(s),Sector,Primary NAICS Code,Primary SIC Code,Street Address,City,ZIP Code,County or Parish,Research Notes,Associated Facilities (ID),Associated Facilities,Pipelines (ID),Pipelines,Air Operating (ID),Air Operating,CWA-NPDES (ID),CWA-NPDES,CWA Wetland (ID),CWA Wetland,Other Permits (ID),Other Permits,Congressional Representatives,Link to EJSCREEN Report,Estimated Population within 3 miles,Percent People of Color within 3 miles,Percent Low-Income within 3 miles,Percent under 5 Years Old within 3 miles,Percent People over 64 Years Old within 3 miles,Air Toxics Cancer Risk (NATA Cancer Risk),Respiratory Hazard Index,PM2.5 (ug/m3),O3 (ppb),Wastewater Discharge Indicator,Location,Facility Footprint,EPA FRS ID,Featured,Featured Facility Descriptors (ID),Featured Facility Descriptors,Facility ID
250,996,Oak Grove Gas Plant,EIP Test Account,2021-05-20T19:13:45.411472,Alexandra Shaykevich,2022-03-07T15:31:46.717963,False,,2716,"Williams Ohio Valley Midstream, LLC[2716]",3003,Oak Grove Gas Plant - Initial Construction[3003],WV,QAQC Flag - 3 permits marked private - need to...,,,,,051-00157,,Natural Gas,,,5258 Fork Ridge,Moundsville,26041,Marshall,,,,,,,,,,,,,,"David B. McKinley, Republican",https://ejscreen.epa.gov/mapper/EJSCREEN_repor...,3282.0,5.0,46.0,10.0,14.0,25.0,0.32,8.33,43.1,0.0068,"-80.6959, 39.8758",,,False,,,10386.0
521,4478,MarkWest Houston Complex,Alexandra Shaykevich,2021-11-07T23:16:48.953992,Courtney Bernhardt,2021-11-18T14:23:26.340753,False,,2592,"MarkWest Liberty Midstream & Resources, LLC[2592]",4477,Houston Gas Plant No. 4 and De-ethanizer Proje...,PA,,,,,,707367,,Petrochemicals and Plastics,211112,,800 Western Ave,Washington,15301,Washington,https://www.ahs.dep.pa.gov/eFACTSWeb/searchRes...,1006.0,Shell Chemical Appalachia Petrochemicals Compl...,3691.0,Appalachian to Texas Express (ATEX) Expansion[...,,,,,,,,,,,8642.0,9.0,25.0,5.0,24.0,29.0,0.39,9.42,45.5,0.025,"-80.254988, 40.262146",,,,,,
268,1015,Formosa Point Comfort Plant,EIP Test Account,2021-05-20T19:13:45.411472,Alexandra Shaykevich,2022-01-12T21:52:21.032416,False,,2545,Formosa Plastics Corporation[2545],"3028, 3029, 3030, 3573","Point Comfort EDC/VCM Reactor[3028], Point Com...",TX,FOIA pending for retroactive PSDs. PN publishe...,CB,,The Formosa Point Comfort Plant is an existing...,,RN100218973,,Petrochemicals and Plastics,"325110, 325181","2812, 2821",201 Formosa Dr,Point Comfort,77978,Calhoun,https://www.fpcusa.com/about-formosa/our-opera...,,,,,,,,,,,4248.0,Comptroller Application No. 1537[4248],"Michael Cloud, Republican",https://ejscreen.epa.gov/mapper/EJSCREEN_repor...,768.0,43.0,39.0,10.0,8.0,24.0,0.28,8.47,32.6,5e-06,"-96.54722, 28.68889",,,False,,,10169.0


Googling "Oak Grove Gas Plant" turns up the facility. [Street address](https://www.google.com/maps/place/Williams+Natural+Gas+Oak+Grove+Facility/@39.871189,-80.6944623,1177m/data=!3m1!1e3!4m13!1m7!3m6!1s0x0:0x6769abd010d373f9!2zMznCsDUyJzMyLjkiTiA4MMKwNDEnNDUuMiJX!3b1!8m2!3d39.8758!4d-80.6959!3m4!1s0x8835e69402fb74cd:0x94b44b7720f51c5!8m2!3d39.8690544!4d-80.693195) and coordinates match. Owner also matches.

MarkWest Houston Complex location is also good. Google maps labels the [corporate office](https://www.google.com/maps/place/MarkWest+Houston+Plant/@40.262237,-80.2596898,1240m/data=!3m1!1e3!4m13!1m7!3m6!1s0x8834528cbcacb571:0xbd8b49797f3fdd4!2s800+Western+Ave,+Washington,+PA+15301!3b1!8m2!3d40.2584361!4d-80.2555021!3m4!1s0x8834539d500f0e45:0x248d758337e3de37!8m2!3d40.2585062!4d-80.254957) as across the street from the given address, which belongs to a different facility building. But that doesn't matter for our purposes -- we aren't sending them a letter. Owner also matches.

Formosa Point Comfort plant street address matches [google maps](https://www.google.com/maps/place/Formosa+Plastics+Corporation,+Texas/@28.6804226,-96.5626898,13964m/data=!3m1!1e3!4m5!3m4!1s0x0:0x469e4fbb5f6d12a1!8m2!3d28.6975144!4d-96.5449333) and coordinates are inside the facility. Owner also matches.

### Atomicity
By inspection I see that all the ID and associated name fields can contain multiple values: company, project, pipelines, and permits. The location fields are mercifully single valued

In [9]:
# street address does not look multi-valued but has other problems. Thankfully lat lon is still available
# a little more digging suggests bad addresses are because these have not yet been built.
# Can't check for sure until I can join project status on to facilities
pd.options.display.max_colwidth = 0
fac.loc[fac['Street Address'].str.len().nlargest(10).index, ['id', 'name', 'Street Address', 'Location']]

Unnamed: 0,id,name,Street Address,Location
461,3839,Rio Bravo Compressor Station 2,From intersection of I69E N and US 77 N turn left onto Unnamed Rd. Go 1.5 mi site on R,"-97.786294, 26.609886"
11,750,Annova LNG Brownsville,USFWS Access Road (left from intersection of Boca Chica Blvd and Kingston Ave),"-97.2675, 26.00556"
275,1022,Praxair Clear Lake Plant,NW corner of Celanese Industrial Complex (N end of Bayport Blvd),"-95.066606, 29.625159"
398,3739,Shintech Addis Plant Expansion,"1 S (Southern portion of Addis Plant property), Addis, LA 70710","-91.260466, 30.322319"
357,1105,Turkey Creek Compressor Station,W on Onyx Rd (towards the intersection of Johnsons Landing Rd),"-92.424444, 30.939722"
284,1031,El Paso Natural Gas - Red Mountain Compressor Station,1.4 miles on Co Rd D0006 from the intersection with NM-418,"-107.998849, 32.257081"
522,4480,Lone Star Alkylate Production Facility,Approx. 1.8 miles SW from FM 1942 and Hatcherville Rd,"-94.923882, 29.84787"
544,4676,Chickahominy Power Generating Station,State Rd. 106 (Roxbury Rd.) and Chambers/Landfill Rd.,"-77.155, 37.436"
89,829,Corpus Christi Polymer & Desalination Plant,7001 Joe Fulton International Trade Corridor STE 200,"-97.49595, 27.834238"
375,1124,Willcox and Dragoon Compressor Stations,Arzberger Rd (6 miles E of Kansas Settlement Rd),"-109.662345, 32.109089"


In [10]:
# location is not multi-valued - only see two decimal points per coordinate pair
fac['Location'].str.count('\.').agg(['min', 'max'])

min    2.0
max    2.0
Name: Location, dtype: float64

In [11]:
# a shitload of missing facility IDs, but no multi-valued ones
fac['Facility ID'].describe()

count    384.000000  
mean     10195.119792
std      113.430883  
min      10000.000000
25%      10096.750000
50%      10194.500000
75%      10293.250000
max      10393.000000
Name: Facility ID, dtype: float64

### Completeness
Notable missing values and lack of missing values:
* 93/563 (16.5%) missing street address. Plus some addresses are not missing but look unusable.
* 4/563 (0.7%) of facilities are missing linked Project IDs
* 9/563 (1.6%) missing "Location" (coordinates)
* 3/563 (tiny%) missing county (none missing state). But the true test is how successful `addfips` is with these pairs
* 60 to 100 (10% to 18%) missing EJ Screen metrics, depending on which metric

I don't know what `Facility ID` is (vs `id` of this facility table), but 179/563 (31.8%) rows are missing `Facility ID`. They have different numerical ranges and I see that the companies and project tables thankfully use the `id` numbers, which are 100% complete.

Based on these nan counts, I should first try `addfips` on state/county pairs. If too many fail, the most complete option is to geocode via coordinates.

In [21]:
len(fac)

563

In [12]:
fac.count().T

id                                                 563
name                                               563
created_by                                         563
created_on                                         563
modified_by                                        563
modified_on                                        563
private                                            563
CCS/CCUS                                           17 
Company (ID)                                       563
Company                                            563
Project (ID)                                       559
Project                                            559
State                                              563
Completeness Review Notes                          44 
Final Review                                       113
Facility Alias                                     172
Facility Description                               352
Initial Review                                     56 
State Faci

### Consistency - defer
Defer until I've cleaned the related datasets
### Uniformity
Important columns to check consistent representation:
* coordinates
* ID fields (check consistent array delimiters)

Secondary importance:
* street address (this is a luxury field)
* modified_on

#### Coordinates

In [13]:
# "-XX.X, YY.Y" with 2 or 3 digits before the decimal and 2 to 7 digits after.
# Plus optional leading/trailing whitespace.
coord_pattern = r'\s*-\d{2,3}\.\d{2,7}, \d{2,3}\.\d{2,7}\s*'
fac['Location'].str.match(coord_pattern).sum()

554

In [14]:
# matches count, so they all have the same formatting
fac['Location'].count()

554

In [15]:
# tighten criteria to 3+ digits after decimal
# Reveals that only 2 facilities have poor precision (plus or minus about a km)
coord_pattern = r'\s*-\d{2,3}\.\d{3,7}, \d{2,3}\.\d{3,7}\s*'
fac['Location'].str.match(coord_pattern).sum()

552

#### ID Fields
Want to check for consistent array delimiters.

In [16]:
# exclude ID cols with numeric types (no arrays present)
id_cols = [col for col in fac.columns if '(ID)' in col and pd.api.types.is_object_dtype(fac[col])]
id_cols

['Company (ID)',
 'Project (ID)',
 'Associated Facilities (ID)',
 'Pipelines (ID)',
 'Air Operating (ID)',
 'CWA-NPDES (ID)',
 'Other Permits (ID)']

In [17]:
# opening pattern, optional delimiter, optional repeating pattern, optional closing pattern, end of line
array_pattern = r'(?:\d{3,5})(?:, ?)?(?:\d{3,5}, ?)*(?:\d{3,5})?$'

In [18]:
test_case = pd.Series([
    '1234',
    '1234,567',
    '1234, 567',
    '12345, 678, 9012',
    '1234\t5678', # tab is bad, no comma
    '12, 3456', # too short
    '1234    5678', # too many spaces, no comma
])
pd.concat([test_case, test_case.str.match(array_pattern)], axis=1)

Unnamed: 0,0,1
0,1234,True
1,1234567,True
2,"1234, 567",True
3,"12345, 678, 9012",True
4,1234\t5678,False
5,"12, 3456",False
6,1234 5678,False


In [19]:
# all pass the formatting test
for col in id_cols:
    assert fac[col].str.match(array_pattern).all()

#### Date Modified

In [20]:
# to_datetime works on all values present
timestamps = pd.to_datetime(fac['modified_on'])
timestamps.dtypes, timestamps.isna().sum()

(dtype('<M8[ns]'), 0)

#### Street Address - defer
hard to test and I don't care that much if it's wrong. Best way to test is probably to outsource to a pre-built geocoder