## Purpose
**This notebook is a work in progress.** This notebook attempts to link a quarter of EQR customers with [FERC CIDs](https://www.ferc.gov/company-registration).

In [138]:
# from pudl_rmi.process_eqr import engine, DATE_COLUMNS

import pandas as pd
import sqlalchemy as sa

## EQR Sales

In [7]:
engine = sa.create_engine("sqlite:////Users/bendnorman/catalyst/rmi-ferc1-eia/outputs/eqr.db")

with engine.connect() as conn:
    contracts = pd.read_sql("select * from contracts", conn)
    identities = pd.read_sql("select * from identities", conn)

In [8]:
contracts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 778269 entries, 0 to 778268
Data columns (total 33 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   contract_unique_id                     778269 non-null  object 
 1   seller_company_name                    778269 non-null  object 
 2   seller_history_name                    0 non-null       object 
 3   customer_company_name                  778269 non-null  object 
 4   contract_affiliate                     778269 non-null  object 
 5   ferc_tariff_reference                  778241 non-null  object 
 6   contract_service_agreement_id          778189 non-null  object 
 7   contract_execution_date                778269 non-null  object 
 8   commencement_date_of_contract_term     778269 non-null  object 
 9   contract_termination_date              260131 non-null  object 
 10  actual_termination_date                15911 non-null   

In [9]:
identities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22998 entries, 0 to 22997
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   filer_unique_id                                  22998 non-null  object
 1   company_name                                     22998 non-null  object
 2   company_identifier                               22998 non-null  object
 3   contact_name                                     22998 non-null  object
 4   contact_title                                    22998 non-null  object
 5   contact_address                                  22998 non-null  object
 6   contact_city                                     22998 non-null  object
 7   contact_state                                    22998 non-null  object
 8   contact_zip                                      22998 non-null  object
 9   contact_country_name                   

In [10]:
identities = identities.query("quarter == 'Q1'")
contracts = contracts.query("quarter == 'Q1'")

## Explore FERC CID dataset

In [19]:
cids = pd.read_excel("/Users/bendnorman/catalyst/rmi-ferc1-eia/inputs/eqr_data/FERC-CID-Listing-9-1-2022.xlsx", header=2)
cids.columns = [col.lower().replace(" ", "_") for col in cids.columns]

In [23]:
assert cids.cid.is_unique
assert cids.organization_name.is_unique

AssertionError: 

In [34]:
cids.duplicated(keep=False).value_counts()

False    5026
dtype: int64

In [40]:
non_cid_fields = cids.columns.to_list()
non_cid_fields.remove("cid")
cids.duplicated(subset=non_cid_fields, keep=False).value_counts()

False    5010
True       16
dtype: int64

There are 16 records that share duplicate information with another record except for CID. This means there are a few company names that have muliple CIDs. We'll probably be linking EQR buyers on company name so there is the posibility a buyer links to multiple CIDS. To keep it simple for now I'm going to drop duplicate organization names.

In [41]:
cids = cids.drop_duplicates(subset=["organization_name"])
assert cids.cid.is_unique
assert cids.organization_name.is_unique

In [113]:
cids.region.value_counts()

Western      3410
Pipelines     780
Eastern       531
Central       255
Name: region, dtype: int64

In [114]:
cids.program.value_counts()

FPA (Market Based Rate) Public Utilities                                     3100
FPA (Traditional Cost of Service and Market Based Rates) Public Utilities    1083
ICA Oil Pipelines                                                             395
NGA Gas Pipelines                                                             248
NGPA 311 and NGA Hinshaw Gas Pipelines                                        141
Power Administrations                                                           9
Name: program, dtype: int64

## Try to link CIDs to buyers

In [87]:
def clean_company_names(s):
    s = s.str.lower()
    s = s.str.strip()
    s = s.str.replace(".", "")
    s = s.str.replace(",", "")
    return s

s = pd.Series(["ben llc.", "ben, llc"])
clean_company_names(s)

  s = s.str.replace(".", "")


0    ben llc
1    ben llc
dtype: object

In [112]:
eqr_customers = clean_company_names(contracts["customer_company_name"])
eqr_customers = eqr_customers.drop_duplicates().to_frame(name="customer_company_name")

cids["organization_name"] = clean_company_names(cids["organization_name"])
cids = cids.drop_duplicates(subset=["organization_name"])
assert cids.cid.is_unique
assert cids.organization_name.is_unique

print(cids.organization_name.nunique())
print(eqr_customers.customer_company_name.nunique())

4976
11846


  s = s.str.replace(".", "")


There are way more unique eqr customer names than there are FERC companies.

In [91]:
merged_customers = eqr_customers.merge(cids, how="left", left_on="customer_company_name", right_on="organization_name", validate="1:1")

In [92]:
merged_customers.cid.isna().value_counts()

True     9429
False    2417
Name: cid, dtype: int64

In [96]:
merged_customers.sample(10)

Unnamed: 0,customer_company_name,organization_name,cid,address,program,region
10525,village of greenwich,,,,,
10915,pacific gas and electric company on behalf of ...,,,,,
3685,city of glencoe,,,,,
8737,r:eweb/a:corp,,,,,
7323,wildhorse wind energy llc,wildhorse wind energy llc,C010276,"30 Ivan Allen Jr. Blvd., N.W., Atlanta, GA 30308",FPA (Market Based Rate) Public Utilities,Western
7183,city of grand island utilities department,,,,,
5063,merrill lynch commodities canada ulc,,,,,
8649,tdy industries inc (wah chang),,,,,
795,midwest energy inc,midwest energy inc,C011423,"1330 CANTERBURY DR, Hays, KS 67601",FPA (Traditional Cost of Service and Market Ba...,Central
10148,mt jackson solar i llc,,,,,


In [101]:
name = "jackson"
cids[cids.organization_name.str.contains(name)]

Unnamed: 0,organization_name,cid,address,program,region
3619,jackson generation llc,C011151,"1900 East Golf Rd. Suite 1030, Schaumburg, IL ...",FPA (Traditional Cost of Service and Market Ba...,Central
4714,jackson prairie,C011209,"239 Zandecki Road, Chehallis, WA 98532",NGA Gas Pipelines,Pipelines
4937,jackson pipeline company,C000107,"370 17th Street Suite 2500, Denver, CO 80202",NGPA 311 and NGA Hinshaw Gas Pipelines,Pipelines


**Who needs to file for a CID?**
> You must obtain a Company Identifier if your company is required by Commission regulations to submit an electronic filing for which a Company Identifier is required. If you are uncertain about whether an electronic filing you are required to submit requires a Company Identifier, you should obtain legal advice regarding FERC regulatory affairs and, if deemed necessary, consult with appropriate FERC staff.

Which forms require CID? EQR sellers, Form 1 respondents. What types of entities would likely be buying energy but don't have to file most FERC forms? I think we need to answer this to understand how many matches we should expect.

**What types of entities would purchase energy?**

**Can we assume a seller and buyer are in the same region?**

**What is cid.program?**
> Regulatory Programs refer to the Commission’s different statutory mandates and regulatory schemes.

**How do filers input customer name information?**
It seems like it's a free for all :(

**Find region of buyer?**
To improve entity matching, we could categorize the EQR customer's Region by using the Point of Delivery Specific Location field.

In [107]:
entergy_purchases = merged_customers[merged_customers.customer_company_name.str.contains("entergy")]

In [110]:
entergy_purchases.cid.isna().value_counts()

True     18
False    16
Name: cid, dtype: int64

In [116]:
entergy_purchases

Unnamed: 0,customer_company_name,organization_name,cid,address,program,region
53,entergy arkansas inc,entergy arkansas inc,C000776,"425 West Capitol Ave, Little Rock, AR 72201",FPA (Traditional Cost of Service and Market Ba...,Central
98,entergy services inc acting as agent for the o...,,,,,
130,entergy koch trading lp,,,,,
333,entergy nuclear power marketing llc,entergy nuclear power marketing llc,C000860,"100 First Stamford Place, Stamford, CT 06902",FPA (Market Based Rate) Public Utilities,Western
364,entergy power marketing corp,,,,,
507,entergy services inc as agent for the entergy ...,,,,,
527,entergy services inc,,,,,
2181,entergy-koch trading lp,,,,,
3113,entergy louisiana llc,entergy louisiana llc,C004995,"4809 Jefferson Highway, Jefferson, LA 70121",FPA (Traditional Cost of Service and Market Ba...,Central
3471,entergy services llc,entergy services llc,C002146,"639 Loyola Avenue, New Orleans, LA 70113",FPA (Traditional Cost of Service and Market Ba...,Central


## Explore customer information
EQR provides so little information about the customer. What can we use to help associate the buyer with a CID? 
- Point of Delivery Specific Location: This could give us a better sense of the location. The column might be missing values though. What do the location codes even mean? Are they reliable? What field in Form 1 could we connect this to?

In [119]:
contracts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 177530 entries, 0 to 177529
Data columns (total 33 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   contract_unique_id                     177530 non-null  object 
 1   seller_company_name                    177530 non-null  object 
 2   seller_history_name                    0 non-null       object 
 3   customer_company_name                  177530 non-null  object 
 4   contract_affiliate                     177530 non-null  object 
 5   ferc_tariff_reference                  177523 non-null  object 
 6   contract_service_agreement_id          177510 non-null  object 
 7   contract_execution_date                177530 non-null  object 
 8   commencement_date_of_contract_term     177530 non-null  object 
 9   contract_termination_date              65348 non-null   object 
 10  actual_termination_date                3766 non-null    

In [129]:
print(contracts.point_of_delivery_specific_location.isna().value_counts())
contracts[~contracts.point_of_delivery_specific_location.isna()].point_of_delivery_specific_location.value_counts().sample(10)

True     122157
False     55373
Name: point_of_delivery_specific_location, dtype: int64


Singing River EPA - North Lucedale 115 kV           1
Jacksonville 46 kV                                  1
Gautier                                             5
Mile-Hi 115 kV Substation, Lake County, OR         20
The ISO Grid at SCE's Victor 115kV Substation      31
Cascade - Cottonwood 115 kV Line                    2
Kantor, HillTop, Griffith, N Havasu, Blk Mesa +     6
Midway-Temblor 115 kV Transmission Line             2
BHCE/West Station                                   1
115 kV Keene Road Substation                        3
Name: point_of_delivery_specific_location, dtype: int64

Some of these might connect to generators in eia860. Seems like they could be substations, transmission lines.

In [136]:
print(contracts.point_of_delivery_balancing_authority.isna().value_counts())
print()
print(contracts.point_of_delivery_balancing_authority.value_counts().head(10))

True     111101
False     66429
Name: point_of_delivery_balancing_authority, dtype: int64

BPAT    15468
CISO    10680
ISNE     7399
PJM      4148
PGE      3031
SCL      2830
MISO     2769
SWPP     2744
PACE     2357
SOCO     2142
Name: point_of_delivery_balancing_authority, dtype: int64


Most contracts don't have a balancing authority or point of delivery. Are there other features we could use? 
- Transaction table has timezone of transaction.
- Is there an LLC encoder library we could use to create some additional features?