# Purpose
The the field to link identities and contracts on is the company name field. This isn't ideal because it is a freeform text field. Utilities might report different names across in the identities and contracts table. 

## TL;DR
Luckily there are only 5 of ~2500 contracts that can't be linked to an identity in 2020. TBD if company names and CIDs remain consistent across years. 

In [1]:
from pudl_rmi.process_eqr import engine, DATE_COLUMNS

import pandas as pd

year = 2020
quarter = 'Q2'

## Get Contracts

In [2]:
with engine.connect() as conn:
    all_contracts = pd.read_sql("select * from contracts", conn)

In [3]:
all_contracts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 778269 entries, 0 to 778268
Data columns (total 33 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   contract_unique_id                     778269 non-null  object 
 1   seller_company_name                    778269 non-null  object 
 2   seller_history_name                    0 non-null       object 
 3   customer_company_name                  778269 non-null  object 
 4   contract_affiliate                     778269 non-null  object 
 5   ferc_tariff_reference                  778241 non-null  object 
 6   contract_service_agreement_id          778189 non-null  object 
 7   contract_execution_date                778269 non-null  object 
 8   commencement_date_of_contract_term     778269 non-null  object 
 9   contract_termination_date              260131 non-null  object 
 10  actual_termination_date                15911 non-null   

In [4]:
contracts = all_contracts.query("year == @year & quarter == @quarter")

## Get Identities

In [5]:
with engine.connect() as conn:
    all_identities = pd.read_sql_table("identities", conn)

In [6]:
all_identities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22998 entries, 0 to 22997
Data columns (total 16 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   filer_unique_id                                  22998 non-null  object
 1   company_name                                     22998 non-null  object
 2   company_identifier                               22998 non-null  object
 3   contact_name                                     22998 non-null  object
 4   contact_title                                    22998 non-null  object
 5   contact_address                                  22998 non-null  object
 6   contact_city                                     22998 non-null  object
 7   contact_state                                    22998 non-null  object
 8   contact_zip                                      22998 non-null  object
 9   contact_country_name                   

In [7]:
identities = all_identities.query("year == @year & quarter == @quarter")

## Does every `contracts.seller_company_name` exist in `identities.company_name`?

In [8]:
contract_company_names = pd.Series(contracts.seller_company_name.unique())
identity_company_names = pd.Series(identities.company_name.unique())
print(contract_company_names.shape)
print(identity_company_names.shape)

(2315,)
(2810,)


In [9]:
is_name_in_both = contract_company_names.isin(identity_company_names)
print(is_name_in_both.value_counts())

contract_company_names[~is_name_in_both]

True     2314
False       1
dtype: int64


1224    FirstLight Power Resources Management, LLC
dtype: object

In [10]:
contract_company_names[contract_company_names.str.contains("First")]

1143                  FirstLight CT Housatonic LLC
1144                       FirstLight CT Hydro LLC
1145                       FirstLight MA Hydro LLC
1222                       First Choice Energy LLC
1223               FirstLight Power Management LLC
1224    FirstLight Power Resources Management, LLC
2076                        First Point Power, LLC
dtype: object

In [11]:
identity_company_names[identity_company_names.str.contains("First")]

1449       FirstLight CT Housatonic LLC
1450            FirstLight CT Hydro LLC
1451            FirstLight MA Hydro LLC
1539            First Choice Energy LLC
1540    FirstLight Power Management LLC
2525             First Point Power, LLC
dtype: object

There is ONE seller company name in the contracts data that does not exist in the identity table. 

## All of 2020

In [12]:
contract_company_names = pd.Series(all_contracts.seller_company_name.unique())
identity_company_names = pd.Series(all_identities.company_name.unique())
print(contract_company_names.shape)
print(identity_company_names.shape)

(2483,)
(3005,)


In [13]:
is_name_in_both = contract_company_names.isin(identity_company_names)
print(is_name_in_both.value_counts())

contract_company_names[~is_name_in_both]

True     2478
False       5
dtype: int64


182                  Cayuga Operating Company, LLC
1477         Newmont Nevada Energy Investment, LLC
1616                Fortistar North Tonawanda Inc.
1921    FirstLight Power Resources Management, LLC
2467               VESI Pomona Energy Storage Inc.
dtype: object

It looks like there are 5 contracts in 2020 that do not have corresponding identity information. Not too bad?