# Introduction

Utilities are often part of a nested hierarchy of holding companies and subsidiaries which makes it difficult to understand the complex web of political and economic incentives that inform these companies' behavior.
Subsidiary relationships are reported in the SEC’s Form 10-K, along with other useful information about each company.
PUDL has extracted a beta version of several tables of corporation data from the SEC Form 10-K and its attachments, including a conservative set of links likely to exist between SEC-identified entities and those from FERC and the EIA.
Four output tables are available:

* `out_sec10k__quarterly_filings`: information about the Form 10-K filings themselves (filing date, subversion of the 10-K used, source URL, etc)
* `out_sec10k__quarterly_company_information`: attributes describing the companies which file 10-K’s
* `out_sec10k__parents_and_subsidiaries`: ownership information about parent companies and their subsidiary companies
* `out_sec10k__changelog_company_name`: information about company name changes

In this notebook, we will introduce PUDL's SEC 10-K tables and demonstrate some tasks they can help with.

In [1]:
import pandas as pd

In [2]:
def s3(table):
    """Get the S3 address for a PUDL table's parquet file."""
    return f"s3://pudl.catalyst.coop/nightly/{table}.parquet"

# 1. Warm-up: Find all historical names for a company

_Valaris, Limited_ is an oil industry company headquartered in Texas that's been around in one form or another since the 1980s.
It hasn't always been called _Valaris_, though -- and if we want to be able to look up data about this company that it filed at different times over the years, we'll need to know what names it used at what times.

We can use `out_sec10k__changelog_company_name`, the SEC 10-k company name changelog in PUDL, to help with this.

First, we need the company's Central Index Key (CIK), which is a unique identifier used by the SEC to identify corporations.
The SEC provides a CIK lookup utility at https://www.sec.gov/search-filings/cik-lookup that will let us search by company name.

The CIK for _Valaris, Limited_ is `0000314808`.

We can then use the CIK to filter `out_sec10k__changelog_company_name`.
This will give us a listing of all the names _Valaris, Limited_ has ever filed with the SEC, and the dates when each name was active.

In [3]:
valaris_names = pd.read_parquet(
    s3("out_sec10k__changelog_company_name"),
    columns=[
        "central_index_key",
        "name_change_date",
        "company_name_old",
        "company_name_new",
    ],
    dtype_backend="pyarrow",
    engine="pyarrow",
    filters=[("central_index_key","=","0000314808")],
)
valaris_names

Unnamed: 0,central_index_key,name_change_date,company_name_old,company_name_new
0,314808,1987-10-15,blocker energy corp,energy service company inc
1,314808,1992-07-03,energy service company inc,ensco international inc
2,314808,1995-05-26,ensco international inc,ensco international plc
3,314808,2009-12-23,ensco international plc,ensco rowan plc
4,314808,2019-04-10,ensco rowan plc,valaris plc
5,314808,2019-08-01,valaris plc,valaris ltd


Before October, 1987, _Valaris_ was known as _Blocker Energy Corporation_. From then to July 1992, they were _Energy Service Company, Incorporated_. Periodic name changes continued; the company has been _Valaris, Limited_ since August 2019.

If we just want a list of all the names they've ever used so that we can keyword-search in other datasets, we can do some data frame arithmetic to combine the two name columns and deduplicate them into a file:

In [4]:
print(
    valaris_names.drop(columns="name_change_date")
    .set_index("central_index_key")
    # put all the names in a single column
    .stack()
    .drop(index="level_1")
    .groupby("central_index_key")
    # join unique names together, |-delimited
    .agg(lambda x: "|".join(sorted(set(x))))
    # format for output
    .reset_index()
    .rename(columns={0:"names"})
    .to_csv(index=False)
)

central_index_key,names
0000314808,blocker energy corp|energy service company inc|ensco international inc|ensco international plc|ensco rowan plc|valaris ltd|valaris plc



This method and format would be suitable for compiling historical names of many different corporations into a single file, which could then be used as input to some future process downstream.

# 2. Leverage Industry Codes

The SEC uses a system called [Standard Industrial Classification (SIC) coding](https://www.sec.gov/search-filings/standard-industrial-classification-sic-code-list) to indicate a company's type of business. Internally at the SEC, SIC codes are used to determine which department reviews each company's finances. We can use them to reduce the scale of data we have to deal with, focusing our attention on industries of interest and excluding corporations in industries not relevant to our analyses.

SIC codes are included in the following PUDL tables:

* `out_sec10k__quarterly_company_information` - for SEC 10-K filers only
* `out_sec10k__parents_and_subsidiaries` - for subsidiary companies as well as parent companies/SEC 10-K filers

## Use SIC codes to evaluate the performance of PUDL record linkages between SEC and EIA

The SEC 10-K does not include EIA or FERC utility ids. PUDL uses a statistical model to conservatively predict linkages between corporations in the SEC 10-K data and their counterparts in the EIA data.
We want to evaluate the performance of this model (are the links any good?), but we don't want to manually examine each of the thousands of links.
Instead, we can use industry codes to compute some summary statistics that can help us decide whether the links roughly make sense as a whole:

* which industries have the greatest numbers of links
* link coverage (percentage of SEC companies with a link) in industries where we expect most utility companies to operate
* link coverage (percentage of SEC companies with a link) in industries where we expect most companies to be utilities (subtly different -- more below)

In [10]:
company_quarters = pd.read_parquet(s3("out_sec10k__quarterly_company_information"))

In [28]:
import polars as pl
pl_company_quarters = pl.read_parquet(
    s3("out_sec10k__quarterly_company_information"),
    storage_options={
        "skip_signature": "true",
        "region": "us-west-2",
    },
)

#### What industries have the highest representation among record linkages?

For how many quarterly records do we have a link between the SEC filer and an EIA utility ID?

In [11]:
linked_company_quarters = company_quarters.loc[company_quarters.utility_id_eia.notna()]
linked_company_quarters.shape[0]

15178

How many unique companies are found within those records?

In [12]:
linked_companies = (
    linked_company_quarters.drop_duplicates(subset=["central_index_key","utility_id_eia"])
)
linked_companies.shape[0]

529

How many records do we retain if we require a valid industry code?

In [13]:
linked_company_quarters.loc[
    linked_company_quarters.industry_id_sic.notna()
].shape[0]

15133

How many _unique companies_ do we retain if we require a valid industry code?

In [14]:
linked_companies_with_sic = (
    linked_company_quarters.loc[
        linked_company_quarters.industry_id_sic.notna()
    ].drop_duplicates(subset=["central_index_key","utility_id_eia"])
)
linked_companies_with_sic.shape[0]

526

Do companies sometimes change their industry code across filings?

/is the number of unique (company, industry) pairs greater than the number of unique companies?

In [15]:
linked_company_quarters.drop_duplicates(subset=["central_index_key","industry_id_sic"]).shape[0]

632

Among quarterly records with a link between the filer and an EIA utility company, as well as a valid SIC code, what industries are most commonly seen?

/which industries hold the greatest percentage of available links?

In [16]:
quarterly_links_per_sic = (
    linked_company_quarters
    .loc[linked_company_quarters.industry_id_sic.notna()]
    [["industry_id_sic","industry_name_sic"]]
    .value_counts()
    .pipe(lambda links_with_sic: pd.DataFrame({
        "links_with_sic": links_with_sic,
        "fraction_with_sic": links_with_sic/links_with_sic.sum()
    }))
    .sort_values("fraction_with_sic", ascending=False)
    .reset_index()
)
quarterly_links_per_sic.head(10)

Unnamed: 0,industry_id_sic,industry_name_sic,links_with_sic,fraction_with_sic
0,4911,electric services,6903,0.456155
1,4931,electric & other services combined,2105,0.1391
2,6798,real estate investment trusts,413,0.027291
3,6189,asset-backed securities,376,0.024846
4,1311,crude petroleum & natural gas,229,0.015132
5,2621,paper mills,185,0.012225
6,2834,pharmaceutical preparations,177,0.011696
7,4991,cogeneration services & small power producers,136,0.008987
8,4922,natural gas transmission,124,0.008194
9,2631,paperboard mills,124,0.008194


Does the distribution over industries change significantly if we only count unique (company, industry) pairs?

In [17]:
unique_links_per_sic = (
    linked_company_quarters
    .loc[linked_company_quarters.industry_id_sic.notna()]
    .drop_duplicates(subset=["central_index_key","industry_id_sic"]) # drop multiple instances of the same company
    [["industry_id_sic","industry_name_sic"]]
    .value_counts()
    .pipe(lambda links_with_sic: pd.DataFrame({
        "links_with_sic": links_with_sic,
        "fraction_with_sic": links_with_sic/links_with_sic.sum()
    }))
    .sort_values("fraction_with_sic", ascending=False)
    .reset_index()
)
unique_links_per_sic.head(10)

Unnamed: 0,industry_id_sic,industry_name_sic,links_with_sic,fraction_with_sic
0,4911,electric services,124,0.197452
1,4931,electric & other services combined,49,0.078025
2,6798,real estate investment trusts,19,0.030255
3,6189,asset-backed securities,17,0.02707
4,2621,paper mills,15,0.023885
5,1311,crude petroleum & natural gas,14,0.022293
6,6770,blank checks,11,0.017516
7,2834,pharmaceutical preparations,11,0.017516
8,2911,petroleum refining,10,0.015924
9,4991,cogeneration services & small power producers,10,0.015924


lol, "blank checks"[<sup id="fn1-back">1</sup>](#fn1 "https://en.wikipedia.org/wiki/Special-purpose_acquisition_company") -- but otherwise pretty close to the distribution over all quarterly links.

[<sup id="fn1">1</sup>](#fn1-back) https://en.wikipedia.org/wiki/Special-purpose_acquisition_company

#### What industries have the best link coverage?

Among quarterly records with a valid SIC code, and with or without links to an EIA utility, which industries have the greatest link coverage?

/which industries have the fewest unlinked records?

In [18]:
sics_by_link_coverage = (
    company_quarters
    .loc[company_quarters.industry_id_sic.notna()]
    .assign(has_utility_id=lambda x: x.utility_id_eia.notna())
    [["industry_id_sic","industry_name_sic", "has_utility_id"]]
    .value_counts()
    .unstack("has_utility_id").fillna(0)
    .rename_axis(columns=None)
    .assign(
        sic_total_records=lambda x: x.sum(axis="columns"),
        fraction_with_utility_id=lambda x: x[True]/x.sic_total_records,
    )
    .drop(columns=[False, True])
    .sort_values("fraction_with_utility_id", ascending=False)
    .reset_index()
)
sics_by_link_coverage.head(10)

Unnamed: 0,industry_id_sic,industry_name_sic,sic_total_records,fraction_with_utility_id
0,4911,electric services,11827.0,0.583664
1,4991,cogeneration services & small power producers,247.0,0.550607
2,2631,paperboard mills,226.0,0.548673
3,4931,electric & other services combined,4063.0,0.51809
4,2621,paper mills,454.0,0.407489
5,2600,papers & allied products,40.0,0.4
6,2650,paperboard containers & boxes,290.0,0.344828
7,3011,tires & inner tubes,115.0,0.269565
8,3760,guided missiles & space vehicles & parts,158.0,0.265823
9,2511,"wood household furniture, (no upholstered)",153.0,0.228758


Does this ranking change if we only consider unique (company, industry) pairs?

In [19]:
sics_by_unique_link_coverage = (
    company_quarters
    .loc[company_quarters.industry_id_sic.notna()]
    .drop_duplicates(subset=["central_index_key","industry_id_sic"]) # drop multiple instances of the same company
    .assign(has_utility_id=lambda x: x.utility_id_eia.notna())
    [["industry_id_sic","industry_name_sic", "has_utility_id"]]
    .value_counts()
    .unstack("has_utility_id").fillna(0)
    .rename_axis(columns=None)
    .assign(
        sic_total_records=lambda x: x.sum(axis="columns"),
        fraction_with_utility_id=lambda x: x[True]/x.sic_total_records,
    )
    .drop(columns=[False, True])
    .sort_values("fraction_with_utility_id", ascending=False)
    .reset_index()
)
sics_by_unique_link_coverage.head(10)

Unnamed: 0,industry_id_sic,industry_name_sic,sic_total_records,fraction_with_utility_id
0,2631,paperboard mills,21.0,0.47619
1,4931,electric & other services combined,105.0,0.466667
2,4911,electric services,303.0,0.409241
3,2600,papers & allied products,8.0,0.375
4,4991,cogeneration services & small power producers,28.0,0.357143
5,2621,paper mills,44.0,0.340909
6,2650,paperboard containers & boxes,31.0,0.225806
7,4932,gas & other services combined,18.0,0.222222
8,2611,pulp mills,5.0,0.2
9,2732,book printing,7.0,0.142857


This change is more significant: in the top 10 industries we still see electric services and paper, but there's no more tires, space vehicles, or furniture, and we've added gas. The overall proportion of each SIC which is linked has also dropped: the majority of _quarterly records_ for 4911 Electric Services have links to EIA utilities, but a _minority_ of unique companies associated with this industry are linked.

TODO: conclusions

## SIC codes of interest

The following SIC codes represent industries where we expect nearly all companies to have a corresponding EIA utility ID:

* 4911: Electric Services
* 4931: Electric & Other Services Combined
* 4991: Cogeneration Services & Small Power Producers

#### Within industries we most associate with electric utilities, what percent of SEC filers have links to an EIA utility ID?

In [31]:
likely_electric_utility_records = (
    company_quarters.loc[
        company_quarters.industry_id_sic.notna() &
        company_quarters.industry_id_sic.isin(["4911", "4931", "4991"])
    ]
)
(
    likely_electric_utility_records
    .drop_duplicates(subset=["central_index_key","industry_id_sic"]) # drop multiple instances of the same company
    .groupby(["industry_id_sic","industry_name_sic"])
    .aggregate(
        count_with_utility_id=pd.NamedAgg(column="utility_id_eia", aggfunc=lambda x: x.notna().sum()),
        fraction_with_utility_id=pd.NamedAgg(column="utility_id_eia", aggfunc=lambda x: x.notna().mean()),
        total_unique_companies=pd.NamedAgg(column="utility_id_eia", aggfunc=lambda x: x.size),
    )
    .sort_values("fraction_with_utility_id", ascending=False)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,count_with_utility_id,fraction_with_utility_id,total_unique_companies
industry_id_sic,industry_name_sic,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4931,electric & other services combined,49,0.466667,105
4911,electric services,124,0.409241,303
4991,cogeneration services & small power producers,10,0.357143,28


#### What are some typical companies in these industries that have not been linked to an EIA utility id?

Top 10 unlinked companies with the most SEC 10-K quarterly records:

In [93]:
top_unlinked = (
    likely_electric_utility_records.loc[likely_electric_utility_records.utility_id_eia.isna()]
    .groupby("central_index_key")
    .agg(
        available_record_count=pd.NamedAgg(column="filename_sec10k", aggfunc=lambda x: x.size),
    )
    .sort_values("available_record_count", ascending=False)
    .head(10)
    # glue company information back on so we can see who they are beyond just the CIK
    .merge(
        (
            likely_electric_utility_records
            .drop(columns=["filename_sec10k", "filer_count", "report_date", "filing_date", "source_url"])
            .drop_duplicates()
        ), 
        on="central_index_key", how="left")
    .groupby("central_index_key")
    .agg(
        # collapse multiple values used across different filings into a single list
        lambda x: x.drop_duplicates()
    )
    .sort_values("central_index_key")
)
top_unlinked

Unnamed: 0_level_0,available_record_count,utility_id_eia,utility_name_eia,company_name,fiscal_year_end,taxpayer_id_irs,incorporation_state,industry_name_sic,industry_group_sic,industry_id_sic,...,business_zip_code,business_zip_code_4,business_postal_code,mail_street_address,mail_street_address_2,mail_city,mail_state,mail_zip_code,mail_zip_code_4,mail_postal_code
central_index_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
23426,139,,,connecticut light & power co,1231,06-0303850,CT,electric services,office of energy & transportation,4911,...,06037,1616,,"[<NA>, 107 selden street]",,"[<NA>, berlin]","[<NA>, CT]","[<NA>, 06037]",,
44570,163,,,"[gulf states utilities co, entergy gulf states...",1231,74-0662730,"[TX, LA]",electric services,office of energy & transportation,4911,...,"[77701, 70802]","[<NA>, 5717]",,"[<NA>, 350 pine st, 446 north boulevard]",,"[<NA>, beaumont, baton rouge]","[<NA>, TX, LA]","[<NA>, 77701, 70802]","[<NA>, 5717]",
66901,219,,,"[mississippi power & light co, entergy mississ...","[1231, 1204, 0721]","[64-0205830, 83-1950019]","[MS, TX]",electric services,office of energy & transportation,4911,...,"[39215, 39201, 70113]","[1640, <NA>]",,"[<NA>, 308 east pearl street, 639 loyola ave]",,"[<NA>, jackson, new orleans]","[<NA>, MI, MS, LA]","[<NA>, 39201, 70113]",,
71508,219,,,"[new orleans public service inc, entergy new o...",1231,"[72-0273040, 82-2212934]","[LA, TX]",electric & other services combined,office of energy & transportation,4931,...,"[70161, 70113, 70112]",,,"[<NA>, po box 61000, 1600 perdido st]","[<NA>, bldg 505]","[<NA>, new orl, new orleans]","[<NA>, LA]","[<NA>, 70161, 70112]",,
72741,140,,,"[northeast utilities, northeast utilities syst...",1231,04-2147929,MA,electric services,office of energy & transportation,4911,...,"[01090, 01105, 01104]","[0010, <NA>]",,"[107 seldon st, <NA>, 107 selden st]",,"[berlin, <NA>]","[CT, <NA>]","[06037, <NA>]","[1616, <NA>]",
78100,144,,,peco energy co,1231,23-0970240,PA,electric & other services combined,office of energy & transportation,4931,...,"[19103, 19101, <NA>]","[<NA>, 8699]",,"[<NA>, 2301 market street]","[<NA>, po box 8699]","[<NA>, philadelphia]","[<NA>, PA]","[<NA>, 19101]","[<NA>, 8699]",
92122,182,,,southern co,1231,58-0690070,DE,electric services,office of energy & transportation,4911,...,"[30346, 30303, 30308]",,,"[<NA>, 64 perimeter center east, 270 peachtree...",,"[<NA>, atlanta]","[<NA>, GA]","[<NA>, 30346, 30303, 30308]",,
202584,219,,,"[system energy resources inc, system energy re...",1231,72-0752777,AR,electric services,office of energy & transportation,4911,...,39213,,,"[<NA>, po box 31995, echelon one, 1340 echelon...","[<NA>, 1340 echelon pkwy]","[<NA>, jackson]","[<NA>, MS]","[<NA>, 39286, 39213]","[<NA>, 1995]",
315256,149,,,public service co of new hampshire,1231,02-0181050,NH,electric services,office of energy & transportation,4911,...,03105,"[<NA>, 0330]",,"[<NA>, 1000 elm street, 780 n. commercial street]",,"[<NA>, manchester]","[<NA>, NH]","[<NA>, 03105]","[<NA>, 0330]",
1109357,139,,,exelon corp,1231,23-2990190,PA,electric & other services combined,office of energy & transportation,4931,...,"[60690, <NA>, 60680]","[3005, <NA>, 5398]",,"[p o box 767, <NA>, po box 805398]",,"[chicago, <NA>]","[IL, <NA>]","[60690, <NA>, 60680]","[<NA>, 5398]",


#### Could we manually find these in EIA if we needed to?

Let's look at Connecticut Light & Power Co. The SEC data has it located in Berlin, CT:

In [89]:
top_unlinked.iloc[0:1].T.dropna()

central_index_key,0000023426
available_record_count,139
company_name,connecticut light & power co
fiscal_year_end,1231
taxpayer_id_irs,06-0303850
incorporation_state,CT
industry_name_sic,electric services
industry_group_sic,office of energy & transportation
industry_id_sic,4911
film_number,"[94517972, 94519900, 96534651, 96534893, 97562..."
sec10k_type,"['10-k', '10-k/a'] Categories (8, object): ['1..."


We can look in the list of utilities reporting to the EIA each year using PUDL table `out_eia__yearly_utilities`:

In [43]:
eia_utilities = pd.read_parquet(s3("out_eia__yearly_utilities"))

In [44]:
pl_eia_utilities = pl.read_parquet(
    s3("out_eia__yearly_utilities"),
    storage_options={
        "skip_signature": "true",
        "region": "us-west-2",
    },
)

Then look for utilities with a name that starts with "Connecticut Light".

We'll also grab any utilities located in Berlin, CT.

In [95]:
(
    eia_utilities.loc[
        eia_utilities.utility_name_eia.str.contains("(?i)^connecticut light") |
        ((eia_utilities.city == "Berlin") & (eia_utilities.state == "CT"))
    ]
    .drop_duplicates(subset=["utility_id_eia","utility_id_pudl","utility_name_eia","street_address","city","state","zip_code"])
    .sort_values("report_date")
    .T
)

Unnamed: 0,151997,151991,122134,122133,104366
utility_id_eia,4176,4176,29868,29868,64810
utility_id_pudl,75,75,4423,4423,13966
utility_name_eia,Connecticut Light & Power Co,Connecticut Light & Power Co,Northeast Generation Services,Northeast Generation Services,Eversource Energy
report_date,2002-01-01,2009-01-01,2010-01-01,2011-01-01,2023-01-01
street_address,,301 Hammer Mill Road,,107 Selden Street,107 Selden Street
city,Hartford,Rocky Hill,Berlin,Berlin,Berlin
state,CT,CT,CT,CT,CT
zip_code,06141,06067,06037,06037,06037
plants_reported_owner,,,,,
plants_reported_operator,,,,,


So.... sortof. We have two records for a Connecticut Light & Power Co, but the one from 2002 is in Hartford, and the one from 2009 is in Rocky Hill. So, a state match, but not a city match.

We also have three records for utilities in Berlin, CT, all of which have a street address that's an exact match for 107 Selden Street we find in the SEC data, but none of them are called Connecticut Light & Power.

Let's do a quick check through the company name changelog to see if we can find reinforcement for any of these potential matches.

In [79]:
clp_names = pd.read_parquet(
    s3("out_sec10k__changelog_company_name"),
    columns=[
        "central_index_key",
        "name_change_date",
        "company_name_old",
        "company_name_new",
    ],
    dtype_backend="pyarrow",
    engine="pyarrow",
    filters=[("central_index_key","=","0000023426")],
)
clp_names

Unnamed: 0,central_index_key,name_change_date,company_name_old,company_name_new


No results, which means only one company name was ever filed with the SEC for the CIK of Connecticut Light & Power Co.

We can try cross-referencing the other two EIA utility names, Eversource Energy and Northeast Generation Services, to see if they appear in the SEC data under a different CIK.

In [102]:
(
    likely_electric_utility_records.loc[
        likely_electric_utility_records.company_name.str.contains("(?i)^eversource energy") |
        likely_electric_utility_records.company_name.str.contains("(?i)^northeast generation")
    ]
    .drop(columns=["filename_sec10k", "filer_count", "report_date", "filing_date", "source_url"])
    .drop_duplicates()
    .groupby("central_index_key")
    .agg(
        # collapse multiple values used across different filings into a single list
        lambda x: x.drop_duplicates()
    )
    .sort_values("central_index_key")
).T.dropna()

central_index_key,0000072741
company_name,eversource energy
fiscal_year_end,1231
taxpayer_id_irs,04-2147929
incorporation_state,MA
industry_name_sic,electric services
industry_group_sic,office of energy & transportation
industry_id_sic,4911
film_number,"[161463498, 17630367, 18637878, 19632597, 2065..."
sec10k_type,10-k
sec_act,1934 act


Okay, Eversource Energy has its _business_ address in Massachusetts, but its _mailing_ address is an exact match for Connecticut Light & Power Co. Hmmmmmmmmm this whiffs of some kind of buyout/takeover. We can use the parents and subsidiaries table to check:

In [100]:
subs = pd.read_parquet(
    s3("out_sec10k__parents_and_subsidiaries"),
    dtype_backend="pyarrow",
    engine="pyarrow",
    filters=[("parent_company_central_index_key","=","0000072741")],
)

In [98]:
subs.loc[
    subs.subsidiary_company_central_index_key=="0000023426"
]

Unnamed: 0,filename_sec10k,subsidiary_company_name,subsidiary_company_location,subsidiary_company_id_sec10k,fraction_owned,parent_company_central_index_key,parent_company_name,filing_date,report_date,parent_company_phone_number,...,subsidiary_company_mail_street_address,subsidiary_company_mail_street_address_2,subsidiary_company_mail_zip_code,subsidiary_company_mail_zip_code_4,subsidiary_company_incorporation_state,subsidiary_company_utility_id_eia,subsidiary_company_utility_name_eia,subsidiary_company_industry_name_sic,subsidiary_company_industry_id_sic,subsidiary_company_taxpayer_id_irs
472,72741/0000072741-96-000049,connecticut light and power company,,0000072741_connecticut light and power company_,1.0,72741,northeast utilities,1996-03-14,1996-01-01,2036655000,...,,,,,CT,,,electric services,4911,06-0303850
514,72741/0000072741-97-000054,connecticut light and power company,,0000072741_connecticut light and power company_,1.0,72741,northeast utilities system,1997-03-25,1997-01-01,2036655000,...,,,,,CT,,,electric services,4911,06-0303850
555,72741/0000072741-98-000076,connecticut light and power company,,0000072741_connecticut light and power company_,1.0,72741,northeast utilities system,1998-03-19,1998-01-01,4137855871,...,,,,,CT,,,electric services,4911,06-0303850
601,72741/0000072741-99-000089,connecticut light and power company,,0000072741_connecticut light and power company_,1.0,72741,northeast utilities system,1999-03-23,1999-01-01,4137855871,...,,,,,CT,,,electric services,4911,06-0303850


Okay! So here's what the PUDL SEC data helped us find out:

* The company in the SEC 10-k data called _Connecticut Light & Power Co_ (CIK 0000023426) wasn't auto-matched to an EIA utility.
* Alas, there is no EIA utility that matches both the company name and city/state.
* A utility in the EIA data called _Eversource Energy_ is one of two utilities that matches city/state with the SEC's _Connecticut Light & Power Co_.
* _Eversource Energy_ (CIK 0000072741) is a listed parent company for _Connecticut Light & Power Co_ in the SEC data.
* While the SEC's _Eversource Energy_ wasn't auto-matched to an EIA utility, the combination of the name and city/state match strongly suggests correspondence to EIA utility ID 64810.
* In applications with lower confidence requirements, we might also consider _Connecticut Light & Power Co_ to correspond to EIA utility ID 4176, even though only the company name and state matches the SEC data. If we go that route, it also suggests a parent-subsidiary link between EIA utilities 64810 (_Eversource_) and 4176 (_CL&P_).

or in other words:

* Add one high-confidence link between CIK 0000072741 and EIA utility ID 64810
* Add one medium-confidence link between CIK 0000023426 and EIA utility ID 4176
* Add one medium-confidence parent-subsidiary relationship between EIA utility IDs 64810 and 4176

These tangles can be a real mess, and having the address information, name changes, and parent-subsidiary information at your fingertips can really help point the way!

## Select a meaningful subset of respondents

### Electricity

Above, we identified the three SIC codes most associated with the production and distribution of electricity:

* 4911: Electric Services
* 4931: Electric & Other Services Combined
* 4991: Cogeneration Services & Small Power Producers

We can filter all SEC respondents using these codes.

In [123]:
out_sec10k__quarterly_company_information = pd.read_parquet(
    s3("out_sec10k__quarterly_company_information")
)

In [135]:
electricity_sic = ["4911", "4931", "4991"]
info_columns = [
    "central_index_key", "company_name",
    "utility_id_eia", "utility_name_eia",
    "business_city", "business_state", "business_street_address",
    "mail_city", "mail_state", "mail_street_address",
]
     
sec_electricity = out_sec10k__quarterly_company_information.loc[
    out_sec10k__quarterly_company_information.industry_id_sic.isin(electricity_sic),
    info_columns
].drop_duplicates()
print(sec_electricity.shape)
sec_electricity.head()
    

(983, 10)


Unnamed: 0,central_index_key,company_name,utility_id_eia,utility_name_eia,business_city,business_state,business_street_address,mail_city,mail_state,mail_street_address
594,3153,alabama power co,195,Alabama Power Co,birmingham,AL,600 n 18th st,,,
656,3153,alabama power co,195,Alabama Power Co,birmingham,AL,600 n 18th st,birmingham,AL,600 n 18th st
1033,3673,allegheny power system inc,363,Allegheny Energy SupplyWheatld,new york,NY,12 east 49th st,new york,NY,12 east 49th street
1036,3673,allegheny power system inc,363,Allegheny Energy SupplyWheatld,,,,,,
1037,3673,allegheny energy inc,363,Allegheny Energy SupplyWheatld,hagerstown,MD,10435 downsville pike,hagerstown,MD,10435 downsville pike


### Natural Gas

We can do the same for natural gas:

* 4922: NATURAL GAS TRANSMISSION
* 4923: NATURAL GAS TRANSMISISON & DISTRIBUTION
* 4924: NATURAL GAS DISTRIBUTION
* 4932: GAS & OTHER SERVICES COMBINED

In [134]:
ng_sic = ["4922", "4923", "4924", "4932"]
sec_ng = out_sec10k__quarterly_company_information.loc[
    out_sec10k__quarterly_company_information.industry_id_sic.isin(ng_sic),
    info_columns
].drop_duplicates()
print(sec_ng.shape)
sec_ng.head()
    

(511, 10)


Unnamed: 0,central_index_key,company_name,utility_id_eia,utility_name_eia,business_city,business_state,business_street_address,mail_city,mail_state,mail_street_address
525,3146,alabama gas corp,,,birmingham,AL,2101 sixth ave north,birmingham,AL,2101 sixth ave north
540,3146,alabama gas corp,,,birmingham,AL,2101 sixth ave north,birmingham,AL,605 richard arrington jr blvd north
543,3146,alabama gas corp,,,birmingham,AL,605 richard arrington jr blvd north,birmingham,AL,605 richard arrington jr blvd north
567,3146,alabama gas corp,,,birmingham,AL,2102 6th avenue north,birmingham,AL,2102 6th avenue north
570,3146,alabama gas corp,,,birmingham,AL,2101 6th avenue north,birmingham,AL,2101 6th avenue north


### Fuel

Do we mean extraction?

* 1311:	Office of Energy & Transportation; CRUDE PETROLEUM & NATURAL GAS
* 1381:	Office of Energy & Transportation; 	DRILLING OIL & GAS WELLS
* 1382:	Office of Energy & Transportation; 	OIL & GAS FIELD EXPLORATION SERVICES
* 1389:	Office of Energy & Transportation; 	OIL & GAS FIELD SERVICES, NEC
* 3533:	Office of Energy & Transportation; 	OIL & GAS FIELD MACHINERY & EQUIPMENT

Or refining?

* 2911:	Office of Energy & Transportation; 	PETROLEUM REFINING
* 2990:	Office of Energy & Transportation; 	MISCELLANEOUS PRODUCTS OF PETROLEUM & COAL

Or wholesale?

* 5171:	Office of Trade & Services; 	WHOLESALE-PETROLEUM BULK STATIONS & TERMINALS
* 5172:	Office of Trade & Services; 	WHOLESALE-PETROLEUM & PETROLEUM PRODUCTS (NO BULK STATIONS)

Or coal?

* 1220:	Office of Energy & Transportation; 	BITUMINOUS COAL & LIGNITE MINING
* 1221:	Office of Energy & Transportation; 	BITUMINOUS COAL & LIGNITE SURFACE MINING

### Links between companies in different industries

If an electric utility is closely related to a company in another industry, it might behave differently from a company whose only concern is producing and/or distributing electricity.

We can use the parent/subsidiary relationships in concert with SIC codes to identify cases when a parent company in an electricity industry has a subsidiary company in an industry not otherwise related to electricity, and vice versa.

Let's look at the subsidiaries of electricity-related parent companies first:

In [155]:
sec_electricity_ciks = sec_electricity[["central_index_key"]].drop_duplicates()

# cross-industry relationships where an electricity company owns a company in another industry
electricity_as_parent = (
    out_sec10k__parents_and_subsidiaries.merge(
        sec_electricity_ciks,
        left_on="parent_company_central_index_key",
        right_on="central_index_key",
        how="inner"
    )
)
print(f"{electricity_as_parent.shape[0]:6d} subsidiaries of companies in electricity industries")

 # industry is only known for a subset of subsidiaries
electricity_as_parent_known_sic = electricity_as_parent.dropna(subset="subsidiary_company_industry_id_sic")
print(f"{electricity_as_parent_known_sic.shape[0]:6d} subsidiaries where SIC is known")

electricity_as_parent_unknown_sic_known_eia = electricity_as_parent.loc[
    electricity_as_parent.subsidiary_company_industry_id_sic.isna() &
    electricity_as_parent.subsidiary_company_utility_id_eia.notna()
]
print(f"{electricity_as_parent_unknown_sic_known_eia.shape[0]:6d} subsidiaries with unknown SIC known to be an EIA utility")

print(f"{electricity_as_parent.shape[0] - (electricity_as_parent_known_sic.shape[0] + electricity_as_parent_unknown_sic_known_eia.shape[0]):6d}"
      " subsidiaries with insufficient industry metadata to decide either way")

# keep only subsidiaries outside the electricity industries
xindustry_electricity_as_parent = (
    electricity_as_parent_known_sic
    .loc[lambda x: ~x.subsidiary_company_industry_id_sic.isin(electricity_sic)]
)
print(f"{xindustry_electricity_as_parent.shape[0]:6d} subsidiaries where SIC is not an electricity industry")

150505 subsidiaries of companies in electricity industries
  5023 subsidiaries where SIC is known
 13436 subsidiaries with unknown SIC known to be an EIA utility
132046 subsidiaries with insufficient industry metadata to decide either way
    65 subsidiaries where SIC is not an electricity industry


There are only 65 known electric-parent/nonelectric-subsidiary relationships in the entire dataset,
however the coverage of the metadata we need to query that relationship is quite low
(less than 10%).

But if we assume the distribution over industries is the same among labeled and unlabeled subsidiaries,
we can get a rough idea of typical industry relationships in this direction by looking at the frequency table:

In [159]:
xindustry_electricity_as_parent[["subsidiary_company_industry_name_sic","subsidiary_company_industry_id_sic"]].value_counts()

subsidiary_company_industry_name_sic                 subsidiary_company_industry_id_sic
natural gas distribution                             4924                                  22
natural gas transmission                             4922                                  11
natural gas transmission & distribution              4923                                  11
telephone communications (no radiotelephone)         4813                                   8
electrical work                                      1731                                   7
airports, flying fields & airport terminal services  4581                                   1
blank checks                                         6770                                   1
miscellaneous fabricated metal products              3490                                   1
services-facilities support management services      8744                                   1
services-prepackaged software                        7372         

The strong presence of natural gas transmission and distribution in this list is notable.

We can do a similar analysis of the parents of electricity-related companies:

In [158]:

# cross-industry relationships where an electricity company is owned by a company in another industry
electricity_as_subsidiary = (
    out_sec10k__parents_and_subsidiaries.merge(
        sec_electricity_ciks,
        left_on="subsidiary_company_central_index_key",
        right_on="central_index_key",
        how="inner"
    )
)
print(f"{electricity_as_subsidiary.shape[0]:6d} parents of companies in electricity industries")

 # industry is only known for a subset of parent companies
electricity_as_subsidiary_known_sic = electricity_as_subsidiary.dropna(subset="parent_company_industry_id_sic")
print(f"{electricity_as_subsidiary_known_sic.shape[0]:6d} parent companies where SIC is known")

electricity_as_subsidiary_unknown_sic_known_eia = electricity_as_subsidiary.loc[
    electricity_as_subsidiary.parent_company_industry_id_sic.isna() &
    electricity_as_subsidiary.parent_company_utility_id_eia.notna()
]
print(f"{electricity_as_subsidiary_unknown_sic_known_eia.shape[0]:6d} parent companies with unknown SIC known to be an EIA utility")

print(f"{electricity_as_subsidiary.shape[0] - (electricity_as_subsidiary_known_sic.shape[0] + electricity_as_subsidiary_unknown_sic_known_eia.shape[0]):6d}"
      " parent companies with insufficient industry metadata to decide either way")

# keep only parent companies outside the electricity industries
xindustry_electricity_as_subsidiary = (
    electricity_as_subsidiary_known_sic
    .loc[lambda x: ~x.parent_company_industry_id_sic.isin(electricity_sic)]
)
print(f"{xindustry_electricity_as_subsidiary.shape[0]:6d} parent companies where SIC is not an electricity industry")

  7162 parents of companies in electricity industries
  7103 parent companies where SIC is known
     0 parent companies with unknown SIC known to be an EIA utility
    59 parent companies with insufficient industry metadata to decide either way
   408 parent companies where SIC is not an electricity industry


Metadata is signifianctly more complete for parent companies, since they are guaranteed to be SEC filers. We can be more certain that we have captured all cross-industry relationships in this direction.

We have a longer tail here in the distribution over non-electricity industries:

In [160]:
xindustry_electricity_as_subsidiary[["parent_company_industry_name_sic","parent_company_industry_id_sic"]].value_counts()

parent_company_industry_name_sic                             parent_company_industry_id_sic
electric, gas & sanitary services                            4900                              119
fire, marine & casualty insurance                            6331                               50
natural gas transmission                                     4922                               41
gas & other services combined                                4932                               40
refuse systems                                               4953                               35
natural gas distribution                                     4924                               31
natural gas transmission & distribution                      4923                               24
electrical work                                              1731                               12
telephone communications (no radiotelephone)                 4813                                9
crude petroleum &

# Leverage Subsidiary Relationships

## Find all historical subsidiaries of a company

## Working with multiple layers of subsidiary nesting