# Notebook to map respondent ID to entity ID

This notebook provides a simple "first draft" of mapping of `respondent_id`'s from the historical
DBF based FERC data to `entity_id`'s in the new XBRL based data. To do this, this notebook
will use the years of data that FERC has migrated to the new XBRL format. Each filing they
have migrated contains the `respondent_id` in the file name, and the `entity_id` embedded in
the filings.

The first step is to extract the migrated filings to a SQLite database to make the entity ID's
accessible. These filings can be downloaded [here](https://ferc.gov/filing-forms/eforms-refresh/migrated-data-downloads).
There's not data for every filer included in each year of data, so using the entire set of years will
provide the best results. To create the SQLite database, extract the downloaded zip files to a single
directory, then use the FERC XBRL extractor tool with the following command:

```
xbrl_extract {path_to_filing_directory} ferc1.sqlite
```

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

The only data needed to perform the mapping are the `filing_name`, and `entity_id` columns from the `identificiation_001_duration` table.

In [None]:
engine = create_engine("sqlite:///ferc1.sqlite")

# Select RespondentLegalName as well for convenience
id_table = pd.read_sql(
    "SELECT filing_name, entity_id, RespondentLegalName FROM identification_001_duration",
    engine,
    parse_dates=["start_date", "end_date"]
)

The `respondent_id` is embedded in each filing name with the format `{UtilityName}-{respondent_id}-{year}{quarter}{form_number}`.
The first step is extract that ID, then drop duplicate pairs (same pairs will exist for different years).

In [None]:
id_table["respondent_id"] = pd.to_numeric(
    id_table["filing_name"].str.extract(r'.+-(\d+)-.+').loc[:,0]
)
map_table = (
    id_table.drop("filing_name", axis=1)
    .drop_duplicates(subset=["entity_id", "respondent_id"])
    .convert_dtypes()
    .sort_values(by=["respondent_id"])
)

Save the mapping to a CSV file.

In [None]:
map_table.to_csv("respondent_map.csv", index=False)

Check for any `entity_id`'s that map to multiple `respondent_id`'s. These will need to be analyzed
further to identify if they are mistakes or if something else is happening.

In [None]:
map_table.loc[map_table["entity_id"].duplicated(), :]