# Working with the EIA Extract / Transform
This notebook steps through PUDL's extract and transform steps for the EIA 860 and 923 datasets, to make it easier to test and add new years of data, or new tables from the various spreadsheets that haven't been integrated yet.

In [1]:
%load_ext autoreload
%autoreload 2
import pudl
from pudl import constants as pc
import logging
import sys
from pathlib import Path
import pandas as pd
pd.options.display.max_columns = None

In [2]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [3]:
pudl_settings = pudl.workspace.setup.get_defaults()

## Set the scope for the Extract-Transform:

In [9]:
eia923_tables = pc.PUDL_TABLES['eia923']
eia923_years = list(range(2001, 2020))
eia860_tables = pc.PUDL_TABLES['eia860']
eia860_years = list(range(2001, 2021))
eia860_ytd = True

## Create a locally cached datastore

In [10]:
ds = pudl.workspace.datastore.Datastore(local_cache_path=Path(pudl_settings["data_dir"]))

# EIA-860

## Extract just the EIA-860

In [11]:
%%time
eia860_extractor = pudl.extract.eia860.Extractor(ds)
eia860_raw_dfs = eia860_extractor.extract(year=eia860_years)

Extracting eia860 spreadsheet data.
Extra columns found in page boiler_generator_assn: {'plant_name', 'utility_name', 'steam_plant_type', 'generator_association'}
Extra columns found in page generator: {'winter_capacity', 'ferccogen', 'summer_capacity', 'planned_derates_net_summer_cap', 'fercewgdoc', 'fercdock', 'fercother'}
Extra columns found in page generator_proposed: {'winter_estimated_capacity', 'summer_estimated_capacity', 'winter_capacity', 'summer_capacity'}
Extra columns found in page plant: {'ferc_exempt_wholesale_generator_docket_number', 'ownertransdist'}
Extra columns found in page utility: {'areacode'}
CPU times: user 5min 50s, sys: 6.4 s, total: 5min 57s
Wall time: 6min 8s


## Extract EIA-860m

In [12]:
if eia860_ytd:
    eia860m_raw_dfs = pudl.extract.eia860m.Extractor(ds).extract(
        year_month=pc.WORKING_PARTITIONS['eia860m']['year_month'])
    eia860_raw_dfs = pudl.extract.eia860m.append_eia860m(
        eia860_raw_dfs=eia860_raw_dfs, eia860m_raw_dfs=eia860m_raw_dfs)

Extracting eia860m spreadsheet data.


## Transform just the EIA-860

In [15]:
%%time
eia860_transformed_dfs = pudl.transform.eia860.transform(
    eia860_raw_dfs, eia860_tables=eia860_tables)

Transforming raw EIA 860 DataFrames for ownership_eia860 concatenated across all years.


  own_df.set_index(["plant_id_eia", "generator_id"])


Transforming raw EIA 860 DataFrames for generators_eia860 concatenated across all years.
Transforming raw EIA 860 DataFrames for plants_eia860 concatenated across all years.
Transforming raw EIA 860 DataFrames for boiler_generator_assn_eia860 concatenated across all years.
Transforming raw EIA 860 DataFrames for utilities_eia860 concatenated across all years.
CPU times: user 2min 10s, sys: 20.6 s, total: 2min 31s
Wall time: 2min 38s


# EIA-923

## Extract just the EIA-923

In [None]:
%%time
eia923_extractor = pudl.extract.eia923.Extractor(ds)
eia923_raw_dfs = eia923_extractor.extract(year=eia923_years)

## Transform just the EIA-923

In [None]:
%%time
eia923_transformed_dfs = pudl.transform.eia923.transform(
    eia923_raw_dfs, eia923_tables=eia923_tables)

# Combined EIA Data

## Merge the EIA-860 and EIA-923 Dataframe Dictionaries

In [None]:
%%time
eia_transformed_dfs = eia923_transformed_dfs.copy()
eia_transformed_dfs.update(eia860_transformed_dfs.copy())

## Set all column data types

In [None]:
%%time
eia_transformed_dfs = pudl.helpers.convert_dfs_dict_dtypes(
    eia_transformed_dfs, 'eia')

## Run the entity resolution process

In [None]:
entities_dfs, eia_transformed_dfs = pudl.transform.eia.transform(
    eia_transformed_dfs,
    eia860_years=eia860_years,
    eia923_years=eia923_years,
)