# Catalyst Cooperative Jupyter Notebook Template
This notebook lays out a standard format and some best practices for creating interactive / exploratory notebooks which can be relatively easily shared between different PUDL users, and turned into reusable Python modules for integration into our underlying Python packages.

## Begin with a narrative outline
Each notebook should start with a brief explanation (in Markdown) explaining the purpose of the analysis, and outlining the different stages / steps which are taken to accomplish the analysis.
As the analysis develops, you can add new sections or details to this section.

## Notebooks should be runnable
Insofar as it's possible, another PUDL user who has cloned the repository that the notebook is part of should be able to update their `pudl-dev` conda environment, open the notebook, and run all cells successfully.
If there are required data or other prerequisites that the notebook cannot manage on its own -- like a file that needs to be downloaded by hand and placed in a particular location -- those steps should be laid out clearly at the beginning of the notebook.

## Avoid Troublesome Elements

### Don't hardcode passwords or access tokens
Most of our work is done in public Github repositories.
No authentication information should ever appear in a notebook.
These values can be stored in environment variables on your local computer.

### Do not hardocde values, especially late in the notebook
If the analysis depends on particular choices of input values, those should be called out explicitly at the beginning of the notebook.
(NB: We should explore ways to parameterize notebooks, [papermill](https://papermill.readthedocs.io/en/latest/) is one tool that does this).

### Don't hardcode absolute file paths
If anyone is going to be able to use the notebook, the files it uses will need to be stored somewhere that makes sense on both your and other computers.
We assume that anyone using this template has the PUDL package installed, and has a local PUDL data management environment set up.
  * Input data that needs to be stored on disk and accessed via a shared location should be saved under `<PUDL_IN>/data/local/<data_source>/`.
  * Intermediate saved data products (e.g. a pickled result of a computationally intensive process) and results should be saved to a location relative to the notebook, and within the directory hierarchy of the repository that the notebook is part of.
  
### Don't require avoidable long-running computations
Consider persisting to disk the results of computations that take more than a few minutes, if the outputs are small enough to be checked into the repository and shared with other users.
Only require the expensive computation to be run if this pre-computed output is not available.

### Don't litter
Don't leave lots of additional code laying around, even commented out, "just in case" you want to look at it again later.
Notebooks need to be relatively linear in the end, even though the thought processes and exploratory analyses that generate them may not be.
Once you have a working analysis, either prune those branches, or encapsulate them as options within functions.

### Don't load unneccesary libraries
Only import libraries which are required by the notebook, to avoid unnecessary dependencies.
If your analysis requires a new library that isn't yet part of the shared `pudl-dev` environment, add it to the `devtools/environment.yml` file so that others will get it when they update their environment.

## Related Resources:
Lots of these guidelines are taken directly from Emily Riederer's post: [RMarkdown Driven Development](https://emilyriederer.netlify.app/post/rmarkdown-driven-development/).
For more in depth explanation of the motivations behind this layout, do go check it out!

# Import Libraries
* Because it's very likely that you will be editing the PUDL Python packages or your own local module under development while working in the notebook, use the %autoreload magic with autoreload level 2 to ensure that any changes you've made in those files are always reflected in the code that's running in the notebook.
* Put all import statements at the top of the notebook, so everyone can see what its dependencies are up front, and so that if an import is going to fail, it fails early, before the rest of the notebook is run.
* Try to avoid importing individual functions and classes from deep within packages, as it may not be clear to other users where those elements came from, later in the notebook, and also to minimize the chance of namespace collisions.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Standard libraries
import logging
import os
import pathlib
import sys

# 3rd party libraries
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import sqlalchemy as sa

# Local libraries
import pudl

# Configure Display Parameters

In [3]:
sns.set()
%matplotlib inline
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

# Use Python Logging facilities
* Using a logger from the beginning will make the transition into the PUDL package easier.
* Creating a logging handler here will also allow you to see the logging output coming from PUDL and other underlying packages.

In [4]:
# logger=logging.getLogger()
# logger.setLevel(logging.INFO)
# handler = logging.StreamHandler(stream=sys.stdout)
# formatter = logging.Formatter('%(message)s')
# handler.setFormatter(formatter)
# logger.handlers = [handler]

# Define Functions
In many cases, the eventual product of a notebook analysis is going to be the creation of new, reusable functions that are integrated into the underlying PUDL code. You should begin the process of accumulating and organizing those functions as soon as you notice repeated patterns in your code.
* Functions should be used to encapsulate any potentially reusable code.
* Functions should be defined immediately after the imports, to avoid accidental dependence on zombie variables that are defined further down in the code
* While they may evolve over time, having brief docstrings explaining what they do will help others understand them.
* If there's a particular type of plot or visualization that is made repeatedly in the notebook, it's good to parameterize and functionalize it here too, so that as you refine the presentation of the data and results, those improvements can be made in a single place, and shown uniformly throughout the notebook.
* As these functions mature and become more general purpose tools, you will probably want to start migrating them into their own local module, under a `src` directory in the same directory as the notebook. You will want to import this module 

## Dummy EIA 861 ETL

In [5]:
def test_etl_eia(eia_inputs, pudl_settings):
    """
    This is a dummy function that runs the first part of the EIA ETL
    process -- everything up until the entity harvesting begins. For
    use in this notebook only.

    """
    eia860_tables = eia_inputs["eia860_tables"]
    eia860_years = eia_inputs["eia860_years"]
    eia861_tables = eia_inputs["eia861_tables"]
    eia861_years = eia_inputs["eia861_years"]
    eia923_tables = eia_inputs["eia923_tables"]
    eia923_years = eia_inputs["eia923_years"]

    # generate CSVs for the static EIA tables, return the list of tables
    #static_tables = _load_static_tables_eia(datapkg_dir)

    # Extract EIA forms 923, 860
    eia860_raw_dfs = pudl.extract.eia860.Extractor().extract(eia860_years, testing=True)
    eia861_raw_dfs = pudl.extract.eia861.Extractor().extract(eia861_years, testing=True)
    eia923_raw_dfs = pudl.extract.eia923.Extractor().extract(eia923_years, testing=True)

    # Transform EIA forms 860, 861, 923
    eia860_transformed_dfs = pudl.transform.eia860.transform(eia860_raw_dfs, eia860_tables=eia860_tables)
    eia861_transformed_dfs = pudl.transform.eia861.transform(eia861_raw_dfs, eia861_tables=eia861_tables)
    eia923_transformed_dfs = pudl.transform.eia923.transform(eia923_raw_dfs, eia923_tables=eia923_tables)

    # create an eia transformed dfs dictionary
    eia_transformed_dfs = eia860_transformed_dfs.copy()
    eia_transformed_dfs.update(eia861_transformed_dfs.copy())
    eia_transformed_dfs.update(eia923_transformed_dfs.copy())

    # convert types..
    eia_transformed_dfs = pudl.helpers.convert_dfs_dict_dtypes(eia_transformed_dfs, 'eia')

    return eia860_raw_dfs, eia861_raw_dfs, eia923_raw_dfs, eia_transformed_dfs

# Define Notebook Parameters
If there are overarching parameters which determine the nature of the analysis -- which US states to look at, which utilities are of interest, a particular start and end date -- state those clearly at the beginning of the analysis, so that they can be referred to by the rest of the notebook and easily changed if need be.
* If the analysis depends on local (non-PUDL managed) datasets, define the paths to those data here.
* If there are external URLs or other resource locations that will be accessed, define those here as well.
* This is also where you should create your `pudl_settings` dictionary and connections to your local PUDL databases

In [6]:
EIA861_YEARS = list(range(2001, 2019))
pudl_settings = pudl.workspace.setup.get_defaults()
display(pudl_settings)

ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
display(ferc1_engine)

pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
display(pudl_engine)

pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)


# Is there other external data you need to pull in?
# If so, put it in a (relatively) standard place on the filesystem.
my_new_data_url = "https://mynewdata.website.gov/path/to/new/data.csv"
my_new_datadir = pathlib.Path(pudl_settings["data_dir"]) / "local/new_data_source"

# Store API keys and other secrets in environment variables
# and read them in here, if needed:
# API_KEY_EIA = os.environ["API_KEY_EIA "]
# API_KEY_FRED = os.environ["API_KEY_FRED "]

{'pudl_in': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR',
 'data_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/data',
 'settings_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/settings',
 'pudl_out': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR',
 'sqlite_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite',
 'parquet_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/parquet',
 'datapkg_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/datapkg',
 'notebook_dir': '/Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/notebook',
 'ferc1_db': 'sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/ferc1.sqlite',
 'pudl_db': 'sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/pudl.sqlite'}

Engine(sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/ferc1.sqlite)

Engine(sqlite:////Users/aesharpe/Desktop/Work/Catalyst_Coop/PUDL_DIR/sqlite/pudl.sqlite)

# Load Data
* Given the above parameters and functions, it should now be possible to load the required data into local variables for further wrangling, analysis, and visualization.
* If the data is not yet present on the machine of the person running the notebook, this step should also acquire the data from its original source, and place it in the appropriate location under `<PUDL_IN>/data/local/`.
* If there are steps which have to be done manually here, put them first so that they fail first if the user hasn't read the instructions, and they can fix the situation before a bunch of other work gets done. Try to minimize the amount of work in the filesystem that has to be done manually though.
* If this process takes a little while, don't be shy about producing `logging` output.
* Using the `%%time` cell magic can also help users understand which pieces of work / data acquisition are hard:

## EIA 861 (2010-2018)
* Not yet fully integrated into PUDL
* Post-transform harvesting process isn't compatible w/ EIA 861 structure
* Only getting the `sales_eia861`, `balancing_authority_eia861`, and `service_territory_eia861` tables

### Already Transformed EIA 861 DataFrames

In [None]:
%%time
eia_inputs = {
    "eia860_years": [],
    "eia860_tables": pudl.constants.pudl_tables["eia860"],
    "eia861_years": EIA861_YEARS,
    "eia861_tables": pudl.constants.pudl_tables["eia861"],
    "eia923_years": [],
    "eia923_tables": pudl.constants.pudl_tables["eia923"],
}
eia860_raw_dfs, eia861_raw_dfs, eia923_raw_dfs, eia_transformed_dfs = test_etl_eia(eia_inputs=eia_inputs, pudl_settings=pudl_settings)



# Sanity Check Data
If there's any validation that can be done on the data which you've loaded to flag if/when it is inappropriate for the analysis that follows, do it here. If you find the data is unusable, use `assert` statements or `raise` Exceptions to stop the notebook from proceeding, and indicate what the problem is.

In [34]:
list(eia861_raw_dfs.keys())

['advanced_metering_infrastructure_eia861',
 'balancing_authority_eia861',
 'demand_response_eia861',
 'demand_side_management_eia861',
 'distributed_generation_eia861',
 'distribution_systems_eia861',
 'dynamic_pricing_eia861',
 'frame_eia861',
 'green_pricing_eia861',
 'mergers_eia861',
 'net_metering_eia861',
 'non_net_metering_eia861',
 'operational_data_eia861',
 'reliability_eia861',
 'sales_eia861',
 'service_territory_eia861',
 'short_form_eia861',
 'utility_data_eia861']

In [35]:
list(eia_transformed_dfs.keys())

['service_territory_eia861',
 'balancing_authority_eia861',
 'sales_eia861',
 'advanced_metering_infrastructure_eia861',
 'demand_response_eia861',
 'distribution_systems_eia861',
 'dynamic_pricing_eia861',
 'green_pricing_eia861',
 'mergers_eia861',
 'balancing_authority_assn_eia861']

# Preliminary Data Wrangling
Once all of the data is loaded and looks like it's in good shape, do any initial wrangling that's specific to this particular analysis. This should mostly make use of the higher level functions which were defined above. If this step takes a while, don't be shy about producing `logging` outputs.

In [12]:
def zero_pad_zips(zip_series, N):
    zip_series = (
        pd.to_numeric(zip_series)  # Make sure it's all numerical values
        .astype(pd.Int64Dtype())  # Make it a (nullable) Integer
        .fillna(0)  # fill the NA
        .astype(str).str.zfill(N)  # Make it a string and zero pad it.
        .astype(pd.StringDtype())  # Make it nullable
        .replace({N*"0": pd.NA})  # All-zero Zip codes aren't valid.
    )
    return zip_series

In [36]:
eia_transformed_dfs['mergers_eia861']

Unnamed: 0,utility_id_eia,utility_name_eia,state,report_date,entity_type,merge_address,merge_city,merge_company,merge_date,merge_state,merge_zip_4,merge_zip_5,new_parent
0,2409,Brownsville Public Utilities Board,TX,2007-01-01,Municipal,12567 FM Road 3430,Vernon,Acquired 7.81% Intrst of Oklaunion Plant,2007-02-14,TX,,76384,Oklaunion Power Station
1,2485,City of Buffalo,MN,2007-01-01,Municipal,,,Xcel,2005-12-16,,,,City of Buffalo
2,4410,Coral Power LLC,TX,2007-01-01,Retail Power Marketer,,,Avista Energy,2007-07-01,,,,"Coral Power, L.L.C."
3,5779,Elkhorn Public Service Co,WV,2007-01-01,Investor Owned,1 Riverside Plaza,Columbus,American Electric Power,2007-06-30,OH,,43215,American Electric Power
4,7601,Green Mountain Power Corp,VT,2007-01-01,Investor Owned,85 Swift Street,South Burlington,Northern New England Energy Corp,2007-04-12,VT,,05403,Northern New England Energy Corp
...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,54913,NSTAR Electric Company,,2018-01-01,,247 Station Drive,Westwood,Western Massachusetts Electric Company,2017-12-31,MA,,02090,NSTAR Electric Company
184,56286,"Frontier Utilities, Inc.",,2018-01-01,,20455 Tx Hwy 249,Houston,"Nextera Energy Services, LLC",2019-04-01,TX,,77077,"Nextera Energy Services, LLC"
185,58263,V247 Power Corporation,,2018-01-01,,4475 Trinity Mills #700127,Dallas,OnPAC Energy,2018-04-20,TX,,75370,Pegasus Alliance Corporation
186,59385,"Dynegy Energy Services, LLC",,2018-01-01,,6555 Sierra Drive,Irving,Vistra Energy,2018-04-09,TX,,75039,Vistra Energy


In [30]:
mer = eia861_raw_dfs['mergers_eia861']
fra = eia861_raw_dfs['frame_eia861']
merg = pd.merge(mer,fra, on=['utility_id_eia'], how='outer', suffixes=('_mer', '_fra'))
merg = merg.loc[(merg['ownership_mer'].notnull()) & (merg['ownership_code'].notnull())]
merg['compare_codes'] = merg['ownership_mer'] == merg['ownership_code']
merg.loc[merg['compare_codes']==False]

Unnamed: 0,merge_address,merge_city,merge_company,merge_date,merge_state,merge_zip_4,merge_zip_5,new_parent,ownership_mer,report_year_mer,state,utility_id_eia,utility_name_eia_mer,advanced_metering,demand_response,distribution_systems,dynamic_pricing,energy_efficiency,mergers,monthly,net_metering,non_net_metering_distributed,operational_data,ownership_fra,ownership_code,reliability,report_year_fra,sales_to_ultimate_customers,service_territory,short_form,utility_data,utility_name_eia_fra,compare_codes
206,1900 North Akard Street,Dallas,Cap Rock Energy,2010-07-13 00:00:00,TX,,75201.0,"Sharyland Utilities, L.P.",I,2010.0,TX,17008,Sharyland Utilities LP,,,,,,,,,,X,Transmission,T,,2018.0,,,,X,Sharyland Utilities LP,False
228,2000 Westchester Avenue,Purchase,TGX LP,2011-01-01 00:00:00,NY,,10577.0,MS TGX LLC,W,2011.0,NY,12917,Morgan Stanley Capital Grp Inc,,,,,,,X,,,X,Retail Power Marketer,R,,2018.0,X,,,X,Morgan Stanley Capital Grp Inc,False


In [89]:
fuzzy_df = fuzzy_match(df, 'utility_name_eia_base', 'new_parent')
fd = fuzzy_df.loc[fuzzy_df['fuzzy']<100].reset_index()
fd.iloc[10:20]

Unnamed: 0,utility_id_eia,state,report_date_base,utility_id_pudl,utility_name_eia_base,city,entity_type,plants_reported_asset_manager,plants_reported_operator,plants_reported_other_relationship,plants_reported_owner,street_address,zip_code,merge_address,merge_city,merge_company,merge_date,merge_state,merge_zip_4,merge_zip_5,new_parent,ownership_code,utility_name_eia_861,report_date_861,fuzzy
10,10171,KY,2015-01-01,163,Kentucky Utilities Co,Lexington,I,,True,,True,One Quality Street,40507,220 West Main Street,Louisville,Kentucky Utilities,2010-11-01,KY,,40202,PPL Corporation,I,Kentucky Utilities Co,2010-01-01,22
11,10171,KY,2014-01-01,163,Kentucky Utilities Co,Lexington,I,,True,,True,One Quality Street,40507,220 West Main Street,Louisville,Kentucky Utilities,2010-11-01,KY,,40202,PPL Corporation,I,Kentucky Utilities Co,2010-01-01,22
12,10171,KY,2013-01-01,163,Kentucky Utilities Co,Lexington,I,True,True,True,True,One Quality Street,40507,220 West Main Street,Louisville,Kentucky Utilities,2010-11-01,KY,,40202,PPL Corporation,I,Kentucky Utilities Co,2010-01-01,22
13,10171,KY,2012-01-01,163,Kentucky Utilities Co,Lexington,I,,,,,220 West Main Street,40507,220 West Main Street,Louisville,Kentucky Utilities,2010-11-01,KY,,40202,PPL Corporation,I,Kentucky Utilities Co,2010-01-01,22
14,10171,KY,2011-01-01,163,Kentucky Utilities Co,Lexington,I,,,,,One Quality Street,40507,220 West Main Street,Louisville,Kentucky Utilities,2010-11-01,KY,,40202,PPL Corporation,I,Kentucky Utilities Co,2010-01-01,22
15,13511,NY,2013-01-01,213,New York State Elec & Gas Corp,Ithaca,I,True,True,True,True,P O Box 3287Corporate Acctg &,14852,70 Farm View Drive,New Gloucester,"Iberdrola, S.A.",2008-09-16,ME,5117.0,4260,"Iberdrola, S.A.",I,New York State Elec & Gas Corp,2008-01-01,22
16,13511,NY,2017-01-01,213,New York State Elec & Gas Corp,Ithaca,I,,,,True,P O Box 3287Corporate Acctg &,14852,70 Farm View Drive,New Gloucester,"Iberdrola, S.A.",2008-09-16,ME,5117.0,4260,"Iberdrola, S.A.",I,New York State Elec & Gas Corp,2008-01-01,22
17,13511,NY,2015-01-01,213,New York State Elec & Gas Corp,Ithaca,I,True,True,,True,P O Box 3287Corporate Acctg &,14852,70 Farm View Drive,New Gloucester,"Iberdrola, S.A.",2008-09-16,ME,5117.0,4260,"Iberdrola, S.A.",I,New York State Elec & Gas Corp,2008-01-01,22
18,13511,NY,2014-01-01,213,New York State Elec & Gas Corp,Ithaca,I,True,True,,True,P O Box 3287Corporate Acctg &,14852,70 Farm View Drive,New Gloucester,"Iberdrola, S.A.",2008-09-16,ME,5117.0,4260,"Iberdrola, S.A.",I,New York State Elec & Gas Corp,2008-01-01,22
19,10171,KY,2017-01-01,163,Kentucky Utilities Co,Lexington,I,,,,True,One Quality Street,40507,220 West Main Street,Louisville,Kentucky Utilities,2010-11-01,KY,,40202,PPL Corporation,I,Kentucky Utilities Co,2010-01-01,22


In [131]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

In [172]:
df = eia_transformed_dfs['advanced_metering_infrastructure_eia861'].copy()
df.columns.tolist()
df.sort_values('energy_served_ami_mwh', ascending=False)
df = df.drop(index=46605)

In [181]:
#sns.jointplot("energy_served_ami_mwh", "advanced_metering_infrastructure", data=df, kind='reg');

In [180]:
#plt.hist(df['energy_served_ami_mwh'], normed=True, alpha=0.5)
#sns.distplot(df['energy_served_ami_mwh'])
#plt.boxplot(df)
#df[['energy_served_ami_mwh']].boxplot()

In [179]:
#df.sort_values('energy_served_ami_mwh',ascending=False)
df_s = (
    df.loc[df['utility_id_eia']==3046]
    .set_index(['utility_id_eia','utility_name_eia','state','balancing_authority_code_eia','report_date','customer_class'])
    [['advanced_metering_infrastructure']]  #energy_served_ami_mwh
    .unstack()
)


In [178]:
#df['energy_served_ami_mwh'].hist(bins=100)

## Totals Analysis

In [41]:
df = compare_totals(df, idx_cols, class_cols)
df.loc[df['tot_comp']==True]

Unnamed: 0,utility_id_eia,state,report_date,customer_attributes,residential,commercial,industrial,transportation,other,total,sum,tot_comp


In [13]:
import math

# df should be a dataframe with a 'customer class' column, one of whom's values is "total"
#df = tidy_ami

idx_cols = [
    'utility_id_eia',
    'state',
    #'balancing_authority_code_eia',
    'report_date'
]

class_cols = [
     'customers',
     'rec_revenues',
     'rec_sales_mwh',
     'revenues',
     'sales_mwh'
]

# you will need to edit the LEVEL numbers depending on the customer class
def compare_totals(df, idx_cols, class_cols):
    # Make long df and then unstack customer classes as columns
    piv_df = (
        df.reset_index()
        .set_index(idx_cols+['customer_class'])[class_cols]
        .stack()
        .reset_index()
        .rename(columns={'level_4':'customer_attributes', 0:'value'})
        .drop_duplicates()
        .set_index(idx_cols+['customer_class', 'customer_attributes'])
        .unstack(level=3)
        .fillna(0)
    )
    
    # Get rid of categorical index
    piv_df.columns = [x[1] for x in piv_df.columns.tolist()]
    piv_df = piv_df.reset_index()
    
    # Add sum column
    piv_df['sum'] = (
        piv_df['commercial']
        + piv_df['industrial']
        + piv_df['residential']
        + piv_df['transportation']
    )
    
    # Add comparison bool col
    piv_df['tot_comp'] = (
        abs(piv_df['sum'] - piv_df['total']) < 1
    )
    
    # Show spots where totals don't match up
    false_df = piv_df.loc[piv_df['tot_comp']==False]
    
    return false_df

In [17]:
# # Define base comparison df 
# eia_base_df = (
#     pudl_out.utils_eia860()
#     #[['report_date', 'utility_id_eia', 'utility_id_pudl', 'utility_name_eia', 'state']]
#     .assign(utility_id_eia=lambda x: x.utility_id_eia.astype('Int64'))
# )
# # Shorten df
# #df = df[['utility_id_eia', 'state', 'utility_name_eia', 'report_date']]

# # Merge dfs together 
# df_merge = (
#     pd.merge(eia_base_df, df, on=['utility_id_eia', 'state', 'report_date'], how='outer', suffixes=('_base', '_861'))
#     .set_index(['utility_id_eia', 'report_date'])
#     .assign(utility_name_eia_base=lambda x: x.utility_name_eia_base.astype('string'),
#             utility_name_eia_861=lambda x: x.utility_name_eia_861.astype('string'))
# )
# df_merge

In [67]:
eia_base_df.loc[eia_base_df['utility_id_eia']==2409]

Unnamed: 0,report_date,utility_id_eia,utility_id_pudl,utility_name_eia,city,entity_type,plants_reported_asset_manager,plants_reported_operator,plants_reported_other_relationship,plants_reported_owner,state,street_address,zip_code
1371,2018-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,,,True,TX,P.O. Box 3270,78523
1372,2017-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,,,True,TX,P.O. Box 3270,78523
1373,2016-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,,,True,TX,P.O. Box 3270,78523
1374,2015-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,True,,True,TX,P.O. Box 3270,78523
1375,2014-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,True,,True,TX,P.O. Box 3270,78523
1376,2013-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,True,True,True,True,TX,P.O. Box 3270,78523
1377,2012-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,,,,TX,,78523
1378,2011-01-01,2409,716,Brownsville Public Utilities Board,Brownsville,M,,,,,TX,P.O. Box 3270,78523


## Fuzzy Match Columns

In [85]:
from fuzzywuzzy import fuzz

def fuzzy_match(df, col1, col2):
    # Define base comparison df 
    eia_base_df = (
        pudl_out.utils_eia860()
        #[['report_date', 'utility_id_eia', 'utility_id_pudl', 'utility_name_eia', 'state']]
        .assign(utility_id_eia=lambda x: x.utility_id_eia.astype('Int64'))
    )
    # Shorten df
    #df = df[['utility_id_eia', 'state', 'utility_name_eia', 'report_date']]
    
    # Merge dfs together 
    df_merge = (
        pd.merge(eia_base_df, df, on=['utility_id_eia', 'state'], how='outer', suffixes=('_base', '_861'))
        .set_index(['utility_id_eia', 'state'])
        .assign(utility_name_eia_base=lambda x: x.utility_name_eia_base.astype('string'),
                utility_name_eia_861=lambda x: x.utility_name_eia_861.astype('string'))
    )
    
    # Only run where both columns for utility name have values
    df_merge_no_null = (
        df_merge.loc[
            (df_merge[col1].notnull()) 
            & (df_merge[col2].notnull())].copy()
    )
    
    # Fuzzy match
    df_merge_no_null['fuzzy'] = (
        df_merge_no_null.apply(lambda x: fuzz.ratio(x.utility_name_eia_base, x.new_parent),axis=1)
    )
#     df_merge_no_null = (
#         df_merge_no_null.assign(fuzzy=lambda x: fuzz.ratio(x[col1], x[col2])))
    
    #df_merge_no_null.loc[:,'fuzzy'] = fuzzy_series
    df_merge_no_null = df_merge_no_null.sort_values('fuzzy')
    
    return df_merge_no_null

In [56]:
# This one used to work but now pivot_table takes WAY too long.
def compare_totals_loong(df, idx_cols, class_cols):
    no_tot = df.loc[df['customer_class']!='total']
    tot = df.loc[df['customer_class']=='total']
    
    pivot_df = no_tot.pivot_table(class_cols, idx_cols, 'customer_class')
    
    sum_df = no_tot.groupby(idx_cols).sum().reset_index()
    combo_df = (
        pd.merge(sum_df, tot, on=idx_cols, how='outer', suffixes=['_sum', '_tot'])
        #.fillna(0)
    ).set_index(idx_cols)
    
    # make bool col for whether sum is the same as total (in this case can be off by one)
    for col in class_cols:
        combo_df[col+'_bool'] = (
            abs(combo_df[col+'_sum'] - combo_df[col+'_tot']) <=1
        )
    
    combo_df = combo_df.reset_index()
    bad_idx_list = []
    
    # make a list of the indexes where series comparisons have are FALSE (meaning sum is different from total)
    for col in combo_df[combo_df.filter(regex='bool').columns]:
        if len(combo_df[col].unique().tolist()) > 1:
            bad_idxs = combo_df.index[combo_df[col]==False].tolist() 
            bad_idx_list + bad_idxs
    bad_idx_list = list(set(bad_idx_list))    
    
    bad_df = combo_df.iloc[bad_idx_list]
    
    return bad_df

# DSM Stuff

In [37]:
dsm = eia861_raw_dfs['demand_side_management_eia861'].copy()
#dsm.columns.tolist()

In [50]:
idx_cols = ['utility_id_eia', 'state', 'ba_code', 'report_year']

dsm_test = dsm[['utility_id_eia', 'state', 'ba_code', 'report_year', 
                'residential_annual_load_management_energy_effects ', 
                'commercial_annual_load_management_energy_effects ', 
                'industrial_annual_load_management_energy_effects ',
                'transportation_annual_load_management_energy_effects ',
                'total_annual_load_management_energy_effects ',
]]

dr_test = raw_dr_eia861[['utility_id_eia', 'state', 'balancing_authority_code_eia', 'report_year',
                         'residential_energy_savings_mwh',
                         'commercial_energy_savings_mwh',
                         'industrial_energy_savings_mwh',
                         'transportation_energy_savings_mwh',
                         'total_energy_savings_mwh'
]].rename(columns={'balancing_authority_code_eia': 'ba_code'})

In [84]:
merge = pd.merge(dsm_test, dr_test, on=idx_cols, how='outer')
merge = pudl.helpers.oob_to_nan(merge, 
                       ['residential_annual_load_management_energy_effects ', 
                        'commercial_annual_load_management_energy_effects ', 
                        'industrial_annual_load_management_energy_effects ',
                        'transportation_annual_load_management_energy_effects ',
                        'residential_energy_savings_mwh',
                        'commercial_energy_savings_mwh',
                        'industrial_energy_savings_mwh',
                        'transportation_energy_savings_mwh',
                        'total_energy_savings_mwh',
                        'total_annual_load_management_energy_effects '])

In [71]:
dr_col = 'residential_energy_savings_mwh'
dsm_col = 'residential_annual_load_management_energy_effects '

In [36]:
merge = merge.loc[merge['report_year'].isin(range(2012,2014))]
merge = merge.loc[merge[dr_col].notna()]
#merge = merge.loc[merge[dsm_col].notna()]


NameError: name 'merge' is not defined

In [34]:
#test.plot.scatter('residential_annual_load_management_energy_effects ', 'residential_energy_savings_mwh')

In [33]:
#raw_dr_eia861.columns.tolist()

# Data Analysis and Visualization
* Now that you've got the required data in a usable form, you can tell the story of your analysis through a mix of visualizations, and further data wrangling steps.
* This narrative should be readable, with figures that have titles, legends, and labeled axes as appropriate so others can understand what you're showing them.
* The code should be concise and make use of the parameters and functions which you've defined above when possible. Functions should contain comprehensible chunks of work that make sense as one step in the story of the analysis.