## Playing with PUDL
This notebook is meant to help get you up and running with the PUDL database, so you can play with it!

### Importing external code.
We need to import a bunch of outside code to do our work here.  Sometimes we import entire packages (like `numpy` and `pandas`) and sometimes we just pull in a couple of pieces we need from a particular part of a large package (like `declarative_base`)

In [78]:
import sys
import os.path
import numpy as np
import pandas as pd
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.engine.url import URL
from sqlalchemy import Integer, String, Numeric
from pudl.ferc1 import db_connect_ferc1, cleanstrings, ferc1_meta
from sqlalchemy.sql import select

### Importing our own code
We also need to tell Python where to look to find our own code.  It has a list of directories that it looks in, but our little project isn't in that list, unless we add it -- which is what `sys.path.append()` does.  You'll need to change this path to reflect where on your computer the PUDL project folder (which you pull down with `git`) lives.

Once Python knows to look in the `pudl` project folder, it will let you import `pudl` modules just like any other Python module.  Here we're pulling in the `ferc1` and `pudl` modules from the `pudl` package (which is a directory inside the `pudl` project directory).

In [2]:
sys.path.append('/Users/christinagosnell/code/pudl')
sys.path.append('/Users/zaneselvans/code/catalyst/pudl')
sys.path.append('/Users/Nannerz/Desktop/working/pudl/')
sys.path.append('/home/alana/Dropbox/catalyst/pudl')
from pudl import ferc1, pudl, models, models_ferc1, settings, constants

### Automatically reloading a work in progress
Because you're probably going to be editing the Python modules related to PUDL while you're working with this notebook, it's useful to have them get automatically reloaded before every cell is executed -- this means you're always using the freshest version of the module, with all your recent edits.

In [3]:
%load_ext autoreload
%autoreload 1
%aimport pudl.pudl
%aimport pudl.ferc1
%aimport pudl.constants
%aimport pudl.settings
%aimport pudl.models

### Connecting to our databases.
We have two different databases that we're working with right now.  The FERC Form 1 (`ferc1`) and our own PUDL database (`pudl`). For this software to work, you'll need to have the Postgresql database server running on your computer, and you'll need to have created empty databases to receive the tables and data we're going to create.  On a mac, the easiest Postgres install to get running is probably Postgress.app.  You'll need to fire it up at the command line at least once to create the databases (one called `ferc1` and another called `pudl_sandbox`) and a user named `catalyst` with no password.  This information is stored in the `settings` module if you need to look it up.

Here are two shortcuts for connecting to the two databases once they're created:

In [4]:
pudl_engine  = pudl.pudl.db_connect_pudl()
ferc1_engine = pudl.ferc1.db_connect_ferc1()

### Initializing the FERC Form 1 database
Now that you've got an empty database, let's put some data in it!  This function initializes the database by reading in the FERC Form 1 database structure from `refyear` and data from `years` (which can eventually be a list of years, but that's not working yet...). In order for this to work, you need to have the FERC Form 1 data downloaded into the data directory. There's a script called `get_ferc1.sh` down in `data/ferc/form1/` that will get it for you if you don't have it.

In [5]:
pudl.ferc1.init_db(refyear=2015, years=[2015,], ferc1_tables=pudl.constants.ferc1_default_tables)

Defining new FERC Form 1 DB based on 2015...
Clearing any existing FERC Form 1 database MetaData...
Ingesting FERC Form 1 Data from 2015...


### Initializing the PUDL database
We can also initialize the PUDL database now. Because we're messing around with it a lot, and playing with re-importing data over and over again, it's not a bad idea to destroy whatever old version of it exists in postgres (with `drop_tables_pudl`) and then re-create the tables anew (with `create_tables_pudl`).

Then we can actualy initialize some of the data (mostly glue tables connecting plants to utilities, and a few lists of constants like the US States for now) using `pudl.init_db()`. Eventually we'll need to tell this where to pull data from... but since we're not really putting any meaningful data into it yet, the function takes no arguments.

In [8]:
pudl.pudl.drop_tables_pudl(pudl_engine)
pudl.pudl.create_tables_pudl(pudl_engine)
pudl.pudl.init_db()

Ingesting static PUDL tables...
Sniffing EIA923/FERC1 glue tables...
Ingesting f1_fuel from FERC Form 1 into PUDL.
Ingesting f1_steam from FERC Form 1 into PUDL.
Ingesting f1_plant_in_srvce from FERC Form 1 into PUDL.


### Pulling data out of the database!
Now we're ready to pull some data out of one of the databases, just to show that it works. `pd.read_sql()` takes an SQL Query and a database connection, and puts the results of the query into a pandas DataFrame you can play with easily.

In [9]:
ferc1_fuel = pd.read_sql('''SELECT respondent_id, report_year, plant_name, fuel, fuel_unit, fuel_quantity,\
                                fuel_avg_heat, fuel_cost_delvd, fuel_cost_burned, fuel_cost_btu, fuel_cost_kwh,\
                                fuel_generaton \
                                FROM f1_fuel WHERE plant_name <> '' AND fuel NOT IN ('Total','') ''',ferc1_engine)

In [10]:
ferc1_fuel_pudl = pd.read_sql('''SELECT respondent_id, report_year, plant_name, fuel, fuel_unit, fuel_qty_burned,\
                                fuel_avg_mmbtu_per_unit, fuel_cost_per_unit_delivered, fuel_cost_per_unit_burned, fuel_cost_per_mmbtu, fuel_cost_per_kwh,\
                                fuel_mmbtu_per_kwh \
                                FROM fuel_ferc1 WHERE plant_name <> '' AND fuel NOT IN ('Total','') ''',pudl_engine)

### Examining the data we pulled.
the `sample()` DataFrame method returns a random sample of records from the DataFrame, which is useful for seeing what kinds of things are in there, without always seeing just the first few records.

In [11]:
ferc1_fuel.sample(5)

Unnamed: 0,respondent_id,report_year,plant_name,fuel,fuel_unit,fuel_quantity,fuel_avg_heat,fuel_cost_delvd,fuel_cost_burned,fuel_cost_btu,fuel_cost_kwh,fuel_generaton
382,166,2015,Plant X,Gas,Mcf,8625289.0,1041.0,3.18,3.19,3.06,0.0,0.0
62,7,2015,Saguaro 1,Gas,MCF,22044.0,1065324.0,2.008,3.724,3.496,0.097,0.0
1360,45,2015,Riverbend,Oil,Barrels,0.0,0.0,0.0,0.0,0.0,0.0,0.0
753,70,2015,Boardman,Oil,Barrels,1083.0,138800.0,76.795,105.956,18.182,0.0,0.0
985,87,2015,Little Gypsy,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
ferc1_fuel_pudl.sample(5)

Unnamed: 0,respondent_id,report_year,plant_name,fuel,fuel_unit,fuel_qty_burned,fuel_avg_mmbtu_per_unit,fuel_cost_per_unit_delivered,fuel_cost_per_unit_burned,fuel_cost_per_mmbtu,fuel_cost_per_kwh,fuel_mmbtu_per_kwh
576,159,2015,Hardeeville Peaking,oil,bbl,55.0,138142.0,0.0,118.667,20.453,1.085,0.0
945,51,2015,Energy Center,oil,bbl,6.0,132295.0,114.6,114.6,20.625,0.313,15182.283
83,88,2015,Paddy's Run CT,gas,mcf,1029359.0,1012.0,3.524,3.524,3.483,0.038,10899.0
574,159,2015,Hagood Combined,gas,mcf,280102.0,1031.0,3.863,3.863,3.748,0.045,0.0
174,210,2015,Walter Scott #2,gas,mcf,18205.0,1025048.0,6.285,6.285,6.131,0.0,0.0


In [6]:
pudl.pudl.init_db(ferc1_tables=['f1_fuel','f1_steam','f1_hydro','f1_plant_in_srvce', \
                                'f1_accumdepr_prvsn', 'f1_purchased_pwr'], ferc1_years=[2015, ],\
                                  eia923_tables=[], eia923_years=[2014, 2015, 2016], debug=True)

Ingesting static PUDL tables...
Sniffing EIA923/FERC1 glue tables...
Ingesting f1_fuel from FERC Form 1 into PUDL.
Ingesting f1_steam from FERC Form 1 into PUDL.
Ingesting f1_hydro from FERC Form 1 into PUDL.
Ingesting f1_plant_in_srvce from FERC Form 1 into PUDL.
Ingesting f1_purchased_pwr from FERC Form 1 into PUDL.
Ingesting f1_accumdepr_prvsn from FERC Form 1 into PUDL.


In [59]:
pudl_power = pd.read_sql('''SELECT *
                                FROM purchased_power_ferc1''',pudl_engine)
pudl_power.replace(to_replace='', value=np.nan,inplace=True)
pudl_power.dropna(subset=['statistical_classification','rate_schedule_tariff_number'], inplace=True)
pudl_power.sample(40)

[autoreload of pudl.pudl failed: Traceback (most recent call last):
  File "/Users/Nannerz/anaconda3/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ImportError: cannot import name 'month_dict_2015_eia923'
]


Unnamed: 0,id,respondent_id,report_year,authority_company_name,statistical_classification,rate_schedule_tariff_number,average_billing_demand,average_monthly_ncp_demand,average_monthly_cp_demand,mwh_purchased,mwh_received,mwh_delivered,demand_charges,energy_charges,other_charges,settlement_total
2932,2933,17,2015,CPI Roxboro,LU,(1),,,,281716.0,0.0,0.0,0.0,22749269.0,0.0,22749269.0
672,672,88,2015,"East Kentucky Power Cooperative, Inc.",IU,(9),,,,5572.0,0.0,0.0,1628301.0,0.0,0.0,1628301.0
1330,1331,176,2015,Public Service Company of Colorado,SF,WSPP,,0.0,0.0,4168.0,0.0,0.0,0.0,110466.0,0.0,110466.0
3914,3915,70,2015,Curry Cattle Company,LU,-,.084Mw,,,744.0,0.0,0.0,26796.0,30741.0,0.0,57537.0
850,851,101,2015,Catalyst New Martinsville Hydro El. Co,LU,Special Contract,,,,112858.0,0.0,0.0,4175750.0,8417564.0,0.0,12593314.0
4814,4814,157,2015,Grant County Public Utility District,SF,Schedule No. 42,,,,21.0,0.0,0.0,0.0,380.0,0.0,380.0
715,715,141,2015,Exelon Generation Co.,SF,WSPP-1,,,,55829.0,0.0,0.0,0.0,1187557.0,0.0,1187557.0
688,688,141,2015,Avista Corp. - AVWP (was WWP),SF,WSPP-1,,,,117318.0,0.0,0.0,0.0,4207697.0,0.0,4207697.0
4960,4959,187,2015,Sacramento Municipal Utility District,SF,WSPP,,,,400.0,0.0,0.0,0.0,7500.0,0.0,7500.0
6208,6209,45,2015,LYNWOOD SOLAR I LLC,LU,(1),,,,179.0,0.0,0.0,0.0,9965.0,0.0,9965.0


In [41]:
pudl_na = pudl_power.dropna(subset=['statistical_classification','rate_schedule_tariff_number'])

In [7]:
pudl_power1 = pd.read_sql('''SELECT * FROM purchased_power_ferc1 WHERE respondent_id=122 AND report_year=2015''',pudl_engine)

In [42]:
pudl_na.sample(30)

Unnamed: 0,id,respondent_id,report_year,authority_company_name,statistical_classification,rate_schedule_tariff_number,average_billing_demand,average_monthly_ncp_demand,average_monthly_cp_demand,mwh_purchased,mwh_received,mwh_delivered,demand_charges,energy_charges,other_charges,settlement_total
2343,2344,134,2015,"Iberdrola Renewables, LLC",AD,T-11,,,,0.0,478.0,1670.0,0.0,0.0,-54646.0,-54646.0
2039,2040,138,2015,"Energy America, LLC",SF,,,,,380969.0,0.0,0.0,0.0,29580990.0,0.0,29580990.0
5225,5226,169,2015,Total,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5674,5675,55,2015,MORGAN STANLEY CAPITAL GROUP,OS,177,,,,5418.0,0.0,0.0,0.0,198510.0,0.0,198510.0
5368,5368,403,2015,WACMM,OS,,,,,10.0,0.0,0.0,0.0,190.0,0.0,190.0
6300,6300,45,2015,SOUTHDATA INC,LU,(1),,,,13.0,0.0,0.0,0.0,712.0,0.0,712.0
717,717,141,2015,"Glendale, City of",SF,WSPP-1,,,,160.0,0.0,0.0,0.0,3190.0,0.0,3190.0
6400,6400,45,2015,"PJM Settlements, Inc",OS,(2),,,,544719.0,0.0,0.0,0.0,20496982.0,0.0,20496982.0
5094,5095,9,2015,PJM Interconnection,,,,,,219432.0,0.0,0.0,792451.0,7521663.0,841270.0,9155384.0
4808,5035,226,2015,Morgan Stanley,IF,,,,,563132.0,0.0,0.0,0.0,13319741.0,217457.0,13537198.0


In [14]:
rows_length = len(pudl_power1.axes[0])
rows_length

6797

In [13]:
problem_pudl = pd.read_sql('''SELECT * FROM purchased_power_ferc1 \
WHERE respondent_id=122 AND report_year=2015 AND authority_company_name= 'MONTANA PURCHASES' ''',pudl_engine)
problem_pudl

Unnamed: 0,id,respondent_id,report_year,authority_company_name,statistical_classification,rate_schedule_tariff_number,average_billing_demand,average_monthly_ncp_demand,average_monthly_cp_demand,mwh_purchased,mwh_received,mwh_delivered,demand_charges,energy_charges,other_charges,settlement_total
0,1,122,2015,MONTANA PURCHASES,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [76]:
f1_purchased_pwr = ferc1_meta.tables['f1_purchased_pwr']
select_pp = select([f1_purchased_pwr])
read_pp = pd.read_sql(select_pp, ferc1_engine)
read_pp.drop(['spplmnt_num','row_number', 'row_seq',\
                                 'row_prvlg', 'report_prd'],
                      axis=1, inplace=True)
read_pp.replace(to_replace='', value=np.nan,inplace=True)
read_pp.dropna(subset=['sttstcl_clssfctn',\
    'rtsched_trffnbr'], inplace=True)
read_pp.rename(columns={
        # FERC 1 DB Name  PUDL DB Name
        'athrty_co_name': 'authority_company_name',
        'sttstcl_clssfctn': 'statistical_classification',
        'rtsched_trffnbr': 'rate_schedule_tariff_number',
        'avgmth_bill_dmnd': 'average_billing_demand',
        'avgmth_ncp_dmnd': 'average_monthly_ncp_demand',
        'avgmth_cp_dmnd': 'average_monthly_cp_demand',
        'mwh_recv': 'mwh_received',
        'mwh_delvd': 'mwh_delivered',
        'dmnd_charges': 'demand_charges',
        'erg_charges': 'energy_charges',
        'othr_charges': 'other_charges',
        'settlement_tot': 'settlement_total'},
        inplace=True)


In [81]:
out_pp = read_pp.to_sql(name='purchased_power_ferc1',
                               con=pudl_engine, index=False, if_exists='append',
                               dtype={'respondent_id': Integer,
                                      'report_year': Integer,
                                      'authority_company_name': String,
                                      'statistical_classification': String,
                                      'rate_schedule_tariff_number': String,
                                      'average_billing_demand': String,
                                      'average_monthly_ncp_demand': String,
                                      'average_monthly_cp_demand': String,
                                      'mwh_purchased' : Numeric(14, 2),
                                      'mwh_received' : Numeric(14, 2),
                                      'mwh_delivered' : Numeric(14, 2),
                                      'demand_charges' : Numeric(14, 2),
                                      'energy_charges' : Numeric(14, 2),
                                      'other_charges' : Numeric(14, 2),
                                      'settlement_total' : Numeric(14, 2)})
out_pp

In [82]:
ferc1_engine = pudl.ferc1.db_connect_ferc1()
pudl.ferc1.init_db(refyear=2015, years=[2015,], def_db=True)
pudl_engine = pudl.pudl.db_connect_pudl()
pudl.pudl.drop_tables_pudl(pudl_engine)
ft = constants.ferc1_pudl_tables+['f1_purchased_pwr',]
pudl.pudl.init_db(eia923_tables=[], eia923_years=[],
                  ferc1_tables=ft, ferc1_years=[2015,],
                  debug=True, verbose=True)

Defining new FERC Form 1 DB based on 2015...
Clearing any existing FERC Form 1 database MetaData...
Ingesting FERC Form 1 Data from 2015...
Ingesting static PUDL tables...
Sniffing EIA923/FERC1 glue tables...
Ingesting f1_fuel from FERC Form 1 into PUDL.
Ingesting f1_steam from FERC Form 1 into PUDL.
Ingesting f1_hydro from FERC Form 1 into PUDL.
Ingesting f1_plant_in_srvce from FERC Form 1 into PUDL.
Ingesting f1_purchased_pwr from FERC Form 1 into PUDL.
Ingesting f1_accumdepr_prvsn from FERC Form 1 into PUDL.
