# Getting started with PUDL
This notebook is meant to help you start working with [the data that's loaded into the PUDL database](https://github.com/catalyst-cooperative/pudl#available-data).
If you haven't gotten the database up and running yet, you should work your way through [our getting started guide on GitHub](https://github.com/catalyst-cooperative/pudl/blob/master/docs/getting_started.md) first, and then come back here. We'll assume that you're at least somewhat familiar with the Python programming language and Jupyter Notebooks.

If you're not yet familiar with Jupyter Notebooks, here are a few resources:
 * [A Jupyter Notebook tutorial from DataCamp](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
 * [An introduction to Jupyter Notebooks from Real Python](https://realpython.com/jupyter-notebook-introduction/)
 * [A gallery of interesting Notebooks from Project Jupyter](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)

In [1]:
# Pandas for data analysis:
import pandas as pd

# SQLAlchemy gives us a bunch of database tools:
import sqlalchemy as sa

# Import the top level PUDL package:
import pudl

# Connecting to our databases.
Currently PUDL creates two distinct databases. The FERC Form 1 Database is
a compilation of all the raw Form 1 data that has been released by FERC
since 1994. We use an SQLite database to store this cloned data in the file
`results/sqlite/ferc1.sqlite` within the PUDL repository. If you really
need access to the original FERC Form 1 data, you can create a connection
to the database directly like this:

In [2]:
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_settings

{'pudl_in': '/Users/christinagosnell/code/pudl_work',
 'data_dir': '/Users/christinagosnell/code/pudl_work/data',
 'settings_dir': '/Users/christinagosnell/code/pudl_work/settings',
 'pudl_out': '/Users/christinagosnell/code/pudl_work',
 'sqlite_dir': '/Users/christinagosnell/code/pudl_work/sqlite',
 'parquet_dir': '/Users/christinagosnell/code/pudl_work/parquet',
 'datapackage_dir': '/Users/christinagosnell/code/pudl_work/datapackage',
 'notebook_dir': '/Users/christinagosnell/code/pudl_work/notebook',
 'ferc1_db': 'sqlite:////Users/christinagosnell/code/pudl_work/sqlite/ferc1.sqlite',
 'pudl_db': 'sqlite:////Users/christinagosnell/code/pudl_work/sqlite/pudl.sqlite'}

In [3]:
ferc1_engine = sa.create_engine(pudl_settings["ferc1_db"])

Then you can use Pandas to pull a particular FERC Form 1 table, or run an arbitrary SQL query on the database, using `pd.read_sql()`  For instance, this will create a dataframe called `f1_fuel_df` containing all of the origingal FERC Form 1 data from the `f1_fuel` table, which contains information about the fuel consumed by large thermal generators, from 1994 to the present.

In [4]:
f1_fuel_df = pd.read_sql_table('f1_fuel', ferc1_engine)

If you want to see a 10-line long random sample of what's in that dataframe..., the `df.sample()` function is here for you:

In [5]:
f1_fuel_df.sample(10)

Unnamed: 0,respondent_id,report_year,spplmnt_num,row_number,row_seq,row_prvlg,plant_name,fuel,fuel_unit,fuel_quantity,...,fuel_f,fuel_unit_f,fuel_quantity_f,fuel_avg_heat_f,fuel_cost_delvd_f,fuel_cost_burned_f,fuel_cost_btu_f,fuel_cost_kwh_f,fuel_generaton_f,report_prd
2582,281,2017,0,3,3,,"Prairie Creek 1,3",,,,...,0,0,0,0,0,0,0,0,0,12
90,133,2017,2,1,1,,,,,,...,0,0,0,0,0,0,0,0,0,12
2060,45,2017,1,6,6,,Dan River,,,,...,0,0,0,0,0,0,0,0,0,12
3142,169,2017,0,8,8,,,,,,...,0,0,0,0,0,0,0,0,0,12
608,210,2017,3,9,9,,Quad-Cities,,,,...,0,0,0,0,0,0,0,0,0,12
5580,159,2017,21,1,1,,,,,,...,0,0,0,0,0,0,0,0,0,12
3823,195,2017,5,14,14,,,,,,...,0,0,0,0,0,0,0,0,0,12
5597,80,2017,0,3,3,,Murray Gill,Oil,Barrel,,...,0,0,0,0,0,0,0,0,0,12
5702,17,2017,3,3,3,,Morehead,,,,...,0,0,0,0,0,0,0,0,0,12
5361,159,2017,6,7,7,,,,,,...,0,0,0,0,0,0,0,0,0,12


However, there are a bunch of "footnote" reference columns that end with `_f` which aren't particularly useful here. To drop them from the dataframe and look at a different sample of the remaining dataframe we can do:

In [6]:
f1_fuel_df = f1_fuel_df.drop(f1_fuel_df.filter(like='_f').columns, axis=1)
f1_fuel_df.sample(10)

Unnamed: 0,respondent_id,report_year,spplmnt_num,row_number,row_seq,row_prvlg,plant_name,fuel,fuel_unit,fuel_quantity,fuel_avg_heat,fuel_cost_delvd,fuel_cost_burned,fuel_cost_btu,fuel_cost_kwh,fuel_generaton,report_prd
242,22,2017,2,3,3,,,,,,,0.0,0.0,0.0,0.0,0.0,12
235,22,2017,1,11,11,,Coughlin,,,,,0.0,0.0,0.0,0.0,0.0,12
1298,99,2017,0,9,9,,Watson,,,,,0.0,0.0,0.0,0.0,0.0,12
537,186,2017,8,13,13,,Remington Solar,,,,,0.0,0.0,0.0,0.0,0.0,12
2171,144,2017,1,12,12,,CAYUGA,Oil,Barrels,8241.0,137000.0,74.409,72.946,12.677,0.0,10309.0,12
699,176,2017,0,10,10,,Demoss Petrie,,,,,0.0,0.0,0.0,0.0,0.0,12
212,22,2017,0,3,3,,Teche,,,,,0.0,0.0,0.0,0.0,0.0,12
291,82,2017,2,7,7,,,,,,,0.0,0.0,0.0,0.0,0.0,12
1669,130,2017,8,5,5,,,,,,,0.0,0.0,0.0,0.0,0.0,12
460,186,2017,3,11,11,,Yorktown,No. 6 Oil,bbls,90846.0,6409536.0,41.66,43.26,6.75,0.05,0.0,12


### FERC Form 1, but from PUDL
But that's just the FERC Form 1 DB... and it's a mess.  To connect to the PUDL database, which includes a cleaned up subset of the FERC Form 1 data, as well as data from the EIA 860 and EIA 923, we can do something similar, to access the same data, but in an easier to use form. Note that the fuel types have been standardized, several derived columns are now gone, and the column names are more informative.

In [7]:
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
# Dropping the 'id' column which contains autoincremented surrogate primary key
fuel_ferc1_df = pd.read_sql('fuel_ferc1', pudl_engine).drop('id', axis=1)
fuel_ferc1_df.sample(10)

Unnamed: 0,record_id,utility_id_ferc1,report_year,plant_name,fuel_type_code_pudl,fuel_unit,fuel_qty_burned,fuel_mmbtu_per_unit,fuel_cost_per_unit_burned,fuel_cost_per_unit_delivered,fuel_cost_per_mmbtu
2644,f1_fuel_2006_89_0_3,89,2006,blount station,,ton,3076.0,0.011975,116.963,116.963,4.884
10098,f1_fuel_2014_194_4_7,194,2014,sheboygan fallsunit2,gas,mcf,61784.0,1.033,4.39,4.39,4.244
11181,f1_fuel_2015_44_0_13,44,2015,monroe,coal,ton,8257031.0,19.782,50.812,50.126,2.565
5191,f1_fuel_2009_7_4_10,7,2009,saguaro 1,gas,mcf,1907.0,1.026481,10.535,3.902,10.263
1340,f1_fuel_2005_42_2_7,42,2005,killen,coal,ton,1289055.0,22.724,45.659,45.772,2.009
6317,f1_fuel_2010_193_2_1,193,2010,elm road- unit 1,coal,ton,48886.0,25.962,72.023,72.023,2.77418
2307,f1_fuel_2006_194_2_4,194,2006,edgewater unit4(wpl),coal,ton,794056.0,17.0,32.03,30.92,1.843
3793,f1_fuel_2008_82_0_13,82,2008,haefling,gas,mcf,1723.0,1.025,15.096,15.096,14.728
5459,f1_fuel_2009_17_1_11,17,2009,h.b. robinson,coal,ton,408053.0,23.814,91.982,91.408,3.8626
3523,f1_fuel_2007_195_5_11,195,2007,w marinette m33-wps,gas,mcf,105276.0,1.003,7.265,7.265,7.245


### Selecting data with an SQL query
Of course, we can also pull other data. Here we'll data from the EIA 923 Fuel Receipts and Costs table into a dataframe, using an SQL query rather than pulling the table directly. Let's limit it to coal records from January 2014 or later.

In [8]:
frc_eia923 = pd.read_sql(
    """SELECT * FROM fuel_receipts_costs_eia923
       WHERE fuel_type_code_pudl='coal'""", pudl_engine)
frc_eia923.sample(10)

Unnamed: 0,id,plant_id_eia,report_date,contract_type_code,contract_expiration_date,energy_source_code,fuel_type_code_pudl,fuel_group_code,fuel_group_code_simple,mine_id_pudl,...,sulfur_content_pct,ash_content_pct,mercury_content_ppm,fuel_cost_per_mmbtu,primary_transportation_mode_code,secondary_transportation_mode_code,natural_gas_transport_code,natural_gas_delivery_contract_type_code,moisture_content_pct,chlorine_content_ppm
2408,5702,564,2009-02-01,C,2011-12-01,BIT,coal,coal,coal,68.0,...,1.31,10.7,,2.937,RR,,,,,
80539,206400,6165,2012-09-01,C,,BIT,coal,coal,coal,443.0,...,0.52,15.4,9.0,2.108,TR,TC,,,,
95838,239579,3399,2013-08-01,C,2016-12-01,BIT,coal,coal,coal,437.0,...,2.93,8.6,,2.265,RV,,,,,
42836,109516,56068,2010-10-01,C,2012-12-01,BIT,coal,coal,coal,51.0,...,1.8,7.2,,2.827,RR,,,,,
21592,53670,8102,2009-11-01,C,,BIT,coal,coal,coal,489.0,...,3.42,10.76,,1.791,RV,,,,,
14031,33780,8042,2009-07-01,C,2010-12-01,BIT,coal,coal,coal,303.0,...,1.27,11.0,,3.809,RR,,,,,
38037,96575,3470,2010-08-01,C,2010-12-01,SUB,coal,coal,coal,59.0,...,0.31,5.4,,,RR,,,,,
142675,344926,10151,2016-08-01,S,,WC,coal,coal,coal,596.0,...,2.23,52.9,0.0,,TR,,,,11.83,0.0
151291,372581,4125,2017-05-01,S,,PC,coal,petroleum_coke,petroleum_coke,,...,5.95,0.4,0.0,1.813,TR,,,,,
48421,123247,54035,2011-01-01,C,2014-05-01,BIT,coal,coal,coal,238.0,...,1.31,8.5,,,RR,,,,,


### Using a tabular output object
The normalized database tables are great, but we've also created some routines to compile dataframes that are a little more human friendly. For instance, by including the names of the various plants and utilities referenced, rather than just their unique IDs. These routines are bundled together into tabular output objects, defined by the `pudl.output.pudltabl.PudlTabl` class. These objects can also aggregate the data appropriately at different time resolutions. This example creates an output object with all available data, at monthly resolution:

In [9]:
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine=pudl_engine, freq='MS')

Now that we have the object, we can ask it for some data, say about per-generator net generation:

In [10]:
gen_eia923 = pudl_out.gen_eia923()

In [11]:
gen_eia923.sample(10)

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name,utility_id_eia,utility_id_pudl,utility_name,generator_id,net_generation_mwh
17437,2014-02-01,568,1542,Bridgeport Station,15452,270,PSEG Power Connecticut LLC,3,221605.0
118829,2017-06-01,6112,192,Fort St Vrain,15466,272,Public Service Co of Colorado,2,51560.0
231351,2011-04-01,55087,661,Zeeland Generating Station,4254,81,Consumers Energy Co,2C,37860.0
40204,2012-05-01,1317,1772,Pratt,15321,1160,City of Pratt- (KS),3,0.0
18547,2014-08-01,609,88,Cape Canaveral,6452,121,Florida Power & Light Co,3A,126154.0
188412,2013-01-01,50572,3735,CSL Gas Recovery,54842,3800,WM Renewable Energy LLC,GEN3,1334.0
281074,2013-11-01,57926,6135,Heat Plant 770,57305,3812,Wright Patterson AFB,HP,
177438,2017-07-01,50290,3600,SEMASS Resource Recovery,16929,3118,SEMASS Partnership,GEN2,17021.0
197780,2015-09-01,50900,3869,Covington Facility,20508,2400,MeadWestvaco Corp,GEN5,19947.88
259107,2017-04-01,55545,4608,Hidalgo Energy Center,2934,757,Calpine Corp - Hidalgo,CTG2,65197.0


### Requesting derived values from output objects
The output objects also know how to use the available data to calculate some interesting things, on a whole-dataset basis. For example, the per-generator marginal fuel costs, heat rates, and capacity factors... This can take a couple of minutes.

In [None]:
mcoe_df = pudl_out.mcoe()

### Plotting from the dataframe
Now that we've got some interesting analysis stored in a dataframe, we should make a few visualizations so we can see what's going on.

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

In [None]:
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150

In [None]:
coal_mcoe = mcoe_df[mcoe_df.fuel_type_code_pudl=='coal']
gas_mcoe = mcoe_df[mcoe_df.fuel_type_code_pudl=='gas']
start_year = min(mcoe_df.report_date.dt.year)
end_year = max(mcoe_df.report_date.dt.year)
plt.hist(coal_mcoe.heat_rate_mmbtu_mwh,
         weights=coal_mcoe.net_generation_mwh,
         range=(6,14), bins=100,
         label='Coal', alpha=0.75)
plt.hist(gas_mcoe.heat_rate_mmbtu_mwh,
         weights=gas_mcoe.net_generation_mwh,
         range=(6,14), bins=100,
         label='Freedom Molecules', alpha=0.75)
plt.xlabel('mmBTU/MWh')
plt.ylabel('MWh')
plt.title(f"Generation weighted heat rates by fuel, {start_year}-{end_year}")
plt.legend();

In [None]:
coal_mcoe = mcoe_df[mcoe_df.fuel_type_code_pudl=='coal']
gas_mcoe = mcoe_df[mcoe_df.fuel_type_code_pudl=='gas']
start_year = min(mcoe_df.report_date.dt.year)
end_year = max(mcoe_df.report_date.dt.year)
plt.hist(coal_mcoe.fuel_cost_per_mwh,
         weights=coal_mcoe.net_generation_mwh,
         range=(0,70), bins=100,
         label='Coal', alpha=0.75)
plt.hist(gas_mcoe.fuel_cost_per_mwh,
         weights=gas_mcoe.net_generation_mwh,
         range=(0,70), bins=100,
         label='Freedom Molecules', alpha=0.75)
plt.xlabel('$/MWh')
plt.ylabel('MWh')
plt.title(f"Generation weighted fuel costs, {start_year}-{end_year}")
plt.legend();

## Other notebooks to play with:
Any Jupyter Notebook that's deployed into the `notebooks` directory in your PUDL workspace should work, because they are tested automatically by PyTest before we do a release.