# Data Wrangling & Analysis Take Home Questions
* These questions are examples of the kind of work we do to make public energy data usable for analysis. We don’t expect you to come up with clean, comprehensive solutions. Rather, we want to be able to explore your general approach together, and see how you think about these kinds of problems.
* **Please choose one of the problems below** and spend several hours working on it. This doesn't have to happen all at once. It isn't an exam! We want you to have time to play with the data, step away from it to think, and then come back to it again.
* Feel free to use whatever documentation or online resources you would normally consult while working on a data wrangling problem.
* Feel free to use additional 3rd party libraries if you want to.  You should be able to install them from within the notebook using `!pip install packagename` or `!conda install packagename`

## Questions to keep in mind:
* What assumptions are you making about the data?
* How will you test whether / when those assumptions are valid?
* How would you / did you deal with the data that don’t conform to those assumptions?
* If there are records which can’t be reasonably cleaned automatically, but were high value in an advocacy context, how would you integrate manual cleaning into the automated process so that the manual effort is captured, and can be incrementally improved over time?
* What expectations do you have about the output data?
* What kind of data validation checks would you design to make sure that the output meets your expectations? These could be either integrated into the table transformation process, or run on the final output.
* How do you decide when data isn’t recoverable?
* How will you evaluate the completeness of the data that you’ve been able to extract?
* What kind of queries are you trying to make easy with the structure of the output data?
* What parts of this process might make sense to generalize / abstract for re-use in extracting, cleaning, and reorganizing data from other tables?

# Background on the FERC Form 1 Database
* The FERC Form 1 collects financial data about electric utilities in the US. It’s a treasure-trove of information if you want to understand how these utilities make and spend money. The capital they have locked up in existing fossil fuel infrastructure is one of the big reasons they fight against the transition to clean energy. Data from the FERC Form 1 can help advocates understand which utilities will be easiest to engage in the transition, and which ones may be hopeless pyromaniacs.
* Unfortunately, FERC does not organize its data very well, or do much quality control, so this data is difficult to extract and use. We’ve built a script that pulls together all of FERC’s annual Visual FoxPro databases into a single SQLite database covering all the years of data. Then we write extract and transform functions to pull tables from this multi-year DB and clean them up for easier analysis.
* To help us understand how you approach working with messy data and turning it into something usable, we’d like you to develop a strategy for reshaping and cleaning the data in a couple of these tables.
* [Here is some documentation about the FERC Form 1 Database](https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc1_db_notes.html), including a mapping between database tables and the pages of the PDF that their data is collected from.

## Other things you should know about the database structure:
  * The FERC Form 1 database generally mirrors the structure of the old paper forms that were used to collect this data.
  * Columns on the paper form have been translated into columns in the database, and the numbered rows on the paper forms are identified by a `row_number` field in the database.
  * Different row numbers correspond to different reported values, meaning that the original records in the database are not all observations of the same variable.
  * In addition, what row number corresponds to what variable has changed over the years, as new rows have been added, or old ones have been removed or split into multiple rows offering more granular information.
  * The `f1_row_lit_tbl` maps combinations of table names, row numbers, and years to the description that applied to that row number, and thus indicates how the meanings associated with individual rows have changed over time. The `row_chg_yr` column indicates the last year in which the meaning of a row number changed.

# Setting up access to the PUDL Data
The cells below will create SQLAlchemy database connection engines that point at the FERC Form 1 DB and the PUDL DB.

In [1]:
import sqlalchemy as sa
import pandas as pd
import pudl

In [2]:
pudl_settings = pudl.workspace.setup.get_defaults()
ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
pudl_engine = sa.create_engine(pudl_settings['pudl_db'])

# Question 1: Tidy Data and Database Normalization
* Within the FERC Form 1 DB, examine the Electric Operating and Maintenance Expenses table: `f1_elc_op_mnt_expn`.
* Refer to the [blank Form 1 (PDF)](https://catalystcoop-pudl.readthedocs.io/en/dev/_downloads/6a316a949a522f595e7575b6fd7034b8/ferc1_blank_2022-11-30.pdf) for a description of what specific information is being collected in this table. In general it describes different categories of electric utility operating and maintenance expenses, across a wide range of utility assets.
* Design a process for normalizing the table, re-shaping it into one or more tidy, long-form tables where each column corresponds to a single variable, and each row represents a single observation (all the data that pertains to one utility in a given year) while minimizing the duplication of information.
* The process should account for the fact that row numbers in the original database table correspond to different quantities in different years.
* The original table implicitly groups sets of rows into categories and sub-categories, though this is only obvious if you look at the PDF of the blank form. In the normalized table it should be possible to identify those groups, but without duplicating information in the table.
* The FERC account numbers and sub-account numbers that appear in this table show up in many different contexts within the Form 1 and other FERC reporting. Being able to use them to address the data is valuable.
* You can ignore the columns whose names end with `_f` as they refer to footnotes.

In [3]:
electricity_expenses = pd.read_sql("f1_elc_op_mnt_expn", ferc1_engine)
row_labels = (
    pd.read_sql("f1_row_lit_tbl", ferc1_engine)
    .query("sched_table_name=='f1_elc_op_mnt_expn'")
)

In [4]:
electricity_expenses.head(20)

Unnamed: 0,respondent_id,report_year,spplmnt_num,row_number,row_seq,row_prvlg,crnt_yr_amt,prev_yr_amt,crnt_yr_amt_f,prev_yr_amt_f,report_prd
0,1,1994,0,4,4,N,2058807.0,2006612.0,0,0,12
1,1,1994,0,5,5,N,100684754.0,89214112.0,0,0,12
2,1,1994,0,6,6,N,889217.0,885222.0,0,0,12
3,1,1994,0,9,9,N,737882.0,690677.0,0,0,12
4,1,1994,0,10,10,N,2069398.0,2054883.0,0,0,12
5,1,1994,0,11,11,N,67319486.0,67003989.0,0,0,12
6,1,1994,0,13,13,N,173759544.0,161855495.0,0,0,12
7,1,1994,0,15,15,N,855653.0,876005.0,0,0,12
8,1,1994,0,16,16,N,627088.0,716465.0,0,0,12
9,1,1994,0,17,17,N,6560762.0,9402606.0,0,0,12


In [5]:
row_labels.head(20)

Unnamed: 0,sched_table_name,report_year,row_number,row_seq,row_literal,row_status,row_chg_yr
362,f1_elc_op_mnt_expn,1994,1,1,1. POWER PRODUCTION EXPENSES,A,1994
363,f1_elc_op_mnt_expn,1994,2,2,A. Steam Power Generation,A,1994
364,f1_elc_op_mnt_expn,1994,3,3,Operation,A,1994
365,f1_elc_op_mnt_expn,1994,4,4,(500) Operation Supervision and Engineering,A,1994
366,f1_elc_op_mnt_expn,1994,5,5,(501) Fuel,A,1994
367,f1_elc_op_mnt_expn,1994,6,6,(502) Steam Expenses,A,1994
368,f1_elc_op_mnt_expn,1994,7,7,(503) Steam from Other Sources,A,1994
369,f1_elc_op_mnt_expn,1994,8,8,(Less) (504) Steam Transferred-Cr.,A,1994
370,f1_elc_op_mnt_expn,1994,9,9,(505) Electric Expenses,A,1994
371,f1_elc_op_mnt_expn,1994,10,10,(506) Miscellaneous Steam Power Expenses,A,1994


# Question 2: Data Cleaning
* Examine the Small Plants table: `f1_gnrt_plant` and the [blank Form 1 (PDF)](https://catalystcoop-pudl.readthedocs.io/en/dev/_downloads/6a316a949a522f595e7575b6fd7034b8/ferc1_blank_2022-11-30.pdf) describing the information being collected in this table.
* This table describes smaller power plants in the US, including their date of construction, fuel types, fuel consumption, net electricity generation, operational and maintenance costs.
* Unfortunately, the table does not provide a unique ID for each power plant so it’s difficult to assemble a complete time series tracking the plant across all the years of available data. In general the table also doesn’t use controlled vocabularies for variables like the type of plant or fuel. In some cases, information that should be part of a single record is spread across multiple rows. Sometimes numeric IDs and names are stored in the same field. It also seems like different records may be reporting the same kind of value in a given column, but using different units of measurement (e.g. kilowatts vs. megawatts, or tons of coal vs. gallons of diesel fuel).
* Design a process for cleaning up the data in this table so that it can be used for analysis, including tracking individual plants across years, and ensuring that the values reported within each column are comparable. Derived values of interest might be the total per-plant cost per Megawatt-hour of electricity generated each year, or the per-plant quantity of fuel consumed per Megawatt-hour of electricity generated each year.
* You can ignore the columns whose names end with `_f` as they refer to footnotes.

In [6]:
small_plants = pd.read_sql("f1_gnrt_plant", ferc1_engine)
small_plants.sample(10)

Unnamed: 0,respondent_id,report_year,spplmnt_num,row_number,row_seq,row_prvlg,plant_name,yr_constructed,capacity_rating,net_demand,...,net_demand_f,net_generation_f,plant_cost_f,plant_cost_mw_f,operation_f,expns_fuel_f,expns_maint_f,kind_of_fuel_f,fuel_cost_f,report_prd
4951,157,1999,0,4,4,N,FLEISH,1905.0,2.0,2.5,...,0,0,0,0,0,0,0,0,0,12
2940,23,1997,0,17,0,N,Lewiston Canal Facilities:,,,7.0,...,470230279,0,0,0,0,0,0,0,0,12
6413,73,2001,0,1,1,N,Hydroelectric,,0.0,0.0,...,0,0,0,0,0,0,0,0,0,12
10429,161,2007,0,13,13,,Hydro,,0.0,0.0,...,0,0,0,0,0,0,0,0,0,12
9634,161,2006,0,35,35,,Bishop Creek No. 6,1913.0,1.6,0.0,...,0,0,0,0,0,0,0,0,0,12
7153,25,2002,0,16,16,N,Pierce Mills 2396,1928.0,0.2,0.0,...,0,0,0,0,0,0,0,0,0,12
13293,41,2011,0,10,10,,Webber - FPC #2566,1907.0,4.3,3.1,...,0,0,0,0,0,0,0,0,0,12
6829,157,2001,0,8,8,N,,,0.0,0.0,...,0,0,0,0,0,0,0,0,0,12
17528,61,2017,0,5,5,,"W, Danville Station # 15",1917.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,12
10101,121,2006,0,24,24,,Thornapple,1927.0,1.4,1.6,...,0,0,0,0,0,0,0,0,0,12
