# Take Home Project: Wrangling FERC Form 1

* This task is an example of the kind of work we do to make public energy data usable for analysis.
* We want to be able to explore your general approach together and see how you think about these kinds of problems.
* **Spend 2-4 hours working on it.** This doesn't have to happen all at once. We want you to have time to play with the data, step away from it to think, and then come back to it again.
* Feel free to use whatever documentation or online resources you would normally consult while working on a data wrangling problem.
* Feel free to use additional 3rd party libraries if you want to.  You should be able to install them from within the notebook using `!pip install packagename` or `!conda install packagename`

## Email us your notebook within a week.
* Send it to [hello@catalyst.coop](mailto:hello@catalyst.coop) (normally we'd have you make a PR but... we don't want everyone looking at each others solutions)
* We'll review your notebook and if it looks good, we'll reach out to schedule a longer conversation about it, and another technical interview.

## Some questions to keep in mind:
* What assumptions are you making about the data?
* Is the raw data well structured?
* How will you test whether / when those assumptions are valid?
* How would you / did you deal with the data that don’t conform to those assumptions?
* If there are records which can’t be reasonably cleaned automatically, but were high value in an advocacy context, how would you integrate manual cleaning into the automated process so that the manual effort is captured, and can be incrementally improved over time?
* What expectations do you have about the output data?
* What kind of data validation checks would you design to make sure that the output meets your expectations? These could be either integrated into the table transformation process, or run on the final output.
* How do you decide when data isn’t recoverable?
* How will you evaluate the completeness of the data that you’ve been able to extract?
* What kind of queries are you trying to make easy with the structure of the output data?
* What parts of this process might make sense to generalize / abstract for re-use in extracting, cleaning, and reorganizing data from other tables?

# Background on the FERC Form 1 Database
* The FERC Form 1 collects financial data about electric utilities in the US. It’s a treasure-trove of information if you want to understand how these utilities make and spend money. The capital they have locked up in existing fossil fuel infrastructure is one of the big reasons they fight against the transition to clean energy. Data from the FERC Form 1 can help advocates understand which utilities will be easiest to engage in the transition, and which ones may be hopeless pyromaniacs.
* Unfortunately, FERC does not organize its data very well, or do much quality control, so this data is difficult to extract and use. We’ve built a script that pulls together all of FERC’s annual Visual FoxPro databases into a single SQLite database covering all the years of data. Then we write extract and transform functions to pull tables from this multi-year DB and clean them up for easier analysis.
* To help us understand how you approach working with messy data and turning it into something usable, we’d like you to develop a strategy for reshaping and cleaning the data in one of these tables.
* [Here is some documentation about the FERC Form 1 Database](https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc1_db_notes.html), including a mapping between database tables and the pages of the PDF that their data is collected from.

# Set up access to the FERC Form 1 DB
* You can download a copy of our FERC Form 1 SQLite DB from: https://data.catalyst.coop/ferc1.db
* Substitute the path to that file on your system below:

In [None]:
import sqlalchemy as sa
import pandas as pd

FERC1_DB_PATH = "/path/to/your/copy/of/ferc1.db"

ferc1_engine = sa.create_engine(f"sqlite:///{FERC1_DB_PATH}")

# Prepare the FERC Form 1 Small Plants table for quantitative analysis.
* Explore the Small Plants table (named `f1_gnrt_plant` in the FERC 1 DB).
* Refer to the [blank Form 1 (PDF)](https://catalystcoop-pudl.readthedocs.io/en/dev/_downloads/6a316a949a522f595e7575b6fd7034b8/ferc1_blank_2022-11-30.pdf) (pages 410-411) for more context about the table.
* Our goal is to make as much of the information as possible available for easy programmatic analysis.
* Unfortunately, in its raw form this data is only semi-structured.

## Identify issues in the data
* Make a list of issues that would need to be addressed before this table would be ready for analytical use.
* Show us how you identified the issues you highlight, and briefly talk through why they're problematic, and how you might approach fixing them.
* Don't worry about cataloging every possible issue, but do try to identify several of the biggest problems.

## Tackle one or more of the issues you identified
* Choose one or more of the major issues you identified above, and address it in Python, using pandas and whatever other packages you find useful.
* Imagine these being the first few steps of an ETL pipeline that would ultimately output a tidy, well-structured, analysis-ready database table.

In [None]:
# Read the small plants table, ignoring footnote reference columns:
small_plants = pd.read_sql("f1_gnrt_plant", ferc1_engine)
small_plants = small_plants.loc[:, ~small_plants.columns.str.endswith("_f")]

In [None]:
small_plants.info()