# Take Home Project: Wrangling coal mine data

The task is to **write code to extract and clean** some data about coal mines, and **write notes on how to integrate that work into an existing system**.

* This task is an example of the kind of work we do to make public energy data usable for analysis. We want to get at a few different angles:
  * How do you approach building code from scratch?
  * How do you think about messy datasets?
  * How do you think about testing software that deals with said messy datasets?
  * How do you think about integrating code into an existing system?
* **Spend up to two hours working on this.** If you don't finish, don't worry! The point is to have something concrete and technical to talk about at all.
* Feel free to use whatever documentation or online resources you would normally consult while working on a data wrangling problem.
* Feel free to use additional 3rd party libraries if you want to.  You should be able to install them from within the notebook using `!pip install packagename` or `!conda install packagename`

## Email us your notebook!
* Send it to [hello@catalyst.coop](mailto:hello@catalyst.coop) (normally we'd have you make a PR but we don't want everyone looking at each others solutions)
* We'll review your notebook and if it looks good, we'll reach out to schedule a longer conversation about it.

## The longer conversation

If we think your notebook looks good, we'll schedule a 60 minute conversation about it.

We'll ask you to walk us through your Python code for the extraction piece and your English words for the integration piece.

We'll also ask some of the following questions:

* What assumptions are you making about the input data?
  * How will you test whether / when those assumptions are valid?
  * How would you / did you deal with the data that don’t conform to those assumptions?
* What expectations do you have about the output data?
  * How will you evaluate the completeness of the data that you’ve been able to extract?
  * What kind of queries are you trying to make easy with the structure of the output data?
  * What kind of data validation checks would you design to make sure that the output meets your expectations? These could be either integrated into the table transformation process, or run on the final output.
* Did you try anything that didn't work? What was it?
* If there are records which can’t be reasonably cleaned automatically, but were high value in an advocacy context, how would you integrate manual cleaning into the automated process so that the manual effort is captured, and can be incrementally improved over time?
* How do you decide when data isn’t recoverable?
* What parts of this process might make sense to generalize / abstract for re-use in extracting, cleaning, and reorganizing data from other tables?

## Background on the MSHA Coal Mine Data

* The Mine Health and Safety Administration (MSHA) collects a variety of information about mines, incuding who owns them, what and how much they produce, mining methods used, environmental and safety violations, number of employees, ownership, and location.
* This information can be helpful for understanding the economic and environmental consequences of shutting down coal fired power plants. It's especially relevant right now, since the Inflation Reduction Act (IRA) provides tax benefits for clean energy projects in former coal communities. (You can read more about "energy and coal communitites" [here](https://www.resources.org/common-resources/what-is-an-energy-community/), but that's not required to answer this interview question.)

## Extract and Clean the MSHA Mine Data Set

Please write code in this notebook to address the following requirements.

### Extract

* Design and implement a function or class that can be used to extract the [MSHA Mines Data Set (ZIP)](https://arlweb.msha.gov/OpenGovernmentData/DataSets/Mines.zip).
    * This function or class should also be adaptable to extracting the other similarly formatted data sources available from the MSHA website. E.g. the [Controller/Operator History (ZIP)](https://arlweb.msha.gov/OpenGovernmentData/DataSets/ControllerOperatorHistory.zip) or [Employment/Production Data Set (ZIP)](https://arlweb.msha.gov/OpenGovernmentData/DataSets/MinesProdYearly.zip).
    * The input to this function can be a URL or local path to the published zipfile. You don't have to worry about handling both.
    * The output should be a pandas dataframe.

### Transform/Clean
* Take the extracted MSHA Mines data frame and impose some order on it, in preparation for loading it into a well-normalized relational database.
    * Clean the columns you think are required to define a clear structure for the data.
    * Feel free to clean any other columns you think would be helpful from an advocacy or research perspective, but you don't need to clean every column.
    * Any columns you've cleaned should end up with well-defined data types.

### Some hints
* The MSHA dataset has a [Definitions File](https://arlweb.msha.gov/OpenGovernmentData/DataSets/Mines_Definition_File.txt) with column type and description information. 
* You'll need to use Latin character encoding when extracting the .txt files.

## Write about integrating this with the rest of our codebase

Take a look at the [main Public Utility Data Liberation (PUDL) codebase](https://github.com/catalyst-cooperative/pudl).

It's certainly a little overwhelming to jump into. But we're curious how you'd approach it.

We'd like you to answer the following questions in a notebook cell below. Please show your thought process for the first two questions - we're particularly interested in how you deal with any wrong turns during your investigation.

* Where would you put this MSHA extraction code you just wrote? How would you actually persist the outputs to our database?
* What other data tables do we already integrate that deal with similar data? What would you learn from connecting the datasets?

As a starting point:
* the main body of our source code lives in [`/src/pudl`](https://github.com/catalyst-cooperative/pudl/tree/main/src/pudl)
* we use [Dagster](https://docs.dagster.io/getting-started) as an ETL orchestration tool, so that might help demystify things as well.
* we maintain a [data dictionary](https://catalystcoop-pudl.readthedocs.io/en/stable/data_dictionaries/pudl_db.html) that describes the many data tables we integrate already.