# Notebook Preamble
For this notebook to work you need to have the `pudl_catalog` package installed. To install the most recently released version, you can do:

## Installation
```
pip install catalystcoop.pudl_catalog
```
or
```
mamba install -c conda-forge catalystcoop.pudl_catalog
```

If you want to work with the development version in the repository, you can clone it locally and create a conda environment in the top level directory, where `environment.yml` is, and then activate that environment:

```
mamba env create
mamba activate pudl-catalog
```

Or you can use `pip`

```
pip install --editable ./
```

## Configuration / Environment
* You need to have configured a Google Cloud Platform project & billing account. See the [PUDL Catalog documentation](https://catalystcoop-pudl-catalog.readthedocs.io/en/latest/) on using public "requester pays" data for more information.
* The catalog makes use of two environment variables `PUDL_INTAKE_CACHE` and `PUDL_INTAKE_PATH`.
* `PUDL_INTAKE_PATH` is the source location for the catalog data, `gs://intake.catalyst.coop/REF` by default, where `REF` is either the catalog version (e.g. `v2022.06.10`) or `dev` for unreleased versions of the catalog that refer to the nightly PUDL data builds. You should not need to set this environment variable unless you're working with your own copy of the underlying catalog data in some special situation.
* `PUDL_INTAKE_CACHE` is the path to the directory where Intake will cache the data locally. This directory needs to exist -- it won't be created if it doesn't. By default it's set to `$HOME/.cache/intake`.
* If you need to set either of these environment variables to custom values, it must be done before the `intake` package is imported by Python, since paths within the catalog are set at import.
* That means it can be done either in the environment where you're running Jupyter beforehand, or in the notebook itself prior to the `import intake` line.

# Data is cached locally
* The contents of the data catalogs are only updated occasionally, so it doesn't make sense to download the whole dataset again every time you want to access it, or to directly access the version that's stored in the cloud.
* Downloading data from cloud storage also incurs a small cost ($0.10-0.20/GB).
* By default the PUDL Catalog creates a local copy (cache) of the data the first time you access it.
* Where exactly this cached data is stored is determined by the `PUDL_INTAKE_CACHE` environment variable.
* Subsequent access will refer to this local copy rather than the remote data.
* When a new version of the catalog is released, and you upgrade your installation of the `catalystcoop.pudl_catalog` package, the new data will be downloaded locally again when you attempt to access it for the first time.
* Each of the SQLite databases is about 1 GB, and the EPA CEMS dataset is about 5 GB, so this may take a few minutes, depending on the speed of your network connection.
* Once the data has been cached, subsequent access should be much faster.

In [None]:
%load_ext autoreload
%autoreload 3

In [None]:
# Standard Library Imports
import logging
import os
import sys
from pathlib import Path

logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter("%(message)s")
handler.setFormatter(formatter)
logger.handlers = [handler]

# Where to cache downloaded data locally. Defaults to ~/.intake/cache
# os.environ["PUDL_INTAKE_CACHE"] = str(Path.home() / ".cache/intake")

# You can override the default path to the data in your environment, if you have it.
# By default it reads from Google Cloud Storage:
# os.environ["PUDL_INTAKE_PATH"] = "gs://intake.catalyst.coop/dev"

# 3rd Party Imports:
import intake
import pandas as pd
from pudl_catalog.helpers import year_state_filter


# Explore installed Intake catalogs
* When you install and import `intake`, it provides a built-in (but empty) catalog at `intake.cat`
* If other Intake catalog packages (like `catalystcoop.pudl_catalog`) are installed as well, they register their existence with the top-level Intake catalog.
* Listing the built in catalog will show you which (sub-)catalogs are available, in this case including `pudl_cat`:

In [None]:
list(intake.cat)

## Catalog Level Metadata
* To avoid going through the main Intake catalog every time, we can store a reference to the PUDL Catalog in a variable.
* Looking at the text representation of the catalog, you can see high level information about it:

In [None]:
pudl_cat = intake.cat.pudl_cat
pudl_cat

## The contents of `pudl_cat`:
* Listing that catalog will show us the data sources it contains by name.
* These sources can also be sub-catalogs nested within it, as is the case for the catalog entries which represent whole SQL databases with multiple tables.

In [None]:
list(pudl_cat)

## Inspecting an individual data source
* Some of the entries in the PUDL Catalog are data sources.
* In our case this means they represent a particular tabular dataset.
* One example is the EPA CEMS hourly emissions data, which is stored in Apache Parquet files.
* Looking at that catalog entry, we can see some metadata related to the source.

In [None]:
pudl_cat.hourly_emissions_epacems

## Inspecting an SQLite sub-catalog
* Several of the PUDL Catalog entries are entire databases containing many separate tables.
* These databases are each used to populate a whole sub-catalog, with each table in the database being represented by a data source within that catalog.
* The top level catalog representation shows some basic metadata.
* The first time you access this sub-catalog, it should download the data and cache it locally. It's about 1 GB, so it could take a couple of minutes depending on the speed of your network connection.  Subsequent access will be much faster.
* If you don't have GCP / Requester Pays set up correctly, this is the first place that would cause a problem.

In [None]:
pudl_cat.pudl

## Identifying sources within an SQLite sub-catalog
* As with the top level PUDL Catalog (or any Intake catalog), looking at the `list()` representation of the catalog will show you all the available sources within it.
* In the case of an SQL database derived catalog, each table becomes its own independent data source:

In [None]:
list(pudl_cat.pudl)

# Reading data from the PUDL Catalog
* Every source exists as an attribute of the catalog
* You can see what form it will be returned in by looking at the `.container` attribute.
* In our case everything is going to be returned as a dataframe.
* You can also look at the `.container` attribute to differentiate between data sources and sub-catalogs

In [None]:
pudl_cat.pudl.fuel_ferc1.container

In [None]:
pudl_cat.hourly_emissions_epacems.container

In [None]:
pudl_cat.ferc1.container

In [None]:
pudl_cat.ferc1.f1_steam.container

## Reading an SQL catalog source into a Pandas dataframe
* The SQL table data sources have a `.read()` method that will read the whole table into a dataframe

In [None]:
fuel_ferc1 = pudl_cat.pudl.fuel_ferc1.read()
fuel_ferc1

## Reading data directly from SQLite
* If you need to query the underlying DB rather than reading an entire table, the `uri` attribute is an SQLAlchemy URI.
* However, this isn't the recommended usage pattern, and it will only work if the data has already been cached locally.

In [None]:
pudl_cat.pudl.uri

In [None]:
import sqlalchemy as sa
engine = sa.create_engine(pudl_cat.pudl.uri)
sql = """
SELECT utility_id_ferc1, plant_id_ferc1, report_year, plant_name_ferc1, capacity_mw
  FROM plants_steam_ferc1
 LIMIT 10
"""
df = pd.read_sql(sql, engine)
df

## Reading Parquet data into a Dask dataframe
* The Parquet datasets are typically too large to fit in memory. EPA CEMS is about a billion rows.
* Rather than reading the entire table all at once, we can select subsets using filters.
* If we want to operate on the entire dataset, we can also use Dask to serialize or distribute computations, only returning a Pandas dataframe once the data has been consolidated or aggregated to a reasonable scale.
* For more on how to work with Dask, you can check out [this self-guided tutorial](https://coiled.io/blog/how-to-learn-dask-in-2021/).
* Here we create a Dask dataframe, but we don't compute its contents yet.
* However, if this is the first time you're accessing the data, this query will trigger the download and local caching of the entire dataset, so it may still take a couple of minutes to run.

In [None]:
%%time
## Read a couple of years of data for a couple of states into a dataframe
TEST_YEARS = [2018, 2020]
TEST_STATES = ["ID", "ME"]
TEST_FILTERS = year_state_filter(years=TEST_YEARS, states=TEST_STATES)
display(TEST_FILTERS)
epacems_dd = (
    pudl_cat.hourly_emissions_epacems(filters=TEST_FILTERS)
    .to_dask()
)
epacems_dd

In [None]:
epacems_df = epacems_dd.compute()
epacems_df.sample(20)