Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook land: intro notebooks for CEMS and output tables #823

Merged
merged 35 commits into from
Dec 11, 2020

Conversation

aesharpe
Copy link
Member

This branch contains the work-in-progress notebooks related to RMI work and dataset tours.

CEMS_by_utility.ipynb is what I used to aggregate CEMS data by utility for RMI. For now, you can ignore this (it was just a tool for me).

explore-CEMS.ipynb is the CEMS intro notebook, woo! This one is close to done (with a few commented out map portions RE: hawaii and alaska that I wasn't sure how to adapt to the new map)

explore-output-tables.ipynb is the beginnings of an intro to output tables book. This one still needs lots of work / direction / correction. Should I give examples of all the tables or should it just serve as a hub of information about the tables (either permanently or until we get meta-data for the output tables).

@aesharpe aesharpe changed the title Notebook land Notebook land: intro notebooks for CEMS and output tables Nov 12, 2020
@codecov
Copy link

codecov bot commented Nov 12, 2020

Codecov Report

Merging #823 (ed5266e) into sprint27 (ce1ea8f) will increase coverage by 0.34%.
The diff coverage is 20.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##           sprint27     #823      +/-   ##
============================================
+ Coverage     70.31%   70.66%   +0.34%     
============================================
  Files            39       40       +1     
  Lines          4871     4887      +16     
============================================
+ Hits           3425     3453      +28     
+ Misses         1446     1434      -12     
Impacted Files Coverage Δ
src/pudl/analysis/service_territory.py 21.77% <0.00%> (ø)
src/pudl/output/eia860.py 100.00% <ø> (ø)
src/pudl/output/pudltabl.py 58.25% <0.00%> (ø)
src/pudl/output/epacems.py 25.00% <25.00%> (ø)
src/pudl/workspace/datastore.py 61.05% <0.00%> (+12.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce1ea8f...ed5266e. Read the comment docs.

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can use the automatically generated Jupyter table of contents instead of trying to maintain one by hand, which may get frustrating with edits to the notebook over time. It's currently a plugin, but will be integrated into the juplyterlab core in 3.0 (which is coming out momentarily)


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh yes! that would make a big difference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't have it installed already the plugin is called jupyterlab-toc (you can install it directly from the plugin manager on the left-hand-side of JupyterLab -- it looks like a puzzle piece)

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geoplot isn't part of the pudl-dev environment, and doesn't seem to be getting used below. Do we need it? It caused this cell to fail for me.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that was leftover from some other maps I that didn't make the cut

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will always want users to access only the root of the parquet dataset -- Dask and other tools that known how to work with Parquet understand how to work with the hierarchy / partitioning efficiently. See my edits to the notebook.


Reply via ReviewNB

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit dangerous -- it'll trigger the full computation of any tasks that exist in the dataframe. In this case it works okay because you've only just read it in, and the number of records is stored in the parquet file metadata, but in general this could be very computationally intensive.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm good to know.

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do dd.dtypes to get a list of both the column names, and their data types, which is a bit more informative.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better, agreed.

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of iteration and aggregation is what Dask is designed to do automatically. If you point it at the top level of the dataset, it will do all this work for you seamlessly under the hood.


Reply via ReviewNB

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, is Texas really that bad? This seems like a crazy outlier. Might there be some bad data somewhere? It might be more illustrative to plot the GHG emissions per capita on a state by state basis, since you've got the census data handy already.


Reply via ReviewNB

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, these shapefiles are a little janky aren't they, since they include coastal waters not just the land boundaries. Let's use contextily to get a nice basemap to display this info.


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For time-series based aggregation, take a look at the pandas.Grouper documentation -- there's a bunch of time-specific functionality that knows how to work with datetime columns directly at whatever frequency you're interested in.


Reply via ReviewNB

@@ -0,0 +1,1130 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make a really cool comparative seasonal load chart here with seaborn -- with a grid of different states and a statistical envelope showing the monthly load for each state across all of the years of data.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo that would be cool!

@zaneselvans zaneselvans merged commit 3c72fc2 into sprint27 Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants