Integrate EIA 860 M into ETL #824

cmgosnell · 2020-11-12T19:08:46Z

@zaneselvans :
please give this a good lookin' over. Generally speaking this PR adds the metadata and extractor for EIA 860 M and incorporated this into the ETL process. There is now an eia860m extract module which extracts 860m and appends it to the 860 data. Some notes:

This is not currently integrated into the tests. The way I see it is the only way to do that would be to run the ytd in the standard tests, which would not be terrible but is slightly outside our current convention. I also have a few little data sanity tests in a notebook that relies on having the ytd 2020 data and the 2019 data. I assume these should live in the validation tests not our CI test, but idk.
I did add a data_source column in the generators table with just the data source code (eia860 or eia860m). I haven't added an indicator of which eia860m this is yet... I don't think our old little datasets table exists anymore and it felt weird to recreate it because the information should be populating in the datapackage metadata and having this table would be another duplication of data. idk. We should make some decisions on what is best for this.
The harvesting process required a little revamp in there because the longitude column in 860m has an additional decimal point and is just off compared to the rest of the data. I added a little manage_strictness function to help with this column as well as the plant and utility name from a while back.
I also cleaned up some of the etl parameter validation functions and got rid of a few of the noqa: C901 exceptions 🎉
ALSO, this does not at all deal with the outputs. that is for Issue update pudl_out to integrate EIA 860 M #812

@rousik:
I'd love to have your perspective on whether there is a cleaner way to call ds.get_resource() inside the standard excel extractor with the format of: partition_name=part (ie. year=2019 or year_month='2020-08'). Right now I got around this just by overriding get_datapackage_resources() in the non-year dataset specific extractor. My original thought was to pass a dictionary into the excel extractor like {'year': [2009,2010,2011]} or {'year_month':['2020-08']} and somehow using the keys to call ds.get_resource() properly, but I couldn't work out the mechanics of that.

Working on Issue #811

working on Issue #811

Issue #811.

…ia860m

codecov · 2020-11-13T17:33:58Z

Codecov Report

Merging #824 (5e422b0) into sprint27 (c462d87) will increase coverage by 0.71%.
The diff coverage is 92.86%.

@@             Coverage Diff              @@
##           sprint27     #824      +/-   ##
============================================
+ Coverage     70.47%   71.18%   +0.71%     
============================================
  Files            40       41       +1     
  Lines          4897     4941      +44     
============================================
+ Hits           3451     3517      +66     
+ Misses         1446     1424      -22

Impacted Files	Coverage Δ
src/pudl/transform/ferc1.py	`91.18% <ø> (ø)`
src/pudl/workspace/datastore.py	`48.42% <0.00%> (ø)`
src/pudl/output/pudltabl.py	`58.45% <50.00%> (+0.20%)`	⬆️
src/pudl/extract/eia861.py	`95.83% <83.33%> (+3.83%)`	⬆️
src/pudl/extract/excel.py	`92.98% <89.47%> (-1.25%)`	⬇️
src/pudl/etl.py	`81.00% <96.88%> (+3.08%)`	⬆️
src/pudl/constants.py	`100.00% <100.00%> (ø)`
src/pudl/extract/eia860.py	`100.00% <100.00%> (+5.56%)`	⬆️
src/pudl/extract/eia860m.py	`100.00% <100.00%> (ø)`
src/pudl/extract/eia923.py	`100.00% <100.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c462d87...5e422b0. Read the comment docs.

rousik

I have added few nits/cosmetic comments, the bulk of what was asked is in the biggest comment added to extract/excel.py. TL;DR - I think we should be using **kwargs for the partition and make excel generic to the point it does not actually care what the names of these partition attributes are.

rousik · 2020-11-17T16:25:37Z

src/pudl/constants.py

@@ -1633,6 +1633,7 @@
 # The full set of years we currently expect to be able to ingest, per source:
 working_years = {
    'eia860': tuple(range(2009, 2020)),
+    'eia860m': ["2020-08"],


This feels a bit like a hack. We are storing completely different type (list instead of tuple, string instead of number and yyyy-mm instead of a year) and there's no easy way to know that eia860m is different from the rest.

yes, I was wondering about this. I suppose it would be better to have 'eia860m': tuple(['2018-10']) here, but maybe it doesn't make sense at all to have this year-month inside of working_years because it is not really a year at all.

I don't think there's any reason it needs to be a collection at all, is there? If it's only ever allowed to be a single value, we should just store the single value. But anything that's being stored and used as an argument for other functions, like in the other datasets here, needs to be non-mutable (i.e. tuple instead of list) because otherwise the arguments can get accidentally changed in one place, and that change will appear everywhere the value is referenced since it has a scope outside of any of the functions.

I went ahead and changed working_years to working_partitions and within each data set is another dictionary with the name of the partition as the keys and the disparate partition types as the values.

src/pudl/etl.py

rousik · 2020-11-17T16:37:40Z

src/pudl/transform/eia.py

-                eia_transformed_dfs,
-                entities_dfs,
-                debug=False):
+def _manage_strictness(col, eia860_ytd):


nit: I would probably call this something like _get_column_strictness, as the word manage implies (at least to me) some side-effects that go beyond returning dict with strictness targets for each column.

src/pudl/extract/excel.py

rousik · 2020-11-17T16:58:12Z

src/pudl/extract/excel.py

        return {}

-    def extract(self, years):
+    def extract(self, partitions):


Okay. I think I understand it now. I think the best approach here may be to consider that the partition is a dict-like structure that assigns list of values to each named key (such as year or year_month).

The cleanest way to support such approach would be to change the signature of this function to: def extract(self, **partition), then expand this to sequence of single-valued mappings, e.g. year=[1,2,3] will turn into [{year=1}, {year=2}, {year=3}] and then plum this all the way down to self.ds.get_resources(self._dataset_name, **partition).

The above generic approach could support composite keys (e.g. partition=dict(year=2018, month=8) but this would require reorganizing our excel metadata files (right now it assumes that partition matches the column name) or at the least adding function that maps structured partition to metadata column name (e.g. meta_col = lambda **p: f'{p["year"]}_{p["month"]}'). Composite partition keys is more of a long-term thinking and does not immediately apply.

do you mean passing in kwargs into extract like extract(year=1, year=2, year=3) or passing in extract(year=[1,2,3,])? for the former, i'm not sure how you would distinguish between the years. and if it is the latter, I'm not sure how to convert extract's kwargs into get_resources's kwargs.

You can't pass named argument more than once so it'd have to be years=[1,2,3]. My idea was to add helper function that will fan-out the multi-valued partition into sequence of single-valued assignments, e.g. {month=9, year=[1,2]} --> [{month=9, year=1}, {month=9, year=2}].

Here is the helper function:

import itertools def iterate_multivalue_dict(**kwargs): single_valued = {k: v for k, v in kwargs.items() if type(v) != list} # Transform multi-valued {k: vlist} into {k1: [{k1: v1}, {k1: v2}, ...], k2: [...], ...} multi_valued = {k: [{k: v} for v in vlist] for k, vlist in kwargs.items() if type(vlist) == list} for value_assignments in itertools.product(*multi_valued.values()): result = dict(single_valued) for k_v in value_assignments: result.update(k_v) yield result

rousik · 2020-11-17T17:00:51Z

test/etl_test.py


    def test_extract_eia923(self, pudl_datastore_fixture):
        """Spot check extraction eia923 excel files."""
        extractor = pudl.extract.eia923.Extractor(pudl_datastore_fixture)
        assert "Page 3 Boiler Fuel Data" in extractor.load_excel_file(
-            2018, "stocks").sheet_names
+            part=2018, page="stocks").sheet_names


this is another reason why using **kwargs might be cleaner here. Instead of using vague part=$somevalue, using year=2018, or year_month="2018-08" makes it more apparent of what we are passing through.

I very much like the idea of not needing to specify exactly what all of the partitions are, since the possibilities will probably change over time (e.g. epacems has year + state, ferceqr has year + quarter, the LMP data may be partitioned by which grid operator it came from, etc.) and doing the right thing can be left to the extract function, which is going to be dataset specific. But this will affect the metadata that we use to specify filename / column / tab maps for the spreadsheets.

src/pudl/package_data/meta/datapkg/datapackage.json

standardizes the working_years to be working_partitions. Adds in new plants to the mapping sheet

cmgosnell · 2020-12-03T21:52:20Z

src/pudl/extract/excel.py

-        return self._sheet_name.at[page, str(year)]
+    def get_sheet_name(self, page, **partition):
+        """Returns name of the excel sheet that contains the data for given partition and page."""
+        return self._sheet_name.at[page, str(list(partition.values())[0])]


this feels like a hack.... is there a better way to unpack the kwarg here? at this stage it should always be setup as partition_name=partition (year=2019 or year_month="2020-08")

This indeed feels like a hack. The implicit assumption here is that **partition contains exactly one key=value pair and we use the stringified value when pulling the sheet name from pandas data frame.

I would make this assumption explicit in the form of AssertonError:

if len(partition) != 1: raise AssertionError(f"Expecting exactly one partition attribute (found: {partition})") return self._sheet_name.at[page, str(partition.values()[0])]

I now see that this value extraction happens multiple time so I would suggest creating helper function _get_partition_key(**partitions) that will do that.

Such approach could easily be modified if we ever encounter multi-dimensional partitions (e.g. year, month) where you could construct keys in the form of "month=1,year=2019" and use these for indexing pandas dataframes.

this is a great idea!

cmgosnell · 2020-12-03T21:54:36Z

src/pudl/extract/excel.py

        """Extracts dataframes.

        Returns dict where keys are page names and values are
        DataFrames containing data across given years.

        Args:
-            years (list): list of years to extract.
+            partitions (list, tuple or string): list of partitions to


I tried using **partitions format here for the docs but i got a flake8 error. Is the right way to document kwargs to just list them like normal?... that seems weird.

rousik · 2020-12-04T22:32:16Z

src/pudl/extract/excel.py

@@ -152,12 +152,13 @@ def __init__(self, ds):
        self._file_cache = {}
        self.ds = ds

-    def process_raw(self, df, year, page):
+    def process_raw(self, df, partition, page):


It is strange that process_renamed and process_raw handle partition differently (one uses **kwarg while the other expects dict in a positional argument, in addition to page and partition having different order in the two cases.

rousik · 2020-12-04T22:33:41Z

src/pudl/extract/excel.py

                # we are going to skip
-                if self.excel_filename(yr, page) == '-1':
+                if self.excel_filename(page, **partition) == '-1':


Now that I think of it, string valued -1 is a very non-standard way of encoding missing value. We should just return None here.

this is a very good point. i'm going to add an issue for that.

rousik · 2020-12-04T23:24:52Z

src/pudl/output/pudltabl.py

@@ -82,15 +82,14 @@ def __init__(self, pudl_engine, ds=None, freq=None, start_date=None, end_date=No
        if start_date is None:
            self.start_date = \
                pd.to_datetime(
-                    f"{min(pc.working_years['eia923'])}-01-01")
+                    f"{min(pc.working_partitions['eia923']['years'])}-01-01")


Hm. I was thinking about how to make this conversion better and thought about using pd.Timestamp but that did not work. I do think, however, that this could be done better by using the following code instead:

self.start_date = pd.to_datetime(min(pc.working_partitions['eia923']['years']), format='%Y')

rousik · 2020-12-04T23:29:50Z

src/pudl/output/pudltabl.py

-                pd.to_datetime(
-                    f"{max(pc.working_years['eia923'])}-12-31")
+            self.end_date = pd.to_datetime(
+                pc.working_partitions['eia860m']['year_month'])


It is somewhat confusing that we are pulling the date ranges from different working_partitions. The start is given by eia923, the end by eia860m. Is this guaranteed to be logically correct?

Here we are basically trying to set the default start and end date based on the max and min EIA dates. So these are different because 923 has the oldest year rn and 860m has the newest.

To make this clear I'm thinking about incorporating a little helper function that grabs all EIA dates from pc.working_partitions and we can just grab the max/min.

cmgosnell added 7 commits November 5, 2020 18:25

Metadata for eia860m

9b30187

working on Issue #811

Enable 860M integration into ETL.

d22b0cc

Issue #811.

Enable 860M integration into ETL.

ead53c9

Issue #811.

Merge branch 'eia860m' of github.com:catalyst-cooperative/pudl into e…

05c1f41

…ia860m

Change name of data_source column, duh.

b4cc1a8

... continue to change data_source col name

064cf71

add eia860m as source in dp metadata

222cdc3

cmgosnell added this to the PUDL Sprint 27 milestone Nov 12, 2020

cmgosnell requested review from zaneselvans and rousik November 12, 2020 19:08

cmgosnell assigned zaneselvans Nov 12, 2020

cmgosnell linked an issue Nov 12, 2020 that may be closed by this pull request

integrate EIA 860 M into ETL #811

Closed

3 tasks

name args in tests for load_excel_file

124414c

Add param assert & change output end date

0662bba

rousik reviewed Nov 17, 2020

View reviewed changes

cmgosnell modified the milestones: PUDL Sprint 27, Sprint 28 Nov 18, 2020

zaneselvans reviewed Nov 19, 2020

View reviewed changes

src/pudl/package_data/meta/datapkg/datapackage.json Outdated Show resolved Hide resolved

cmgosnell added 5 commits November 20, 2020 13:32

second round of updates to eia860m.

fa0678e

standardizes the working_years to be working_partitions. Adds in new plants to the mapping sheet

rename overwridden arg names

0661c45

omigosh again w/ the renames

91a5f30

bb fix of logger message

6179849

Integrate kwargs into eia860m/excel extrator

47ef954

cmgosnell commented Dec 3, 2020

View reviewed changes

cmgosnell added 2 commits December 3, 2020 17:21

fix default excel extractor args

267c4c7

clean some datastore tests....

d84d919

cmgosnell requested a review from rousik December 3, 2020 22:36

rousik reviewed Dec 4, 2020

View reviewed changes

cmgosnell added 2 commits December 7, 2020 13:32

clean partition unpacking start/end dates in pudl_out

c4f9b7c

Merge branch 'sprint27' into eia860m

5e422b0

cmgosnell merged commit 6252cd5 into sprint27 Dec 7, 2020

cmgosnell deleted the eia860m branch December 7, 2020 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate EIA 860 M into ETL #824

Integrate EIA 860 M into ETL #824

cmgosnell commented Nov 12, 2020 •

edited

codecov bot commented Nov 13, 2020 •

edited

rousik left a comment

rousik Nov 17, 2020

cmgosnell Nov 17, 2020

zaneselvans Nov 19, 2020

cmgosnell Nov 23, 2020

rousik Nov 17, 2020

rousik Nov 17, 2020

cmgosnell Nov 17, 2020

rousik Nov 18, 2020

rousik Nov 17, 2020

zaneselvans Nov 19, 2020

cmgosnell Dec 3, 2020

rousik Dec 4, 2020

rousik Dec 4, 2020

cmgosnell Dec 7, 2020

cmgosnell Dec 3, 2020

rousik Dec 4, 2020

rousik Dec 4, 2020

cmgosnell Dec 7, 2020

rousik Dec 4, 2020 •

edited

rousik Dec 4, 2020

cmgosnell Dec 7, 2020

Integrate EIA 860 M into ETL #824

Integrate EIA 860 M into ETL #824

Conversation

cmgosnell commented Nov 12, 2020 • edited

codecov bot commented Nov 13, 2020 • edited

Codecov Report

rousik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousik Dec 4, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell commented Nov 12, 2020 •

edited

codecov bot commented Nov 13, 2020 •

edited

rousik Dec 4, 2020 •

edited