Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate EIA 860 M into ETL #824

Merged
merged 18 commits into from
Dec 7, 2020
Merged

Integrate EIA 860 M into ETL #824

merged 18 commits into from
Dec 7, 2020

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Nov 12, 2020

@zaneselvans :
please give this a good lookin' over. Generally speaking this PR adds the metadata and extractor for EIA 860 M and incorporated this into the ETL process. There is now an eia860m extract module which extracts 860m and appends it to the 860 data. Some notes:

  • This is not currently integrated into the tests. The way I see it is the only way to do that would be to run the ytd in the standard tests, which would not be terrible but is slightly outside our current convention. I also have a few little data sanity tests in a notebook that relies on having the ytd 2020 data and the 2019 data. I assume these should live in the validation tests not our CI test, but idk.
  • I did add a data_source column in the generators table with just the data source code (eia860 or eia860m). I haven't added an indicator of which eia860m this is yet... I don't think our old little datasets table exists anymore and it felt weird to recreate it because the information should be populating in the datapackage metadata and having this table would be another duplication of data. idk. We should make some decisions on what is best for this.
  • The harvesting process required a little revamp in there because the longitude column in 860m has an additional decimal point and is just off compared to the rest of the data. I added a little manage_strictness function to help with this column as well as the plant and utility name from a while back.
  • I also cleaned up some of the etl parameter validation functions and got rid of a few of the noqa: C901 exceptions 🎉
  • ALSO, this does not at all deal with the outputs. that is for Issue update pudl_out to integrate EIA 860 M #812

@rousik:
I'd love to have your perspective on whether there is a cleaner way to call ds.get_resource() inside the standard excel extractor with the format of: partition_name=part (ie. year=2019 or year_month='2020-08'). Right now I got around this just by overriding get_datapackage_resources() in the non-year dataset specific extractor. My original thought was to pass a dictionary into the excel extractor like {'year': [2009,2010,2011]} or {'year_month':['2020-08']} and somehow using the keys to call ds.get_resource() properly, but I couldn't work out the mechanics of that.

Working on Issue #811

@cmgosnell cmgosnell added this to the PUDL Sprint 27 milestone Nov 12, 2020
@cmgosnell cmgosnell linked an issue Nov 12, 2020 that may be closed by this pull request
3 tasks
@codecov
Copy link

codecov bot commented Nov 13, 2020

Codecov Report

Merging #824 (5e422b0) into sprint27 (c462d87) will increase coverage by 0.71%.
The diff coverage is 92.86%.

Impacted file tree graph

@@             Coverage Diff              @@
##           sprint27     #824      +/-   ##
============================================
+ Coverage     70.47%   71.18%   +0.71%     
============================================
  Files            40       41       +1     
  Lines          4897     4941      +44     
============================================
+ Hits           3451     3517      +66     
+ Misses         1446     1424      -22     
Impacted Files Coverage Δ
src/pudl/transform/ferc1.py 91.18% <ø> (ø)
src/pudl/workspace/datastore.py 48.42% <0.00%> (ø)
src/pudl/output/pudltabl.py 58.45% <50.00%> (+0.20%) ⬆️
src/pudl/extract/eia861.py 95.83% <83.33%> (+3.83%) ⬆️
src/pudl/extract/excel.py 92.98% <89.47%> (-1.25%) ⬇️
src/pudl/etl.py 81.00% <96.88%> (+3.08%) ⬆️
src/pudl/constants.py 100.00% <100.00%> (ø)
src/pudl/extract/eia860.py 100.00% <100.00%> (+5.56%) ⬆️
src/pudl/extract/eia860m.py 100.00% <100.00%> (ø)
src/pudl/extract/eia923.py 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c462d87...5e422b0. Read the comment docs.

Copy link
Collaborator

@rousik rousik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added few nits/cosmetic comments, the bulk of what was asked is in the biggest comment added to extract/excel.py. TL;DR - I think we should be using **kwargs for the partition and make excel generic to the point it does not actually care what the names of these partition attributes are.

@@ -1633,6 +1633,7 @@
# The full set of years we currently expect to be able to ingest, per source:
working_years = {
'eia860': tuple(range(2009, 2020)),
'eia860m': ["2020-08"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit like a hack. We are storing completely different type (list instead of tuple, string instead of number and yyyy-mm instead of a year) and there's no easy way to know that eia860m is different from the rest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was wondering about this. I suppose it would be better to have 'eia860m': tuple(['2018-10']) here, but maybe it doesn't make sense at all to have this year-month inside of working_years because it is not really a year at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any reason it needs to be a collection at all, is there? If it's only ever allowed to be a single value, we should just store the single value. But anything that's being stored and used as an argument for other functions, like in the other datasets here, needs to be non-mutable (i.e. tuple instead of list) because otherwise the arguments can get accidentally changed in one place, and that change will appear everywhere the value is referenced since it has a scope outside of any of the functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and changed working_years to working_partitions and within each data set is another dictionary with the name of the partition as the keys and the disparate partition types as the values.

src/pudl/etl.py Outdated Show resolved Hide resolved
src/pudl/etl.py Outdated Show resolved Hide resolved
eia_transformed_dfs,
entities_dfs,
debug=False):
def _manage_strictness(col, eia860_ytd):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would probably call this something like _get_column_strictness, as the word manage implies (at least to me) some side-effects that go beyond returning dict with strictness targets for each column.

src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
return {}

def extract(self, years):
def extract(self, partitions):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I think I understand it now. I think the best approach here may be to consider that the partition is a dict-like structure that assigns list of values to each named key (such as year or year_month).

The cleanest way to support such approach would be to change the signature of this function to: def extract(self, **partition), then expand this to sequence of single-valued mappings, e.g. year=[1,2,3] will turn into [{year=1}, {year=2}, {year=3}] and then plum this all the way down to self.ds.get_resources(self._dataset_name, **partition).

The above generic approach could support composite keys (e.g. partition=dict(year=2018, month=8) but this would require reorganizing our excel metadata files (right now it assumes that partition matches the column name) or at the least adding function that maps structured partition to metadata column name (e.g. meta_col = lambda **p: f'{p["year"]}_{p["month"]}'). Composite partition keys is more of a long-term thinking and does not immediately apply.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean passing in kwargs into extract like extract(year=1, year=2, year=3) or passing in extract(year=[1,2,3,])? for the former, i'm not sure how you would distinguish between the years. and if it is the latter, I'm not sure how to convert extract's kwargs into get_resources's kwargs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't pass named argument more than once so it'd have to be years=[1,2,3]. My idea was to add helper function that will fan-out the multi-valued partition into sequence of single-valued assignments, e.g. {month=9, year=[1,2]} --> [{month=9, year=1}, {month=9, year=2}].

Here is the helper function:

import itertools

def iterate_multivalue_dict(**kwargs):
    single_valued = {k: v for k, v in kwargs.items() if type(v) != list}

    # Transform multi-valued {k: vlist} into {k1: [{k1: v1}, {k1: v2}, ...], k2: [...], ...}
    multi_valued = {k: [{k: v} for v in vlist] for k, vlist in kwargs.items() if type(vlist) == list}

    for value_assignments in itertools.product(*multi_valued.values()):
        result = dict(single_valued)
        for k_v in value_assignments:
            result.update(k_v)
        yield result

test/etl_test.py Outdated

def test_extract_eia923(self, pudl_datastore_fixture):
"""Spot check extraction eia923 excel files."""
extractor = pudl.extract.eia923.Extractor(pudl_datastore_fixture)
assert "Page 3 Boiler Fuel Data" in extractor.load_excel_file(
2018, "stocks").sheet_names
part=2018, page="stocks").sheet_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is another reason why using **kwargs might be cleaner here. Instead of using vague part=$somevalue, using year=2018, or year_month="2018-08" makes it more apparent of what we are passing through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like the idea of not needing to specify exactly what all of the partitions are, since the possibilities will probably change over time (e.g. epacems has year + state, ferceqr has year + quarter, the LMP data may be partitioned by which grid operator it came from, etc.) and doing the right thing can be left to the extract function, which is going to be dataset specific. But this will affect the metadata that we use to specify filename / column / tab maps for the spreadsheets.

@cmgosnell cmgosnell modified the milestones: PUDL Sprint 27, Sprint 28 Nov 18, 2020
return self._sheet_name.at[page, str(year)]
def get_sheet_name(self, page, **partition):
"""Returns name of the excel sheet that contains the data for given partition and page."""
return self._sheet_name.at[page, str(list(partition.values())[0])]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a hack.... is there a better way to unpack the kwarg here? at this stage it should always be setup as partition_name=partition (year=2019 or year_month="2020-08")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed feels like a hack. The implicit assumption here is that **partition contains exactly one key=value pair and we use the stringified value when pulling the sheet name from pandas data frame.

I would make this assumption explicit in the form of AssertonError:

if len(partition) != 1:
  raise AssertionError(f"Expecting exactly one partition attribute (found: {partition})")
return self._sheet_name.at[page, str(partition.values()[0])]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see that this value extraction happens multiple time so I would suggest creating helper function _get_partition_key(**partitions) that will do that.

Such approach could easily be modified if we ever encounter multi-dimensional partitions (e.g. year, month) where you could construct keys in the form of "month=1,year=2019" and use these for indexing pandas dataframes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great idea!

"""Extracts dataframes.

Returns dict where keys are page names and values are
DataFrames containing data across given years.

Args:
years (list): list of years to extract.
partitions (list, tuple or string): list of partitions to
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using **partitions format here for the docs but i got a flake8 error. Is the right way to document kwargs to just list them like normal?... that seems weird.

@cmgosnell cmgosnell requested a review from rousik December 3, 2020 22:36
@@ -152,12 +152,13 @@ def __init__(self, ds):
self._file_cache = {}
self.ds = ds

def process_raw(self, df, year, page):
def process_raw(self, df, partition, page):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange that process_renamed and process_raw handle partition differently (one uses **kwarg while the other expects dict in a positional argument, in addition to page and partition having different order in the two cases.

# we are going to skip
if self.excel_filename(yr, page) == '-1':
if self.excel_filename(page, **partition) == '-1':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of it, string valued -1 is a very non-standard way of encoding missing value. We should just return None here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very good point. i'm going to add an issue for that.

@@ -82,15 +82,14 @@ def __init__(self, pudl_engine, ds=None, freq=None, start_date=None, end_date=No
if start_date is None:
self.start_date = \
pd.to_datetime(
f"{min(pc.working_years['eia923'])}-01-01")
f"{min(pc.working_partitions['eia923']['years'])}-01-01")
Copy link
Collaborator

@rousik rousik Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I was thinking about how to make this conversion better and thought about using pd.Timestamp but that did not work. I do think, however, that this could be done better by using the following code instead:

self.start_date = pd.to_datetime(min(pc.working_partitions['eia923']['years']), format='%Y')

pd.to_datetime(
f"{max(pc.working_years['eia923'])}-12-31")
self.end_date = pd.to_datetime(
pc.working_partitions['eia860m']['year_month'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat confusing that we are pulling the date ranges from different working_partitions. The start is given by eia923, the end by eia860m. Is this guaranteed to be logically correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are basically trying to set the default start and end date based on the max and min EIA dates. So these are different because 923 has the oldest year rn and 860m has the newest.

To make this clear I'm thinking about incorporating a little helper function that grabs all EIA dates from pc.working_partitions and we can just grab the max/min.

@cmgosnell cmgosnell merged commit 6252cd5 into sprint27 Dec 7, 2020
@cmgosnell cmgosnell deleted the eia860m branch December 7, 2020 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

integrate EIA 860 M into ETL
3 participants