-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate EIA 860 M into ETL #824
Conversation
working on Issue #811
Codecov Report
@@ Coverage Diff @@
## sprint27 #824 +/- ##
============================================
+ Coverage 70.47% 71.18% +0.71%
============================================
Files 40 41 +1
Lines 4897 4941 +44
============================================
+ Hits 3451 3517 +66
+ Misses 1446 1424 -22
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added few nits/cosmetic comments, the bulk of what was asked is in the biggest comment added to extract/excel.py
. TL;DR - I think we should be using **kwargs
for the partition and make excel generic to the point it does not actually care what the names of these partition attributes are.
src/pudl/constants.py
Outdated
@@ -1633,6 +1633,7 @@ | |||
# The full set of years we currently expect to be able to ingest, per source: | |||
working_years = { | |||
'eia860': tuple(range(2009, 2020)), | |||
'eia860m': ["2020-08"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit like a hack. We are storing completely different type (list instead of tuple, string instead of number and yyyy-mm instead of a year) and there's no easy way to know that eia860m is different from the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was wondering about this. I suppose it would be better to have 'eia860m': tuple(['2018-10'])
here, but maybe it doesn't make sense at all to have this year-month inside of working_years
because it is not really a year at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any reason it needs to be a collection at all, is there? If it's only ever allowed to be a single value, we should just store the single value. But anything that's being stored and used as an argument for other functions, like in the other datasets here, needs to be non-mutable (i.e. tuple instead of list) because otherwise the arguments can get accidentally changed in one place, and that change will appear everywhere the value is referenced since it has a scope outside of any of the functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and changed working_years
to working_partitions
and within each data set is another dictionary with the name of the partition as the keys and the disparate partition types as the values.
eia_transformed_dfs, | ||
entities_dfs, | ||
debug=False): | ||
def _manage_strictness(col, eia860_ytd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would probably call this something like _get_column_strictness
, as the word manage implies (at least to me) some side-effects that go beyond returning dict with strictness targets for each column.
src/pudl/extract/excel.py
Outdated
return {} | ||
|
||
def extract(self, years): | ||
def extract(self, partitions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I think I understand it now. I think the best approach here may be to consider that the partition is a dict-like structure that assigns list of values to each named key (such as year
or year_month
).
The cleanest way to support such approach would be to change the signature of this function to: def extract(self, **partition)
, then expand this to sequence of single-valued mappings, e.g. year=[1,2,3]
will turn into [{year=1}, {year=2}, {year=3}]
and then plum this all the way down to self.ds.get_resources(self._dataset_name, **partition)
.
The above generic approach could support composite keys (e.g. partition=dict(year=2018, month=8)
but this would require reorganizing our excel metadata files (right now it assumes that partition matches the column name) or at the least adding function that maps structured partition to metadata column name (e.g. meta_col = lambda **p: f'{p["year"]}_{p["month"]}'
). Composite partition keys is more of a long-term thinking and does not immediately apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean passing in kwargs into extract like extract(year=1, year=2, year=3)
or passing in extract(year=[1,2,3,])
? for the former, i'm not sure how you would distinguish between the years. and if it is the latter, I'm not sure how to convert extract
's kwargs into get_resources
's kwargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't pass named argument more than once so it'd have to be years=[1,2,3]
. My idea was to add helper function that will fan-out the multi-valued partition into sequence of single-valued assignments, e.g. {month=9, year=[1,2]} --> [{month=9, year=1}, {month=9, year=2}]
.
Here is the helper function:
import itertools
def iterate_multivalue_dict(**kwargs):
single_valued = {k: v for k, v in kwargs.items() if type(v) != list}
# Transform multi-valued {k: vlist} into {k1: [{k1: v1}, {k1: v2}, ...], k2: [...], ...}
multi_valued = {k: [{k: v} for v in vlist] for k, vlist in kwargs.items() if type(vlist) == list}
for value_assignments in itertools.product(*multi_valued.values()):
result = dict(single_valued)
for k_v in value_assignments:
result.update(k_v)
yield result
test/etl_test.py
Outdated
|
||
def test_extract_eia923(self, pudl_datastore_fixture): | ||
"""Spot check extraction eia923 excel files.""" | ||
extractor = pudl.extract.eia923.Extractor(pudl_datastore_fixture) | ||
assert "Page 3 Boiler Fuel Data" in extractor.load_excel_file( | ||
2018, "stocks").sheet_names | ||
part=2018, page="stocks").sheet_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is another reason why using **kwargs
might be cleaner here. Instead of using vague part=$somevalue
, using year=2018, or year_month="2018-08"
makes it more apparent of what we are passing through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much like the idea of not needing to specify exactly what all of the partitions are, since the possibilities will probably change over time (e.g. epacems has year + state, ferceqr has year + quarter, the LMP data may be partitioned by which grid operator it came from, etc.) and doing the right thing can be left to the extract function, which is going to be dataset specific. But this will affect the metadata that we use to specify filename / column / tab maps for the spreadsheets.
standardizes the working_years to be working_partitions. Adds in new plants to the mapping sheet
src/pudl/extract/excel.py
Outdated
return self._sheet_name.at[page, str(year)] | ||
def get_sheet_name(self, page, **partition): | ||
"""Returns name of the excel sheet that contains the data for given partition and page.""" | ||
return self._sheet_name.at[page, str(list(partition.values())[0])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like a hack.... is there a better way to unpack the kwarg here? at this stage it should always be setup as partition_name=partition (year=2019
or year_month="2020-08"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed feels like a hack. The implicit assumption here is that **partition
contains exactly one key=value
pair and we use the stringified value when pulling the sheet name from pandas data frame.
I would make this assumption explicit in the form of AssertonError
:
if len(partition) != 1:
raise AssertionError(f"Expecting exactly one partition attribute (found: {partition})")
return self._sheet_name.at[page, str(partition.values()[0])]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now see that this value extraction happens multiple time so I would suggest creating helper function _get_partition_key(**partitions)
that will do that.
Such approach could easily be modified if we ever encounter multi-dimensional partitions (e.g. year, month) where you could construct keys in the form of "month=1,year=2019"
and use these for indexing pandas dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great idea!
"""Extracts dataframes. | ||
|
||
Returns dict where keys are page names and values are | ||
DataFrames containing data across given years. | ||
|
||
Args: | ||
years (list): list of years to extract. | ||
partitions (list, tuple or string): list of partitions to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using **partitions
format here for the docs but i got a flake8 error. Is the right way to document kwargs to just list them like normal?... that seems weird.
src/pudl/extract/excel.py
Outdated
@@ -152,12 +152,13 @@ def __init__(self, ds): | |||
self._file_cache = {} | |||
self.ds = ds | |||
|
|||
def process_raw(self, df, year, page): | |||
def process_raw(self, df, partition, page): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange that process_renamed
and process_raw
handle partition differently (one uses **kwarg
while the other expects dict in a positional argument, in addition to page and partition having different order in the two cases.
# we are going to skip | ||
if self.excel_filename(yr, page) == '-1': | ||
if self.excel_filename(page, **partition) == '-1': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think of it, string valued -1
is a very non-standard way of encoding missing value. We should just return None here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very good point. i'm going to add an issue for that.
src/pudl/output/pudltabl.py
Outdated
@@ -82,15 +82,14 @@ def __init__(self, pudl_engine, ds=None, freq=None, start_date=None, end_date=No | |||
if start_date is None: | |||
self.start_date = \ | |||
pd.to_datetime( | |||
f"{min(pc.working_years['eia923'])}-01-01") | |||
f"{min(pc.working_partitions['eia923']['years'])}-01-01") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I was thinking about how to make this conversion better and thought about using pd.Timestamp
but that did not work. I do think, however, that this could be done better by using the following code instead:
self.start_date = pd.to_datetime(min(pc.working_partitions['eia923']['years']), format='%Y')
src/pudl/output/pudltabl.py
Outdated
pd.to_datetime( | ||
f"{max(pc.working_years['eia923'])}-12-31") | ||
self.end_date = pd.to_datetime( | ||
pc.working_partitions['eia860m']['year_month']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is somewhat confusing that we are pulling the date ranges from different working_partitions. The start is given by eia923, the end by eia860m. Is this guaranteed to be logically correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are basically trying to set the default start and end date based on the max and min EIA dates. So these are different because 923 has the oldest year rn and 860m has the newest.
To make this clear I'm thinking about incorporating a little helper function that grabs all EIA dates from pc.working_partitions
and we can just grab the max/min.
@zaneselvans :
please give this a good lookin' over. Generally speaking this PR adds the metadata and extractor for EIA 860 M and incorporated this into the ETL process. There is now an eia860m extract module which extracts 860m and appends it to the 860 data. Some notes:
data_source
column in the generators table with just the data source code (eia860 or eia860m). I haven't added an indicator of which eia860m this is yet... I don't think our old little datasets table exists anymore and it felt weird to recreate it because the information should be populating in the datapackage metadata and having this table would be another duplication of data. idk. We should make some decisions on what is best for this.manage_strictness
function to help with this column as well as the plant and utility name from a while back.noqa: C901
exceptions 🎉@rousik:
I'd love to have your perspective on whether there is a cleaner way to call
ds.get_resource()
inside the standard excel extractor with the format of:partition_name=part
(ie.year=2019
oryear_month='2020-08'
). Right now I got around this just by overridingget_datapackage_resources()
in the non-year dataset specific extractor. My original thought was to pass a dictionary into the excel extractor like{'year': [2009,2010,2011]}
or{'year_month':['2020-08']}
and somehow using the keys to callds.get_resource()
properly, but I couldn't work out the mechanics of that.Working on Issue #811