Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove leading zeroes from all numeric EIA generator_ids. #968

Merged
merged 4 commits into from
Apr 23, 2021

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Apr 21, 2021

I integrated a new helper function into the process_raw functions which we have embedded in each of the EIA Excel extractors, to fix the leading zeroes in generator IDs. Also added a bit of code to remove some duplicate generators which resulted in the ownership table.

Not sure this is really the right place for the code to go, but until we re-factor the extract or transform steps to have an "apply these fixes to all of these dataframes uniformly" step, it was the simplest place to drop it in.

Closes #964

@codecov
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #968 (022dbf8) into dev (4ed4112) will decrease coverage by 0.13%.
The diff coverage is 97.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #968      +/-   ##
==========================================
- Coverage   83.41%   83.29%   -0.13%     
==========================================
  Files          47       47              
  Lines        5583     5613      +30     
==========================================
+ Hits         4657     4675      +18     
- Misses        926      938      +12     
Impacted Files Coverage Δ
src/pudl/extract/excel.py 97.22% <ø> (ø)
src/pudl/validate.py 35.19% <ø> (ø)
src/pudl/transform/eia860.py 96.15% <93.33%> (-0.65%) ⬇️
src/pudl/extract/eia860.py 100.00% <100.00%> (ø)
src/pudl/extract/eia860m.py 100.00% <100.00%> (ø)
src/pudl/extract/eia861.py 96.30% <100.00%> (+0.46%) ⬆️
src/pudl/extract/eia923.py 100.00% <100.00%> (ø)
src/pudl/helpers.py 92.26% <100.00%> (+0.56%) ⬆️
src/pudl/analysis/timeseries_cleaning.py 85.68% <0.00%> (-2.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ed4112...022dbf8. Read the comment docs.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the de-duping needs a bb amount of in-line documentation. And I don't love the fix_leading_zero_gen_ids() being in the extract step, but I don't feel strongly about that one.

logger.info("Fixed %s EIA generator IDs with leading zeros.", num_fixes)
df = (
df.drop("generator_id", axis="columns")
.assign(generator_id=fixed_generator_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why drop the column and then assign? if you just assigned, wouldn't it just fully replace it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I did this to avoid the assigning from a copy of a slice error.

bad_na_regexes = [
r'^\.$', # Nothing but a decimal point
r'^\s$', # A single whitespace character
r'^$', # The empty string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining these! for the regex illiterate in the crowd (me, it me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone is regex illiterate beyond a certain point!

One question I had about these -- do we actually only want to replace a single whitespace character? Or do we want to assign NA to any field which is nothing but whitespace? I feel like we probably want to do the latter, but I didn't want to change it while trying to fix something else first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch... I would definitely think we want the latter (convert to NA if a field is nothing but whitespace)

"utility_id_eia": pd.Int64Dtype(),
"plant_id_eia": pd.Int64Dtype(),
"owner_state": pd.StringDtype()
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't add these but I think we can delete them because the types will be applied uniformly in the post transform step

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love them either, but I'd like to not touch them as part of this PR, just in case they do mater -- and instead integrate this kind of type casting cleanup into the integration of the new metadata / harvesting system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuuugh, okay actually (per our chat) I need to drop all but one copy of the duplicates, not drop one of the duplicates. And add an assertion here that there are no more duplicates thereafter. And also add some inline comments explaining wtf is happening here.

test/unit/helpers.py Outdated Show resolved Hide resolved
@@ -40,6 +50,7 @@ def process_raw(self, df, page, **partition):
if page in pages_eia860m:
df = df.assign(data_source='eia860')
self.cols_added.append('data_source')
df = fix_leading_zero_gen_ids(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should at least get into the standard process_raw method in the GenericExtractor. Even though it is basically not being run right now because it is always overwritten? At least as a template to let our future selves know it should be added? Or is it specific enough to tables which contain generator ids that it should be done for all datasets... obviously it would just do nothing for non-gen tables but we would probably remove it entirely for a non-gen data set... hm. just musing on your original point of maybe wanting this in a standard place.

In truth it feels more like it should be an early phase transform step. In which case it could be applied to all dfs within the data set specific main transform function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that this should be a pre-transform step, not an extract step. My first attempt was to integrate it there, but the differences between how the eia860, eia923, and eia861 transform modules are structured meant that it wasn't straightforward and I broke everything. Integrating it upstream was easier for now. But as we standardize those modules, probably with a common class definition that they all use, I totally agree.

I think this is an EIA specific early transform fix though, not a generic one, though we could make it into something more generic that can be applied to any "string" field that contains a subset of integer fields, which should ditch any spurious leading zeroes. But in that case we would need to tell it which columns it should apply to, or have that information stored in the metadata about the columns somehow. Unfortunately generator_id is a little too generic a name (I think we should go to generator_id_eia personally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants