Remove leading zeroes from all numeric EIA generator_ids. #968

zaneselvans · 2021-04-21T20:51:57Z

I integrated a new helper function into the process_raw functions which we have embedded in each of the EIA Excel extractors, to fix the leading zeroes in generator IDs. Also added a bit of code to remove some duplicate generators which resulted in the ownership table.

Not sure this is really the right place for the code to go, but until we re-factor the extract or transform steps to have an "apply these fixes to all of these dataframes uniformly" step, it was the simplest place to drop it in.

Closes #964

codecov · 2021-04-21T21:08:33Z

Codecov Report

Merging #968 (022dbf8) into dev (4ed4112) will decrease coverage by 0.13%.
The diff coverage is 97.50%.

@@            Coverage Diff             @@
##              dev     #968      +/-   ##
==========================================
- Coverage   83.41%   83.29%   -0.13%     
==========================================
  Files          47       47              
  Lines        5583     5613      +30     
==========================================
+ Hits         4657     4675      +18     
- Misses        926      938      +12

Impacted Files	Coverage Δ
src/pudl/extract/excel.py	`97.22% <ø> (ø)`
src/pudl/validate.py	`35.19% <ø> (ø)`
src/pudl/transform/eia860.py	`96.15% <93.33%> (-0.65%)`	⬇️
src/pudl/extract/eia860.py	`100.00% <100.00%> (ø)`
src/pudl/extract/eia860m.py	`100.00% <100.00%> (ø)`
src/pudl/extract/eia861.py	`96.30% <100.00%> (+0.46%)`	⬆️
src/pudl/extract/eia923.py	`100.00% <100.00%> (ø)`
src/pudl/helpers.py	`92.26% <100.00%> (+0.56%)`	⬆️
src/pudl/analysis/timeseries_cleaning.py	`85.68% <0.00%> (-2.64%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ed4112...022dbf8. Read the comment docs.

cmgosnell

I think the de-duping needs a bb amount of in-line documentation. And I don't love the fix_leading_zero_gen_ids() being in the extract step, but I don't feel strongly about that one.

cmgosnell · 2021-04-23T15:56:26Z

src/pudl/helpers.py

+        logger.info("Fixed %s EIA generator IDs with leading zeros.", num_fixes)
+        df = (
+            df.drop("generator_id", axis="columns")
+            .assign(generator_id=fixed_generator_id)


why drop the column and then assign? if you just assigned, wouldn't it just fully replace it?

I feel like I did this to avoid the assigning from a copy of a slice error.

cmgosnell · 2021-04-23T15:57:23Z

src/pudl/helpers.py

+    bad_na_regexes = [
+        r'^\.$',  # Nothing but a decimal point
+        r'^\s$',  # A single whitespace character
+        r'^$',    # The empty string


thanks for explaining these! for the regex illiterate in the crowd (me, it me)

Everyone is regex illiterate beyond a certain point!

One question I had about these -- do we actually only want to replace a single whitespace character? Or do we want to assign NA to any field which is nothing but whitespace? I feel like we probably want to do the latter, but I didn't want to change it while trying to fix something else first.

Oh good catch... I would definitely think we want the latter (convert to NA if a field is nothing but whitespace)

cmgosnell · 2021-04-23T16:18:34Z

src/pudl/transform/eia860.py

+        "utility_id_eia": pd.Int64Dtype(),
+        "plant_id_eia": pd.Int64Dtype(),
+        "owner_state": pd.StringDtype()
+    })


I know you didn't add these but I think we can delete them because the types will be applied uniformly in the post transform step

I don't love them either, but I'd like to not touch them as part of this PR, just in case they do mater -- and instead integrate this kind of type casting cleanup into the integration of the new metadata / harvesting system.

Uuuugh, okay actually (per our chat) I need to drop all but one copy of the duplicates, not drop one of the duplicates. And add an assertion here that there are no more duplicates thereafter. And also add some inline comments explaining wtf is happening here.

test/unit/helpers.py

cmgosnell · 2021-04-23T16:25:20Z

src/pudl/extract/eia860.py

@@ -40,6 +50,7 @@ def process_raw(self, df, page, **partition):
        if page in pages_eia860m:
            df = df.assign(data_source='eia860')
            self.cols_added.append('data_source')
+        df = fix_leading_zero_gen_ids(df)


Should at least get into the standard process_raw method in the GenericExtractor. Even though it is basically not being run right now because it is always overwritten? At least as a template to let our future selves know it should be added? Or is it specific enough to tables which contain generator ids that it should be done for all datasets... obviously it would just do nothing for non-gen tables but we would probably remove it entirely for a non-gen data set... hm. just musing on your original point of maybe wanting this in a standard place.

In truth it feels more like it should be an early phase transform step. In which case it could be applied to all dfs within the data set specific main transform function.

I totally agree that this should be a pre-transform step, not an extract step. My first attempt was to integrate it there, but the differences between how the eia860, eia923, and eia861 transform modules are structured meant that it wasn't straightforward and I broke everything. Integrating it upstream was easier for now. But as we standardize those modules, probably with a common class definition that they all use, I totally agree.

I think this is an EIA specific early transform fix though, not a generic one, though we could make it into something more generic that can be applied to any "string" field that contains a subset of integer fields, which should ditch any spurious leading zeroes. But in that case we would need to tell it which columns it should apply to, or have that information stored in the metadata about the columns somehow. Unfortunately generator_id is a little too generic a name (I think we should go to generator_id_eia personally).

Remove leading zeroes from all numeric EIA generator_ids.

51b6d4a

zaneselvans requested a review from cmgosnell April 21, 2021 20:51

Merge branch 'dev' into fix-gen-ids

1ab94ca

cmgosnell requested changes Apr 23, 2021

View reviewed changes

zaneselvans added 2 commits April 23, 2021 11:42

Rename pudl.helpers test module so pytest finds it

1301ec7

Improve method of dropping (some) duplicate plant/gen IDs

022dbf8

zaneselvans merged commit f305025 into dev Apr 23, 2021

zaneselvans deleted the fix-gen-ids branch April 30, 2021 17:01

aesharpe mentioned this pull request Sep 13, 2022

Add epacems crosswalk to etl #1692

Merged

zaneselvans mentioned this pull request Mar 7, 2023

Fix EIA boiler IDs that have inconsistenly reported leading zeros #2366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove leading zeroes from all numeric EIA generator_ids. #968

Remove leading zeroes from all numeric EIA generator_ids. #968

zaneselvans commented Apr 21, 2021 •

edited

codecov bot commented Apr 21, 2021 •

edited

cmgosnell left a comment

cmgosnell Apr 23, 2021

zaneselvans Apr 23, 2021

cmgosnell Apr 23, 2021

zaneselvans Apr 23, 2021

cmgosnell Apr 23, 2021

cmgosnell Apr 23, 2021

zaneselvans Apr 23, 2021

zaneselvans Apr 23, 2021

cmgosnell Apr 23, 2021

zaneselvans Apr 23, 2021

Remove leading zeroes from all numeric EIA generator_ids. #968

Remove leading zeroes from all numeric EIA generator_ids. #968

Conversation

zaneselvans commented Apr 21, 2021 • edited

codecov bot commented Apr 21, 2021 • edited

Codecov Report

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Apr 21, 2021 •

edited

codecov bot commented Apr 21, 2021 •

edited