-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove leading zeroes from all numeric EIA generator_ids. #968
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #968 +/- ##
==========================================
- Coverage 83.41% 83.29% -0.13%
==========================================
Files 47 47
Lines 5583 5613 +30
==========================================
+ Hits 4657 4675 +18
- Misses 926 938 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the de-duping needs a bb amount of in-line documentation. And I don't love the fix_leading_zero_gen_ids()
being in the extract step, but I don't feel strongly about that one.
logger.info("Fixed %s EIA generator IDs with leading zeros.", num_fixes) | ||
df = ( | ||
df.drop("generator_id", axis="columns") | ||
.assign(generator_id=fixed_generator_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why drop the column and then assign? if you just assigned, wouldn't it just fully replace it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I did this to avoid the assigning from a copy of a slice error.
bad_na_regexes = [ | ||
r'^\.$', # Nothing but a decimal point | ||
r'^\s$', # A single whitespace character | ||
r'^$', # The empty string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining these! for the regex illiterate in the crowd (me, it me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone is regex illiterate beyond a certain point!
One question I had about these -- do we actually only want to replace a single whitespace character? Or do we want to assign NA to any field which is nothing but whitespace? I feel like we probably want to do the latter, but I didn't want to change it while trying to fix something else first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch... I would definitely think we want the latter (convert to NA if a field is nothing but whitespace)
"utility_id_eia": pd.Int64Dtype(), | ||
"plant_id_eia": pd.Int64Dtype(), | ||
"owner_state": pd.StringDtype() | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didn't add these but I think we can delete them because the types will be applied uniformly in the post transform step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love them either, but I'd like to not touch them as part of this PR, just in case they do mater -- and instead integrate this kind of type casting cleanup into the integration of the new metadata / harvesting system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uuuugh, okay actually (per our chat) I need to drop all but one copy of the duplicates, not drop one of the duplicates. And add an assertion here that there are no more duplicates thereafter. And also add some inline comments explaining wtf is happening here.
@@ -40,6 +50,7 @@ def process_raw(self, df, page, **partition): | |||
if page in pages_eia860m: | |||
df = df.assign(data_source='eia860') | |||
self.cols_added.append('data_source') | |||
df = fix_leading_zero_gen_ids(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should at least get into the standard process_raw
method in the GenericExtractor
. Even though it is basically not being run right now because it is always overwritten? At least as a template to let our future selves know it should be added? Or is it specific enough to tables which contain generator ids that it should be done for all datasets... obviously it would just do nothing for non-gen tables but we would probably remove it entirely for a non-gen data set... hm. just musing on your original point of maybe wanting this in a standard place.
In truth it feels more like it should be an early phase transform step. In which case it could be applied to all dfs within the data set specific main transform function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree that this should be a pre-transform step, not an extract step. My first attempt was to integrate it there, but the differences between how the eia860, eia923, and eia861 transform modules are structured meant that it wasn't straightforward and I broke everything. Integrating it upstream was easier for now. But as we standardize those modules, probably with a common class definition that they all use, I totally agree.
I think this is an EIA specific early transform fix though, not a generic one, though we could make it into something more generic that can be applied to any "string" field that contains a subset of integer fields, which should ditch any spurious leading zeroes. But in that case we would need to tell it which columns it should apply to, or have that information stored in the metadata about the columns somehow. Unfortunately generator_id
is a little too generic a name (I think we should go to generator_id_eia
personally).
I integrated a new helper function into the
process_raw
functions which we have embedded in each of the EIA Excel extractors, to fix the leading zeroes in generator IDs. Also added a bit of code to remove some duplicate generators which resulted in the ownership table.Not sure this is really the right place for the code to go, but until we re-factor the extract or transform steps to have an "apply these fixes to all of these dataframes uniformly" step, it was the simplest place to drop it in.
Closes #964