Split TableTransformer.transform() into 3 phases #1900

zaneselvans · 2022-09-03T01:10:07Z

Within each dataset, there are often a set of standard operations that have to happen at
the beginning and the end of each table's transformation. To allow those operations to
be defined just once per dataset, I create 3 separate phases, defined as abstract
methods in the AbstractTableTransformer class:

start_transform()
main_transform()
finish_transform()

These are called by the (no longer abstract) TableTransformer.transform() method and
should be defined by the dataset specific abstract table transformer. Of course they can
also be overridden by the individual concrete table transformer when necessary.

This commit also adds a decorator @cache_df() that enables the optional caching of the
outputs of the individual in the self._cached_dfs dictionary for forensic / debugging
purposes.

This caching is only done if self.cache_dfs = True. By default it is False.

The transform() method removes all cached dataframes after calling finish_transform()
method unless TableTransformer.clear_cached_dfs is False. It is True by default.

For more information on how decorators work, I found this post useful

Within each dataset, there are often a set of standard operations that have to happen at the beginning and the end of each table's transformation. To allow those operations to be defined just once per dataset, I create 3 separate phases, defined as abstract methods in the AbstractTableTransformer class: * start_transform() * main_transform() * finish_transform() These are called by the (no longer abstract) TableTransformer.transform() method and should be defined by the dataset specific abstract table transformer. Of course they can also be overridden by the individual concrete table transformer when necessary. This commit also adds a decorator @cache_df() that enables the optional caching of the outputs of the individual in the self._cached_dfs dictionary for forensic / debugging purposes. This caching is only done if self.cache_dfs = True. By default it is False. The transform() method removes all cached dataframes after calling finish_transform() method unless TableTransformer.clear_cached_dfs is False. It is True by default.

zaneselvans · 2022-09-03T01:34:26Z

@cmgosnell if I do any more work on the row dropper methods before you get back to your computer I'll add them to this PR, but I will try and make sure that each commit is a self-contained piece of functionality so you can look at them individually.

Create a fuel_ferc1 table specific drop_invalid_rows method, which both makes use of the parameterized method inherited from the AbstractTableTransformer, and adds a separate method for identifying rows which we believe to be plant totals, rather than rows that pertain to individual fuels. This results in dropping more than 2/3 of all the records in the fuel_ferc1 table. I also made some name changes, trying to be a little more standard and (hopefully) informative: * remove_invalid_rows => drop_invalid_rows, since there were already several methods defined in the FERC 1 table transformers that do similar things in more specific circumstances, and they were named drop_*_cols or drop_*_rows. * cols_to_check => required_valid_cols and cols_to_not_check => allowed_invalid_cols to provide some indication as to what is being checked for (validity / invalidity) in relation to the other transform parameter (invalid_values) * DropInvalidRows => InvalidRows to follow the convention of the other TransformParams which are nouns describing their contents, while the methods & functions that they parameterize are verbs/actions. * Moved the documentation of the InvalidRows parameters into the class definition.

cmgosnell

This looks good overall. I have some questions and request for more docs/explanations.

I think I would prefer going with transform_{phase} as a naming convention for these key transform abstract methods. But that is not a hill I will die on.

src/pudl/analysis/classify_plants_ferc1.py

src/pudl/transform/classes.py

src/pudl/transform/ferc1.py

src/pudl/transform/ferc714.py

src/pudl/transform/classes.py

zaneselvans · 2022-09-06T01:38:15Z

I've responded to your questions/comments and added better documentation of the cache_df decorator.

Sorry about confusingly merging in some code changes from the pandas-1.5 branch accidentally.

I still think I prefer the start_transform() and finish_transform() since it reads more like English ("What does this method do? It starts the transform. What does that one do? It finishes the transform.") but I've switched it for now.

cmgosnell

Thanks for the decorator docs. some of that seemed inferrable but its nice to make it clear.

On the naming.. I've tended to use "reverse notation" for many reasons. while its slightly less colloquial, it clearly groups elements and it is easier to use with tab completion.

src/pudl/transform/classes.py

…rows Implement drop_invalid_rows() for fuel_ferc1 table

zaneselvans added ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition labels Sep 3, 2022

zaneselvans self-assigned this Sep 3, 2022

zaneselvans linked an issue Sep 3, 2022 that may be closed by this pull request

Refine generic table transform architecture #1853

Closed

16 tasks

zaneselvans marked this pull request as ready for review September 3, 2022 01:16

zaneselvans added 11 commits September 3, 2022 13:37

Bump pandas & setuptools versions. Fix bad geopandas version in testenv.

1efa3f6

Convert set of dataframe columns into a list.

745d649

Avoid infinite df.replace() recursion.

26483e9

Replace df.iteritems() with df.items() in BGA compilation.

e717230

Update setuptools and geopandas in environment.yml

538f299

Merge branch 'xbrl_steam' into ferc1-transform-phases

9d43281

Merge branch 'pandas-1.5' into ferc1-transform-phases

9bcc34e

Revert to pandas 1.4.4

c51dbc4

Merge branch 'xbrl_steam' into ferc1-transform-phases

bfa153e

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

83f9b5a

cmgosnell requested changes Sep 5, 2022

View reviewed changes

cmgosnell reviewed Sep 5, 2022

View reviewed changes

src/pudl/transform/classes.py Show resolved Hide resolved

zaneselvans added 7 commits September 5, 2022 19:04

Integrate compilation of transform params into AbstractTabletransformer

e2e867a

Integrate compilation of transform params into AbstractTabletransformer

85241ef

Expand docstring to explain what modules in the subpackage need to do.

d2354fc

Merge branch 'xbrl_steam' into ferc1-transform-phases

9a94f26

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

fd20677

Add documentation explaining the cache_df decorator.

6d00cf4

Switch from saying {phase}_transform() to transform_{phase}

aca7439

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

2db046c

Clean up a couple of docstrings.

39d4853

cmgosnell reviewed Sep 6, 2022

View reviewed changes

src/pudl/transform/classes.py Show resolved Hide resolved

cmgosnell approved these changes Sep 6, 2022

View reviewed changes

zaneselvans added 3 commits September 6, 2022 13:19

Add a link to deeper documentation on how decorators work.

ad0e931

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

a78cfba

Merge pull request #1903 from catalyst-cooperative/drop-invalid-fuel-…

b1f18b5

…rows Implement drop_invalid_rows() for fuel_ferc1 table

zaneselvans merged commit d92c190 into xbrl_steam Sep 6, 2022

zaneselvans deleted the ferc1-transform-phases branch September 6, 2022 18:33

zaneselvans mentioned this pull request Sep 6, 2022

Refactor fuel_ferc1 transform for XBRL + DBF inputs #1722

Closed

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split TableTransformer.transform() into 3 phases #1900

Split TableTransformer.transform() into 3 phases #1900

zaneselvans commented Sep 3, 2022 •

edited

zaneselvans commented Sep 3, 2022

cmgosnell left a comment

zaneselvans commented Sep 6, 2022

cmgosnell left a comment

Split TableTransformer.transform() into 3 phases #1900

Split TableTransformer.transform() into 3 phases #1900

Conversation

zaneselvans commented Sep 3, 2022 • edited

zaneselvans commented Sep 3, 2022

cmgosnell left a comment

Choose a reason for hiding this comment

zaneselvans commented Sep 6, 2022

cmgosnell left a comment

Choose a reason for hiding this comment

zaneselvans commented Sep 3, 2022 •

edited