Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonising wildcard processing #209

Closed
olejandro opened this issue Mar 6, 2024 · 5 comments · Fixed by #211
Closed

Harmonising wildcard processing #209

olejandro opened this issue Mar 6, 2024 · 5 comments · Fixed by #211

Comments

@olejandro
Copy link
Member

Currently there are 2 transforms dealing with wildcard processing:

  • process_uc_wildcards deals with wildcards in uc_t tables;
  • process_wildcards deals with wildcards in tfm tables and changes the dataframe with attribute data based on their content.

Would it be practical to:

  • separate the modification logic of the dataframe with attribute data based on the content of tfm tables out of process_wildcards;
  • combine wildcard processing done in process_uc_wildcards and process_wildcards in a single transform?

I believe these changes would also allow addressing #153 and #154 easier.

@olejandro
Copy link
Member Author

@siddharth-krishna @SamRWest what do you think?

@siddharth-krishna
Copy link
Collaborator

The only potential issue I can see with separating e.g. process_wildcards into two steps, one that expands the wildcards and the second to apply/insert/update the resulting rows, is if the intermediate table with the expanded rows is huge and consumes a lot of memory. Right now, we expand the wildcards in each row and immediately apply/insert/update it, IIRC, which avoids the creation of this intermediate expanded table. If you think it's not likely for intermediate tables to be too large, then your plan sounds good.

An alternative idea might be to move the helper functions inside process_wildcards to utils or something and also use them in process_uc_wildcards. This would avoid creation of such intermediate tables. My intent in #166 was anyway that the helper functions should eventually live somewhere more general (and perhaps even in the future allow a jupyter-based scenario workflow).

@olejandro
Copy link
Member Author

olejandro commented Mar 6, 2024

Memory can definitely be an issue. Could we e.g., store resolved wildcards as a list / string in in the original table until it is used or would that not help?

If choosing to expand them, we could also reduce the size by dropping the records that would otherwise be overwritten.

@SamRWest
Copy link
Collaborator

SamRWest commented Mar 6, 2024

The fastest approach I've found (and what I've implemented in _match_uc_wildcards()) is to essentially pre-build a lookup table with all unique wildcard patterns and their list of matches, then join these back to the table (thus re-duplicating them) and explode the match-lists for each row (making the DF long format).

The regex lookups are slow, and the DF row iteration is incredibly slow. This minimises the regex lookups and mostly avoids the iteration, making it much faster.

There are probably some additional gains to be had by building this lookup table of unique wildcard matches once for wildcards in all tables, as this should eliminate additional duplicates. Yes it'll use more RAM to store, but it'll be far less than the end-result joined/exploded tables, so I doubt it'd a bottleneck.

I've got this working for ~TFM_UPD over here but got sidetracked on some other stuff and need to get back to it.

I've opened PR #210 for you guys to have a look at. Happy to push on if you think this is the right approach?

The remaining performance issue is that it still has to iterate over updates.copy().iterrows() because of the eval_and_update() call, which I'm not sure how to vectorise yet. If this logic (and similar table-specific operations for other tags in there) could be separated out from the wildcard expansion logic, it would make things a lot cleaner.

@olejandro
Copy link
Member Author

@SamRWest I've opened #211 which I believe simplifies the workflow. Do you think it will make some of the optimisations in #210 easier to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants