Harmonising wildcard processing #209

olejandro · 2024-03-06T15:24:44Z

Currently there are 2 transforms dealing with wildcard processing:

process_uc_wildcards deals with wildcards in uc_t tables;
process_wildcards deals with wildcards in tfm tables and changes the dataframe with attribute data based on their content.

Would it be practical to:

separate the modification logic of the dataframe with attribute data based on the content of tfm tables out of process_wildcards;
combine wildcard processing done in process_uc_wildcards and process_wildcards in a single transform?

I believe these changes would also allow addressing #153 and #154 easier.

The text was updated successfully, but these errors were encountered:

olejandro · 2024-03-06T15:25:00Z

@siddharth-krishna @SamRWest what do you think?

siddharth-krishna · 2024-03-06T16:37:55Z

The only potential issue I can see with separating e.g. process_wildcards into two steps, one that expands the wildcards and the second to apply/insert/update the resulting rows, is if the intermediate table with the expanded rows is huge and consumes a lot of memory. Right now, we expand the wildcards in each row and immediately apply/insert/update it, IIRC, which avoids the creation of this intermediate expanded table. If you think it's not likely for intermediate tables to be too large, then your plan sounds good.

An alternative idea might be to move the helper functions inside process_wildcards to utils or something and also use them in process_uc_wildcards. This would avoid creation of such intermediate tables. My intent in #166 was anyway that the helper functions should eventually live somewhere more general (and perhaps even in the future allow a jupyter-based scenario workflow).

olejandro · 2024-03-06T17:29:20Z

Memory can definitely be an issue. Could we e.g., store resolved wildcards as a list / string in in the original table until it is used or would that not help?

If choosing to expand them, we could also reduce the size by dropping the records that would otherwise be overwritten.

SamRWest · 2024-03-06T22:01:44Z

The fastest approach I've found (and what I've implemented in _match_uc_wildcards()) is to essentially pre-build a lookup table with all unique wildcard patterns and their list of matches, then join these back to the table (thus re-duplicating them) and explode the match-lists for each row (making the DF long format).

The regex lookups are slow, and the DF row iteration is incredibly slow. This minimises the regex lookups and mostly avoids the iteration, making it much faster.

There are probably some additional gains to be had by building this lookup table of unique wildcard matches once for wildcards in all tables, as this should eliminate additional duplicates. Yes it'll use more RAM to store, but it'll be far less than the end-result joined/exploded tables, so I doubt it'd a bottleneck.

I've got this working for ~TFM_UPD over here but got sidetracked on some other stuff and need to get back to it.

I've opened PR #210 for you guys to have a look at. Happy to push on if you think this is the right approach?

The remaining performance issue is that it still has to iterate over updates.copy().iterrows() because of the eval_and_update() call, which I'm not sure how to vectorise yet. If this logic (and similar table-specific operations for other tags in there) could be separated out from the wildcard expansion logic, it would make things a lot cleaner.

olejandro · 2024-03-07T04:00:42Z

@SamRWest I've opened #211 which I believe simplifies the workflow. Do you think it will make some of the optimisations in #210 easier to do?

Closes #209

SamRWest mentioned this issue Mar 6, 2024

WIP Feature/process wildcard speedup #210

Closed

olejandro mentioned this issue Mar 7, 2024

Use a single transform to process all wildcards #211

Merged

olejandro closed this as completed in #211 Mar 8, 2024

olejandro added a commit that referenced this issue Mar 8, 2024

Use a single transform to process all wildcards (#211)

c05a94d

Closes #209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonising wildcard processing #209

Harmonising wildcard processing #209

olejandro commented Mar 6, 2024

olejandro commented Mar 6, 2024

siddharth-krishna commented Mar 6, 2024

olejandro commented Mar 6, 2024 •

edited

Loading

SamRWest commented Mar 6, 2024 •

edited

Loading

olejandro commented Mar 7, 2024

Harmonising wildcard processing #209

Harmonising wildcard processing #209

Comments

olejandro commented Mar 6, 2024

olejandro commented Mar 6, 2024

siddharth-krishna commented Mar 6, 2024

olejandro commented Mar 6, 2024 • edited Loading

SamRWest commented Mar 6, 2024 • edited Loading

olejandro commented Mar 7, 2024

olejandro commented Mar 6, 2024 •

edited

Loading

SamRWest commented Mar 6, 2024 •

edited

Loading