Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate expressions per-context and ensure memory stability #1431

Merged
merged 3 commits into from May 29, 2021

Conversation

sc1f
Copy link
Contributor

@sc1f sc1f commented May 26, 2021

This PR refactors the way that expressions are handled and stored in Perspective - previously, expressions columns were calculated and stored on the main table inside t_gstate and its intermediate tables (prev, current, delta, transitions, etc.). By having all expressions on the gnode's tables, expressions were "real" columns to the engine, and it guaranteed that all the features available to real columns (pivot, filter, etc.) were available to expression columns without having to do large refactors to the code that generated aggregates, pivots etc.

However, this meant that expressions were not "isolated" to the View they were created in - a user could not create a new expression named new column of float type, and then replace it immediately with an expression that returns a string, as the main table schema had already recorded new column as a float column and could not overwrite it.

Now, expression columns are stored per-context on data tables that only contain the expression columns belonging to the View. This means that each View can create expression columns that use the same aliases as other Views, and they can be of any type:

const v1 = await table.view({
  expressions: [`// new column\n "ask" - "bid"`]
});

const v2 = await table.view({
  expressions: [`// new column\n upper("client")`]
});

And now, expression columns are guaranteed to be deleted when view.delete() is called - when the context is deleted, the data tables that contain their expressions are cleaned up from the heap.

To implement this, I added a join function to t_data_table - given two data tables, it creates a new data table that references the columns from the source tables using their shared pointers without copying any of the underlying memory or allocating new data stores on the heap. When the context receives data from the gnode during notify, it is provided a t_data_table that contains the "real" columns joined with the expression columns from the context being notified. This means that the context will continue to treat expression columns as "real" columns so there is no loss of functionality.

Some work remains on the UI to allow for duplicate expression aliases to be created - right now, perspective-viewer does not allow aliases to be reused. Because #1426 changes the expressions UI "bigly", this PR does not contain any UI changes, and those changes will be added on in a new PR before the Perspective 0.9.0 release.

Changelog

  • Store expressions per-context using the t_expression_tables struct
  • t_vocab for string expressions are now per-context, instead of a static global vocab for all expressions
  • Added tests using multiple views with identical expression aliases
  • Added tests to assert that there are no memory leaks or excess allocations when views with expressions are created and deleted
  • Updated the View documentation to reflect the expressions API

@sc1f sc1f marked this pull request as draft May 26, 2021 10:12
sc1f added 2 commits May 26, 2021 19:45
working for realsies

properly construct vocab
@sc1f sc1f marked this pull request as ready for review May 26, 2021 12:17
@sc1f sc1f requested a review from texodus May 26, 2021 12:57
@sc1f sc1f added C++ enhancement Feature requests or improvements JS Python labels May 26, 2021
@sc1f sc1f added this to the 0.9.0 milestone May 26, 2021
Copy link
Member

@texodus texodus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Check my review notes for a some debug cruft leftover.

There is a mostly-trivial performance regression here of about 5%, which is IMO acceptable for the massive improvement in code organization, test coverage, inline comments and general good engineering. We discussed t_data_table construction costs at length offline, and I do not suspect these are the cause of this slowdown - but IMO we should measure the impact of this anyway and more generally make some project-wide rules about shared_ptr vs. const ref vs. std::move just so we're consistent (as this is completely ad-hoc right now :) ).

Screen Shot 2021-05-28 at 9 35 02 PM

@@ -250,6 +250,7 @@ if (PSP_WASM_BUILD)
-s DISABLE_EXCEPTION_CATCHING=0 \
-s ASSERTIONS=2 \
-s DEMANGLE_SUPPORT=1 \
-s SAFE_HEAP=1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this setting basically makes debug-build Perspective unusable (though it is quite useful for finding memory leaks!)

computed_function::order order_fn = computed_function::order(vocab); \
computed_function::upper upper_fn = computed_function::upper(vocab); \
computed_function::lower lower_fn = computed_function::lower(vocab); \
computed_function::length length_fn = computed_function::length(vocab); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this needs to be a macro and not a function? The scope-capture of vocab specifically is something I'd like to stylistically avoid if possible. (Looks also like recompute is deprecated?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this macro additionally captures the vocab - it is basically just inlining this code into where the macro is invoked, and the macro doesn't take vocab as a parameter. AFAIK since it's just a text replacement macro, it shouldn't capture vocab or have any notion of what vocab is during the preprocessor step. I think leaving it as a macro works well for readability in the compute function.

Copy link
Member

@texodus texodus May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what macro variable capture means. There's little risk of vocab being mis-used here as you mention the macro won't compile, but for complexity analysis purposes I think its best if we avoid variable capture (in this case with a macro argument) and try where possible to force scope (with {}, though this doesn't work in this case since it introduces symbols) in our macros.

Regarding readability - this is identical to a function call, no?

@@ -42,9 +42,25 @@ t_ctx_grouped_pkey::init() {
m_tree = std::make_shared<t_stree>(pivots, m_config.get_aggregates(), m_schema, m_config);
m_tree->init();
m_traversal = std::shared_ptr<t_traversal>(new t_traversal(m_tree));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but - we should not compile this as there is no public API (nor tests ...)

const std::string& colname,
const std::vector<t_tscalar>& pkeys,
std::vector<t_tscalar>& out_data) const {
std::shared_ptr<t_data_table> master_table = m_gstate->get_table();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go within the else block


for (const std::string& column : other_schema.m_columns) {
// Only append unique columns
if (!schema.has_column(column)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to this seems an error ...

// Check if this expression is already in the array, if so then
// we need to replace the expression so the last expression tagged
// with the alias is the one that is applied to the engine.
if (expression_idx_map[expression_alias] !== undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impossible without a malformed config, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible in the current UI, if one creates an expression column and then replaces it with another one - the viewer API will pass both expressions sequentially: ["// a \n 123", "// a\n upper('abcde')"]. With all the expression UI/fix PRs merged we can fix this on the viewer side.

@@ -160,22 +159,15 @@ def _parse_expression_strings(expressions):

alias_match = re.match(ALIAS_REGEX, expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: ALIAS_REGEX is wrong (corrected in JS but not python ..)

const expression_schema = await view.expression_schema();
expect(Object.keys(expression_schema).length).toEqual(1);
await view.delete();
}, 3000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These looks correct on-the-face, but why 3,000? In theory, these test should run identically with 1 or 1,000,000 - iteration checks are just there for offensively slow operations.

scripts/test_python.js Show resolved Hide resolved
auto joined_delta = delta->join(ctx_expression_tables->m_delta);
auto joined_prev = prev->join(ctx_expression_tables->m_prev);
auto joined_current = current->join(ctx_expression_tables->m_current);
auto joined_transitions = transitions->join(ctx_expression_tables->m_transitions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do this join once per context? The column sets do not change per gnode.process() call.

@texodus texodus merged commit d721184 into master May 29, 2021
@texodus texodus deleted the improved-expressions branch May 29, 2021 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ enhancement Feature requests or improvements JS Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants