Fix /datasets?filter to select across namespaces #3359

dbutenhof · 2023-03-24T14:27:37Z

PBENCH-1117

I discovered that the single simplistic LEFT JOIN allows combining native Dataset and Metadata terms in a SELECT, but with limitations: because the SQL join constructs a row for each Metadata match, matches for, e.g., server.origin and dataset.metalog.pbench.script or global.server.legacy will appear on separate table rows. Each has duplicate Dataset columns, but that doesn't help when trying to select across namespaces.

The only effective solution I was able to find was to cascade the joins in order to build a new table with a separate column for each metadata namespace row matching the dataset. This allows a single SELECT to work across the columns in the new table.

This broke the keysummary code, requiring some adjustment there. With the join now renaming multiple copies of the Metadata table, the simplistic override of the SELECT terms doesn't work. Instead of trying to "optimize" the SELECT, the new code just iterates through the returned datasets and their metadata collections.

PBENCH-1117 I discovered that the single simplistic `LEFT JOIN` allows combining native `Dataset` and `Metadata` terms in a `SELECT`, but with limitations: because the SQL join constructs a row for each `Metadata` match, matches for, e.g., `server.origin` and `dataset.metalog.pbench.script` or `global.server.legacy` will appear on separate table rows. Each has duplicate `Dataset` columns, but that doesn't help when trying to select across namespaces. The only effective solution I was able to find was to cascade the joins in order to build a new table with a separate column for each metadata namespace row matching the dataset. This allows a single `SELECT` to work across the columns in the new table. I'm somewhat unhappy with the `filter` unit tests, because the `sqlite3` engine seems to be generating odd and incorrect SQL for queries involving the `user` namespace. (PostgreSQL does this correctly on a live server.) This should be a fourth `LEFT JOIN` but on a user_id match as well as the dataset reference and primary key name. I've made the unit tests "pass" ... but the SQL being compared wouldn't actually work correctly. (Also note that the uery builder omits the `user` join if the query doesn't reference that alias, but doesn't similarly optimize for the other joins, which seems weird.)

With the join now renaming multiple copies of the `Metadata` table, the simplistic override of the `SELECT` terms doesn't work. Instead of trying to "optimize" the `SELECT`, just iterate through the returned datasets and their metadata collection.

webbnh

I'm offering my approval, if you don't find anything in the below worth changing. However, there are a few things that I'm concerned about.

Foremost is the change to filter_query(): it's not obvious to me that terms is being properly managed. (However, apparently you think the tests will pass...I just poked Mr. Jenkins to see what he thinks; if they do pass, and my concern is well-founded, then I think you'll need to expand the tests....)

The next is I have a thought about the four-joins approach.

And, finally, there's a large nit about the tests' overwriting/reusing certain identifiers.

(And, of course, there are a bunch of other nits, smaller things, and other comments/questions.)

lib/pbench/server/api/resources/datasets_list.py

webbnh · 2023-03-24T15:58:25Z

lib/pbench/server/api/resources/datasets_list.py

+                expression = aliases[native_key].value[keys].as_string()
+                filter = expression.contains(v) if contains else expression == v
+            else:
+                filter = and_(*terms)


After this change, can terms ever have more than one "term" in it? If not, consider changing it to a scalar.

Relatedly, if we take the if branch instead of the else branch here, I think we will ignore terms altogether...is that desired?

terms is only used when we're referencing the Dataset table. And you're right: with the changes I think there's never more than one term and I could probably simplify. That doesn't seem critical.

I agree, there's nothing critical here.

However, as the code stands with this change, I don't think we need for the two if statements (lines 246 and 276 respectively) to be separate from each other (that is, the path through the first dictates the path through the second).

So, I think it would be reasonable (even good!) to combine them together. We can get rid of the use_dataset flag, and the scope of terms becomes limited to only one block (and, in fact, becomes so simple that we can get rid of it altogether). You would have to replace the elif with an else, but that's easily done by moving the conditional to the contained if. And, the result would be much more straightforward code.

Your latest revision is much improved. However, I still suggest combining the two if statements, if you opt to make a change for some other reason.

webbnh · 2023-03-24T16:13:39Z

lib/pbench/server/api/resources/datasets_list.py

-            elif k:
-                self.accumulate(aggregate, k, v)
+
+        Database.dump_query(query, current_app.logger)


The effect of this is conditional upon DEBUG-level logging being enabled, right? It would be nice if "debug" appeared in its name....

OK, having read further, I see that you've enhanced this to work for non-DEBUG-level; in that case, I humbly request that you explicitly include the third argument here, which will make it more obvious to the reader that it's a "debug thing".

It's a debug thing until its not. The default behavior of this method hasn't changed. I just extended it to allow externally overriding to info for testing. I didn't see any point in renaming, and adding the default argument here is redundant.

Adding the default argument here makes it clear to the reader that the function is intended to be active only for debugging (otherwise, this behavior is implicit, and the concerned reader has to go find the definition of dump_query() to figure that out). (We can keep the argument's default value for compatibility with other, existing code; but it would be good from a code-as-documentation perspective to supply the value explicitly here.)

webbnh · 2023-03-24T16:21:41Z

lib/pbench/server/api/resources/datasets_list.py

+                if m.key == Metadata.METALOG:
+                    self.accumulate(aggregate["dataset"], m.key, m.value)
+                else:
+                    self.accumulate(aggregate, m.key, m.value)


Alternately,

a = aggregate["dataset"] if m.key == Metadata.METALOG else aggregate self.accumulate(a, m.key, m.value)

Yeah, but I like the explicit if better here than the local variable, especially when I'm pretty sure black wouldn't let me put it on one line anyway.

If you really like it better, that's fine. (Myself, I prefer a common call point with conditional data over multiple conditional call points.) However, I checked with Mr. Black before I posted it, and he said it was fine.

lib/pbench/server/api/resources/datasets_list.py

lib/pbench/test/unit/server/test_datasets_list.py

npalaska

Looks good as is, I just have a couple of questions.

lib/pbench/server/api/resources/datasets_list.py

lib/pbench/test/unit/server/test_datasets_list.py

webbnh

LGTM, I just have a couple of pointed questions and a coding suggestion.

lib/pbench/server/api/resources/datasets_list.py

lib/pbench/test/unit/server/test_datasets_list.py

webbnh · 2023-03-25T00:42:51Z

lib/pbench/server/api/resources/datasets_list.py

+                expression = aliases[native_key].value[keys].as_string()
+                filter = expression.contains(v) if contains else expression == v
+            else:
+                filter = and_(*terms)


Your latest revision is much improved. However, I still suggest combining the two if statements, if you opt to make a change for some other reason.

lib/pbench/test/unit/server/test_datasets_list.py

lib/pbench/server/api/resources/datasets_list.py

webbnh

Ship it! (Unless you want to make any of the changes from my earlier reviews.... 😉)

npalaska

👍

dbutenhof added 3 commits March 24, 2023 10:04

Fix the keysummary

716c417

With the join now renaming multiple copies of the `Metadata` table, the simplistic override of the `SELECT` terms doesn't work. Instead of trying to "optimize" the `SELECT`, just iterate through the returned datasets and their metadata collection.

Fix test setup

f1e4626

dbutenhof added bug Server API Of and relating to application programming interfaces to services and functions labels Mar 24, 2023

dbutenhof requested review from portante, webbnh, npalaska, vishalvvr, riya-17 and siddardh-ra March 24, 2023 14:27

dbutenhof self-assigned this Mar 24, 2023

webbnh previously approved these changes Mar 24, 2023

View reviewed changes

npalaska previously approved these changes Mar 24, 2023

View reviewed changes

lib/pbench/server/api/resources/datasets_list.py Outdated Show resolved Hide resolved

lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved

lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved

Some more refactoring and cleanup

e32fd47

dbutenhof dismissed stale reviews from npalaska and webbnh via e32fd47 March 24, 2023 20:35

npalaska previously approved these changes Mar 24, 2023

View reviewed changes

webbnh previously approved these changes Mar 25, 2023

View reviewed changes

siddardh-ra previously approved these changes Mar 27, 2023

View reviewed changes

More review observations ...

2256c05

dbutenhof dismissed stale reviews from siddardh-ra, webbnh, and npalaska via 2256c05 March 27, 2023 12:07

dbutenhof requested review from webbnh, npalaska and siddardh-ra March 27, 2023 12:12

webbnh approved these changes Mar 27, 2023

View reviewed changes

npalaska approved these changes Mar 27, 2023

View reviewed changes

dbutenhof merged commit 2d1e6af into distributed-system-analysis:main Mar 27, 2023
4 checks passed

dbutenhof deleted the joiner branch March 27, 2023 13:39

portante added this to the v0.73 milestone Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix /datasets?filter to select across namespaces #3359

Fix /datasets?filter to select across namespaces #3359

dbutenhof commented Mar 24, 2023

webbnh left a comment

webbnh Mar 24, 2023

dbutenhof Mar 24, 2023

webbnh Mar 24, 2023

webbnh Mar 25, 2023

webbnh Mar 24, 2023

dbutenhof Mar 24, 2023

webbnh Mar 24, 2023

webbnh Mar 24, 2023

dbutenhof Mar 24, 2023

webbnh Mar 24, 2023 •

edited

npalaska left a comment

webbnh left a comment

webbnh Mar 25, 2023

webbnh left a comment

npalaska left a comment

Fix /datasets?filter to select across namespaces #3359

Fix /datasets?filter to select across namespaces #3359

Conversation

dbutenhof commented Mar 24, 2023

webbnh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh Mar 24, 2023 • edited

Choose a reason for hiding this comment

npalaska left a comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

npalaska left a comment

Choose a reason for hiding this comment

webbnh Mar 24, 2023 •

edited