Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query's `query_hash` is incorrect for queries with parameters #1909

Open
redshirtrob opened this issue Aug 5, 2017 · 7 comments
Open

Query's `query_hash` is incorrect for queries with parameters #1909

redshirtrob opened this issue Aug 5, 2017 · 7 comments

Comments

@redshirtrob
Copy link

@redshirtrob redshirtrob commented Aug 5, 2017

Issue Summary

When using parameters in queries, the fully rendered query's hash_code does not match the hash_code stored in the queries table. Consequently, the latest_query_data_id does not get updated.

If the query is created with a parameter, saved, run, used in a dashboard, and that dashboard is subsequently shared, the shared dashboard will crash because there will be no cached data for the query, i.e. latest_query_data_id is empty.

Steps to Reproduce

  1. Create a new query with a parameter, e.g.
SELECT ts_bucket,
       COUNT(*) AS records
FROM db.table
WHERE ts_date >= CURRENT_DATE - interval '{{day}}' DAY
  AND ts_bucket >= CURRENT_TIMESTAMP - interval '24' hour
GROUP BY ts_bucket
  1. Save the query

  2. Observe the data stored in the queries table, i.e. hash_code and latest_query_data_id

=> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | 7d6b4ae0f8b0838f73c83f19dedf602e |                  186
  1. Execute the query

  2. Observe the log messages. Note the logged hash_code does NOT match the table's value.

[2017-08-05 05:10:55,082][PID:637][INFO][root] Inserting job for c453626aef08a7d0e8506e90ac5bf4f0 with metadata={'Username': u'email', 'Query ID': 7}
  1. Observe the data stored in the queries table is unchanged, especially latest_query_data_id
> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | 7d6b4ae0f8b0838f73c83f19dedf602e |                  186
  1. Remove the parameter from the query and replace it with the actual value
SELECT ts_bucket,
       COUNT(*) AS records
FROM db.table
WHERE ts_date >= CURRENT_DATE - interval '1' DAY
  AND ts_bucket >= CURRENT_TIMESTAMP - interval '24' hour
GROUP BY ts_bucket
  1. Save the query

  2. Observe the new values in the database. Note that query_hash now matches the value from the log message in Step 5.

=> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | c453626aef08a7d0e8506e90ac5bf4f0 |                  188
  1. Execute the query

  2. Observe the log messages. Note the logged hash_code now matches the table's value.

[2017-08-05 05:20:12,731][PID:1387][INFO][root] Inserting job for c453626aef08a7d0e8506e90ac5bf4f0 with metadata={'Username': u'email', 'Query ID': 7}
  1. Observe the values in the database. Note that the latest_query_data_id has been updated.
=> select id,query_hash,latest_query_data_id from queries where id=7;
 id |            query_hash            | latest_query_data_id
----+----------------------------------+----------------------
  7 | c453626aef08a7d0e8506e90ac5bf4f0 |                  189

I expect parameterized queries to behave just like any other queries. When they succeed the data should be linked to the queries table, and the latest results should be available in any shared dashboards.

Technical details:

I believe the simplest solution is to pass along a query_hash field in the metadata parameter to tasks:queries.py::enqueue_query(). This way QueryExecutor could choose the metadata version when calling models.QueryResult.store_result(). The query_hash field in metadata should contain the same value as the hash_code field in the queries table.

My main concern with replacing the value of the query_hash formal parameter to enqueue_query is that it's used in a bunch of places in the redis_pipeline and I'm not sure what kind of unintended side-effects that might cause.

  • Redash Version: 1.0.3+b2850
  • Browser/OS: Chrome/macOS
  • How did you install Redash: Docker
@arikfr

This comment has been minimized.

Copy link
Member

@arikfr arikfr commented Aug 6, 2017

Thank you for the elaborate description. Another possible solution is to have the query_hash represent the query with its default parameter(s), then the rest of the system will work as intended.

WDYT?

@redshirtrob

This comment has been minimized.

Copy link
Author

@redshirtrob redshirtrob commented Aug 6, 2017

@arikfr I'm not sure I understand how that will work. I assume you mean the query_hash value stored in the queries table will be the hash of the default rendered Query? If so, I think there's still an issue in that enqueue_query generates the hash based on the text sent from the FE. This could be a Query rendered with non-default values.

def enqueue_query(query, data_source, user_id, scheduled_query=None, metadata={}):
    query_hash = gen_query_hash(query)
    logging.info("Inserting job for %s with metadata=%s", query_hash, metadata)

OTOH, perhaps I'm misunderstanding something about the architecture. I realize there are quite a few moving parts here and I'm still working my way through the code.

Anyway, I want to propose another approach. Ultimately, the Query and QueryResult are linked via models.QueryResult.store_result(), specifically this bit of code:

        queries = db.session.query(Query).filter(
            Query.query_hash == query_hash,
            Query.data_source == data_source)
        for q in queries:
            q.latest_query_data = query_result
            db.session.add(q)

Suppose we altered the interfaces a bit to pass along a query_id as well (I've done a preliminary scan, and this seems quite possible). Then, we could change that bit of code to:

        queries = db.session.query(Query).filter(
            Query.query_hash == query_hash or Query.id == query_id, # Change here
            Query.data_source == data_source)
        for q in queries:
            q.latest_query_data = query_result
            db.session.add(q)

I believe this would preserve existing behavior (in that "duplicate" queries are updated with the latest data) and that the source parameterized query also gets linked.

I'm happy to continue kicking this around. And, once we land on an approach, I'm happy to send out a PR.

@arikfr

This comment has been minimized.

Copy link
Member

@arikfr arikfr commented Aug 6, 2017

I assume you mean the query_hash value stored in the queries table will be the hash of the default rendered Query?

Exactly.

If so, I think there's still an issue in that enqueue_query generates the hash based on the text sent from the FE. This could be a Query rendered with non-default values.

It definitely can be and in this case there is no reason to update the query's latest result reference. latest_query_data is mostly used as an optimization to avoid looking up the last result. It's also used for shared dashboards. For the optimization, it doesn't matter much, but for the shared dashboards it will result in unexpected behavior if it suddenly starts to reference "random" results rather than the default one (because the shared dashboards only shows the default one).

@redshirtrob

This comment has been minimized.

Copy link
Author

@redshirtrob redshirtrob commented Aug 7, 2017

Ah, it's the point about shared dashboards that I was missing. Thanks for the explanation.

In that case I think your solution is appropriate. IIUC, saving the query would also need to render the template with default values and generate the hash from that. Is that what you mean?

@arikfr

This comment has been minimized.

Copy link
Member

@arikfr arikfr commented Aug 7, 2017

In that case I think your solution is appropriate. IIUC, saving the query would also need to render the template with default values and generate the hash from that. Is that what you mean?

Yes.
Not sure I'm 100% comfortable with this, but it feels like a reasonable solution for this issue (and not a hard one either).

@redshirtrob

This comment has been minimized.

Copy link
Author

@redshirtrob redshirtrob commented Aug 7, 2017

Here's one other thing to consider, that I tried to call out in my initial report. If you create a brand new query with a parameter, such that the first query save is already templated, you can never share this query in a dashboard because latest_query_data can't get set.

Anyway, I can understand being somewhat queasy with this change, but I think it addresses a pretty clear bug. And, as you said, it's a pretty small change.

I'll craft a PR.

@arikfr

This comment has been minimized.

Copy link
Member

@arikfr arikfr commented Aug 9, 2017

Thanks, @redshirtrob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.