Skip to content

Conversation

@gaoran10
Copy link
Contributor

@gaoran10 gaoran10 commented Feb 14, 2023

Motivation

Currently, the Pulsar SQL doesn't support query compacted data, the compacted data indicates the latest data of one key, it's like the table view of the topic. It's useful to get compacted data through Pulsar SQL, for example, we can get topic policies, etc.

How to use

Use the flag __compacted_query__ to indicate query compacted data or not.

trino> select * from pulsar."public/default"."pt-5" where __compacted_query__=true;
 age |  name   | __partition__ | __event_time__ |    __publish_time__     | __message_id__ | __sequence_id__ | __producer_name__ | __key__ | __properties__ | __compacted_query__
-----+---------+---------------+----------------+-------------------------+----------------+-----------------+-------------------+---------+----------------+---------------------
 102 | user-92 |             0 | NULL           | 2023-02-13 03:07:19.925 | (2414,18,0)    |              18 | standalone-0-8    | 2       | {}             | true
 107 | user-97 |             0 | NULL           | 2023-02-13 03:07:19.937 | (2414,19,0)    |              19 | standalone-0-8    | 7       | {}             | true
 100 | user-90 |             3 | NULL           | 2023-02-13 03:07:19.921 | (2419,18,0)    |              18 | standalone-0-8    | 0       | {}             | true
 104 | user-94 |             2 | NULL           | 2023-02-13 03:07:19.930 | (2418,18,0)    |              18 | standalone-0-8    | 4       | {}             | true
 109 | user-99 |             2 | NULL           | 2023-02-13 03:07:19.941 | (2418,19,0)    |              19 | standalone-0-8    | 9       | {}             | true
 103 | user-93 |             1 | NULL           | 2023-02-13 03:07:19.927 | (2416,18,0)    |              18 | standalone-0-8    | 3       | {}             | true
 108 | user-98 |             1 | NULL           | 2023-02-13 03:07:19.939 | (2416,19,0)    |              19 | standalone-0-8    | 8       | {}             | true
 101 | user-91 |             4 | NULL           | 2023-02-13 03:07:19.923 | (2417,18,0)    |              18 | standalone-0-8    | 1       | {}             | true
 106 | user-96 |             4 | NULL           | 2023-02-13 03:07:19.935 | (2417,19,0)    |              19 | standalone-0-8    | 6       | {}             | true
(9 rows)

Query 20230214_110459_00000_6kfee, FINISHED, 1 node
Splits: 21 total, 21 done (100.00%)
1.20 [9 rows, 8.23KB] [7 rows/s, 6.84KB/s]

Compacted query workflow

  1. Read data from the compacted ledger (if the compacted ledger is exist) and cache the message id and corresponding key.
  2. Read uncompacted data and cache the message id and corresponding key.
  3. Read complete data by the latest message id.

About Splits

For the compacted query, one non-partitioned topic or partitioned topic can only have one sub-query task, this is different from the normal query, for a normal query, one topic can split into multi sub-tasks to query.

Modifications

Add a virtual metadata column __compacted_query__, it indicates execute a compacted query or not.
Add a map in PulsarRecordCusor, it's used to cache the latest message id of the key.
Add a CompactedLedgerReader in PulsarRecordCusor, it's used to get the latest message id of the key and get complete data by the latest message id.

Verifying this change

Add unit test and integration test for the compacted query using Pulsar SQL.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: (gaoran10#22)

@github-actions github-actions bot added the doc-required Your PR changes impact docs and you will update later. label Feb 14, 2023
@gaoran10 gaoran10 self-assigned this Feb 14, 2023
@gaoran10 gaoran10 added the area/sql Pulsar SQL related features label Feb 14, 2023
@gaoran10 gaoran10 changed the title [fea][sql] Support querying compacted data in Pulsar SQL [feat][sql] Support querying compacted data in Pulsar SQL Feb 14, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Mar 24, 2023
@tisonkun tisonkun self-requested a review June 13, 2023 03:34
@tisonkun
Copy link
Member

I'm going to take a look at this PR. Could you rebase on the latest master? IIRC we make some changes and fix a lot of flaky test so it can ease the review process.

@github-actions github-actions bot removed the Stale label Jun 14, 2023
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jul 14, 2023
@Technoboy- Technoboy- added this to the 3.2.0 milestone Jul 31, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@coderzc coderzc removed this from the 3.3.0 milestone May 8, 2024
@coderzc coderzc added this to the 3.4.0 milestone May 8, 2024
@tisonkun
Copy link
Member

tisonkun commented May 9, 2024

Closed. We no longer bundle pulsar-sql.

@tisonkun tisonkun closed this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/sql Pulsar SQL related features doc-required Your PR changes impact docs and you will update later. Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants