Pulsar SQL supports pulsar's primitive schema #4728

congbobo184 · 2019-07-15T08:04:27Z

Motivation

Continue the PR of #4151

Verifying this change

Add the tests for it

Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

jerrypeng · 2019-07-16T18:19:01Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarRecordCursor.java

@@ -140,9 +140,10 @@ private void initialize(List<PulsarColumnHandle> columnHandles, PulsarSplit puls
        this.readOffloaded = pulsarConnectorConfig.getManagedLedgerOffloadDriver() != null;
        this.pulsarConnectorConfig = pulsarConnectorConfig;

-        Schema schema = PulsarConnectorUtils.parseSchema(pulsarSplit.getSchema());
-
-        this.schemaHandler = getSchemaHandler(schema, pulsarSplit.getSchemaType(), columnHandles);


Please delete method getSchemaHandler since it got refactored to a separate class

jerrypeng · 2019-07-16T18:28:00Z

...l/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarPrimitiveSchemaHandler.java

+
+    @Override
+    public Object deserialize(ByteBuf byteBuf) {
+        byte[] data = ByteBufUtil.getBytes(byteBuf);


I would recommend not allocating a new byte array every time we deserialize. This could heavily degrade performance. We should reuse pre-allocated buffers. Please take a look at what is done in the JSONSchemaHandler:

https://github.com/apache/pulsar/blob/master/pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/JSONSchemaHandler.java#L58

I would recommend not allocating a new byte array every time we deserialize. This could heavily degrade performance. We should reuse pre-allocated buffers. Please take a look at what is done in the JSONSchemaHandler:

https://github.com/apache/pulsar/blob/master/pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/JSONSchemaHandler.java#L58

Pulsar primitive schema decode data by byte[] and can't use byte[] size. So we need to new a byte[], unless modify the method of primitive schema

Pulsar primitive schema decode data by byte[] and can't use byte[] size. So we need to new a byte[], unless modify the method of primitive schema

I don't quite follow. Why can't you just do something like this

int size = payload.readableBytes(); byte[] buffer = tmpBuffer.get(); if (buffer.length < size) { // If the thread-local buffer is not big enough, replace it with // a bigger one buffer = new byte[size * 2]; tmpBuffer.set(buffer); } payload.readBytes(buffer, 0, size); schema.decode(data);

Pulsar primitive schema decode data by byte[] and can't use byte[] size. So we need to new a byte[], unless modify the method of primitive schema

I don't quite follow. Why can't you just do something like this

int size = payload.readableBytes(); byte[] buffer = tmpBuffer.get(); if (buffer.length < size) { // If the thread-local buffer is not big enough, replace it with // a bigger one buffer = new byte[size * 2]; tmpBuffer.set(buffer); } payload.readBytes(buffer, 0, size); schema.decode(data);

For example, StringSchema decode method :

public String decode(byte[] bytes) { if (null == bytes) { return null; } else { return new String(bytes, charset); } }

so we need to overload the decode method to allow the size of the readable byte [] to be passed in.

I see. I think you can just use byteBuf.array(). I looked at the code and it seems to just return the underlying byte array.

I see. I think you can just use byteBuf.array(). I looked at the code and it seems to just return the underlying byte array.

but not all ByteBuf implementation classes implement this method.

in general we should just improve our schema interface to take ByteBuf as the bytes array. we shouldn't do this profiling separately in presto plugin. this is against the purpose of moving presto to use Schema interface as I started this change.

I would suggest creating a follow up issue to add methods in Schema to support deserializing from ByteBuf and we address the problem in Schema implementations, and then change the code here.

@sijie can you create an follow on issue then?

jerrypeng · 2019-07-16T19:04:56Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarMetadata.java

+                return VarbinaryType.VARBINARY;
+            case STRING:
+                return VarcharType.VARCHAR;
+            case DATE:


We should use the corresponding Presto time data types here:

https://prestodb.github.io/docs/current/language/types.html

jerrypeng

Generally looks good! Just a few comments. @congbobo184 I would also recommend you profile the code (primitive schemas code path) via YourKit or any other java profiler to make sure unnecessary objects are not being allocated especially on the critical path.

jerrypeng · 2019-07-16T19:14:29Z

...l/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarPrimitiveSchemaHandler.java

+    @Override
+    public Object deserialize(ByteBuf byteBuf) {
+        byte[] data = ByteBufUtil.getBytes(byteBuf);
+        return schema.decode(data);


For time based types, i.e. Date, Time, and Timestamp, there is not really any point to deserialize them to POJOs as that creates additional objects that need to be allocated and the GCed. For those types we can simply just return the a long which is pretty much already done in those Schemas anyways and also below in the extractField method.

congbobo184 · 2019-07-17T11:09:27Z

Generally looks good! Just a few comments. @congbobo184 I would also recommend you profile the code (primitive schemas code path) via YourKit or any other java profiler to make sure unnecessary objects are not being allocated especially on the critical path.

thanks for reviewing my PR, I will carefully think about the comment you left behind and change my code.

codelipenghui · 2019-07-18T14:02:04Z

run Integration Tests
run java8 tests

jerrypeng · 2019-07-18T18:53:14Z

...l/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarPrimitiveSchemaHandler.java

+    public Object deserialize(ByteBuf byteBuf) {
+        byte[] data = ByteBufUtil.getBytes(byteBuf);
+        Object currentRecord = schema.decode(data);
+        switch (schemaInfo.getType()) {


for these date, time, and timestamp types, I think we can just decode with the LongSchema. For example there is no point in deserializing to a "Date" object just to get the long value via "getTime()"

in principle I would suggest not shortcutting the schema implementation. it would make maintenance become very hard. because if we can anything in schema implementation, we might forget updating the part in presto. For example, if we change the implementation in Date schema to include tz information. In presto we are using LongSchema which bypass the whole backward compatibility logic, which will cause problems.

I would actually encourage sticking to using Schema and let Schema take care of all the bc handling. If this part become a bottleneck, we can always seek an approach to improve it. Correctness and maintenance is the top priority for pulsar-presto at this moment.

@sijie schema in Pulsar is created for serializing from a POJO to bytes and vice versa. This is a different scenario then what happens in the presto connector especially here. What is different about presto connector is that we need to map types to presto types. Which is different goal then what schema is designed for.

For example, if we change the implementation in Date schema to include tz information

We will have to change the logic here regardless if we decided to add tz info. The current logic in this PR still returns a long (timestamp) at the end of the day which will not contain any TZ info. It does by creating unnecessary object like "Date".

We can leave the logic as it is now, but lets create some issues involving improving performance of this and schema

Also if we are intending to use only existing schemas for deserializing data in the presto connector, I would suggest optimizing the schemas first for use cases in the presto connector e.g. adding support for ByteBuf, use DSL JSON, etc, before we replace current deserializing code.

@jerrypeng

This is a different scenario then what happens in the presto connector especially here. What is different about presto connector is that we need to map types to presto types. Which is different goal then what schema is designed for.

How is that different? The schema is for defining the types stored in Pulsar. It is the source of truth for interpreting the Pulsar types. Because it handles schema versioning and schema evaluation. It is not just simply deserialization and serialization.

If Presto wants to deserialize and serialize the raw bytes, it can define its way to serialize and deserialize the raw bytes. But if it is using the schema type defined in Pulsar, it should sticky to Pulsar schema. Because Pulsar schema handles versioning and evaluation. It is technically wrong to bypass it.

We will have to change the logic here regardless if we decided to add tz info. The current logic in this PR still returns a long (timestamp) at the end of the day which will not contain any TZ info. It does by creating unnecessary object like "Date".

Yes. That's called schema evaluation. That's why we have added versioning and many other features to handle schema evaluation. And that's why I am so strong on using Schema interface and implementation. You can't bypass it. At the time you bypass the schema interface and implementation, you are bypassing schema versioning and schema evaluation. When a schema evolved, you have to re-implement the logic in presto. That's is going to be a disaster.

Also if we are intending to use only existing schemas for deserializing data in the presto connector, I would suggest optimizing the schemas first for use cases in the presto connector e.g. adding support for ByteBuf, use DSL JSON, etc, before we replace current deserializing code.

@jerrypeng this PR is NOT replacing any deserializing code. It doesn't touch AVRO and JSON. This PR is adding the support for new types. It doesn't change any existing behaviors. I have created a follow up issue to optimize the date types.

We can optimize schemas first before changing AVRO and JSON to use schema implementation. But it doesn't make any sense to block this PR.

@jerrypeng this PR is NOT replacing any deserializing code. It doesn't touch AVRO and JSON. This PR is adding the support for new types. It doesn't change any existing behaviors. I have created a follow up issue to optimize the date types.
We can optimize schemas first before changing AVRO and JSON to use schema implementation. But it doesn't make any sense to block this PR.

My comment is about the future work, NOT about this PR. Sorry if that wasn't clear. In an ideal situation we should just use schema for deserializing data in presto, but I don't want to sacrifice performance to do so.

I am OK with this PR going in.

In an ideal situation we should just use schema for deserializing data in presto, but I don't want to sacrifice performance to do so

I think we agreed on this.

congbobo184 · 2019-07-19T06:58:53Z

run java8 tests

codelipenghui · 2019-07-19T22:39:28Z

run java8 tests

sijie · 2019-07-21T06:43:22Z

run java8 tests

sijie · 2019-07-21T08:59:01Z

run java8 tests

sijie · 2019-07-21T16:13:35Z

@congbobo184 there is a test failure related this change. Can you please take a look?

Error Message
expected [1563706357298] but found [Sun Jul 21 10:52:37 UTC 2019]
Stacktrace
java.lang.AssertionError: expected [1563706357298] but found [Sun Jul 21 10:52:37 UTC 2019]
	at org.testng.Assert.fail(Assert.java:96)
	at org.testng.Assert.failNotEquals(Assert.java:776)
	at org.testng.Assert.assertEqualsImpl(Assert.java:137)
	at org.testng.Assert.assertEquals(Assert.java:118)
	at org.testng.Assert.assertEquals(Assert.java:442)
	at org.apache.pulsar.sql.presto.TestPulsarPrimitiveSchemaHandler.testPulsarPrimitiveSchemaHandler(TestPulsarPrimitiveSchemaHandler.java:120)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
	at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

…_pulsar_primitive_schemas # Conflicts: # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarMetadata.java

congbobo184 · 2019-07-22T08:30:39Z

run java8 tests

sijie · 2019-07-22T15:02:11Z

run java8 tests

congbobo184 · 2019-07-23T03:18:59Z

run java8 tests

congbobo184 · 2019-07-23T07:17:25Z

run java8 tests

### Motivation Continue the PR of apache#4151

### Motivation Continue the PR of #4151 (cherry picked from commit 1ab35b0)

congbobo added 2 commits July 15, 2019 15:59

Pulsar sql supports pulsr primitive schemas

c2df4c9

Delete the notes

ccb56cf

sijie assigned congbobo184 Jul 16, 2019

sijie added this to the 2.5.0 milestone Jul 16, 2019

sijie added component/schemaregistry area/sql Pulsar SQL related features type/feature The PR added a new feature or issue requested a new feature labels Jul 16, 2019

sijie modified the milestones: 2.5.0, 2.4.1 Jul 16, 2019

jerrypeng reviewed Jul 16, 2019

View reviewed changes

congbobo added 2 commits July 18, 2019 11:18

Fix some comments

d62cbf0

1.delete the redundant code

aa05e0b

jerrypeng reviewed Jul 18, 2019

View reviewed changes

sijie approved these changes Jul 21, 2019

View reviewed changes

Modify the test

1b91cf4

sijie mentioned this pull request Jul 22, 2019

[SQL] Optimize handling date related types in presto handler #4775

Closed

Merge remote-tracking branch 'apache/master' into pulsar_sql_supports…

4070ca4

…_pulsar_primitive_schemas # Conflicts: # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarMetadata.java

sijie merged commit 1ab35b0 into apache:master Jul 24, 2019

easyfan pushed a commit to easyfan/pulsar that referenced this pull request Jul 26, 2019

Pulsar SQL supports pulsar's primitive schema (apache#4728)

cc3adb1

### Motivation Continue the PR of apache#4151

sijie mentioned this pull request Jul 26, 2019

Making use of builtin schemas for primitive types #4022

Closed

1 task

jiazhai mentioned this pull request Aug 28, 2019

Convert anonymous classes to lambda #4703

Merged

jiazhai pushed a commit that referenced this pull request Aug 28, 2019

Pulsar SQL supports pulsar's primitive schema (#4728)

cecea0b

### Motivation Continue the PR of #4151 (cherry picked from commit 1ab35b0)

sijie mentioned this pull request Dec 27, 2019

ISSUE-4775: [SQL] Optimize handling date related types in presto handler streamnative/pulsar-archived#277

Open

sijie mentioned this pull request Apr 16, 2020

presto connection to pulsar issues #5300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulsar SQL supports pulsar's primitive schema #4728

Pulsar SQL supports pulsar's primitive schema #4728

congbobo184 commented Jul 15, 2019 •

edited

Loading

jerrypeng Jul 16, 2019

jerrypeng Jul 16, 2019 •

edited

Loading

congbobo184 Jul 18, 2019

jerrypeng Jul 18, 2019

congbobo184 Jul 18, 2019

jerrypeng Jul 18, 2019

congbobo184 Jul 18, 2019

sijie Jul 18, 2019

jerrypeng Jul 22, 2019

sijie Jul 22, 2019

jerrypeng Jul 16, 2019

jerrypeng left a comment •

edited

Loading

jerrypeng Jul 16, 2019

congbobo184 commented Jul 17, 2019

codelipenghui commented Jul 18, 2019

jerrypeng Jul 18, 2019

sijie Jul 21, 2019

jerrypeng Jul 22, 2019 •

edited

Loading

jerrypeng Jul 22, 2019

sijie Jul 22, 2019

sijie Jul 22, 2019

jerrypeng Jul 22, 2019 •

edited

Loading

sijie Jul 22, 2019

congbobo184 commented Jul 19, 2019

codelipenghui commented Jul 19, 2019

sijie commented Jul 21, 2019

sijie commented Jul 21, 2019

sijie commented Jul 21, 2019

congbobo184 commented Jul 22, 2019

sijie commented Jul 22, 2019

congbobo184 commented Jul 23, 2019

congbobo184 commented Jul 23, 2019

Pulsar SQL supports pulsar's primitive schema #4728

Pulsar SQL supports pulsar's primitive schema #4728

Conversation

congbobo184 commented Jul 15, 2019 • edited Loading

Motivation

Verifying this change

Documentation

Choose a reason for hiding this comment

jerrypeng Jul 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerrypeng left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

congbobo184 commented Jul 17, 2019

codelipenghui commented Jul 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerrypeng Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerrypeng Jul 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

congbobo184 commented Jul 19, 2019

codelipenghui commented Jul 19, 2019

sijie commented Jul 21, 2019

sijie commented Jul 21, 2019

sijie commented Jul 21, 2019

congbobo184 commented Jul 22, 2019

sijie commented Jul 22, 2019

congbobo184 commented Jul 23, 2019

congbobo184 commented Jul 23, 2019

congbobo184 commented Jul 15, 2019 •

edited

Loading

jerrypeng Jul 16, 2019 •

edited

Loading

jerrypeng left a comment •

edited

Loading

jerrypeng Jul 22, 2019 •

edited

Loading

jerrypeng Jul 22, 2019 •

edited

Loading