Skip to content

Pulsar SQL unusable when topic contains malformed message #4982

@ghost

Description

Describe the bug

$ pulsar sql --execute 'select * from pulsar."test/app".topic1' \
 --output-format ALIGNED

Query 20190820_051411_00001_t5ejs failed: Malformed data. Length is negative: -62

and

2019-08-20T01:14:12.305-0400	ERROR	remote-task-callback-1	com.facebook.presto.execution.StageStateMachine	Stage 20190820_051411_00001_t5ejs.1 failed
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -62
	at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
	at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
	at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
	at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
	at org.apache.pulsar.sql.presto.AvroSchemaHandler.deserialize(AvroSchemaHandler.java:70)
	at org.apache.pulsar.sql.presto.PulsarRecordCursor.advanceNextPosition(PulsarRecordCursor.java:421)
	at com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:93)
	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:242)

To Reproduce
Steps to reproduce the behavior:

  1. create a topic with an AvroSchema
  2. using the CLI, send non-Avro message to the topic:
    pulsar-client produce persistent://test/app/topic1 -m yoshi
  3. Query the topic from CLI:
    $ pulsar sql --execute 'select * from pulsar."test/app".topic1' \ --output-format ALIGNED

Expected behavior
I expect a list of the correctly-formed Avro messages, maybe with a malformed notice for each malformed message:

 comment | __event_time__ |      __publish_time__       | __message_id__ | __sequence_id__ | __producer_name__ | __key__ | __properties__
---------+----------------+-----------------------------+----------------+-----------------+-------------------+---------+----------------
 yoshi!  | NULL           | 2019-08-16 23:30:17.405 UTC | (14438,4,0)    |               0 | standalone-7-71   | NULL    | {}
 yoshi!  | NULL           | 2019-08-16 23:15:21.035 UTC | (14438,3,0)    |               0 | standalone-7-70   | NULL    | {}
<malformed>  | NULL           | 2019-08-16 23:14:21.035 UTC | (14438,2,0)    |               0 | standalone-7-70   | NULL    | {}
 yoshi!  | NULL           | 2019-08-16 23:12:21.035 UTC | (14438,1,0)    |               0 | standalone-7-70   | NULL    | {}

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/sqlPulsar SQL related featureslifecycle/staletype/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions