You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Native indexing tasks currently support only text file formats because FirehoseFactory is tightly coupled with InputRowParser (#5584). Other issues around FirehoseFactory and InputRowParser are:
Sampler and indexing task use the same Firehose interface even though their use cases are pretty different.
InputRowParser doesn't have to be exposed in the user spec.
Parser is only for text file formats, but all ParseSpecs have makeParser().
Parallel indexing tasks will need to be able to read a portion of a file for finer-grained parallelism in the future. To do this, the interface abstracting file formats should have a trait of splittable.
Kafka/Kinesis tasks can use only implementations of ByteBufferInputRowParser. This forces us to have duplicate implementations for batch and realtime tasks (AvroHadoopInputRowParser vs AvroStreamInputRowParser).
Since It's not easy to modify or improve existing interfaces without huge change, we need new interfaces that can be used instead of FirehoseFactory, Firehose, InputRowParser, and ParseSpec. The new interfaces should support the following storage types and file formats.
Storage types
HDFS
cloud (s3, gcp, etc)
local
http
sql
byte (inline, kafka/kinesis tasks)
Druid (reingtestion)
File formats
csv
tsv
json
regex
influx
javascript
avro
orc
parquet
protobuf
thrift
Proposed changes
The proposed new interfaces are:
InputSource
InputSource abstracts the storage where input data comes from for batch ingestion. This will replace FiniteFirehoseFactory.
publicinterfaceInputSource
{
/** * Returns true if this inputSource can be processed in parallel using ParallelIndexSupervisorTask. */booleanisSplittable();
/** * Returns true if this inputSource supports different {@link InputFormat}s. */booleanneedsFormat();
InputSourceReaderreader(
InputRowSchemainputRowSchema,
@NullableInputFormatinputFormat,
@NullableFiletemporaryDirectory
);
}
InputRowSchema is the schema for InputRow to be created.
You can create InputSourceReader and InputSourceSampler from InputSource. InputSourceReader is for reading inputs and creating segments while InputSourceSampler is for sampling inputs.
You can directly open an InputStrema on the ObjectSource or fetch() the remote object into a local disk and open a FileInputStream on it. This may be useful to avoid expensive random access on remote storage (e.g., Orc file in s3) or holding connections for too long time (as in SqlFirehoseFactory). Check HttpSource as an example ObjectSource.
For example, OrcInputFormat creates OrcReader. Note that this implementation is really not optimized but will show how it could be implemented.
Deprecated ParseSpec
The existing ParseSpec will be split into TimestampSpec, DimensionsSpec, and Inputformat. TimestampSpec and DimensionsSpec will be at the top level of DataSchema. InputFormat will be in ioConfig.
The FirehoseFactorys extending PrefetchableTextFilesFirehoseFactory currently support prefetching and caching which I don't find very useful. I ran a couple of tests in a cluster and my laptop. The total time taken to download 2 files (200 MB each) from s3 into local storage was 4 sec and 20 sec, whereas the total ingestion time was 20 min and 30 min, respectively. I think the more important issue is probably the indexing or the segment merge speed. The prefetch and cache will be not supported with new interfaces until it becomes a real bottleneck.
FirehoseFactory
FirehoseFactory will remain for RealtimeIndexTask and AppenderatorDriverRealtimeIndexTask.
Rationale
One possible alternative would be modifying existing interfaces. I think adding new ones will be better because the new ones are pretty different from existing ones.
Operational impact
A couple of interfaces will be deprecated but kept for a couple of future releases. This means, the old spec will be still respected.
ParseSpec will be deprecated and split into TimestampSpec, DimensionsSpec, and InputFormat.
DataSchema will have TimestampSpec and DimensionsSpec from the deprecated ParseSpec.
IOConfig will have InputFormat and InputSource.
FirehoseFactory will be deprecated for all batch tasks in favor of InputSource. This will not be applied to the compaction task since it doesn't have firehoseFactory in its spec.
InputRowParser will be deprecated.
Test plan (optional)
Unit tests will be added to test backward compatibility and the implementations of new interfaces.
Future work (optional)
A new method can be added to ObjectSource for more optimized data scan.
Motivation
Native indexing tasks currently support only text file formats because
FirehoseFactory
is tightly coupled withInputRowParser
(#5584). Other issues aroundFirehoseFactory
andInputRowParser
are:Firehose
interface even though their use cases are pretty different.InputRowParser
doesn't have to be exposed in the user spec.Parser
is only for text file formats, but allParseSpec
s havemakeParser()
.splittable
.ByteBufferInputRowParser
. This forces us to have duplicate implementations for batch and realtime tasks (AvroHadoopInputRowParser
vsAvroStreamInputRowParser
).Since It's not easy to modify or improve existing interfaces without huge change, we need new interfaces that can be used instead of
FirehoseFactory
,Firehose
,InputRowParser
, andParseSpec
. The new interfaces should support the following storage types and file formats.Proposed changes
The proposed new interfaces are:
InputSource
InputSource
abstracts the storage where input data comes from for batch ingestion. This will replaceFiniteFirehoseFactory
.InputRowSchema
is the schema forInputRow
to be created.SplittableSource
is the splittableInputSource
that can be processed in parallel.Check
HttpInputSource
as an example.InputSourceReader
andInputSourceSampler
You can create
InputSourceReader
andInputSourceSampler
fromInputSource
.InputSourceReader
is for reading inputs and creating segments whileInputSourceSampler
is for sampling inputs.These reader and sampler are the interfaces what users will use directly.
SplitIteratingReader
is an example ofInputSourceReader
.ObjectSource
andInputFormat
InputSourceReader
andInputSourceSampler
will internally useObjectSource
andInputFormat
.ObjectSource
knows how to read bytes from the given object.You can directly open an
InputStrema
on theObjectSource
orfetch()
the remote object into a local disk and open aFileInputStream
on it. This may be useful to avoid expensive random access on remote storage (e.g., Orc file in s3) or holding connections for too long time (as inSqlFirehoseFactory
). CheckHttpSource
as an exampleObjectSource
.InputFormat
knows how to parse bytes.ObjectReader
actually reads and parses data and returns an interator ofInputRow
.For example,
OrcInputFormat
createsOrcReader
. Note that this implementation is really not optimized but will show how it could be implemented.Deprecated
ParseSpec
The existing
ParseSpec
will be split intoTimestampSpec
,DimensionsSpec
, andInputformat
.TimestampSpec
andDimensionsSpec
will be at the top level ofDataSchema
.InputFormat
will be inioConfig
.An example spec is:
Prefetch and cache
The
FirehoseFactory
s extendingPrefetchableTextFilesFirehoseFactory
currently support prefetching and caching which I don't find very useful. I ran a couple of tests in a cluster and my laptop. The total time taken to download 2 files (200 MB each) from s3 into local storage was 4 sec and 20 sec, whereas the total ingestion time was 20 min and 30 min, respectively. I think the more important issue is probably the indexing or the segment merge speed. The prefetch and cache will be not supported with new interfaces until it becomes a real bottleneck.FirehoseFactory
FirehoseFactory
will remain forRealtimeIndexTask
andAppenderatorDriverRealtimeIndexTask
.Rationale
One possible alternative would be modifying existing interfaces. I think adding new ones will be better because the new ones are pretty different from existing ones.
Operational impact
A couple of interfaces will be deprecated but kept for a couple of future releases. This means, the old spec will be still respected.
ParseSpec
will be deprecated and split intoTimestampSpec
,DimensionsSpec
, andInputFormat
.DataSchema
will haveTimestampSpec
andDimensionsSpec
from the deprecatedParseSpec
.IOConfig
will haveInputFormat
andInputSource
.FirehoseFactory
will be deprecated for all batch tasks in favor ofInputSource
. This will not be applied to the compaction task since it doesn't have firehoseFactory in its spec.InputRowParser
will be deprecated.Test plan (optional)
Unit tests will be added to test backward compatibility and the implementations of new interfaces.
Future work (optional)
A new method can be added to
ObjectSource
for more optimized data scan.The text was updated successfully, but these errors were encountered: