[IOTDB-560] add TSRecordOutputFormat to write TsFile via Flink DataSet/DataStream API.#1084
Conversation
|
Thanks for the PR @WeiZhong94 ! |
|
@WeiZhong94 Could you please rebase the PR. and I would like to have look at it then. |
sunjincheng121
left a comment
There was a problem hiding this comment.
Thanks for the PR @WeiZhong94 !
Overall is pretty good! I only left a few comments in the PR and one as follows:
Though the "OutputFormat" can be used in DataStream API, it would be better if we support writing TsFile via "StreamingFileSink", which is integrated with the checkpointing mechanism to provide exactly once semantics. Of course we could do this in follow up PR.
What do you think?
| import java.util.stream.Collectors; | ||
|
|
||
| /** | ||
| * The example of writing TsFile via Flink DataSet API. |
There was a problem hiding this comment.
writing TsFile -》 writing to TsFile ?
| import java.util.stream.Collectors; | ||
|
|
||
| /** | ||
| * The example of writing TsFile via Flink DataStream API. |
There was a problem hiding this comment.
writing TsFile -》 writing to TsFile ?
flink-tsfile-connector/README.md
Outdated
| } | ||
| ``` | ||
|
|
||
| ### TSRecordOutputFormat Example |
There was a problem hiding this comment.
Example of TSRecordOutputFormat ?
| import java.io.IOException; | ||
| import java.io.Serializable; | ||
|
|
||
| public interface TSRecordConverter<T> extends Serializable { |
There was a problem hiding this comment.
Would be better to add JDK Doc?
|
|
||
| void open(Schema schema) throws IOException; | ||
|
|
||
| void covertAndCollect(T input, Collector<TSRecord> collector) throws IOException; |
There was a problem hiding this comment.
Add JDK Doc? Add semantic description of this method。
|
|
||
| void open(Schema schema) throws IOException; | ||
|
|
||
| void covertAndCollect(T input, Collector<TSRecord> collector) throws IOException; |
There was a problem hiding this comment.
Regarding the method name covertAndCollect ,I think it again, It is not pretty clear for the semantic. I think in in TSRecordConverter the main goal of covertAndCollect is covert the T to TSRecord. So, I would like to change the name from covertAndCollect to convert which make the semantic more clearly. What do you think?
BTW: typo covert -> convert
| public TSRecordConverter<T> getConverter() { | ||
| return converter; | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Please add an empty row.
| import java.net.URISyntaxException; | ||
| import java.util.Optional; | ||
|
|
||
| public abstract class TsFileOutputFormat<T> extends FileOutputFormat<T> { |
flink-tsfile-connector/README.md
Outdated
| Types.FLOAT, | ||
| Types.INT, | ||
| Types.INT, | ||
| Types.FLOAT, | ||
| Types.INT, | ||
| Types.INT |
| DataPoint templateDataPoint = templateRecord.dataPointList.get(dataPointIndexMapping[i]); | ||
| Object o = input.getField(i); | ||
| if (o != null) { | ||
| Class typeClass = o.getClass(); |
There was a problem hiding this comment.
templateDataPoint.type could be used to switch the data type.
|
@sunjincheng121 @qiaojialin Thanks for your review! Sorry for the late reply because my work is really busy recently :(. I have addressed your comments, please take a look.
Yes, "StreamingFileSink" is much better than "OutputFormat" for streaming job. To support writing TsFile via "StreamingFileSink", we need to implement the flink interface "org.apache.flink.api.common.serialization.BulkWriter". Current blocker is that the BulkWriter wraps the output target as a "FSDataOutputStream" object. It is possible to write TsFile data via "FSDataOutputStream" but needs further discussion as the "FSDataOutputStream" does not support the "truncate" method, which is required by the "TsFileOutput" interface. |
|
Hi @WeiZhong94 I am fine with current |
qiaojialin
left a comment
There was a problem hiding this comment.
Hi, thanks, only two data types error in readme. As for the truncate method in TsFileOutput, I think it is ok to not support. It's only used in restarting the IoTDB server.
|
@sunjincheng121 @qiaojialin Thanks for your reply! I have correct the README.md. It seems that the travis test failure is not involved in this PR, so I just rebased this PR to trigger the test, hope this can work. |
TsFile is a columnar storage file format in Apache IoTDB. It is designed for time series data and supports efficient compression and query and is easy to be integrated into big data processing frameworks.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams and becoming more and more popular in IOT scenes. So, it would be great to integrate IoTDB and Flink.
This pull request adds a TSRecordOutputFormat to support write TsFiles on flink via DataStream/DataSet API.
More detail can be found in discussion thread [1], or in the IoTDB wiki [2]
[1]https://lists.apache.org/thread.html/r6dd6afe4e8e4ca42e3ddfbc80609597788f90b214e7a81788c3b51b3%40%3Cdev.iotdb.apache.org%3E
[2]https://cwiki.apache.org/confluence/display/IOTDB/%5BImprovement+Proposal%5D+Add+Flink+Connector+Support+for+TsFile