-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-18735][table] Add support for more types to DataGen source #13010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 60b2d39 (Tue Jul 28 17:12:39 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sjwiesman , sorry for late response.
Now data generator source can be a bounded source, you can implement a ITCase for it, like BlackHoleConnectorITCase.
| if (isRunning && generator.hasNext() && (numberOfRows == null || outputSoFar < toOutput)) { | ||
| synchronized (ctx.getCheckpointLock()) { | ||
| if (numberOfRows != null) { | ||
| outputSoFar++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea: print outputSoFar in close. It is good for debugging.
| if (isRunning && generator.hasNext()) { | ||
| if (isRunning && generator.hasNext() && (numberOfRows == null || outputSoFar < toOutput)) { | ||
| synchronized (ctx.getCheckpointLock()) { | ||
| if (numberOfRows != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this if?
| Set<ConfigOption<?>> options = keyContainer.getOptions(); | ||
| options.addAll(valContainer.getOptions()); | ||
|
|
||
| return DataGeneratorContainer.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look to the comments of RowData. The data strutures should be:
* +--------------------------------+-----------------------------------------+
* | Row | {@link RowData} |
* +--------------------------------+-----------------------------------------+
* | ARRAY | {@link ArrayData} |
* +--------------------------------+-----------------------------------------+
* | MAP / MULTISET | {@link MapData} |
* +--------------------------------+-----------------------------------------+
|
|
||
| - Random generator is the default generator, you can specify random max and min values. For char/varchar/string, the length can be specified. It is a unbounded generator. | ||
| - Sequence generator, you can specify sequence start and end values. It is a bounded generator, when the sequence number reaches the end value, the reading ends. | ||
| Time types are always the local machines current system time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have a table to show all types.
Display the generation strategies they support, and the required parameters?
|
@JingsongLi thanks for taking a look. I've updated my PR accordingly. |
|
@JingsongLi what do you think of the status of this PR? |
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very sorry for the late response. Left some comments. Feel free to ping me...
| throw new ValidationException("Unsupported type: " + logicalType); | ||
| } | ||
|
|
||
| private interface SerializableSupplier<T> extends Supplier<T>, Serializable { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lambda is already Serializable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not, the source gets caught in the closure cleaner without this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant we could do something like return (Supplier<T> & Serializable) () -> {...}, but it is OK for now.
| public DataGeneratorSource(DataGenerator<T> generator, long rowsPerSecond) { | ||
| public DataGeneratorSource(DataGenerator<T> generator, String name, long rowsPerSecond, Long numberOfRows) { | ||
| this.generator = generator; | ||
| this.name = name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a remnant from an old commit, will remove.
| return mapper.apply(generator.next()); | ||
| } | ||
|
|
||
| public interface SerializableFunction<A, B> extends Function<A, B>, Serializable {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, we need this.
|
|
||
| private static final long serialVersionUID = 1L; | ||
|
|
||
| private final DataGenerator[] fieldGenerators; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add ? to DataGenerator like DataGenerator<?>[]? At least, The compiler will have no warning.
|
|
||
| import org.junit.Test; | ||
|
|
||
| public class DataGeneratorConnectorITCase extends StreamingTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use BatchTestBase.
|
|
||
| @Override | ||
| public ScanRuntimeProvider getScanRuntimeProvider(ScanContext context) { | ||
| boolean isBounded = numberOfRows == null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the opposite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
| public void testTypes() { | ||
| tEnv().executeSql(TABLE); | ||
| tEnv().executeSql(SINK); | ||
| tEnv().from("datagen_t").executeInsert("sink"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use: Lists.newArrayList(tEnv().executeSql("select * from datagen_t").collect()).
In this way, the types can be verified.
| @Override | ||
| public DataGeneratorContainer visit(DecimalType decimalType) { | ||
| return DataGeneratorContainer.of( | ||
| SequenceGenerator.bigDecimalGenerator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the structures of Decimal, Times are wrong, you can take a look to the comments of RowData.
|
@JingsongLi thank you for the thorough. Review. I've corrected the types and updated the test to validate them. If you don't have any other comments I'll merge on green. |
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for update, Looks good to me except time type generator, and left some minor comments.
| private final long rowsPerSecond; | ||
|
|
||
| @Nullable | ||
| private Long numberOfRows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
| @Nullable | ||
| private Long numberOfRows; | ||
|
|
||
| private int outputSoFar; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transient
|
|
||
| private int outputSoFar; | ||
|
|
||
| private int toOutput; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transient
|
|
||
| @Override | ||
| public DataGeneratorContainer visit(TimeType timeType) { | ||
| return DataGeneratorContainer.of(TimeGenerator.of(() -> LocalTime.now().get(MILLI_OF_SECOND))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be MILLI_OF_DAY?
| }; | ||
| } | ||
|
|
||
| private static class BigDecimalRandomGenerator implements DataGenerator<DecimalData> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be named DecimalDataRandomGenerator
BigDecimal makes confused
|
Sorry for the long back and forth, just pushed a fix addressing your comments. |
What is the purpose of the change
Updates the DataGen source to support most types. This enables using DataGen in conjunction with Flink's LIKE clause as it currently does not support overwriting physical columns with computed columns.
Brief change log
I refactored the table factory and added support for all types except RAW and STRUCTURED. See commit messages for a full list of changes. I also added support for creating bounded tables which is useful when prototyping a query.
Verifying this change
New unit tests
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation