Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-13192][hive] Add tests for different Hive table formats #9264

Closed
wants to merge 5 commits into from

Conversation

lirui-apache
Copy link
Contributor

What is the purpose of the change

To add test for different table storage formats and fix issue with HiveTableOutputFormat.

Brief change log

  • Make sure to use a common base class of the SerDe to support Hive 2.3.4 and 1.2.1.
  • Include some hadoop dependencies in test since hive runner needs to run MR job
  • Add tests for different table formats.

Verifying this change

Added new test cases.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? NA

@lirui-apache
Copy link
Contributor Author

cc @KurtYoung @xuefuz @zjuwangg

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 29, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e1dd937 (Tue Aug 06 16:01:34 UTC 2019)

Warnings:

  • 1 pom.xml files were touched: Check for build and licensing issues.
  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 29, 2019

CI report:

Copy link
Contributor

@xuefuz xuefuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. The PR looks good. I just have a couple of minor comments.

@@ -256,10 +258,12 @@ public void configure(Configuration parameters) {
public void open(int taskNumber, int numTasks) throws IOException {
try {
StorageDescriptor sd = hiveTablePartition.getStorageDescriptor();
serializer = (AbstractSerDe) Class.forName(sd.getSerdeInfo().getSerializationLib()).newInstance();
serializer = (Serializer) Class.forName(sd.getSerdeInfo().getSerializationLib()).newInstance();
Preconditions.checkArgument(serializer instanceof Deserializer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand this. Interfaces Serializer and Deserializer are independent. While a serde class may implement both, it seem weird to name a variable "serializer" and later cast it to Deserializer type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is SerDeUtils.initializeSerDe requires a Deserializer. So we have to do the cast if we want to reuse this util method. Since most, if not all, SerDe lib implement both Serializer and Deserializer, I suppose this cast is OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting is fine, but can we name the variable differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions about the name? Like serDe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! :)

@lirui-apache
Copy link
Contributor Author

Updated to add test for CSV table. Also found that Hive table schema can be get from either metastore or SerDe. For CSV tables, we should get schema from SerDe, but currently HiveCatalog doesn't support it. Hence the changes to HiveCatalog.
@xuefuz please have another look. Thanks.

@lirui-apache lirui-apache force-pushed the FLINK-13192 branch 3 times, most recently from ca8af79 to a20e191 Compare August 2, 2019 02:55
@lirui-apache
Copy link
Contributor Author

Latest travis build succeeded. @xuefuz do you have any further comments?

@@ -124,7 +124,7 @@
private transient int numNonPartitionColumns;

// SerDe in Hive-1.2.1 and Hive-2.3.4 can be of different classes, make sure to use a common base class
private transient Serializer serializer;
private transient Serializer recordSerDe;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the type here should be just "Object".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be a serializer because we need it to serialize records. Besides, using Object means we have to use reflection to call the serialize method. And if we do this for each record, it might hurt performance.

Copy link
Contributor

@xuefuz xuefuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good. I just have a couple of minor comments.

@lirui-apache
Copy link
Contributor Author

@xuefuz thanks for the review. PR updated to address your comments.

Copy link
Contributor

@xuefuz xuefuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lirui-apache
Copy link
Contributor Author

@KurtYoung could you help review and merge this PR? Thanks.

@KurtYoung
Copy link
Contributor

sure, I will take a look soon.

KurtYoung pushed a commit to KurtYoung/flink that referenced this pull request Aug 6, 2019
Copy link
Contributor

@KurtYoung KurtYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging this.

@KurtYoung KurtYoung closed this in 24078de Aug 6, 2019
@lirui-apache lirui-apache deleted the FLINK-13192 branch August 6, 2019 03:41
becketqin pushed a commit to becketqin/flink that referenced this pull request Aug 17, 2019
becketqin pushed a commit to becketqin/flink that referenced this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants