Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-306] Support Glue catalog and other hive metastore implementations #961

Merged
merged 3 commits into from Nov 12, 2019

Conversation

umehrot2
Copy link
Contributor

Hudi currently does not work with AWS Glue Catalog or other Hive metastore implementations. The issue/exception it runs into has been reported here as well issue .

As mentioned in the issue, the reason for this is:

That is the reason that table gets created in Glue metastore, but while reading or scanning partitions it is talking to the local hive metastore where it does not find the table created.

Note: We need to removing shading of Hive in hudi-spark-bundle by default, because we would get RuntimeException NoSuchMethod because HiveConf is shaded and relocated to a new namespace. But Hive.java is not shaded and hence Hive.get(conf) results in NoSuchMethod. We cannot shade Hive.java since it is in hive-exec which itself is a huge bundle jar with numerous dependencies. A similar issue already exists in Hudi because of shading of Hive which we have reported here: https://issues.apache.org/jira/browse/HUDI-281 . So this PR will help fix that also.

@umehrot2 umehrot2 changed the title Support Glue catalog and other hive metastore implementations [HUDI-306] Support Glue catalog and other hive metastore implementations Oct 18, 2019
@vinothchandar
Copy link
Member

Changes look fine to me. CI seems to have stalled due to inactivity. Wondering if travis is acting up again.. Rekicked the test

@umehrot2
Copy link
Contributor Author

@vinothchandar Can you help out with this integration test failure ? I am not sure why it is getting stalled.

Even on my local setup with master branch, the tests seem to be getting stalled. Not sure how to debug this. I did ssh to the docker container but could not find any logs.

@vinothchandar
Copy link
Member

@umehrot2 It passes for me all the time locally.. if it fails, then the docker containers may still be running locally.. You can use docker/stop_demo.sh to kill them .. and re-run tests if it helps.. https://travis-ci.org/apache/incubator-hudi/builds used to be spotless. but seeing some travis weirdness off late.

let me take a look at the failure itself

@vinothchandar
Copy link
Member

@umehrot2 did not respond here since there was a mailing list thread on the same. Seems like the jetty threads are not exiting cleanly somehow. and its intermittent? do you see that it always stalls. I or others are not seeing that

I did ssh to the docker container but could not find any logs.
stdout is piped back to the test process, that why. you can use jstack and jmap to get dumps and see whats going on.

@umehrot2
Copy link
Contributor Author

@umehrot2 did not respond here since there was a mailing list thread on the same. Seems like the jetty threads are not exiting cleanly somehow. and its intermittent? do you see that it always stalls. I or others are not seeing that

I did ssh to the docker container but could not find any logs.
stdout is piped back to the test process, that why. you can use jstack and jmap to get dumps and see whats going on.

@vinothchandar sorry for the late response here. Have been busy with oncall. Yeah atleast on ec2 instance with amazon linux I was able to constantly run into integration test hanging case. Is there any progress on debugging that ? Willing to help with this if required.

@vinothchandar
Copy link
Member

Latest on that is summarized here
https://issues.apache.org/jira/browse/HUDI-312 We are actively debugging it.. Will try again today.. If you can take a crack at it, sure by all means :)

@umehrot2
Copy link
Contributor Author

umehrot2 commented Nov 4, 2019

@vinothchandar as the integration tests are now fixed, can we trigger another travis build for this ?

@vinothchandar
Copy link
Member

@umehrot2 I was hoping you will rebase and repush :). that would trigger CI

@umehrot2
Copy link
Contributor Author

umehrot2 commented Nov 4, 2019

@umehrot2 I was hoping you will rebase and repush :). that would trigger CI

Sure. I thought since this merges cleanly, rebase is not required. Will do that.

@umehrot2
Copy link
Contributor Author

umehrot2 commented Nov 4, 2019

@vinothchandar @bvaradar I would need your input here. Finally the tests got past the being stuck issue, but now the integration test fails with:

 ###### Stderr #######

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.metadata.Hive.get(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/metadata/Hive;

	at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:111)

	at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:60)

	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:443)

	at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:385)

	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:227)

	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)

	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:298)

	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

	at java.lang.reflect.Method.invoke(Method.java:498)

	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)

	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)

	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)

	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)

	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)

	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

As you can see from the stack trace, and I didn't know this until now that even hudi-utilities-bundle shades Hive. I had removed the shading of Hive from hudi-spark-bundle to unblock this PR, but not from hudi-utilities-bundle and thats why this issue occurs through DeltaStreamer which uses shaded Hive. The reason for this happening is I have added Hive.get in this code, and Hive class exists in hive-exec which has not been shaded in Hudi until now, and ideally should not be because its a fat jar itself. But HiveConf is shaded and relocated.

I think we need to take a larger call here, whether its fine to remove Hive shading from hudi-utilities-bundle or not. And with this change of introducing Hive.get even if Hive is shaded it will still break, because hive-exec is not being shaded while other modules are being shaded. So it will cause this NoSuchMethodError. Need your suggestions on whether we should completely move out of shading Hive.

@vinothchandar
Copy link
Member

utilities-bundle is very similar to its spark-bundle.. For now, we could do a similar fix (control via a property) for utilities-bundle as well?

Longer term, I am wondering if both bundles can do away with bundling Hive at all? (shaded or unshaded).. We probably need to design that more carefully?

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umehrot2 we moved the build instructions to README. Do you mind adding a EMR build section to the file?.. Otherwise LGTM. Will merge once you make that change..

@@ -144,40 +144,6 @@
<shadedPattern>org.apache.hudi.com.databricks.</shadedPattern>
</relocation>
<!-- TODO: Revisit GH ISSUE #533 & PR#633-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line as well?

@umehrot2
Copy link
Contributor Author

@umehrot2 we moved the build instructions to README. Do you mind adding a EMR build section to the file?.. Otherwise LGTM. Will merge once you make that change..

Hudi will be released as an officially supported application in a day or two. On that note, do we really need EMR specific build instructions any more ? I mean from a customers perspective, Hudi is coming packaged on EMR, they don't have to build it themselves. Thoughts ?

@vinothchandar
Copy link
Member

@umehrot2 More than EMR, I was wondering if we should provide some guidance on how/when to use the profiles you added.. I am ok either ways. let me know what you think

@umehrot2
Copy link
Contributor Author

@umehrot2 More than EMR, I was wondering if we should provide some guidance on how/when to use the profiles you added.. I am ok either ways. let me know what you think

To be honest, the profiles added as part of this PR we should not be enabling them as if we shade Hive now, it will cause failures because of the NoSuchMethod error reported above. I guess they are there just as suggestions from you guys, if someone really wants to easily play around with Hive shaded. Like @n3nash suggested way back that in some cases he requires Hive shaded.

@vinothchandar vinothchandar merged commit 0bb5999 into apache:master Nov 12, 2019
sumit-dp pushed a commit to Schedule1/incubator-hudi that referenced this pull request Feb 25, 2020
…ons (apache#961)

- Support Glue catalog and other metastore implementations
- Remove shading from hudi utilities bundle
- Add maven profile to optionally shade hive in utilities bundle
sumit-dp pushed a commit to Schedule1/incubator-hudi that referenced this pull request Mar 6, 2020
…st for hoodie-client module (apache#930)

[HUDI-271] Create QuickstartUtils for simplifying quickstart guide

- This will be used in Quickstart guide (Doc changes to follow in a seperate PR). The intention is to simplify quickstart to showcase hudi APIs by writing and reading using spark datasources.
- This is located in hudi-spark module intentionally to bring all the necessary classes in hudi-spark-bundle finally.

HUDI-121 : Address comments during RC2 voting

1. Remove dnl utils jar from git
2. Add LICENSE Headers in missing files
3. Fix NOTICE and LICENSE in all HUDI packages and in top-level
4. Fix License wording in certain HUDI source files
5. Include non java/scala code in RAT licensing check
6. Use whitelist to include dependencies as part of timeline-server bundling

[HUDI-121] Update Release notes

[HUDI-121] Fix bugs in Release Scripts found during RC creation

[HUDI-287] Address comments during review of release candidate
  1. Remove LICENSE and NOTICE files in hoodie child modules.
  2. Remove developers and contributor section from pom
  3. Also ensure any failures in validation script is reported appropriately
  4. Make hoodie parent pom consistent with that of its parent apache-21 (https://github.com/apache/maven-apache-parent/blob/apache-21/pom.xml)

Update Release notes

[HUDI-121] Fix bug in validation in create_source_release.sh

[HUDI-121] Fix bug in validation in deploy_staging_jars.sh

[HUDI-265] Failed to delete tmp dirs created in unit tests (apache#928)

[HUDI-285] Implement HoodieStorageWriter based on actual file type (apache#936)

[HUDI-121] Remove leftover notice file and replace com.uber.hoodie with org.apache.hudi in log4j properties

[HUDI-121] Prepare for 0.5.0-incubating-rc5

[HUDI-293] Remove KEYS file from github repository

[HUDI-232] Implement sealing/unsealing for HoodieRecord class (apache#938)

[HOTFIX] Move to openjdk to get travis passing (apache#944)

[MINOR] Add incubating to NOTICE and README.md

Please enter the commit message for your changes. Lines starting

Rebased with Huid master
added coudera profile

[HUDI-292] Avoid consuming more entries from kafka than specified sourceLimit. (apache#947)

 - Special handling when allocedEvents > numEvents
 - Added unit tests

[Docs] Update README.md (apache#955)

[HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (apache#956)

* Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables

[HUDI-301] fix path error when update a non-partition MOR table

Shade and relocate Avro dependency in hadoop-mr-bundle

[HUDI-121] Fix licensing issues found during RC voting by general incubator group

Update RELEASE Notes in master

[HUDI-121] Fix issues in release scripts

[HUDI-40] Add parquet support for the Delta Streamer (apache#949)

[HUDI-283] : Ensure a sane minimum for merge buffer memory (apache#964)

- Some environments e.g spark-shell provide 0 for memory size
- This causes unnecessary performance degradation

[MINOR] Remove release notes and move confluent repository to hoodie parent pom

[MINOR] Add backtick escape while syncing partition fields (apache#967)

[MINOR] Move all repository declarations to parent pom (apache#966)

[HUDI-290] Normalize test class name of all test classes (apache#951)

[HUDI-130] Paths written in compaction plan needs to be relative to base-path

[HUDI-169] Speed up rolling back of instants (apache#968)

[MINOR] Fix vm crashes (apache#979)

[MINOR] Fix no output in travis (apache#984)

[MINOR] fix annotation in teardown (apache#990)

[MINOR] Fix avro schema warnings in build

[HUDI-313] Fix select count star error when querying a realtime table

synchronized lock on conf object instead of class

Bump checkstyle from 8.8 to 8.18 (apache#981)

Bumps [checkstyle](https://github.com/checkstyle/checkstyle) from 8.8 to 8.18.
- [Release notes](https://github.com/checkstyle/checkstyle/releases)
- [Commits](checkstyle/checkstyle@checkstyle-8.8...checkstyle-8.18)

Signed-off-by: dependabot[bot] <support@github.com>

Bump httpclient from 4.3.2 to 4.3.6 (apache#980)

Bumps httpclient from 4.3.2 to 4.3.6.

Signed-off-by: dependabot[bot] <support@github.com>

[HUDI-312] Make docker hdfs cluster ephemeral. This is needed to fix flakiness in integration tests. Also, Fix DeltaStreamer hanging issue due to uncaught exception

[HUDI-314] Fix multi partition keys error when querying a realtime table

Add MOR integration testing

[HUDI-321] Support bulkinsert in HDFSParquetImporter (apache#987)

- Add bulk insert feature
- Fix some minor issues

[MINOR] Add features and instructions to build Hudi in README (apache#992)

[HUDI-324] TimestampKeyGenerator should support milliseconds (apache#993)

[HUDI-302]: simplified countInstants() method in HoodieDefaultTimeline (apache#997)

[HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants() (apache#1000)

[DOCS] Update to align with original Uber whitepaper (apache#999)

[DOCS] Change Hudi acronyms to plural

[HUDI-253]: added validations for schema provider class (apache#995)

[HUDI-218] Adding Presto support to Integration Test (apache#1003)

[HUDI-137] Hudi cleaning state changes should be consistent with compaction actions

Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files.
With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata.
This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view.

Cleaner state transitions is now similar to that of compaction.

1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata
2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting
3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan

[HUDI-306] Support Glue catalog and other hive metastore implementations (apache#961)

- Support Glue catalog and other metastore implementations
- Remove shading from hudi utilities bundle
- Add maven profile to optionally shade hive in utilities bundle

[HUDI-80] Leverage Commit metadata to figure out partitions to be cleaned for Cleaning by commits mode (apache#1008)

[HUDI-330] add EmptyStatement java checkstyle rule

Migrate integration tests to spark 2.4.4

[HUDI-329] Presto Containers for integration test must allow newly built local jars to override

[HOTFIX] fix missing version of rat-plugin (apache#1015)

- Fixing RT queries for HiveOnSpark that causes race conditions
- Adding more comments to understand usage of reader/writer schema

- Ensure that rollback instant is always created before the next commit instant.
  This especially affects IncrementalPull for MOR tables since we can end up pulling in
  log blocks for uncommitted data
- Ensure that generated commit instants are 1 second apart

[HUDI-339] Add support of Azure cloud storage (apache#1019)

- Add Azure WASB (BLOB) and ADLS storage in StorageSchemes enum
- Update testStorageSchemes to test new added storage

[HUDI-342] add pull request template for hudi project (apache#1022)

[HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (apache#1025)

[HOTFIX] Fix error configuration item of dockerfile-maven-plugin

[HUDI-350]: updated default value of config.getCleanerCommitsRetained() in javadocs

[HUDI-348] Add Issue template for the project (apache#1029)

[HUDI-345] Fix used deprecated function (apache#1024)

- Schema.parse() with new Schema.Parser().parse
- FSDataOutputStream constructor

[HUDI-328] Adding delete api to HoodieWriteClient (apache#1004)

[HUDI-328]  Adding delete api to HoodieWriteClient and Spark DataSource

[MINOR] Some minor optimizations in HoodieJavaStreamingApp (apache#1046)

[HUDI-362] Adds a check for the existence of field (apache#1047)

[HUDI-358] Add Java-doc and importOrder checkstyle rule (apache#1043)

- import groups are separated by one blank line
- org.apache.hudi.* at the top location

[HUDI-359] Add hudi-env for hudi-cli module (apache#1042)

[HUDI-340]: made max events to read from kafka source configurable (apache#1039)

[HUDI-327] Add null/empty checks to key generators (apache#1040)

* Adds null and empty checks to all key generators.
* Also improves error messaging for key generator issues.

[HUDI-325] Fix Hive partition error for updated HDFS Hudi table (apache#1001)

[HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule (apache#1048)

[HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

[HUDI-366] Refactor some module codes based on new ImportOrder code style rule (apache#1055)

[HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule

[HUDI-373] Refactor hudi-client based on new ImportOrder code style rule (apache#1056)

[HUDI-209] Implement JMX metrics reporter (apache#1045)

[HUDI-372] Support the shortName for Hudi DataSource (apache#1054)

- Ability to do `spark.write.format("hudi")...`

[HUDI-374] Unable to generateUpdates in QuickstartUtils (apache#1059)

[HUDI-357] Refactor hudi-cli based on new comment and code style rules (apache#1051)

[HUDI-370] Refactor hudi-common based on new ImportOrder code style rule (apache#1063)

[DOCS] Update Hudi Readme (apache#1058)

- Add build status
- Clean up layout

[DOCS] Update the build source link (apache#1071)

[MINOR] Update some urls from http to https in the README file (apache#1074)

[HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path (apache#1062)

[HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path

[HUDI-355] Refactor hudi-common based on new comment and code style rules (apache#1049)

[HUDI-355] Refactor hudi-common based on new comment and code style rules

[HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule (apache#1076)

[checkstyle] Add ConstantName java checkstyle rule (apache#1066)

* add SimplifyBooleanExpression java checkstyle rule
* collapse empty tags in scalastyle file

[HUDI-378] Refactor the rest codes based on new ImportOrder code style rule (apache#1078)

[HUDI-379] Refactor the codes based on new JavadocStyle code style rule (apache#1079)

[MINOR] add *.log to .gitignore file (apache#1086)

[HUDI-353] Add hive style partitioning path

[MINOR] Beautify the cli banner (apache#1089)

* Add one empty line
* replace Cli to CLI
* replace Hoodie to Apache Hudi

[checkstyle] Unify LOG form (apache#1092)

[HUDI-390] Add backtick character in hive queries to support hive identifier as tablename (apache#1090)

[HUDI-387] Fix NPE when create savepoint via hudi-cli (apache#1085)

[HUDI-368] code clean up in TestAsyncCompaction class (apache#1050)

[MINOR] Remove redundant plus operator (apache#1097)

[MINOR] replace scala map add operator (apache#1093)

replace ++: with ++

[MINOR] Unify Lists import (apache#1103)

[HUDI-398]Add spark env set/get for spark launcher (apache#1096)

[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

[MINOR] Add slack invite icon in README (apache#1108)

[HUDI-106] Adding support for DynamicBloomFilter (apache#976)

- Introduced configs for bloom filter type
- Implemented dynamic bloom filter with configurable max number of keys
- BloomFilterFactory abstractions; Defaults to current simple bloom filter

[HUDI-415] Get commit time when Spark start (apache#1113)

[HUDI-386] Refactor hudi scala checkstyle rules (apache#1099)

[HUDI-444] Refactor the codes based on scala codestyle ReturnChecker rule (apache#1121)

[HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer

 - Add a transformer class, that adds `Op` fiels if not found in input frame
 - Add a payload implementation, that issues deletes when Op=D
 - Remove Parquet as a top level source type, consolidate with RowSource
 - Made delta streamer work without a property file, simply using overridden cli options
 - Unit tests for transformer/payload classes

Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string in HoodieCopyOnWrite#deleteFilesFunc (apache#1126)

same link in apache#771
this time is in HoodieCopyOnWrite deleteFilesFunc method

[MINOR] Set info servity for ImportOrder temporarily (apache#1127)

- Now we need fix import check error manually, disable the rule temporarily before finding a better solution.

[MINOR] fix typo

[MINOR] fix typo

[minor] Fix few typos in the java docs (apache#1132)

[HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom (apache#1091)

* Fixing Index look up to return partitions for a given key along with fileId with Global Bloom
* Addressing some of the comments
* Fixing test in TestHoodieGlobalBloomIndex to test the fix

[HUDI-416] Improve hint information for cli (apache#1110)

[MINOR] fix typos

[MINOR] optimize hudi timeline service (apache#1137)

[HUDI-470] Fix NPE when print result via hudi-cli (apache#1138)

[MINOR] typo fix (apache#1142)

[MINOR] Update the java doc of HoodieTableType (apache#1148)

[HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

Fix checkstyle

Skip setting commit metadata

Fix empty content clean plan

Update comment

[MINOR]: alter some wrong params which bring fatal exception

[HUDI-482] Fix missing @OverRide annotation on methods (apache#1156)

* [HUDI-482] Fix missing @OverRide annotation on methods

[HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145)

* [HUDI-455] Redo hudi-client log statements using SLF4J

[MINOR] Fix out of limits for results

[MINOR] Fix out of limits for results

Clean up code

[HUDI-343]: Create a DOAP file for Hudi

[HUDI-402]: code clean up in test cases

[MINOR] Fix error usage of String.format (apache#1169)

[HUDI-492]Fix show env all in hudi-cli

[HUDI-118]: Options provided for passing properties to Cleaner, compactor and importer commands

[MINOR] Optimize hudi-cli module (apache#1136)

[MINOR]Optimize hudi-client module (apache#1139)

[HUDI-377] Adding Delete() support to DeltaStreamer (apache#1073)

- Provides ability to perform hard deletes by writing delete marker records into the source data
- if the record contains a special field _hoodie_delete_marker set to true, deletes are performed

[HUDI-484] Fix NPE when reading IncrementalPull.sqltemplate in HiveIncrementalPuller (apache#1167)

Revert "[HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145)" (apache#1181)

This reverts commit e637d9e.

[HUDI-438] Merge duplicated code fragment in HoodieSparkSqlWriter (apache#1114)

[HUDI-406]: added default partition path in TimestampBasedKeyGenerator

[HUDI-501] Execute docker/setup_demo.sh in any directory

[HUDI-405] Remove HIVE_ASSUME_DATE_PARTITION_OPT_KEY config from DataSource

[HUDI-464] Use Hive Exec Core for tests (apache#1125)

[HUDI-417] Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations (apache#1166)

[HUDI-508] Standardizing on "Table" instead of "Dataset" across code (apache#1197)

- Docs were talking about storage types before, cWiki moved to "Table"
 - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming
 - Replacing renaming use of dataset across code/comments
 - Few usages in comments and use of Spark SQL DataSet remain unscathed

[MINOR] Remove old jekyll config file (apache#1198)

Update deprecated HBase API

[HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes (apache#1195)

* Add javadoc build command in README, links to javadoc plugin and rename profile.
* Make java version configurable in one place.

[HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie

    Summary:
    - InputPathHandler class classifies  inputPaths into incremental, non incremental and non hoodie paths.
    - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions
    - listStatus() processes each category separately

[HUDI-331]Fix java docs for all public apis in HoodieWriteClient (apache#1111)

[HUDI-114]: added option to overwrite payload implementation in hoodie.properties file

[HUDI-248] CLI doesn't allow rolling back a Delta commit

[HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

[CLEAN] replace utf-8 constant with StandardCharsets.UTF_8

[MINOR] Fix partition typo (apache#1209)

[HUDI-522] Use the same version jcommander uniformly (apache#1214)

[HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

- Upgrade Spark to 2.4.4, Parquet to 1.10.1, Avro to 1.8.2
- Remove spark-avro from hudi-spark-bundle. Users need to provide --packages org.apache.spark:spark-avro:2.4.4 when running spark-shell or spark-submit
- Replace com.databricks:spark-avro with org.apache.spark:spark-avro
- Shade avro in hudi-hadoop-mr-bundle to make sure it does not conflict with hive's avro version.

[HUDI-322] DeltaSteamer should pick checkpoints off only deltacommits for MOR tables

[HUDI-502] provide a custom time zone definition for TimestampBasedKeyGenerator (apache#1188)

[HUDI-526] fix the HoodieAppendHandle

[MINOR] Reuse random object (apache#1222)

Fix conversion of Spark struct type to Avro schema

cr https://code.amazon.com/reviews/CR-17184364

[MINOR] Refactor unnecessary boxing inside TypedProperties code (apache#1227)

Adding util methods to assist in adding deletion support to Quick Start

Fixing delete util method

Fixing checkstyle issues

[MINOR] Fix redundant judgment statement (apache#1231)

[HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap, for write and random/sequential read paths, by introducing bufferedRandmomAccessFile

Add GlobalDeleteKeyGenerator

Adds new GlobalDeleteKeyGenerator for record_key deletes with global indices. Also refactors key generators into their own package.

[MINOR] Make constant fields final in HoodieTestDataGenerator (apache#1234)

[MINOR] Fix missing @OverRide annotation on BufferedRandomAccessFile method (apache#1236)

[HUDI-509] Renaming code in sync with cWiki restructuring (apache#1212)

 - Storage Type replaced with Table Type (remaining instances)
 - View types replaced with query types;
 - ReadOptimized view referred as Snapshot Query
 - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
 - HoodieDataFile renamed to HoodieBaseFile
 - Hive Sync tool will register RO tables for MOR with a `_ro` suffix
 - Datasource/Deltastreamer options renamed accordingly
 - Support fallback to old config values as well, so migration is painless
 - Config for controlling _ro suffix addition
 - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView

[HUDI-537] Introduce `repair overwrite-hoodie-props` CLI command (apache#1241)

[HUDI-527] scalastyle-maven-plugin moved to pluginManagement as it is only used in hoodie-spark and hoodie-cli modules.

This fixes compile warnings as well as unnecessary plugin invocation for most of the modules which do not have scala code.

[HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues (apache#1229)

[HUDI-238] Make Hudi support Scala 2.12 (apache#1226)

* [HUDI-238] Rename scala related artifactId & add maven profile to support Scala 2.12

[MINOR] Add toString method to TimelineLayoutVersion to make it more readable (apache#1244)

[MINOR] Fix PMC in DOAP] (apache#1247)

[HUDI-552] Fix the schema mismatch in Row-to-Avro conversion (apache#1246)

[HUDI-551] Abstract a test case class for DFS Source to make it extensible (apache#1239)

[HUDI-556] Add lisence for PR#1233

[HUDI-559] : Make the timeline layout version default to be null version

Moving to 0.5.2-SNAPSHOT on master branch.

[MINOR] Download KEYS file when validating release candidate (apache#1259)

[MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles (apache#1263)

[MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

[MINOR] Fix invalid maven repo address (apache#1265)

[MINOR] Change deploy_staging_jars script to take in scala version (apache#1269)

[MINOR] Change deploy_staging_jars script to take in scala version (apache#1270)

[MINOR] Add missing licenses (apache#1271)

[MINOR] fix license issue (apache#1273)

[HUDI-549] update Github README with instructions to build with Scala 2.12 (apache#1275)

[MINOR] Fix missing groupId / version property of dependency

[MINOR] Fix invalid issue url & quickstart url (apache#1282)

[MINOR] Remove junit-dep dependency

[MINOR] Fix assigning to configuration more times (apache#1291)

HUDI-117 Close file handle before throwing an exception due to append failure.
Add test cases to handle/verify stage failure scenarios.

[HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator (apache#1281)

* [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator

* add tests

[HUDI-564] Added new test cases for HoodieLogFormat and HoodieLogFormatVersion.

[HUDI-583] Code Cleanup, remove redundant code, and other changes (apache#1237)

[MINOR] Updated DOAP with 0.5.1 release (apache#1300)

[MINOR] Updated DOAP with 0.5.1 release (apache#1301)

Increase test coverage for HoodieReadClient

[HUDI-596] Close KafkaConsumer every time (apache#1303)

[HUDI-595] code cleanup, refactoring code out of PR# 1159 (apache#1302)

[HUDI-566] Added new test cases for class HoodieTimeline, HoodieDefaultTimeline and HoodieActiveTimeline.

[HUDI-585] Optimize the steps of building with scala-2.12 (apache#1293)

[MINOR] Remove the declaration of thrown RuntimeException (apache#1305)

[HUDI-499] Allow update partition path with GLOBAL_BLOOM (apache#1187)

* Handle partition path update by deleting a record from the old partition and
  insert into the new one
* Add a new configuration "hoodie.bloom.index.update.partition.path" to
  enable the behavior
* Add a new unit test case for global bloom index

[HUDI-571] Add 'commits show archived' command to CLI

[HUDI-570] - Improve test coverage for FSUtils.java

[HUDI-587] Fixed generation of jacoco coverage reports.

surefire plugin's argLine is moved into a property. This configuration allows jacoco plugin to modify the argLine to insert it's Java Agent's configuration during pre-unit-test stage.

[HUDI-560] Remove legacy IdentityTransformer (apache#1264)

[HUDI-582] Update NOTICE year

[HUDI-478] Fix too many files with unapproved license when execute build_local_docker_images (apache#1323)

[HUDI-605] Avoid calculating the size of schema redundantly (apache#1317)

CLI - add option to print additional commit metadata

[HUDI-574] Fix CLI counts small file inserts as updates (apache#1321)

[MINOR] Fix typo (apache#1331)

[HUDI-514] A schema provider to get metadata through Jdbc (apache#1200)

[HUDI-571] Add show archived compaction(s) to CLI

[MINOR] Fix some typos

[MINOR] Code Cleanup, remove redundant code (apache#1337)

[HUDI-615]: Add some methods and test cases for StringUtils. (apache#1338)

[HUDI-108] Removing 2GB spark partition limitations in HoodieBloomIndex with spark 2.4.4 (apache#1315)

[MINOR] Add javadoc to SchedulerConfGenerator and code clean (apache#1340)

[HUDI-617] Add support for types implementing CharSequence (apache#1339)

- Data types extending CharSequence implement a #toString method which provides an easy way to convert them to String.
- For example, org.apache.avro.util.Utf8 is easily convertible into String if we use the toString() method. It's better to make the support more generic to support a wider range of data types as partitionKey.

[HUDI-622]: Remove VisibleForTesting annotation and import from code (apache#1343)

* HUDI:622: Remove VisibleForTesting annotation and import from code

Refactoring getter to avoid double extrametadata in json representation

[HUDI-624]: Split some of the code from PR for HUDI-479 (apache#1344)

[HUDI-597] Enable incremental pulling from defined partitions (apache#1348)

[HUDI-625] Fixing performance issues around DiskBasedMap & kryo (apache#1352)

[HUDI-580] Fix incorrect license header in files

Added cloudera profile

Added cloudera profile
removed  hudi-integ-test
rebased from apache hudi mater
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants