Improving out of box experience for data source #295

vinothchandar · 2018-01-05T22:16:58Z

Fixes Default Configuration Changes to Hoodie to process large upserts #246
Bump up default parallelism to 1500, to handle large upserts
Add docs on s3 confuration & tuning tips with tested spark knobs
Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
Improve speed of ROTablePathFilter by removing directory check
Move to spark-avro 4.0 to handle issue with nested fields with same name
Keep AvroConversionUtils in sync with spark-avro 4.0

vinothchandar · 2018-01-05T22:17:37Z

tested this with upto 400GB of input, shuffling upto 1TB intermediate.

bvaradar

Looks great. Will test with schema involving all avro data-types.

bvaradar · 2018-04-06T21:42:32Z

hoodie-client/src/main/java/com/uber/hoodie/config/HoodieWriteConfig.java

@@ -42,7 +42,7 @@
  private static final String BASE_PATH_PROP = "hoodie.base.path";
  private static final String AVRO_SCHEMA = "hoodie.avro.schema";
  public static final String TABLE_NAME = "hoodie.table.name";
-  private static final String DEFAULT_PARALLELISM = "200";
+  private static final String DEFAULT_PARALLELISM = "1500";


Is this intentional ?

yes. 200 is too small since spark partitions need to be less than 2GB and we 'd like to be able to do 500GB upserts in a stable way out of box.

vinothchandar · 2018-05-23T02:32:50Z

@n3nash do you see any issues (esp config default changes) with merging this?

vinothchandar · 2018-05-23T02:39:18Z

docs/configurations.md

+
+ spark.kryoserializer.buffer.max    512m
+ spark.serializer    org.apache.spark.serializer.KryoSerializer
+ spark.shuffle.memoryFraction    0.2


this is deprecated. remove.

vinothchandar · 2018-05-23T02:39:53Z

docs/configurations.md

+ spark.shuffle.memoryFraction    0.2
+ spark.shuffle.service.enabled    true
+ spark.sql.hive.convertMetastoreParquet    false
+ spark.storage.memoryFraction    0.6


same remove..

@n3nash you may want to double check configs to see if we are still setting old props..

Also anything you can add here for reliable spark configs would be appreciated.

vinothchandar · 2018-05-23T20:21:08Z

docs/configurations.md

+
+ ```
+ spark.driver.extraClassPath    /etc/hive/conf
+ spark.driver.extraJavaOptions    -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof


include gc tuning knobs

- Fixes apache#246 - Bump up default parallelism to 1500, to handle large upserts - Add docs on s3 confuration & tuning tips with tested spark knobs - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset - Improve speed of ROTablePathFilter by removing directory check - Move to spark-avro 4.0 to handle issue with nested fields with same name - Keep AvroConversionUtils in sync with spark-avro 4.0

CLAassistant · 2018-05-24T04:49:19Z

All committers have signed the CLA.

- hoodie-hadoop-mr now needs objectsize bundled - Also updated docs with additional tuning tips

vinothchandar requested a review from prazanna January 5, 2018 22:17

vinothchandar self-assigned this Jan 5, 2018

vinothchandar changed the title ~~Improving out of box experience for data source~~ (WIP) Improving out of box experience for data source Mar 26, 2018

vinothchandar requested a review from bvaradar March 27, 2018 17:08

bvaradar approved these changes Apr 6, 2018

View reviewed changes

vinothchandar commented May 23, 2018

View reviewed changes

vinothchandar force-pushed the out-of-box branch from f3b0af7 to 450dc10 Compare May 24, 2018 04:49

Fixing deps & serialization for RTView

0e91883

- hoodie-hadoop-mr now needs objectsize bundled - Also updated docs with additional tuning tips

vinothchandar changed the title ~~(WIP) Improving out of box experience for data source~~ Improving out of box experience for data source Jun 11, 2018

vinothchandar merged commit 8f1d362 into apache:master Jun 11, 2018

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

Upgrade to version release-v0.16.5 (apache#295)

10b3f8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving out of box experience for data source #295

Improving out of box experience for data source #295

vinothchandar commented Jan 5, 2018

vinothchandar commented Jan 5, 2018

bvaradar left a comment

bvaradar Apr 6, 2018

vinothchandar Apr 9, 2018

vinothchandar commented May 23, 2018

vinothchandar May 23, 2018

vinothchandar May 23, 2018

vinothchandar May 23, 2018

vinothchandar May 23, 2018

CLAassistant commented May 24, 2018 •

edited

Improving out of box experience for data source #295

Improving out of box experience for data source #295

Conversation

vinothchandar commented Jan 5, 2018

vinothchandar commented Jan 5, 2018

bvaradar left a comment

Choose a reason for hiding this comment

bvaradar Apr 6, 2018

Choose a reason for hiding this comment

vinothchandar Apr 9, 2018

Choose a reason for hiding this comment

vinothchandar commented May 23, 2018

vinothchandar May 23, 2018

Choose a reason for hiding this comment

vinothchandar May 23, 2018

Choose a reason for hiding this comment

vinothchandar May 23, 2018

Choose a reason for hiding this comment

vinothchandar May 23, 2018

Choose a reason for hiding this comment

CLAassistant commented May 24, 2018 • edited

CLAassistant commented May 24, 2018 •

edited