Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving out of box experience for data source #295

Merged
merged 2 commits into from Jun 11, 2018

Conversation

vinothchandar
Copy link
Member

  • Fixes Default Configuration Changes to Hoodie to process large upserts #246
  • Bump up default parallelism to 1500, to handle large upserts
  • Add docs on s3 confuration & tuning tips with tested spark knobs
  • Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
  • Improve speed of ROTablePathFilter by removing directory check
  • Move to spark-avro 4.0 to handle issue with nested fields with same name
  • Keep AvroConversionUtils in sync with spark-avro 4.0

@vinothchandar vinothchandar self-assigned this Jan 5, 2018
@vinothchandar
Copy link
Member Author

tested this with upto 400GB of input, shuffling upto 1TB intermediate.

@vinothchandar vinothchandar changed the title Improving out of box experience for data source (WIP) Improving out of box experience for data source Mar 26, 2018
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Will test with schema involving all avro data-types.

@@ -42,7 +42,7 @@
private static final String BASE_PATH_PROP = "hoodie.base.path";
private static final String AVRO_SCHEMA = "hoodie.avro.schema";
public static final String TABLE_NAME = "hoodie.table.name";
private static final String DEFAULT_PARALLELISM = "200";
private static final String DEFAULT_PARALLELISM = "1500";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. 200 is too small since spark partitions need to be less than 2GB and we 'd like to be able to do 500GB upserts in a stable way out of box.

@vinothchandar
Copy link
Member Author

@n3nash do you see any issues (esp config default changes) with merging this?


spark.kryoserializer.buffer.max 512m
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.shuffle.memoryFraction 0.2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is deprecated. remove.

spark.shuffle.memoryFraction 0.2
spark.shuffle.service.enabled true
spark.sql.hive.convertMetastoreParquet false
spark.storage.memoryFraction 0.6
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remove..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash you may want to double check configs to see if we are still setting old props..

Also anything you can add here for reliable spark configs would be appreciated.


```
spark.driver.extraClassPath /etc/hive/conf
spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include gc tuning knobs

 - Fixes apache#246
 - Bump up default parallelism to 1500, to handle large upserts
 - Add docs on s3 confuration & tuning tips with tested spark knobs
 - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
 - Improve speed of ROTablePathFilter by removing directory check
 - Move to spark-avro 4.0 to handle issue with nested fields with same name
 - Keep AvroConversionUtils in sync with spark-avro 4.0
@CLAassistant
Copy link

CLAassistant commented May 24, 2018

CLA assistant check
All committers have signed the CLA.

 - hoodie-hadoop-mr now needs objectsize bundled
 - Also updated docs with additional tuning tips
@vinothchandar vinothchandar changed the title (WIP) Improving out of box experience for data source Improving out of box experience for data source Jun 11, 2018
@vinothchandar vinothchandar merged commit 8f1d362 into apache:master Jun 11, 2018
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants