Removing randomization from UpsertPartitioner #253

ovj · 2017-08-27T19:17:44Z

Fixing UpsertPartitioner to ensure that input records are deterministically assigned to output partitions

@vinothchandar FYI I have separated this change from custom partitioner.

vinothchandar · 2017-08-28T17:20:21Z

hoodie-client/src/main/java/com/uber/hoodie/table/HoodieCopyOnWriteTable.java

@@ -379,7 +359,7 @@ public int getPartition(Object key) {
                List<InsertBucket> targetBuckets = partitionPathToInsertBuckets.get(keyLocation._1().getPartitionPath());
                // pick the target bucket to use based on the weights.
                double totalWeight = 0.0;
-                double r = rand.nextDouble();
+                double r = Math.floorMod(keyLocation._1().getRecordKey().hashCode(), MOD_BASE) / (MOD_BASE-1.0);


Not sure if this will give us a double uniformly distributed in range 0.0-1.0 .. Thoughts?

scala> Math.floorMod("dd".hashCode, 1015073) / (1015073-1.0); res0: Double = 0.0031524857350020493 scala> Math.floorMod("cc".hashCode, 1015073) / (1015073-1.0); res1: Double = 0.0031209608776520286 scala> Math.floorMod("cdjf".hashCode, 1015073) / (1015073-1.0); res2: Double = 0.0035248731124491663 scala> Math.floorMod("jdha;fhfjs".hashCode, 1015073) / (1015073-1.0); res3: Double = 0.6503627328898837 scala>

Seems to be a function of the string length?

hashCode returns an integer so MOD_BASE has to be an integer for fair distribution (can't be total_number_of_records as it is long). Yeah hashCode value is somewhat related to length of the string. Having lower value for MOD_BASE (3 or 4 digit prime) will reduce this problem. Also having all key lengths of 1 or 2 character long is very unlikely right? number of combinations which you get are very low. Should we replace 7 digit prime with 3/4/5 digit prime? what do you think?

We can assume .hashCode is reasonably uniform I guess or use MurMurHash from guava.. That seems more standard instead of dealing with prime numbers and so forth.. wdyt?

I am just concerned that if its not uniformly distributed, then we will generate a skew by sending lots of records to few ranges in 0.0-1.0 which would in turn be mapped to only a few buckets.. In this case, if .hashCode() will map to ranges depending on length, then it seems like a misfit.. Doing a murmurhash (or MD5) seems like a better option..

Then a simple (1.0 * (hashValue % totalInsertsForPartition))/totalInsertsForPartition will yield a uniform value between 0.0-1.0?

Md5 hash makes sense. Updated it to use this. hash values also look fairly distributed.

Here is the output for hash value.

"key" : "hash_value" : "time in nano seconds"

a : 6289574019528802036 : 16492818

aa : 9205979493955281985 : 33314

aaa : 5232998391707188295 : 27652

aaaa : 3170461272618321804 : 29583

aaaaa : 4125589970280795993 : 38808

aaaaaa : 3221507088467735029 : 36931

aaaaaaa : 5198010149255477597 : 30166

vinothchandar

We may need to use the total number of inserts as the MOD_BASE.. ?

vinothchandar

Left a suggestion around making this uniformly distributed

vinothchandar · 2017-09-01T22:53:40Z

hoodie-client/src/main/java/com/uber/hoodie/table/HoodieCopyOnWriteTable.java

@@ -379,7 +359,9 @@ public int getPartition(Object key) {
                List<InsertBucket> targetBuckets = partitionPathToInsertBuckets.get(keyLocation._1().getPartitionPath());
                // pick the target bucket to use based on the weights.
                double totalWeight = 0.0;
-                double r = rand.nextDouble();
+                final long totalInserts = Math.max(1, globalStat.getNumInserts());


should this be getWorkloadStat(partitionPath).getNumInserts() and not the global one right.. guess it does not matter?

vinothchandar

Changes look good. have you been able to test on a real dataset for any regressions on skews. Just calling out, since this is a pretty critical change.

…ically assigned to output partitions

vinothchandar · 2017-09-05T17:14:41Z

@ovj Could you take a look at failing tests?

ovj · 2017-09-05T17:33:23Z

Hi @vinothchandar Those tests are flaky. If I try them locally they pass. "com.uber.hoodie.TestMergeOnReadTable".
Caused by: com.uber.hoodie.exception.HoodieInsertException: Failed to initialize HoodieStorageWriter for path /tmp/junit6603311014218813808/2016/03/15/933ada6f-cef9-48bc-be61-8eddccc48658_0_001.parquet at com.uber.hoodie.io.HoodieCreateHandle.<init>(HoodieCreateHandle.java:67) at com.uber.hoodie.func.LazyInsertIterable.computeNext(LazyInsertIterable.java:82) at com.uber.hoodie.func.LazyInsertIterable.computeNext(LazyInsertIterable.java:39) at com.uber.hoodie.func.LazyIterableIterator.next(LazyIterableIterator.java:118) ... 19 more Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:857) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1740) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1682) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:405) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:401) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:401) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:344) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:920) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:901) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:798) at com.uber.hoodie.io.storage.HoodieWrapperFileSystem.create(HoodieWrapperFileSystem.java:111) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:223) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:266) at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:217) at com.uber.hoodie.io.storage.HoodieParquetWriter.<init>(HoodieParquetWriter.java:66) at com.uber.hoodie.io.storage.HoodieStorageWriterFactory.newParquetStorageWriter(HoodieStorageWriterFactory.java:55) at com.uber.hoodie.io.storage.HoodieStorageWriterFactory.getStorageWriter(HoodieStorageWriterFactory.java:40) at com.uber.hoodie.io.HoodieCreateHandle.<init>(HoodieCreateHandle.java:65) ... 22 more

vinothchandar · 2017-09-05T17:35:14Z

seems related to #243 .. restarted the build

vinothchandar · 2017-09-07T17:51:03Z

@ovj like to merge this in + the other PR on plugging in partitioner, to cut a new release.. Can you update the PRs with the follow ups we discussed offline.. thnx!

ovj · 2017-09-07T20:46:08Z

@vinothchandar Everything looks good with testing results. I tested this fix to make sure it handles small files growth correctly and there is no noticeable difference than existing one. Also for new inserts data is getting distributed fairly evenly.

vinothchandar · 2017-09-08T00:05:09Z

@ovj thanks for the valuable contribution and testing through this..

Merged.. (I am hoping the master CI passes now, given I pulled in @n3nash 's PR also, lets see. Might ping you to fix it right away, if there are indeed regressions by any chance)

ovj · 2017-09-08T00:12:55Z

Thanks @vinothchandar . Sure let me know.

vinothchandar reviewed Aug 28, 2017

View reviewed changes

vinothchandar requested changes Sep 1, 2017

View reviewed changes

ovj force-pushed the upsert_partitioner_fix branch 3 times, most recently from 45145a8 to 25cbf56 Compare September 1, 2017 22:51

vinothchandar reviewed Sep 1, 2017

View reviewed changes

vinothchandar approved these changes Sep 1, 2017

View reviewed changes

ovj force-pushed the upsert_partitioner_fix branch from 25cbf56 to 968da92 Compare September 2, 2017 00:12

Fixing UpsertPartitioner to ensure that input records are determinist…

5aaf353

…ically assigned to output partitions

ovj force-pushed the upsert_partitioner_fix branch from 968da92 to 5aaf353 Compare September 2, 2017 21:59

vinothchandar merged commit ec40d04 into apache:master Sep 8, 2017

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

Upgrade to version release-v0.13.3 (apache#253)

51d656f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing randomization from UpsertPartitioner #253

Removing randomization from UpsertPartitioner #253

ovj commented Aug 27, 2017

vinothchandar Aug 28, 2017

vinothchandar Aug 28, 2017

ovj Aug 28, 2017

vinothchandar Sep 1, 2017

vinothchandar Sep 1, 2017

ovj Sep 1, 2017

vinothchandar left a comment

vinothchandar left a comment

vinothchandar Sep 1, 2017

vinothchandar left a comment

vinothchandar commented Sep 5, 2017

ovj commented Sep 5, 2017

vinothchandar commented Sep 5, 2017

vinothchandar commented Sep 7, 2017

ovj commented Sep 7, 2017

vinothchandar commented Sep 8, 2017

ovj commented Sep 8, 2017

Removing randomization from UpsertPartitioner #253

Removing randomization from UpsertPartitioner #253

Conversation

ovj commented Aug 27, 2017

vinothchandar Aug 28, 2017

Choose a reason for hiding this comment

vinothchandar Aug 28, 2017

Choose a reason for hiding this comment

ovj Aug 28, 2017

Choose a reason for hiding this comment

vinothchandar Sep 1, 2017

Choose a reason for hiding this comment

vinothchandar Sep 1, 2017

Choose a reason for hiding this comment

ovj Sep 1, 2017

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar Sep 1, 2017

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar commented Sep 5, 2017

ovj commented Sep 5, 2017

vinothchandar commented Sep 5, 2017

vinothchandar commented Sep 7, 2017

ovj commented Sep 7, 2017

vinothchandar commented Sep 8, 2017

ovj commented Sep 8, 2017