Added support for Disk Spillable Compaction to prevent OOM issues#289
Added support for Disk Spillable Compaction to prevent OOM issues#289vinothchandar merged 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Will address the indentation introduced in this diff.
c56657a to
cada120
Compare
|
@vinothchandar please take a look when you get a chance. |
|
Will do.. @n3nash can you remove the "WIP" prefix from PRs which are ready for final review. |
|
I will try to get to the reviews before monday meeting next week. |
550e66a to
3d2dc25
Compare
There was a problem hiding this comment.
rename : hoodie.compaction.spill.threshold ?
There was a problem hiding this comment.
rename to SpillableMapTestUtils
There was a problem hiding this comment.
Lets pick this up from a jobConf property.. it can default to 0.75 * freeMemory(), without that, it will fill up heap and potentially OOM with fragmentation issues
There was a problem hiding this comment.
Yeah agreed, I just put it there to start running tests. There are 3 properties we could use : 1) https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/JobConf.html#getMemoryForMapTask() 2) https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/JobConf.html#getMemoryForReduceTask() 3) https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/JobConf.html#MAPRED_TASK_MAXVMEM_PROPERTY (DEPRECATED).
I think getMemoryForMapTask() makes most sense, but it can be that a query to this table is started in a reduce job of a previous query (by default reduce memory = map task memory).
There was a problem hiding this comment.
I think we should simplify this more by assuming that a HoodieAvroDataBlock can be held in memory.. i.e during write we batch and create multiple data blocks..
There was a problem hiding this comment.
You mean just use the old implementation of Maps.newHashMap() ? The problem is we only write 1 data block during a write batch so the size of this can be fairly large given we have the other records which holds all compacted records.
There was a problem hiding this comment.
@vinothchandar here, I think you're referring to just just Maps.newHashMap instead of spillable ? With the new logformat changes and lazy reading of data, this should work.
There was a problem hiding this comment.
How expensive is this constructor? alternative is to reset this everytime and then update() as needed
There was a problem hiding this comment.
/**
* Creates a new CRC32 object.
*/
public CRC32() {
}
That's all it does. No other member variables initialized or anything.
There was a problem hiding this comment.
rename to computeValueSize() ? since you actually serializing here?
There was a problem hiding this comment.
Seems like this method could be shared with some code in CompactedRecordScanner as well?
There was a problem hiding this comment.
Yeah, that's why added a TODO, will look into this now that we are cleaning it.
There was a problem hiding this comment.
do you always have to flush() ? it could be costly right?
There was a problem hiding this comment.
At the moment, SizeAwareDataOutputStream uses DataOutputStream (which uses FileOutputStream) which doesn't buffer anything, it flushes on every write(..) so flush() is a no-op. ( * The flush method of OutputStream does nothing). But we could consider using BufferedOutputStream and then flush only when a get(..) is called because get(..) uses an inputStream from the file so data needs to be written to disk, but BufferedOutputStream doesn't provide api's like writeLong() and writeInt() or readLong() or readInt() so we have to implement them ourselves.
There was a problem hiding this comment.
To clarify, I am talking about fsync/flush at Linux level (not particularly on java apis which could trigger them)..
Given the entire task is going to be retried if any failures happen, what I would suggest is to let Linux pdflush take care of flushing your appends to the log file completely.. Using a BufferedOutputStream may be tricky, if you need to do a get (like you mentioned) .. Most implementations keep a write buffer and decouple it from the flushing.. Flushing on every get will get expensive as well and may not actually help wuth anything, just incurring more system calls..
what do you think about the options?
There was a problem hiding this comment.
I agree with you. It seems like from this documentation : https://docs.oracle.com/javase/7/docs/api/java/io/OutputStream.html#flush() this will just incurr an additional system call. Although, my worry is that we perform get(..) after put(..) but I think the OS should manage that irrespective of this extra system call. Most implementations I've seen keep a BufferedOutputStream but in this case we can just rely on OS buffer and pdflush.
There was a problem hiding this comment.
does with or without flush() make a difference performance wise? whats the AI here?
There was a problem hiding this comment.
So ran a small test on a hadoop box :
public class TestFlush {
public static void main(String args[]) {
try {
FileOutputStream fileOutputStream = new FileOutputStream(new File("/tmp/test1"));
for(int i = 1; i < 1000; i = i*10) {
int count = 100000;
count = count*i;
System.out.println("NumWrites => " + count);
long startTime = 0;
startTime = System.currentTimeMillis();
int j = 0;
while (j < count) {
fileOutputStream.write(1);
j++;
}
System.out.println("WithoutFlush => " + (System.currentTimeMillis() - startTime));
fileOutputStream = new FileOutputStream(new File("/tmp/test2"));
startTime = System.currentTimeMillis();
j = 0;
while (j < count) {
fileOutputStream.write(1);
fileOutputStream.flush();
j++;
}
System.out.println("WithFlush => " + (System.currentTimeMillis() - startTime));
}
} catch(Exception e) {
}
}
}
NumWrites => 100000
WithoutFlush => 117
WithFlush => 117
NumWrites => 1000000
WithoutFlush => 1171
WithFlush => 1223
NumWrites => 10000000
WithoutFlush => 12182
WithFlush => 11930
With Larger bytes :
....
byte [] data = new byte[10000]
...
fileOutputStream.write(data);
....
NumWrites => 100000
WithoutFlush => 838
WithFlush => 848
NumWrites => 1000000
WithoutFlush => 17770
WithFlush => 14958
From the tests there doesn't seem to be much since the OutputStream flush() does not really have any impl.
Action Item : Removed flush() since it's a no-op. Use BufferedOutputStream in future to avoid flushing to disk all the time if we see write() performance degrade. At the moment, from running large jobs, I see write(..) takes < 1 ms.
3d2dc25 to
5769029
Compare
|
lets continue this review, after we work through the perf issues as well.. |
5769029 to
257ec74
Compare
|
@n3nash can you please point out portions you want me to re-review? |
There was a problem hiding this comment.
Should we invest in a RecordSampler class that will run estimate sizes with probability p for each record? This way, its kept more upto date, instead of just using first record.
In this case, lets change variable to metadataEntrySampleSize or sth to denote that this is not an accurate average
There was a problem hiding this comment.
Yes, I plan to invest some time in the RecordSampler, I have no cycles right now. I'll rename to what you mentioned, that seems clear, and then open a task to add a record sampler (which might be needed in other places too)
There was a problem hiding this comment.
again, we may benefit from generic size sampler class
There was a problem hiding this comment.
how much this improve the runtime by? compared to cost of sorting?
There was a problem hiding this comment.
Here are the numbers from sorting vs non-sorting :
----------------- With Sorting enabled---------------------------
18/02/20 20:21:05 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records 43588
18/02/20 20:21:05 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in memory 17109
18/02/20 20:21:05 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in disk 26479
18/02/20 20:21:05 INFO collection.LazyFileIterable: ExtraLogs:: Time Taken to sort 18
18/02/20 20:21:14 INFO compact.HoodieRealtimeTableCompactor: ExtraLogs::Time take to read from spillabel 8513
18/02/21 04:52:40 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records 187068
18/02/21 04:52:40 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in memory 16341
18/02/21 04:52:40 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in disk 170727
18/02/21 04:52:40 INFO collection.LazyFileIterable: ExtraLogs:: Time Taken to sort 146
18/02/21 04:53:45 INFO compact.HoodieRealtimeTableCompactor: ExtraLogs::Time take to read from spillabel 64863
----------------- Without Sorting ---------------------------
18/02/20 22:47:11 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records 170444
18/02/20 22:47:11 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in memory 16341
18/02/20 22:47:11 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in disk 154103
18/02/20 22:47:11 INFO collection.LazyFileIterable: ExtraLogs:: Time Taken to sort 0
18/02/20 22:48:26 INFO compact.HoodieRealtimeTableCompactor: ExtraLogs::Time take to read from spillabel 74631
18/02/21 19:04:38 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records 242087
18/02/21 19:04:38 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in memory 17740
18/02/21 19:04:38 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Size in memory 2900694448
18/02/21 19:04:38 INFO log.HoodieCompactedLogRecordScanner: ExtraLogs:: Number of records in disk 224347
18/02/21 19:04:38 INFO collection.LazyFileIterable: ExtraLogs:: Time Taken to sort 0
18/02/21 19:06:01 INFO compact.HoodieRealtimeTableCompactor: ExtraLogs::Time take to read from spillabel 82955
It seems like sorting gives a minor benefit but again this is controlled by the layout of the dataset on DISK in both cases and how many back and forth disk seeks the non-sorting actually does.
But I realized that this will not matter at the moment since we started to pass Map<> to HoodieMergeHandle instead of Iterator. So when we think of SortMerge we can spend more time on this, WDYT ?
There was a problem hiding this comment.
sure. gains also depend on how IO loaded the box really is or if the seek was teh bottleneck.. Even without sorting, its only taking < 1 ms to read, so def not seeking everytime.. and seems like page cache is kicking in..
This will come in handy on a loaded system anyways.. lets revisit down the line .. sure
There was a problem hiding this comment.
@vinothchandar here, will make the change to make 0.75 a constant ? Or can we just pick it up from jobConf and default to some value if not present ?
There was a problem hiding this comment.
yes a job conf would be good..
There was a problem hiding this comment.
Yes, added a CONF but we probably need to document and expose this conf properly since there should be an easy way to set this for Presto/Hive queries.
There was a problem hiding this comment.
Can you expand on what you mean by other objects..
There was a problem hiding this comment.
Well, the other objects is just he DiskBased metadata Map<String, Metadata>. I did some calculations and say for a parquet of 1GB there are about 400K entries spilled to disk (worst case), then 400K* (averageSizeofKey=36bytes + metadataSize = 4 + 8 + 8 + sizeoffilepath=max(100bytes)) = 60 MB. So it's not huge, we could technically get rid of sizingFactor but kept it to have a way to be conservative.
There was a problem hiding this comment.
@vinothchandar here, I guess we can changes this to maxMemorySizeInBytes since we will use only the in memory Maps.newHashMap() below ? The problem is We need account for Maps.newHashMap() since we are still reading block level ?
There was a problem hiding this comment.
we need to budget for size of a data block correct (after we split it into pieces) . but we can remove the /2 here. sure
There was a problem hiding this comment.
@vinothchandar here, should we rename this since it can also be used in HoodieMergeHandle or have a different config for MergeHandle spillableMap, I feel a different config with same defaults makes sense, thoughts ?
There was a problem hiding this comment.
you mean usage inside compaction vs cow/mergehandle? In that case, we are doing one of those cases right.
There was a problem hiding this comment.
Yeah, we are doing one of those cases but seems like DEFAULT_MAX_SIZE_IN_MEMORY_PER_COMPACTION_IN_BYTES name to set the size of spillable map is weird for cow ?
There was a problem hiding this comment.
answer would depend on if the SpillableMap is ever used by COW path..I don't think thats true today.. So its okay for now. We can add a new param when we change COW actually.. Thats my take.
|
@vinothchandar added comments, will rebase as well in sometime. The rebase shouldn't change the code much since it's mostly in the Scanner code and that code doesn't have many changes in this diff so you should be good without the rebase too. |
There was a problem hiding this comment.
Should we invest in a RecordSampler class that will run estimate sizes with probability p for each record? This way, its kept more upto date, instead of just using first record.
In this case, lets change variable to metadataEntrySampleSize or sth to denote that this is not an accurate average
There was a problem hiding this comment.
could this happen outside in the caller of this method, so value remains an opaque byte[] here?
There was a problem hiding this comment.
The Map's types are <Key, HoodieRecord>
There was a problem hiding this comment.
Can you expand on what you mean by other objects..
There was a problem hiding this comment.
how much this improve the runtime by? compared to cost of sorting?
There was a problem hiding this comment.
There was a problem hiding this comment.
why make a linkedhashmap and throw it away? can't we scan just using the sorted entries?
There was a problem hiding this comment.
I was using the LinkedHashMap for something else and forgot to remove the code, done.
There was a problem hiding this comment.
yes a job conf would be good..
There was a problem hiding this comment.
again, we may benefit from generic size sampler class
There was a problem hiding this comment.
To clarify, I am talking about fsync/flush at Linux level (not particularly on java apis which could trigger them)..
Given the entire task is going to be retried if any failures happen, what I would suggest is to let Linux pdflush take care of flushing your appends to the log file completely.. Using a BufferedOutputStream may be tricky, if you need to do a get (like you mentioned) .. Most implementations keep a write buffer and decouple it from the flushing.. Flushing on every get will get expensive as well and may not actually help wuth anything, just incurring more system calls..
what do you think about the options?
There was a problem hiding this comment.
you mean usage inside compaction vs cow/mergehandle? In that case, we are doing one of those cases right.
257ec74 to
911f6ce
Compare
|
I think this is a good shape to merge for an initial version.. can you file followup ticket for all open items.. we can circle back on this.. |
af9c1a6 to
261b950
Compare
026ec73 to
f3db019
Compare
|
Yes, let me file the tickets first and add some more test results tomorrow before we can merge this. |
f3db019 to
d3a93bb
Compare
d3a93bb to
604cc25
Compare
Some assumptions at the moment :