Analyze & Fix Errors due to tez config changes #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
groupby3_map_multi_distinct Analysis:
We had increased the tez container size from 128 to 256mb to address OOM errors. Now this qtest has a property -
set hive.map.aggr=true;
. If this property is set to true, a background check runs first named -checkMapSideAggregation()
, to verify that there is enough space available to store the hash table that would be required in order to do this aggregation. The allotted space for this aggregation is half of container size and with half of 128mb, it was not enough to store this generated table, but with half of 256mb, now it is sufficient to store this table and hence map side aggregation happens. With this aggregation, the hashes for only these 307 distinct rows out of 500 rows are generated and stored and duplicate rows are mapped to this hashes. Thus, the change in statistics which is expected.mm_all Analysis:
We had increased the tez container size from 128 to 256mb to address OOM errors. Now, total memory allocated to LLAP daemon is 4096mb and with each container size increased to 256 mb, available slots = 4096/256 = 16
With increased container size, split size increases and thus each task have higher resources. Due to this, each task computes larger number of rows and corrresponds to one hive side file each. The amount of data processed remains the same, just the amount of data processed by each task increases. Thus only 16 hive files are generated.
mm_dp Analysis:
The error in this test case arised only because of difference in the random numbers generated. The random number generation not only depends on the seed value passed but also on the available task resources. As above, the task resources have increased and each task processes higher number of rows, generating higher number of random numbers, the random numbers generated are different bw container sizes of 128 and 256.