This repository has been archived by the owner on Jun 7, 2021. It is now read-only.
[TRAFODION-2376] Improve UPDATE STATS performance on varchar columns #1029
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request submits a performance enhancement to the UPDATE STATISTICS utility. This work is the completion of a prototype originally done by Barry Fritchman (@blfritch).
For the moment, the feature is turned off by default. Use CQD USTAT_COMPARE_VARCHARS 'ON' to turn on this enhancement.
What this feature does is compact varchars in memory for the internal sort code path in UPDATE STATISTICS. In the old code, varchars are expanded out to their full length. (Actually, we already truncate them at 256 characters -- the setting of CQD USTAT_MAX_CHAR_COL_LENGTH_IN_BYTES -- giving up some accuracy in UEC computation perhaps but improving performance dramatically for very long varchar columns.) In the new code, we estimate the average length of the column, and allocate space assuming the column still adheres to that average. For columns that already have statistics, we use the average varchar length stored in SB_HISTOGRAMS column V2. For columns that don't, we take a guess that the average is one-half the declared length of the column.
The performance gain from using this feature comes from reducing the number of scans of the table or sample table because more columns can fit in memory in each scan.
Also included in this pull request is a tool, analyzeULOG.py, that can be used to scan ULOGs from UPDATE STATISTICS runs to extract timing data. This is useful for determining where time is spent during UPDATE STATISTICS processing.