Skip to content
This repository has been archived by the owner on Jun 7, 2021. It is now read-only.

[TRAFODION-2164] Incremental Update Statistics documentation #707

Merged
merged 2 commits into from Sep 19, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/shared/revisions.txt
Expand Up @@ -26,6 +26,7 @@
[cols="2",options="header"]
|===
| Version | Date
| 2.1.0 | TBD
| 2.0.1 | July 7, 2016
| 2.0.0 | June 6, 2016
| 1.3.0 | January, 2016
Copy link
Contributor

@svarnau svarnau Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my preference was to remove this section, so we don't try to track release dates in documents.

Though it looks like this file is shared by all the docs? That is less bad than I thought. We could keep it if updating it is added to https://cwiki.apache.org/confluence/display/TRAFODION/Create+Release#CreateRelease-UpdateProjectWebSite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is shared in all documents. My thought was that perhaps the release manager could supply the date at the tail end of the release cycle. If this is agreeable, I can add a comment to the wiki as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds good. Just add revisions.txt to the list of those files that needs to be updated and docs published. Thanks.

Expand Down
15 changes: 1 addition & 14 deletions docs/sql_reference/src/asciidoc/_chapters/about.adoc
Expand Up @@ -51,20 +51,7 @@ This edition includes updates for these new features:
[cols="50%,50%",options="header"]
|===
| New Feature | Location in the Manual
| On Line Analytical Process (OLAP) window functions | <<olap_functions,OLAP Functions>>
| Ability to cancel DDL, update statistics, and
additional child query operations in addition to
DML statements | <<control_query_cancel_statement,CONTROL QUERY CANCEL Statement>>
| Authorization required to run the CONTROL QUERY
CANCEL Statement | <<control_query_cancel_statement,CONTROL QUERY CANCEL Statement>>
| Ability to grant privileges on behalf of a role
using the GRANTED BY clause. | <<grant_component_privilege_statement,GRANT COMPONENT PRIVILEGE Statement>>
| Authorization required for all SHOWDDL commands | <<showddl_statement,SHOWDDL Statement>> +
<<SHOWDDL_SCHEMA_Statement,SHOWDDL SCHEMA Statement>>
| Ability to display the DDL syntax of a library object
using the SHOWDDL LIBRARY command | <<showddl_statement,SHOWDDL Statement>>
| Listing of HBase objects using the GET HBASE OBJECTS
command through an SQL interface | <<get_hbase_objects_statement,GET HBASE OBJECTS Statement>>
| Incremental UPDATE STATISTICS | <<update_statistics_statement,UPDATE STATISTICS Statement>>
|===

<<<
Expand Down
4 changes: 4 additions & 0 deletions docs/sql_reference/src/asciidoc/_chapters/sql_statements.adoc
Expand Up @@ -2344,6 +2344,10 @@ creates an HBase table if it does not already exist when the table is created. T
+
specifies the ANSI logical name of the table. See <<database_object_names,Database Object Names>>. This name must be
unique among names of tables and views within its schema.
+
TIP: Avoid using table names that begin with "TRAF_SAMPLE_", as {project-name} SQL assumes that
such tables are persistent sample tables. See also <<update_statistics_incremental_update_statistics,
Incremental Update Statistics>>.

* `SALT USING _num_ PARTITIONS [ON (_column_[, _column_]&#8230;)]`
+
Expand Down
86 changes: 83 additions & 3 deletions docs/sql_reference/src/asciidoc/_chapters/sql_utilities.adoc
Expand Up @@ -686,11 +686,12 @@ devise optimized access plans.
UPDATE STATISTICS is a {project-name} SQL extension.

```
UPDATE STATISTICS FOR TABLE table [CLEAR | on-clause]
UPDATE STATISTICS FOR TABLE table [CLEAR | on-clause | sample-table-clause ]

on-clause is:
ON column-group-list CLEAR
| ON column-group-list [histogram-option]...
| ON column-group-list INCREMENTAL WHERE predicate

column-group-list is:
column-list [,column-list]...
Expand All @@ -716,8 +717,12 @@ histogram-option is:

sample-option is:
[r ROWS]
| RANDOM percent PERCENT
| RANDOM percent PERCENT [PERSISTENT]
| PERIODIC size ROWS EVERY period ROWS

sample-table-clause is:
CREATE SAMPLE RANDOM percent PERCENT
| REMOVE SAMPLE
```

[[update_statistics_syntax]]
Expand Down Expand Up @@ -872,7 +877,6 @@ than specified, or if the sample size is greater than the system limit,
A row sample is used to read _r_ rows from the table. The value _r_ must
be an integer that is greater than zero

<<<
**** `RANDOM _percent_ PERCENT`
+
directs {project-name} SQL to choose rows randomly from the table. The value
Expand All @@ -881,6 +885,12 @@ addition, only the first four digits to the right of the decimal point
are significant. For example, value 0.00001 is considered to be 0.0000,
Value 1.23456 is considered to be 1.2345.

***** `PERSISTENT`
+
directs {project-name} SQL to create a persistent sample table and store the
random sample in it. This table can then be used later for updating statistics
incrementally.

**** `PERIODIC _size_ ROWS EVERY _period_ ROW`
+
directs {project-name} SQL to choose the first _size_ number of rows from
Expand All @@ -890,6 +900,28 @@ _size_ &#60;= _period_). The size of the _period_ is defined by the number
of rows specified for _period_. The value _period_ must be an integer
that is greater than zero (_period_ > 0).

* `INCREMENTAL WHERE _predicate_`
+
directs {project-name} SQL to update statistics incrementally. That is, instead
of taking a fresh sample of the entire table, {project-name} SQL will use a previously
created persistent sample table. {project-name} SQL will update the persistent sample
by replacing any rows satisfying the _predicate_ with a fresh sample of rows from
the original table satisfying the _predicate_. The sampling rate used is the
_percent_ specified when the persistent sample table was created. Statistics
are then generated from this updated sample. See also
<<update_statistics_incremental_update_statistics,
Incremental Update Statistics>>.

* `CREATE SAMPLE RANDOM _percent_ PERCENT`
+
Creates a persistent sample table associated with this table. The sample is
created using a random sample of _percent_ percent of the rows. The table
can then be used for later incremental statistics update.

* `REMOVE SAMPLE`
+
Drops the persistent sample table associated with this table.

[[update_statistics_considerations]]
=== Considerations for UPDATE STATISTICS

Expand Down Expand Up @@ -1046,6 +1078,35 @@ NECESSARY COLUMN[S] to generate the needed histograms:
update statistics for table _table-name_ on necessary columns sample;
```

[[update_statistics_incremental_update_statistics]]
==== Incremental Update Statistics

UPDATE STATISTICS processing time can be lengthy for very large tables.
One strategy for reducing the time is to create histograms only for
columns that actually need them (for example, using the ON NECESSARY COLUMNS
column group). Another strategy is to update statistics incrementally. These
strategies can be used together if desired.

To use the incremental update statistics feature, you must first create
statistics for the table and create a persistent sample table. One way to
do this is to perform a normal update statistics command, adding the
PERSISTENT keyword to the _sample-option_. Another way to do this if you
already have reasonably up-to-date statistics for the table, is to create
a persistent sample table separately using the CREATE SAMPLE option.

You can then perform update statistics incrementally by using the INCREMENTAL
WHERE _predicate_ syntax in the on-clause. The _predicate_ should be chosen
to describe the set of rows that have changed since the last statistics update
was performed. For example, if your table contains a column with a timestamp
giving the date and time of last update, this is a particularly useful column
to use in the _predicate_.

If you decide later that you wish to change the _percent_ sampling rate used
for the persistent sample table, you can do so by dropping the persistent
sample table (using REMOVE SAMPLE) and creating a new one (by using the
CREATE SAMPLE option). Using a higher _percent_ results in more accurate
histograms, but at the price of a longer-running operation.

<<<
[[update_statistics_examples]]
=== Examples of UPDATE STATISTICS
Expand Down Expand Up @@ -1108,4 +1169,23 @@ table ADDRESS:
UPDATE STATISTICS FOR TABLE address ON street CLEAR;
```

* This example generates statistics but also creates a persistent
sample table for use when updating statistics incrementally:
+
```
UPDATE STATISTICS FOR TABLE address
ON (street), (city), (state), (zip, type)
SAMPLE RANDOM 5 PERCENT PERSISTENT;
```

* This example updates statistics incrementally. It assumes that
a persistent sample table has already been created. The predicate
in the WHERE clause describes the set of rows that have changed
since statistics were last updated. Here we assume that rows
with a state of California are the only rows that have changed:
+
```
UPDATE STATISTICS FOR TABLE address
ON EXISTING COLUMNS
INCREMENTAL WHERE state = 'CA';
```