Skip to content

Using column average width in estimation

Hadi Moshayedi edited this page May 4, 2015 · 1 revision

Description

CStoreGetForeignPaths uses the following formula to calculate totalDiskAccessCost:

double queryColumnRatio = (double) queryColumnCount / relationColumnCount;
double queryPageCount = relationPageCount * queryColumnRatio;
double totalDiskAccessCost = seq_page_cost * queryPageCount;

This is the best we can do if we don't have column widths. But ANALYZE command collects information about average width of each column and puts it into pg_statistic catalog table. We can use this information to get better estimations.

Technical Details

  • pg_statistic stores the average width values in the stawidth column. The comments for this column say:

    /*
     * stawidth is the average width in bytes of non-null entries.  For
     * fixed-width datatypes this is of course the same as the typlen, but for
     * var-width types it is more useful.  Note that this is the average width
     * of the data as actually stored, post-TOASTing (eg, for a
     * moved-out-of-line value, only the size of the pointer object is
     * counted).  This is the appropriate definition for the primary use of
     * the statistic, which is to estimate sizes of in-memory hash tables of
     * tuples.
     */
    int32		stawidth;
  • We can get this value using get_attavgwidth. Comments for this function say:

    /*
      * get_attavgwidth
      *
      *	  Given the table and attribute number of a column, get the average
      *	  width of entries in the column.  Return zero if no data available.
      *
      * Currently this is only consulted for individual tables, not for inheritance
      * trees, so we don't need an "inh" parameter.
      *
      * Calling a hook at this point looks somewhat strange, but is required
      * because the optimizer calls this function without any other way for
      * plug-ins to control the result.
      */
     int32
     get_attavgwidth(Oid relid, AttrNumber attnum);
  • We don't have toasting in cstore tables, so consulting this value should be fine. We need to double check this.

  • We have one bit of overhead per column for the "exists" values, that exists for both NULL and non-NULL values. If we have lots of NULLs this can be significant. We can get the ratio of nulls using pg_statistic->stanullfrac.