Skip to content

Commit

Permalink
PARQUET-1610: Minor grammatical fixes (#132)
Browse files Browse the repository at this point in the history
  • Loading branch information
sudohalt authored and Fokko committed Jun 25, 2019
1 parent 4157b4c commit 96a8f31
Showing 1 changed file with 11 additions and 14 deletions.
25 changes: 11 additions & 14 deletions PageIndex.md
Expand Up @@ -19,36 +19,36 @@

# ColumnIndex Layout to Support Page Skipping

This documents describes the format for column index pages in the Parquet
This document describes the format for column index pages in the Parquet
footer. These pages contain statistics for DataPages and can be used to skip
pages when scanning data in ordered and unordered columns.

## Problem Statement
In previous versions of the format, Statistics are stored for ColumnChunks in
ColumnMetaData and for individual pages inside DataPageHeader structs. When
reading pages, a reader had to process the page header in order to determine
reading pages, a reader had to process the page header to determine
whether the page could be skipped based on the statistics. This means the reader
had to access all pages in a column, thus likely reading most of the column
data from disk.

## Goals
1. Make both range scans and point lookups I/O efficient by allowing direct
access to pages based on their min and max values. In particular:
2. A single-row lookup in a rowgroup based on the sort column of that rowgroup
will only read one data page per retrieved column.
* Range scans on the sort column will only need to read the exact data
2. A single-row lookup in a row group based on the sort column of that row group
will only read one data page per the retrieved column.
* Range scans on the sort column will only need to read the exact data
pages that contain relevant data.
* Make other selective scans I/O efficient: if we have a very selective
predicate on a non-sorting column, for the other retrieved columns we
should only need to access data pages that contain matching rows.
3. No additional decoding effort for scans without selective predicates, e.g.,
full-row group scans. If a reader determines that it does not need to read
full-row group scans. If a reader determines that it does not need to read
the index data, it does not incur any overhead.
4. Index pages for sorted columns use minimal storage by storing only the
boundary elements between pages.

## Non-Goals
* Support for the equivalent of secondary indices, ie, an index structure
* Support for the equivalent of secondary indices, i.e., an index structure
sorted on the key values over non-sorted data.


Expand All @@ -64,9 +64,9 @@ We add two new per-column structures to the row group metadata:
skipped. Hence the OffsetIndexes for each column in a RowGroup are stored
together.

The new index structures are stored separately from RowGroup, near the footer,
so that a reader does not have to pay the I/O and deserialization cost for
reading the them if it is not doing selective scans. The index structures'
The new index structures are stored separately from RowGroup, near the footer.
This is done so that a reader does not have to pay the I/O and deserialization
cost for reading them if it is not doing selective scans. The index structures'
location and length are stored in ColumnChunk.

![Page Index Layout](doc/images/PageIndexLayout.png)
Expand All @@ -92,10 +92,7 @@ a binary search in `min_values` and `max_values`. For unordered columns, a
reader can find matching pages by sequentially reading `min_values` and
`max_values`.

For range scans this approach can be extended to return ranges of rows, page
For range scans, this approach can be extended to return ranges of rows, page
indices, and page offsets to scan in each column. The reader can then
initialize a scanner for each column and fast forward them to the start row of
the scan.



0 comments on commit 96a8f31

Please sign in to comment.