PARQUET-1610: Minor grammatical fixes (#132)

apache · Jun 25, 2019 · 96a8f31 · 96a8f31
1 parent 4157b4c
commit 96a8f31
Showing 1 changed file with 11 additions and 14 deletions.
diff --git a/PageIndex.md b/PageIndex.md
@@ -19,36 +19,36 @@
 
 # ColumnIndex Layout to Support Page Skipping
 
-This documents describes the format for column index pages in the Parquet
+This document describes the format for column index pages in the Parquet
 footer. These pages contain statistics for DataPages and can be used to skip
 pages when scanning data in ordered and unordered columns.
 
 ## Problem Statement
 In previous versions of the format, Statistics are stored for ColumnChunks in
 ColumnMetaData and for individual pages inside DataPageHeader structs. When
-reading pages, a reader had to process the page header in order to determine
+reading pages, a reader had to process the page header to determine
 whether the page could be skipped based on the statistics. This means the reader
 had to access all pages in a column, thus likely reading most of the column
 data from disk.
 
 ## Goals
 1. Make both range scans and point lookups I/O efficient by allowing direct
    access to pages based on their min and max values. In particular:
-2. A single-row lookup in a rowgroup based on the sort column of that rowgroup
-   will only read one data page per retrieved column.
-    * Range scans on the sort column will only need to read the exact data
+2. A single-row lookup in a row group based on the sort column of that row group
+  will only read one data page per the retrieved column.
+    * Range scans on the sort column will only need to read the exact data 
       pages that contain relevant data.
     * Make other selective scans I/O efficient: if we have a very selective
       predicate on a non-sorting column, for the other retrieved columns we
       should only need to access data pages that contain matching rows.
 3. No additional decoding effort for scans without selective predicates, e.g.,
-   full-row group scans. If a reader determines that it does not need to read
+   full-row group scans. If a reader determines that it does not need to read 
    the index data, it does not incur any overhead.
 4. Index pages for sorted columns use minimal storage by storing only the
    boundary elements between pages.
 
 ## Non-Goals
-* Support for the equivalent of secondary indices, ie, an index structure
+* Support for the equivalent of secondary indices, i.e., an index structure
   sorted on the key values over non-sorted data.
 
 
@@ -64,9 +64,9 @@ We add two new per-column structures to the row group metadata:
   skipped. Hence the OffsetIndexes for each column in a RowGroup are stored
   together.
 
-The new index structures are stored separately from RowGroup, near the footer,
-so that a reader does not have to pay the I/O and deserialization cost for
-reading the them if it is not doing selective scans. The index structures'
+The new index structures are stored separately from RowGroup, near the footer.  
+This is done so that a reader does not have to pay the I/O and deserialization 
+cost for reading them if it is not doing selective scans. The index structures'
 location and length are stored in ColumnChunk.
 
  ![Page Index Layout](doc/images/PageIndexLayout.png)
@@ -92,10 +92,7 @@ a binary search in `min_values` and `max_values`. For unordered columns, a
 reader can find matching pages by sequentially reading `min_values` and
 `max_values`.
 
-For range scans this approach can be extended to return ranges of rows, page
+For range scans, this approach can be extended to return ranges of rows, page
 indices, and page offsets to scan in each column. The reader can then
 initialize a scanner for each column and fast forward them to the start row of
 the scan.
-
-
-