# Parquet

Summarizing the [parquet documentation](https://parquet.apache.org/documentation/latest/):

- Goal is interoperability across Hadoop ecosystem
- Compression can be specified per column
- Handles complex nested data structures (similar to XML)
- Follows the 2010 Google Dremel paper

Column chunks are guaranteed to be contiguous within a row group. Here's a hierarcy for a table with 2 columns.
```
File
    Row Group 1
        Column Chunk 1
        Column Chunk 2
    Row Group 2
        Column Chunk 1
        Column Chunk 2        
    Row Group 3
        ...
```
Docs recommend configuring 1GB row group sizes, corresponding to 1 GB HDFS blocks. 

- Row based MapReduce runs in parallel across Row Groups
- IO runs in parallel across column chunks.

Reading data from the top of this page: http://anson.ucdavis.edu/~clarkf/

I used conda to install pyarrow: https://anaconda.org/conda-forge/pyarrow

In [1]:
import os

import pandas as pd
import pyarrow.parquet as pq

Reading the dataset as below is cheap because it only reads the metadata.

Metadata is represented with Apache Thrift.

In [4]:
pems = pq.ParquetDataset("/Users/clark/data/pems/pems_sorted/")

Schema from the database (Hive) is preserved.

In [7]:
pems.schema

<pyarrow._parquet.ParquetSchema object at 0x11417a288>
timeperiod: BYTE_ARRAY UTF8
flow1: INT32
occupancy1: DOUBLE
speed1: DOUBLE
flow2: INT32
occupancy2: DOUBLE
speed2: DOUBLE
flow3: INT32
occupancy3: DOUBLE
speed3: DOUBLE
flow4: INT32
occupancy4: DOUBLE
speed4: DOUBLE
flow5: INT32
occupancy5: DOUBLE
speed5: DOUBLE
flow6: INT32
occupancy6: DOUBLE
speed6: DOUBLE
flow7: INT32
occupancy7: DOUBLE
speed7: DOUBLE
flow8: INT32
occupancy8: DOUBLE
speed8: DOUBLE
 

## Nulls

Quite a bit different than using IEEE NaN or a special bit pattern.

> Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

# Apache Arrow

> Powering Columnar In-Memory Analytics

A bold claim... R, Python (Numpy), and Julia all compete in this space.

Source: [Arrow Docs](https://arrow.apache.org/)

Essentially Arrow is a specification for a memory layout, along with high performance C++ and Java implementations.

My initial experiments using it for shared memory were positive.

![Common memory](common_memory.png)

## Some thoughts

To load data from parquet into a high level language one needs to go from parquet to arrow to data structures in X language. Why not load directly from parquet to high level language?

In [19]:
pems_table = pems.read(["timeperiod", "flow1", "occupancy1"])
pems_table

pyarrow.Table
timeperiod: string
flow1: int32
occupancy1: double
station: dictionary<values=int64, indices=int32>
-- metadata --
org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"timeperiod","type":"string","nullable":true,"metadata":{}},{"name":"flow1","type":"integer","nullable":true,"metadata":{}},{"name":"occupancy1","type":"double","nullable":true,"metadata":{}},{"name":"speed1","type":"double","nullable":true,"metadata":{}},{"name":"flow2","type":"integer","nullable":true,"metadata":{}},{"name":"occupancy2","type":"double","nullable":true,"metadata":{}},{"name":"speed2","type":"double","nullable":true,"metadata":{}},{"name":"flow3","type":"integer","nullable":true,"metadata":{}},{"name":"occupancy3","type":"double","nullable":true,"metadata":{}},{"name":"speed3","type":"double","nullable":true,"metadata":{}},{"name":"flow4","type":"integer","nullable":true,"metadata":{}},{"name":"occupancy4","type":"double","nullable":true,"metadata":{}},{"name":"spee

In [21]:
# Metadata tells us things like the shape
pems_table.shape

(3932049, 4)

In [26]:
# We can pull out underlying pieces

flow1 = pems_table[1]
flow1

<pyarrow.lib.Column at 0x114099e40>

In [27]:
flow1.type

DataType(int32)

In [32]:
flow1.data.num_chunks

1420

In [34]:
flow1.data.chunk(0)

<pyarrow.lib.Int32Array object at 0x13da43228>
[
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  ...
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0
]

### Conversion to a pandas DataFrame


In [22]:
pems_df = pems_table.to_pandas()

In [23]:
pems_df.head()

Unnamed: 0,timeperiod,flow1,occupancy1,station
0,01/01/2016 00:00:05,0,0.0,402260
1,01/01/2016 00:00:35,0,0.0,402260
2,01/01/2016 00:01:06,0,0.0,402260
3,01/01/2016 00:01:35,0,0.0,402260
4,01/01/2016 00:02:05,0,0.0,402260
