<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Reading Data - Parquet Files

**Preliminaries**
The `read_parquet` is available from pandas 0.21. 

You need to install it manually, using the **Import Library** function. Search and install `pandas==0.23.4`

**Technical Accomplishments:**
- Introduce the Parquet file format.
- Read data from:
  - Parquet files without a Schema.
  - Parquet files with a Schema.

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Getting Started

Let's start importing libraries and creating useful variables

In [None]:
%load_ext autotime

import pandas
import s3fs
import boto3
import qcutils

s3 = boto3.client('s3')
baseUri = "s3://quantia-master/training/"

## ![Python Tiny Logo](https://dl.dropboxusercontent.com/s/wl9nvyva3qjsaz2/logo_python_tiny.png) Reading from Parquet Files

[Apache Parquet](https://parquet.apache.org/assets/img/parquet_logo.png) is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

### About Parquet Files
* Free & Open Source.
* Increased query performance over row-based data stores.
* Provides efficient data compression.
* Designed for performance on large data sets.
* Supports limited schema evolution.
* Is a splittable "file format".
* A <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS" target="_blank">Column-Oriented</a> data store

**Row Format** 

| ID |  Name | Score |
|:--:|:-----:|:-----:|
| 1  | john  | 4.1   |
| 2  | mike  | 3.5   |
| 3  | sally | 6.4   |

**Columnar View**

```
ID: 1, 2, 3
Name: john, mike, sally
Score: 4.1, 3.5, 6.4
```

**See also**:
* [Apache Parquet](https://parquet.apache.org)
* [Apache Parquet on Wijipedia](https://en.wikipedia.org/wiki/Apache_Parquet)

### Data Source

The data for this example shows the traffic to various articles on Wikipedia (<a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">23 MB</a> from Wikipedia). 

The original file, captured August 5th of 2016 was downloaded, converted to a Parquet file

**Note**: If the parquet files is partitioned (e.g. it was saved using spark), Pandas is unable to read it, but can only read the single part separately. A workaround to this problem, is to to read the separate fragments separately and then concatenate them.

For this training we use a single fragment of the original parquet file: 

```
s3://quantia-master/training/master-bip/training/wikipedia_pageviews_by_second.parquet/part-00000-tid-863803156164904753-537caea0-8c3b-4349-b236-0762d3215bce-184-c000.snappy.parquet
```

Unlike our CSV and JSON example, the parquet "file" is actually 11 files, 8 of which consist of the bulk of the data and the other three consist of meta-data.

### Read in the Parquet Files

To read in this files, we will specify the location of the parquet directory.

Let's try to read parquet file by passing the base location.

In [None]:
parquetFile = baseUri + "wikipedia_pageviews_by_second.parquet"
tempDF = pandas.read_parquet(parquetFile)
tempDF.info()

So why did not it work?

Pandas is not a distributed framework and it is not able to automatically concatenate the different parquet parts.

Look in the folder on AWS S3

In [None]:
qcutils.list_s3_bucket_objects(limit=10)

Now we will read a single part of the parquet file.

In [None]:
parquetFile = baseUri + "wikipedia_pageviews_by_second.parquet/part-00000-tid-863803156164904753-537caea0-8c3b-4349-b236-0762d3215bce-184-c000.snappy.parquet"

tempDF = pandas.read_parquet(parquetFile)
tempDF

### Read only a subset of the columns
We can read only specific columns.

In [None]:
tempDF = pandas.read_parquet(parquetFile, columns=['timestamp', 'requests'])
tempDF

### Review: Reading from Parquet Files
* We do not need to specify the schema - the column names and data types are stored in the parquet files.
* Unlike the CSV or JSON readers that have to load the entire file and then infer the schema, the parquet reader can "read" the schema very quickly because it's reading that schema from the metadata.
* It is possible to read only a subset of the columns.

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.