<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Reading Data with Spark - Parquet Files

**Technical Accomplishments:**
- Introduce the Parquet file format.
- Read data from:
  - Parquet files without a Schema.
  - Parquet files with a Schema.

## Getting Started

Let's start importing libraries and creating useful variables 

In [1]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

s3 = boto3.client('s3')
baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

## Reading from Parquet Files

[Apache Parquet](https://parquet.apache.org/) is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

### About Parquet Files
* Free & Open Source.
* Increased query performance over row-based data stores.
* Provides efficient data compression.
* Designed for performance on large data sets.
* Supports limited schema evolution.
* Is a splittable "file format".
* A <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS" target="_blank">Column-Oriented</a> data store

**Row Format** 

| ID |  Name | Score |
|:--:|:-----:|:-----:|
| 1  | john  | 4.1   |
| 2  | mike  | 3.5   |
| 3  | sally | 6.4   |

**Columnar View**

```
ID: 1, 2, 3
Name: john, mike, sally
Score: 4.1, 3.5, 6.4
```

**See also**:
* [Apache Parquet](https://parquet.apache.org)
* [Apache Parquet on Wijipedia](https://en.wikipedia.org/wiki/Apache_Parquet)

### Data Source

The data for this example shows the traffic to various articles on Wikipedia (<a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">23 MB</a> from Wikipedia).

In [None]:
qcutils.list_s3_bucket_objects(limit=10)

Unlike our CSV and JSON example, the parquet "file" is actually 11 files, 8 of which consist of the bulk of the data and the other three consist of meta-data.

### Read in the Parquet Files

To read in this files, we will specify the location of the parquet directory.

In [2]:
parquetFile = baseUri + "wikipedia_pageviews_by_second.parquet"

(spark.read              # The DataFrameReader
  .parquet(parquetFile)  # Creates a DataFrame from Parquet after reading in the file
  .printSchema()
)

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)

time: 4.84 s


In [4]:
parquetFile = baseUri + "wikipedia_pageviews_by_second.parquet"

(spark.read              # The DataFrameReader
  .parquet(parquetFile)  # Creates a DataFrame from Parquet after reading in the file
)

timestamp,site,requests
2015-03-22T14:13:34,mobile,1425
2015-03-22T14:23:18,desktop,2534
2015-03-22T14:36:47,desktop,2444
2015-03-22T14:38:39,mobile,1488
2015-03-22T14:57:11,mobile,1519
2015-03-22T15:03:18,mobile,1559
2015-03-22T15:16:47,mobile,1510
2015-03-22T15:45:03,desktop,2673
2015-03-22T15:58:32,desktop,2463
2015-03-22T16:06:11,desktop,2525


time: 3.21 s


### Review: Reading from Parquet Files
* We do not need to specify the schema - the column names and data types are stored in the parquet files.
* Only one job is required to **read** that schema from the parquet file's metadata.
* Unlike the CSV or JSON readers that have to load the entire file and then infer the schema, the parquet reader can "read" the schema very quickly because it's reading that schema from the metadata.

### Read in the Parquet Files w/Schema

If you want to avoid the extra job entirely, we can, again, specify the schema even for parquet files:

** *WARNING* ** *Providing a schema may avoid this one-time hit to determine the `DataFrame's` schema.*  
*However, if you specify the wrong schema it will conflict with the true schema and will result in an analysis exception at runtime.*

In [6]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

parquetSchema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

(spark.read          # The DataFrameReader
  .schema(parquetSchema)  # Use the specified schema
  .parquet(parquetFile)   # Creates a DataFrame from Parquet after reading in the file
  .printSchema()
)

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)

time: 283 ms


let's have a look to the execution time for reading 20 rows

In [7]:
(spark.read          # The DataFrameReader
  .schema(parquetSchema)  # Use the specified schema
  .parquet(parquetFile)   # Creates a DataFrame from Parquet after reading in the file
)

timestamp,site,requests
2015-03-22T14:13:34,mobile,1425
2015-03-22T14:23:18,desktop,2534
2015-03-22T14:36:47,desktop,2444
2015-03-22T14:38:39,mobile,1488
2015-03-22T14:57:11,mobile,1519
2015-03-22T15:03:18,mobile,1559
2015-03-22T15:16:47,mobile,1510
2015-03-22T15:45:03,desktop,2673
2015-03-22T15:58:32,desktop,2463
2015-03-22T16:06:11,desktop,2525


time: 1.21 s


### A note on the time to count when the source is a Parquet file

In [8]:
spark.read.parquet(parquetFile).count()

7200000

time: 1.28 s


It took approximately 1 sec. If you remember, the same task took around 3 sec when the source was a file. **It is almost 3 times faster**.

In [9]:
df = spark.read.parquet(parquetFile)
df.select(df.requests).groupBy().sum()

sum(requests)
13342978934


time: 2.38 s


It took approximately 1.5 sec. If you remember, the same task took around 9 sec when the source was a file. **It is 6 times faster**.

#### Discussion
* Why is counting 3x?
* Why is summing a column even faster (6x)?
* What do you expect if the column were 100 instead of 3?

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.