# Apache Parquet Introduction

- Columnar file format
- Provides optimizations to speed up queries
- Far more efficient file format than CSV or JSON
- Provides efficient data compression with enhanced performance to handle complex data in bulk.
- Spark SQL provides support for both reading and writing Parquet files
- Automatically capture the schema of the original data
- Reduces data storage by 75% on average
- Spark by default supports Parquet in its library hence we don’t need to add any dependency libraries.

## Apache Parquet Advantages:

- Reduces IO operations.
- Fetches specific columns that you need to access.
- Consumes less space.
- Support type-specific encoding.

## Example

In [1]:
__name__

'__main__'

In [2]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
from pyspark.sql import SQLContext

data = (("James ","","Smith","36636","M",3000),
              ("Michael ","Rose","","40288","M",4000),
              ("Robert ","","Williams","42114","M",4000),
              ("Maria ","Anne","Jones","39192","F",4000),
              ("Jen","Mary","Brown","","F",-1))

columns = ("firstname","middlename","lastname","dob","gender","salary")

In [3]:
df = spark.createDataFrame(data, columns)

In [4]:
df.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



## Spark Write DataFrame to Parquet file format
    - Writing Spark DataFrame to Parquet format preserves the column names and data types
    - All columns are automatically converted to be nullable for compatibility reasons

In [5]:
df.write.parquet("data/people.parquet")

AnalysisException: path file:/D:/22-Trngs/2-Confirmed/5-PySpark-56-hours-Geet/GH/Labs/Day3-Day4/data/people.parquet already exists.

In [7]:
! dir "data/people.parquet"

 Volume in drive D is Windows
 Volume Serial Number is 949A-70EE

 Directory of D:\22-Trngs\2-Confirmed\5-PySpark-56-hours-Geet\GH\Labs\Day3-Day4\data\people.parquet

07-05-2022  13:59    <DIR>          .
07-05-2022  13:59    <DIR>          ..
07-05-2022  13:59                24 .part-00000-c2417797-c2db-42af-9be4-f2b02c94ddb3-c000.snappy.parquet.crc
07-05-2022  13:59                24 .part-00001-c2417797-c2db-42af-9be4-f2b02c94ddb3-c000.snappy.parquet.crc
07-05-2022  13:59                24 .part-00002-c2417797-c2db-42af-9be4-f2b02c94ddb3-c000.snappy.parquet.crc
07-05-2022  13:59                24 .part-00003-c2417797-c2db-42af-9be4-f2b02c94ddb3-c000.snappy.parquet.crc
07-05-2022  13:59                 8 ._SUCCESS.crc
07-05-2022  13:59             1,683 part-00000-c2417797-c2db-42af-9be4-f2b02c94ddb3-c000.snappy.parquet
07-05-2022  13:59             1,690 part-00001-c2417797-c2db-42af-9be4-f2b02c94ddb3-c000.snappy.parquet
07-05-2022  13:59             1,710 part-00002-c2417797-c2db-4

## Spark Read Parquet file into DataFrame

In [8]:
parqDF = spark.read.parquet("data/people.parquet")

## Append to existing Parquet file

In [9]:
df.write.mode('append').parquet("data/people.parquet")

## Using SQL queries on Parquet
- We can also create a temporary view on Parquet files and then use it in Spark SQL statements
- This temporary table would be available until the SparkContext present.

In [10]:
parqDF.createOrReplaceTempView("ParquetTable")
parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")

- Above predicate on spark parquet file does the file scan which is performance bottleneck like table scan on a traditional database
- We should use partitioning in order to improve performance.

## Spark parquet partition – Improving performance
- Partitioning is a feature of many databases and data processing frameworks and it is key to make jobs work at scale
- We can do a parquet file partition using spark partitionBy() function.

In [11]:
df.write.partitionBy("gender","salary").parquet("data/people2.parquet")

- Parquet Partition creates a folder hierarchy for each spark partition
- We have mentioned the first partition as gender followed by salary hence, it creates a salary folder inside the gender folder.

In [12]:
! dir people2.parquet
! dir "data/people2.parquet\gender=F"
! dir "data/people2.parquet\gender=M"

 Volume in drive D is Windows
 Volume Serial Number is 949A-70EE

 Directory of D:\22-Trngs\2-Confirmed\5-PySpark-56-hours-Geet\GH\Labs\Day3-Day4



File Not Found


 Volume in drive D is Windows
 Volume Serial Number is 949A-70EE

 Directory of D:\22-Trngs\2-Confirmed\5-PySpark-56-hours-Geet\GH\Labs\Day3-Day4\data\people2.parquet\gender=F

13-05-2022  09:40    <DIR>          .
13-05-2022  09:40    <DIR>          ..
13-05-2022  09:40    <DIR>          salary=-1
13-05-2022  09:40    <DIR>          salary=4000
               0 File(s)              0 bytes
               4 Dir(s)  1,789,914,001,408 bytes free
 Volume in drive D is Windows
 Volume Serial Number is 949A-70EE

 Directory of D:\22-Trngs\2-Confirmed\5-PySpark-56-hours-Geet\GH\Labs\Day3-Day4\data\people2.parquet\gender=M

13-05-2022  09:40    <DIR>          .
13-05-2022  09:40    <DIR>          ..
13-05-2022  09:40    <DIR>          salary=3000
13-05-2022  09:40    <DIR>          salary=4000
               0 File(s)              0 bytes
               4 Dir(s)  1,789,914,001,408 bytes free


## Read from partitioned data

In [13]:
parqDF = spark.read.parquet("data/people2.parquet")
parqDF.createOrReplaceTempView("Table2")
df = spark.sql("select * from Table2  where gender='M' and salary >= 4000")

In [14]:
df.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|  Robert |          |Williams|42114|     M|  4000|
| Michael |      Rose|        |40288|     M|  4000|
+---------+----------+--------+-----+------+------+



- The execution of this query is significantly faster than the query without partition
- It filters the data first on gender and then applies filters on salary.

## Spark Read a specific Parquet partition

In [15]:
# Retrieves the data from the gender partition value “M”.
parqDF = spark.read.parquet("data/people2.parquet/gender=M")

In [16]:
parqDF.show()

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|  dob|salary|
+---------+----------+--------+-----+------+
|  Robert |          |Williams|42114|  4000|
| Michael |      Rose|        |40288|  4000|
|   James |          |   Smith|36636|  3000|
+---------+----------+--------+-----+------+

