# Read Partitioned Data in Parquet Format and Write Updates #
This notebook demostrates reading partitioned data, partition filter operations and writing updated partitions back to the storage layer.  
  
This notebook is dependant on the output data created by the notebook `creditcard_hash_anonymize.ipnyb`


In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.\
        builder.\
        appName("hash_anon_partitions").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        config("spark.eventLog.enabled", "true").\
        config("spark.eventLog.dir", "file:///opt/workspace/events").\
        getOrCreate()      

22/02/23 08:20:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


# Read Parquet Data #

### Configuration Setting to allow partition writes back ###
Missing this configuration option results in

```
java.io.FileNotFoundException ...   
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
```

AND *all the data in the updated partition is deleted*

In [3]:
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

### Demonstrate Partition Filtering ###
Without Parquet filter by partition, we scan 132 partitions:

In [4]:
df_read = spark.read.parquet("/opt/workspace/dataout/credit-cards/").explain()

                                                                                

== Physical Plan ==
*(1) FileScan parquet [card_type#0,bank#1,card_number#2,card_holder#3,expiry_date#4,billing_date#5,credit_limit#6,issue_year#7,issue_month#8] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/opt/workspace/dataout/credit-cards], PartitionCount: 132, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<card_type:string,bank:string,card_number:string,card_holder:string,expiry_date:string,bill...


With a filter by year, we only scan PartitionCount: 12 (12 months)

In [5]:
df_read = spark.read.parquet("/opt/workspace/dataout/credit-cards/issue_year=2013").explain()

== Physical Plan ==
*(1) FileScan parquet [card_type#18,bank#19,card_number#20,card_holder#21,expiry_date#22,billing_date#23,credit_limit#24,issue_month#25] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/opt/workspace/dataout/credit-cards/issue_year=2013], PartitionCount: 12, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<card_type:string,bank:string,card_number:string,card_holder:string,expiry_date:string,bill...


                                                                                

### Read in Partition Data for year 2013 ###

In [6]:
# Use the basePath option to stop the partition column from being dropped
df_read = spark.read.option("basePath", "/opt/workspace/dataout/credit-cards/").parquet("/opt/workspace/dataout/credit-cards/issue_year=2013")

# Save updates to a Partition #

In [7]:
df_read.show(5)

[Stage 3:>                                                          (0 + 1) / 1]

+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|          card_type|            bank|         card_number|         card_holder|expiry_date|billing_date|credit_limit|issue_year|issue_month|
+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|   American Express|American Express|fa97545a484641d8a...|1b5f1b515cab9b586...|    08/2027|           1|      104100|      2013|          8|
|               Visa|  First National|70af4edc8e5c207e3...|a2cde7687a0e544a8...|    08/2015|          21|       55300|      2013|          8|
|        Master Card|           Chase|73ab033eaade21c1c...|e8c999919fa7720d5...|    08/2028|           9|      171900|      2013|          8|
|           Discover|        Discover|87deb4b3c75c3b79f...|ac7ca170a090c60a1...|    08/2030|          11|      137200|      2013|          8|
|Japan

                                                                                

NOTE - we have lost the leading 0 for the month.  This is because the type of the partition column has been re-inferred (int not String)

In [8]:
df_read.printSchema()

root
 |-- card_type: string (nullable = true)
 |-- bank: string (nullable = true)
 |-- card_number: string (nullable = true)
 |-- card_holder: string (nullable = true)
 |-- expiry_date: string (nullable = true)
 |-- billing_date: integer (nullable = true)
 |-- credit_limit: integer (nullable = true)
 |-- issue_year: integer (nullable = true)
 |-- issue_month: integer (nullable = true)



try again with spark.sql.sources.partitionColumnTypeInference.enabled set to False

In [9]:
spark.conf.set('spark.sql.sources.partitionColumnTypeInference.enabled', False)

In [10]:
df_read = spark.read.option("basePath", "/opt/workspace/dataout/credit-cards/").parquet("/opt/workspace/dataout/credit-cards/issue_year=2013")

In [11]:
df_read.printSchema()

root
 |-- card_type: string (nullable = true)
 |-- bank: string (nullable = true)
 |-- card_number: string (nullable = true)
 |-- card_holder: string (nullable = true)
 |-- expiry_date: string (nullable = true)
 |-- billing_date: integer (nullable = true)
 |-- credit_limit: integer (nullable = true)
 |-- issue_year: string (nullable = true)
 |-- issue_month: string (nullable = true)



In [12]:
df_read.show(5)

+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|          card_type|            bank|         card_number|         card_holder|expiry_date|billing_date|credit_limit|issue_year|issue_month|
+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|   American Express|American Express|fa97545a484641d8a...|1b5f1b515cab9b586...|    08/2027|           1|      104100|      2013|         08|
|               Visa|  First National|70af4edc8e5c207e3...|a2cde7687a0e544a8...|    08/2015|          21|       55300|      2013|         08|
|        Master Card|           Chase|73ab033eaade21c1c...|e8c999919fa7720d5...|    08/2028|           9|      171900|      2013|         08|
|           Discover|        Discover|87deb4b3c75c3b79f...|ac7ca170a090c60a1...|    08/2030|          11|      137200|      2013|         08|
|Japan

                                                                                

We now have the leading zero on the issue_month string retained.

### Change the partition data and write it back ###

In [13]:
# increase the credit_limit by factor 10
df_update = df_read.withColumn("credit_limit", df_read.credit_limit*10)

In [14]:
df_update.show(5)


+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|          card_type|            bank|         card_number|         card_holder|expiry_date|billing_date|credit_limit|issue_year|issue_month|
+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|   American Express|American Express|fa97545a484641d8a...|1b5f1b515cab9b586...|    08/2027|           1|     1041000|      2013|         08|
|               Visa|  First National|70af4edc8e5c207e3...|a2cde7687a0e544a8...|    08/2015|          21|      553000|      2013|         08|
|        Master Card|           Chase|73ab033eaade21c1c...|e8c999919fa7720d5...|    08/2028|           9|     1719000|      2013|         08|
|           Discover|        Discover|87deb4b3c75c3b79f...|ac7ca170a090c60a1...|    08/2030|          11|     1372000|      2013|         08|
|Japan

In [15]:
df_update.write.format("parquet")\
                .mode("overwrite")\
                .partitionBy('issue_year', 'issue_month')\
                .save("/opt/workspace/dataout/credit-cards/")

                                                                                

In [16]:
spark.stop()

### Check the Updates to the Partition ###

In [17]:
spark = SparkSession.\
        builder.\
        appName("hash_anon_readupdate").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        config("spark.eventLog.enabled", "true").\
        config("spark.eventLog.dir", "file:///opt/workspace/events").\
        getOrCreate()    

In [18]:
df_read = spark.read.option("basePath", "/opt/workspace/dataout/credit-cards/").parquet("/opt/workspace/dataout/credit-cards/issue_year=2013")
df_read.show(5)


[Stage 1:>                                                          (0 + 1) / 1]

+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|          card_type|            bank|         card_number|         card_holder|expiry_date|billing_date|credit_limit|issue_year|issue_month|
+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|   American Express|American Express|fa97545a484641d8a...|1b5f1b515cab9b586...|    08/2027|           1|     1041000|      2013|          8|
|               Visa|  First National|70af4edc8e5c207e3...|a2cde7687a0e544a8...|    08/2015|          21|      553000|      2013|          8|
|        Master Card|           Chase|73ab033eaade21c1c...|e8c999919fa7720d5...|    08/2028|           9|     1719000|      2013|          8|
|           Discover|        Discover|87deb4b3c75c3b79f...|ac7ca170a090c60a1...|    08/2030|          11|     1372000|      2013|          8|
|Japan

                                                                                

In [19]:
spark.stop()