# Partitioning

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [3]:
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

In [4]:
spark = (
    SparkSession
    .builder
    .config("spark.driver.memory", "10g")
    .master("local[*]")
    .appName("6_0_partitioning")
    .getOrCreate()
)
sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/11 15:34:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [20]:
listening_activity_file = "../data/partitioning/raw/Spotify_Listening_Activity.csv"
df_listening_actv = spark.read.csv(listening_activity_file, header=True, inferSchema=True)
df_listening_actv = (
    df_listening_actv
    .withColumnRenamed("listen_date", "listen_time")
    .withColumn("listen_date", F.to_date("listen_time", "yyyy-MM-dd HH:mm:ss.SSSSSS"))
    .withColumn("listen_hour", F.hour("listen_time"))
)

df_listening_actv.show(5, False)
df_listening_actv.printSchema()
df_listening_actv.count()

+-----------+-------+--------------------------+---------------+-----------+-----------+
|activity_id|song_id|listen_time               |listen_duration|listen_date|listen_hour|
+-----------+-------+--------------------------+---------------+-----------+-----------+
|1          |12     |2023-06-27 10:15:47.008867|69             |2023-06-27 |10         |
|2          |44     |2023-06-27 10:15:47.008867|300            |2023-06-27 |10         |
|3          |75     |2023-06-27 10:15:47.008867|73             |2023-06-27 |10         |
|4          |48     |2023-06-27 10:15:47.008867|105            |2023-06-27 |10         |
|5          |10     |2023-06-27 10:15:47.008867|229            |2023-06-27 |10         |
+-----------+-------+--------------------------+---------------+-----------+-----------+
only showing top 5 rows

root
 |-- activity_id: integer (nullable = true)
 |-- song_id: integer (nullable = true)
 |-- listen_time: string (nullable = true)
 |-- listen_duration: integer (nullable = 

11779

In [34]:
songs_file = "../data/partitioning/raw/Spotify_Songs.csv"
df_songs = spark.read.csv(songs_file, header=True, inferSchema=True)

df_songs.show(5, False)
df_songs.printSchema()
df_songs.count()

+-------+------+---------+--------------------------+
|song_id|title |artist_id|release_date              |
+-------+------+---------+--------------------------+
|1      |Song_1|2        |2021-10-15 10:15:47.006571|
|2      |Song_2|45       |2020-12-07 10:15:47.006588|
|3      |Song_3|25       |2022-07-11 10:15:47.006591|
|4      |Song_4|25       |2019-03-09 10:15:47.006593|
|5      |Song_5|26       |2019-09-07 10:15:47.006596|
+-------+------+---------+--------------------------+
only showing top 5 rows

root
 |-- song_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- artist_id: integer (nullable = true)
 |-- release_date: string (nullable = true)



100

In [23]:
artists_file = "../data/partitioning/raw/Spotify_Artists.csv"
df_artists = spark.read.csv(artists_file, header=True, inferSchema=True)

df_artists.show(5, False)
df_artists.printSchema()
df_artists.count()

+---------+--------+----------+---------+
|artist_id|name    |genre     |country  |
+---------+--------+----------+---------+
|1        |Artist_1|Electronic|France   |
|2        |Artist_2|Electronic|Australia|
|3        |Artist_3|Jazz      |France   |
|4        |Artist_4|Classical |Australia|
|5        |Artist_5|Hip-Hop   |USA      |
+---------+--------+----------+---------+
only showing top 5 rows

root
 |-- artist_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- country: string (nullable = true)



50

## Partitioning By `listen_date`

Let's say we want to **analyse the listening behaviours of user over time**. If we're given the complete dataset (with no partitions), Spark would scan the whole dataset for finding a particular date (similar to the bookshelf analogy where you would scan the entire bookself for finding a book if it is not organized). Given that our usecase needs analysis by date, partitioning (creating folders) on date would help Spark pin point to the exact folder. This makes searching very easy and Spark doesn't scan the entire dataset.  

In [24]:
# Partitioning listening activity by the listen date
(
    df_listening_actv
    .write
    .partitionBy("listen_date")
    .mode("overwrite")
    .parquet("../data/partitioning/partitioned/listening_activity_pt")
)

                                                                                

In [25]:
# **TODO: Example to show partition pruning 

## What Problems Does Partitioning Solve? 
1. `Fast Search (Query Performance)`: Spark will only process the relevant partition instead of the entire dataset (example above). This greatly reduces I/O and query execution time. 
2. `Parallelism / Resource Utilization`: Each core processes 1 partition; More number of partitions, more is the parallelism; again this does not mean we forcefully increase the number of partitions. Each partition should be `128MB` in size. 
3. [TBR] `Joins`: Use `Pre-Partitioning`; Partition early.



# Partitioning Examples
1. Single/multi level partitioning
2. Using `repartition`/`coalesce` with `partitionBy` (controlling number of files inside each partition): 
    - `parititionBy` affects how data is laid out in the storage and is going to ensure that the output directory is organized into subdirectories based on the `value` given in `partitionBy`.  
    - Number of files in each `value` directory of `partitionBy` depends on the number supplied in the `repartition`/`coalesce`.

#### 1. Single/multi level partitioning

In [None]:
(
    df_listening_actv
    .write
    .mode("overwrite")
    .partitionBy("listen_date", "listen_hour")
    .parquet("../data/partitioning/partitioned/listening_activity_pt_2")
)

In [None]:
(
    df_listening_actv
    .write
    .mode("overwrite")
    .partitionBy("listen_hour", "listen_date")
    .parquet("../data/partitioning/partitioned/listening_activity_pt_3")
)

#### 2. Using `repartition`/`coalesce` with `partitionBy`

In [None]:
(
    df_listening_actv
    .repartition(3)
    .write
    .mode("overwrite")
    .partitionBy("listen_date")
    .parquet("../data/partitioning/partitioned/listening_activity_pt_4")
)

In [None]:
# The coalesce method reduces the number of partitions in a DataFrame. 
# It avoids full shuffle, instead of creating new partitions, it shuffles the data using default Hash Partitioner , 
# and adjusts into existing partitions, this means it can only decrease the number of partitions.

(
    df_listening_actv
    .coalesce(3)
    .write
    .mode("overwrite")
    .partitionBy("listen_date")
    .parquet("../data/partitioning/partitioned/listening_activity_pt_5")
)

## Experimenting With `spark.sql.files.maxPartitionBytes`

In [None]:
spark.stop()
spark = SparkSession.builder.appName("Test spark.sql.files.maxPartitionBytes").getOrCreate()

df_default = spark.read.csv("../data/partitioning/raw/listening_activity.csv", header=True, inferSchema=True)
default_partitions = df_default.rdd.getNumPartitions()
print(f"Number of partitions with default maxPartitionBytes: {default_partitions}")


In [None]:
spark.conf.set("spark.sql.files.maxPartitionBytes", "1000")

df_modified = spark.read.csv("../data/partitioning/raw/listening_activity.csv", header=True, inferSchema=True)
modified_partitions = df_modified.rdd.getNumPartitions()
print(f"Number of partitions with modified maxPartitionBytes: {modified_partitions}")

## Dynamic Partition Pruning
- Which partitions to prune is determined at runtime

In [26]:
df_listening_actv_pt = spark.read.parquet("../data/partitioning/partitioned/listening_activity_pt")
df_listening_actv_pt.show(5, False)

+-----------+-------+--------------------------+---------------+-----------+-----------+
|activity_id|song_id|listen_time               |listen_duration|listen_hour|listen_date|
+-----------+-------+--------------------------+---------------+-----------+-----------+
|4456       |16     |2023-07-18 10:15:47.023264|151            |10         |2023-07-18 |
|4457       |65     |2023-07-18 10:15:47.023264|181            |10         |2023-07-18 |
|4458       |60     |2023-07-18 10:15:47.023264|280            |10         |2023-07-18 |
|4459       |3      |2023-07-18 10:15:47.023264|249            |10         |2023-07-18 |
|4460       |45     |2023-07-18 10:15:47.023264|130            |10         |2023-07-18 |
+-----------+-------+--------------------------+---------------+-----------+-----------+
only showing top 5 rows



In [35]:
df_songs = (
    df_songs
    .withColumnRenamed("release_date", "release_datetime")
    .withColumn("release_date", F.to_date("release_datetime", "yyyy-MM-dd HH:mm:ss.SSSSSS"))
)
df_songs.show(5, False)
df_songs.printSchema()

+-------+------+---------+--------------------------+------------+
|song_id|title |artist_id|release_datetime          |release_date|
+-------+------+---------+--------------------------+------------+
|1      |Song_1|2        |2021-10-15 10:15:47.006571|2021-10-15  |
|2      |Song_2|45       |2020-12-07 10:15:47.006588|2020-12-07  |
|3      |Song_3|25       |2022-07-11 10:15:47.006591|2022-07-11  |
|4      |Song_4|25       |2019-03-09 10:15:47.006593|2019-03-09  |
|5      |Song_5|26       |2019-09-07 10:15:47.006596|2019-09-07  |
+-------+------+---------+--------------------------+------------+
only showing top 5 rows

root
 |-- song_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- artist_id: integer (nullable = true)
 |-- release_datetime: string (nullable = true)
 |-- release_date: date (nullable = true)



In [42]:
df_selected_songs = df_songs.filter(F.col("release_date") > F.lit("2019-12-31"))
df_listening_actv_of_selected_songs = df_listening_actv_pt.join(
    df_selected_songs, 
    on=(df_songs.release_date == df_listening_actv_pt.listen_date) & (df_songs.song_id == df_listening_actv_pt.song_id), 
    how="inner"
)

In [43]:
df_listening_actv_of_selected_songs.explain(True)

== Parsed Logical Plan ==
Join Inner, ((release_date#924 = listen_date#751) AND (song_id#881 = song_id#747))
:- Relation [activity_id#746,song_id#747,listen_time#748,listen_duration#749,listen_hour#750,listen_date#751] parquet
+- Filter (release_date#924 > cast(2019-12-31 as date))
   +- Project [song_id#881, title#882, artist_id#883, release_datetime#919, to_date('release_datetime, Some(yyyy-MM-dd HH:mm:ss.SSSSSS)) AS release_date#924]
      +- Project [song_id#881, title#882, artist_id#883, release_date#884 AS release_datetime#919]
         +- Relation [song_id#881,title#882,artist_id#883,release_date#884] csv

== Analyzed Logical Plan ==
activity_id: int, song_id: int, listen_time: string, listen_duration: int, listen_hour: int, listen_date: date, song_id: int, title: string, artist_id: int, release_datetime: string, release_date: date
Join Inner, ((release_date#924 = listen_date#751) AND (song_id#881 = song_id#747))
:- Relation [activity_id#746,song_id#747,listen_time#748,listen_du

In [44]:
spark.stop()