## lets see what partitioning is all about
what is partitioning in spark the data fame is stored in a distributed way partitioning is way to keep our distributed sets in a organized manner


In [1]:
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

In [2]:
spark = (
    SparkSession
    .builder
    .config("spark.driver.memory", "10g")
    .master("local[*]")
    .appName("6_0_partitioning")
    .getOrCreate()
)
sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/19 17:35:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
listening_activity_file = "/Users/bhushanchowdary/Documents/GitHub/pyspark/Optimization/data/partitioning/raw/Spotify_Listening_Activity.csv"
df_listening_actv = spark.read.csv(listening_activity_file, header=True, inferSchema=True)#we are just reading in our csv file for our experiment
df_listening_actv.show(5, False)

+-----------+-------+--------------------------+---------------+
|activity_id|song_id|listen_date               |listen_duration|
+-----------+-------+--------------------------+---------------+
|1          |12     |2023-06-27 10:15:47.008867|69             |
|2          |44     |2023-06-27 10:15:47.008867|300            |
|3          |75     |2023-06-27 10:15:47.008867|73             |
|4          |48     |2023-06-27 10:15:47.008867|105            |
|5          |10     |2023-06-27 10:15:47.008867|229            |
+-----------+-------+--------------------------+---------------+
only showing top 5 rows


In [5]:
#lets do some operations on this df_listening_actv
df_listening_actv =(
    df_listening_actv
    .withColumnRenamed("Listen_date","Listen_time")
    .withColumn("Listen_date",F.to_date("listen_time","yyyy-MM-dd HH:mm:ss.SSSSSS"))
    .withColumn("Listen_hour",F.hour("listen_time")
))


In [6]:
df_listening_actv.show(5)

+-----------+-------+--------------------+---------------+-----------+-----------+
|activity_id|song_id|         Listen_time|listen_duration|Listen_date|Listen_hour|
+-----------+-------+--------------------+---------------+-----------+-----------+
|          1|     12|2023-06-27 10:15:...|             69| 2023-06-27|         10|
|          2|     44|2023-06-27 10:15:...|            300| 2023-06-27|         10|
|          3|     75|2023-06-27 10:15:...|             73| 2023-06-27|         10|
|          4|     48|2023-06-27 10:15:...|            105| 2023-06-27|         10|
|          5|     10|2023-06-27 10:15:...|            229| 2023-06-27|         10|
+-----------+-------+--------------------+---------------+-----------+-----------+
only showing top 5 rows


In [7]:
df_listening_actv.printSchema()
df_listening_actv.count()#counting the number of row or records

root
 |-- activity_id: integer (nullable = true)
 |-- song_id: integer (nullable = true)
 |-- Listen_time: timestamp (nullable = true)
 |-- listen_duration: integer (nullable = true)
 |-- Listen_date: date (nullable = true)
 |-- Listen_hour: integer (nullable = true)



11779

In [8]:
df_listening_actv.explain()

== Physical Plan ==
*(1) Project [activity_id#17, song_id#18, listen_date#19 AS Listen_time#39, listen_duration#20, cast(gettimestamp(listen_date#19, yyyy-MM-dd HH:mm:ss.SSSSSS, TimestampType, try_to_date, Some(America/Chicago), true) as date) AS Listen_date#40, hour(listen_date#19, Some(America/Chicago)) AS Listen_hour#41]
+- FileScan csv [activity_id#17,song_id#18,listen_date#19,listen_duration#20] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/bhushanchowdary/Documents/GitHub/pyspark/Optimization/data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<activity_id:int,song_id:int,listen_date:timestamp,listen_duration:int>




partitioning by listen_date
* lets say we want to analyse the users behaviors over time 
* if we want to find an users behavior on a particular date then spark would analyze the whole thing for findg the simple 
* so given our usecase need to analysis by date (creating partitions or folders by date )would help sopark pin the point that exact folder immediatle

In [9]:
#partitioninng that csv on listen_date
(df_listening_actv
 .write
 .partitionBy("Listen_date")   # partition data by 'listen_date'
 .mode("overwrite")            # overwrite if directory already exists
 .parquet("/Users/bhushanchowdary/Documents/GitHub/pyspark/Optimization/data/partitioning/partitioned/listening_activity_pt"))


                                                                                

Lets see what is partitioning pruning

In [10]:
df_listening_actv_pt_pruned = spark.read.parquet("/Users/bhushanchowdary/Documents/GitHub/pyspark/Optimization/data/partitioning/partitioned/listening_activity_pt")
df_listening_actv_pt_pruned.filter("listen_date = '2019-01-01'").explain()

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [activity_id#79,song_id#80,Listen_time#81,listen_duration#82,Listen_hour#83,Listen_date#84] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/bhushanchowdary/Documents/GitHub/pyspark/Optimization/data..., PartitionFilters: [isnotnull(Listen_date#84), (Listen_date#84 = 2019-01-01)], PushedFilters: [], ReadSchema: struct<activity_id:int,song_id:int,Listen_time:timestamp,listen_duration:int,Listen_hour:int>




### What problems will it solve 
1.fast search -> spark will only process the relevent partition insted of the entire data set
2.parllelizm more partition me more core more number of partitons 128mb

# Partitioning Examples
1. Single/multi level partitioning
2. Using `repartition`/`coalesce` with `partitionBy` (controlling number of files inside each partition): 
    - `parititionBy` affects how data is laid out in the storage and is going to ensure that the output directory is organized into subdirectories based on the `value` given in `partitionBy`.  
    - Number of files in each `value` directory of `partitionBy` depends on the number supplied in the `repartition`/`coalesce`.

In [11]:
(
    df_listening_actv
    .write
    .mode("overwrite")
    .partitionBy("listen_date", "listen_hour")
    .parquet("/Users/bhushanchowdary/Documents/GitHub/pyspark/Optimization/data/partitioning/partitioned/listening_activity_pt_2")
)

                                                                                

In [12]:
spark.stop()