<a href="https://colab.research.google.com/github/chinnuanna123/spark/blob/main/adv_pyspark_Day4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Caching and persisting DataFrames for faster processing
(use of cache() and persist() methods.)

In [19]:
from pyspark.sql import SparkSession
from pyspark import StorageLevel
import time

In [10]:
spark = SparkSession.builder.appName("CachePersistExample").getOrCreate()

In [11]:
data = [(i, i * 2) for i in range(1, 1000)]  # 1000 rows
columns = ["Number", "Double"]
df = spark.createDataFrame(data, columns)

In [12]:
# Without Cache or Persist
start_time = time.time()
df.filter(df.Number > 500).count()
df.filter(df.Number > 500).count()  # Repeated action
no_cache_time = time.time() - start_time
print(f"Without Cache/Persist: {no_cache_time:.4f} seconds")

Without Cache/Persist: 2.2735 seconds


In [14]:
# Using Cache
df_cached = df.cache()
df_cached.count()  # Trigger caching

start_time = time.time()
df_cached.filter(df_cached.Number > 500).count()
df_cached.filter(df_cached.Number > 500).count()  # Repeated action
cache_time = time.time() - start_time
print(f"With Cache: {cache_time:.4f} seconds")


With Cache: 0.9934 seconds


In [15]:
# Using Persist
df_persisted = df.persist(StorageLevel.MEMORY_AND_DISK)
df_persisted.count()  # Trigger persisting

999

In [16]:
start_time = time.time()
df_persisted.filter(df_persisted.Number > 500).count()
df_persisted.filter(df_persisted.Number > 500).count()  # Repeated action
persist_time = time.time() - start_time
print(f"With Persist (MEMORY_AND_DISK): {persist_time:.4f} seconds")

With Persist (MEMORY_AND_DISK): 0.9201 seconds


In [17]:
# Unpersist DataFrames to free resources
df_cached.unpersist()
df_persisted.unpersist()

DataFrame[Number: bigint, Double: bigint]

In [20]:
# Stop Spark Session
spark.stop()

Repartitioning and Coalescing         

In [21]:
from pyspark.sql import SparkSession

In [22]:
spark = SparkSession.builder.appName("RepartitionCoalescingExample").getOrCreate()

In [25]:
data = [(i, i * 2) for i in range(1, 1000)]  # 1000 rows
columns = ["Number", "Double"]
df = spark.createDataFrame(data, columns)
print(f"Default Number of Partitions: {df.rdd.getNumPartitions()}") # to check number of partitions


Default Number of Partitions: 2


In [26]:
# Increase partitions to 4
df_repartitioned = df.repartition(4)
print(f"Partitions after Repartitioning: {df_repartitioned.rdd.getNumPartitions()}")

Partitions after Repartitioning: 4


In [27]:
# Reduce partitions to 2
df_coalesced = df.repartition(4).coalesce(2)
print(f"Partitions after Coalescing: {df_coalesced.rdd.getNumPartitions()}")


Partitions after Coalescing: 2


In [28]:
spark.stop()

Repartition a DataFrame to improve processing speed.

In [29]:
from pyspark.sql import SparkSession
import time

In [30]:
spark = SparkSession.builder.appName("RepartitionExample").getOrCreate()


In [31]:
data = [(i, f"Name_{i}") for i in range(1, 100)]  # 100 million rows
columns = ["ID", "Name"]
df = spark.createDataFrame(data, columns)
df.show()

+---+-------+
| ID|   Name|
+---+-------+
|  1| Name_1|
|  2| Name_2|
|  3| Name_3|
|  4| Name_4|
|  5| Name_5|
|  6| Name_6|
|  7| Name_7|
|  8| Name_8|
|  9| Name_9|
| 10|Name_10|
| 11|Name_11|
| 12|Name_12|
| 13|Name_13|
| 14|Name_14|
| 15|Name_15|
| 16|Name_16|
| 17|Name_17|
| 18|Name_18|
| 19|Name_19|
| 20|Name_20|
+---+-------+
only showing top 20 rows



In [32]:
print(f"Default Number of Partitions: {df.rdd.getNumPartitions()}")


Default Number of Partitions: 2


In [33]:
# Without Repartitioning
start_time = time.time()
df.filter(df.ID > 50).count()  # Example transformation
no_repartition_time = time.time() - start_time
print(f"Processing Time without Repartitioning: {no_repartition_time:.4f} seconds")

Processing Time without Repartitioning: 0.9038 seconds


In [35]:
df_repartitioned = df.repartition(6)  # Increasing partitions for better parallelism
start_time = time.time()
df_repartitioned.filter(df_repartitioned.ID > 50).count()
repartition_time = time.time() - start_time
print(f"Processing Time with Repartitioning: {repartition_time:.4f} seconds")

Processing Time with Repartitioning: 0.8939 seconds


Prepare a list of Common optimization techniques

1. **DataFrame API Optimization**:Prefer using DataFrame API over RDDs for better performance.Avoid using collect() on large datasets.Apply transformations (e.g., select(), filter(), withColumn()) efficiently by reducing shuffles and avoiding unnecessary computations
2. **Caching and Persisting**
3. ** Partition,Repartition,coelance**
4. **Avoid UDFs (Use Built-in Functions)**
5. **Avoid Shuffles**:join(), distinct(), groupBy(), repartition()Use Broadcast Joins or Skew Join Optimization when appropriate.
6. **Enable Adaptive Query Execution**:Dynamically optimizes query plans based on runtime statistics.
spark.conf.set("spark.sql.adaptive.enabled", "true")
7. **Columnar Storage Format**:Save DataFrames as Parquet or ORC files instead of CSV or JSON.

