<h2>INTRODUCTION TO PARTITION</h2>

Spark is one of the distributed computing platform that split data into partitions to achieve parallel computation. The challange with these distributed computing platforms is to manage these partition to keep your spark computations running efficiently. In this topic we discuss how to manage the spark partition with repartition and coalesce.

In [2]:
#Let's create a dataframe to illustrate how data is partitioned.
ds = spark.range(1, 11)
ds = ds.withColumnRenamed("id", "num_idb")
display(ds)

In [3]:
"""
Check the number of parition
"""

ds.rdd.getNumPartitions()

In [4]:
"""
Get the parition distribution
"""
from pyspark.sql.functions import spark_partition_id

ds.select(*ds.columns, spark_partition_id().alias("pid")).show()

<h2>coalesce</h2>
The coalesce method reduces the number of partitions in a DataFrame. Here’s how to consolidate the data in two partitions:

In [6]:
"""
Use colaesce function to reduce the number of parition.In this case we are reducing to 2 partition.
And also verifying the partition distribution after the coalesce.
"""
dsCoalesce = ds.coalesce(2)
dsCoalesce.rdd.getNumPartitions()

In [7]:
from pyspark.sql.functions import spark_partition_id
dsCoalesce.select(*dsCoalesce.columns, spark_partition_id().alias("pid")).show()

<h2>Increasing partitions using coalesce</h2>
You can try to increase the number of partitions with coalesce, but it won’t work!

In [9]:
"""
Use colaesce function to increase the number of parition and see it won't effect the partitions.
"""
from pyspark.sql.functions import spark_partition_id

dsCoalesceInc = ds.coalesce(10)

dsCoalesceInc.rdd.getNumPartitions()

#dsCoalesceInc.select(*dsCoalesceInc.columns, spark_partition_id().alias("pid")).show()

<h2>Repartition</h2>
The repartition method can be used to either increase or decrease the number of partitions in a DataFrame.

In [11]:
"""
Let's create with repartition on ds dataframe
"""
dsRepartition = ds.repartition(2)
dsRepartition.rdd.getNumPartitions()

In [12]:
from pyspark.sql.functions import spark_partition_id
dsRepartition.select(*dsRepartition.columns, spark_partition_id().alias("pid")).show()

<h2>Increasing partitions using repartition</h2>
Repartition will be used to increase the number of partition.

In [14]:
"""
Increase the partition size.
"""
dsRepartitionInc = ds.repartition(10)
dsRepartitionInc.rdd.getNumPartitions()

In [15]:
from pyspark.sql.functions import spark_partition_id
dsRepartitionInc.select(*dsRepartitionInc.columns, spark_partition_id().alias("pid")).show()

In [16]:
"""
The above results showing only the assigned partition number but not include any empty parition in the results.
If you need the entire structure of the partition that includes the empty partitions use the below code.
"""
print('Partitions structure: {}'.format(dsRepartitionInc.rdd.glom().collect()))

<h2>Differences between coalesce and repartition</h2>
The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.