<br><br><br>
<span style="color:red;font-size:50px">Partitioning</span>
<br><br>

<li>Spark automatically partitions data. </li>
<li>If using hadoop,  Spark uses the HDFS partition size for each partition (default = 512MB)</li>
<li>Otherwise, the default is usually  = the number of cores on the machine</li>
<li>But, partitioning is controllable by the user</li>

<li><span style="color:red">Partition size too big</span>: you lose the benefits of working on data in parallel
<li><span style="color:red">Partition size too small</span>: the overhead of managing partitions may become too expensive
<li>So, try to get close to <span style="color:red">just right</span> by understanding the structure of your data

<br><br><br>
<span style="color:green;font-size:xx-large">Creating partitions</span>
<p>
<li>sc.parallelize partitions the data into multiple (usually number of cores) partitions</li>
<li>on OSX, the command <span style="color:red">sysctl -n hw.ncpu</span> will tell you how many cores your machine has</li>
<li>In the example below, each partition will be saved in a separate file on the disk</li>

In [1]:
val grades = Array(("John","A") , ("Jack","B+"), ("Jill","C"),
                   ("Qing","A+"),("Mahesh","A"),("Thierry","B+"))

val grades_RDD = sc.parallelize(grades)
val size = grades_RDD.partitions.size

grades_RDD.saveAsTextFile("grades")

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.149:4044
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1666124313972)
SparkSession available as 'spark'


grades: Array[(String, String)] = Array((John,A), (Jack,B+), (Jill,C), (Qing,A+), (Mahesh,A), (Thierry,B+))
grades_RDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:27
size: Int = 8


In [4]:
!wc -l grades/* //word count

       0 grades/_SUCCESS
       0 grades/part-00000
       1 grades/part-00001
       1 grades/part-00002
       1 grades/part-00003
       0 grades/part-00004
       1 grades/part-00005
       1 grades/part-00006
       1 grades/part-00007
       6 total



In [3]:
!cat grades/part-00001 //Concatenate files and print on the standard output

(John,A)



In [9]:
!rm -r grades 
// remove

rm: grades: No such file or directory



<br><br><br>
<span style="color:green;font-size:xx-large">Purposeful partitioning</span>
<p>
<li>If you have some knowledge about the distribution of your data, you can guide partitioning</li>
<li>Spark allows for user driven partitioning on PairRDDs</li>
<li>Spark provides two partitioning mechanisms</li>
<ol>
    <li><span style="color:red">Hash Partitioners</span>: Creates key, value pairs by hashing on the key</li>
    <li><span style="color:red">Range Partitioners</span>: Sequentially allocates data across the partitions</li>
</ol>

<span style="color:blue;font-size:large">hash partitioning</span>
<p>
        <li>decide on the number of partitions</li>
        <li>hash keys to partition </li>
        <li>allocate key,value pairs to the correct partition</li>
        <li>The partitioning step is a transformation, not an action</li>

In [2]:
val NYC_Data_Path = "nyc_311_2022_clean.csv"

NYC_Data_Path: String = nyc_311_2022_clean.csv


In [11]:
sc.textFile(NYC_Data_Path).getNumPartitions

res2: Int = 49


In [15]:
import org.apache.spark.HashPartitioner
val hash_data = sc.textFile(NYC_Data_Path)
                        .mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}
                        .map(l=>l.split(","))
                        .map(t => (t(2),t(10).toDouble))
                        .partitionBy(new HashPartitioner(10)) //Partition the data into 10 hashed sets

import org.apache.spark.HashPartitioner
hash_data: org.apache.spark.rdd.RDD[(String, Double)] = ShuffledRDD[16] at partitionBy at <console>:31


In [13]:
hash_data.getNumPartitions
//the 49 parition will not occur since there are transformations in the middle

res3: Int = 10


In [16]:
val cat = hash_data.mapPartitions(iter => Iterator(iter.length)) //number of pairs in each partition
cat.collect

cat: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[17] at mapPartitions at <console>:28
res5: Array[Int] = Array(1494658, 583954, 0, 332647, 11897, 4435357, 61183, 236898, 1679035, 0)


<span style="color:blue;font-size:large">Examine the partitions</span>

In [14]:
val cat = hash_data.mapPartitions(iter => Iterator(iter.length)) //number of pairs in each partition
cat.collect

cat: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at mapPartitions at <console>:25
res4: Array[Int] = Array(1494658, 583954, 0, 332647, 11897, 4435357, 61183, 236898, 1679035, 0)


In [17]:
!rm -r nyc_data

In [18]:
hash_data.saveAsTextFile("nyc_data")

In [18]:
!grep "NYPD" nyc_data/part-00005 | wc -l

 4435356



<span style="color:blue;font-size:large">range partitioning</span>
<p>
<li>Find the minimum and maximum values of the keys</li>
<li>divide the difference by the number of partitions to get partition key range sizes</li>
<li>set partition boundaries using the key range size</li>
<li>Range Partitions are transformations, not actions</li>

In [29]:
import org.apache.spark.RangePartitioner
val agency_time_map = sc.textFile(NYC_Data_Path)
                        .mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}
                        .map(l=>l.split(","))
                        .map(t => (t(2),t(10).toDouble))
val range_data = agency_time_map.partitionBy(new RangePartitioner(10,agency_time_map))
val cat = range_data.mapPartitions(iter => Iterator(iter.length))
cat.collect

import org.apache.spark.RangePartitioner
agency_time_map: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[64] at map at <console>:50
range_data: org.apache.spark.rdd.RDD[(String, Double)] = ShuffledRDD[67] at partitionBy at <console>:51
cat: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[68] at mapPartitions at <console>:52
res10: Array[Int] = Array(1315468, 1775422, 11897, 1233935, 57511, 4435356, 6040)


In [30]:
cat.getNumPartitions

res11: Int = 7


<span style="color:blue;font-size:x-large">hash partitioner vs range partitioner</span>
<p>
<li>When the data is ordered, range partitioners are preferred</li>
<li>Range Partitioners need a distribution and typically make two passes (more expensive)</li>
<li>If the distribution is unstable, then range partitioning can be problematic</li>
<li>Let's look at an example</li>

In [31]:
import org.apache.spark.HashPartitioner
import org.apache.spark.RangePartitioner
val rdd1 = sc.range(1,1000).map(x => (x,1))
val rdd2 = sc.range(900,1900).map(x=>(x,1))

val hash_data1 = rdd1.partitionBy(new HashPartitioner(10))
val hash_data2 = rdd2.partitionBy(new HashPartitioner(10))
val h1 = hash_data1.mapPartitions(iter => Iterator(iter.length))
val h2 = hash_data2.mapPartitions(iter => Iterator(iter.length))


val r_partitioner = new RangePartitioner(10,rdd1)
val ran_data1 = rdd1.partitionBy(r_partitioner)
val ran_data2 = rdd2.partitionBy(r_partitioner)
val r1 = ran_data1.mapPartitions(iter => Iterator(iter.length))
val r2 = ran_data2.mapPartitions(iter => Iterator(iter.length))

import org.apache.spark.HashPartitioner
import org.apache.spark.RangePartitioner
rdd1: org.apache.spark.rdd.RDD[(Long, Int)] = MapPartitionsRDD[71] at map at <console>:46
rdd2: org.apache.spark.rdd.RDD[(Long, Int)] = MapPartitionsRDD[74] at map at <console>:47
hash_data1: org.apache.spark.rdd.RDD[(Long, Int)] = ShuffledRDD[75] at partitionBy at <console>:48
hash_data2: org.apache.spark.rdd.RDD[(Long, Int)] = ShuffledRDD[76] at partitionBy at <console>:49
h1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[77] at mapPartitions at <console>:50
h2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[78] at mapPartitions at <console>:51
r_partitioner: org.apache.spark.RangePartitioner[Long,Int] = org.apache.spark.RangePartitioner@b2cf1b58
ran_data1: org.apache.spark.rdd.RDD[(Long, Int)] = Shu...


In [37]:
val r2_partitioner = new RangePartitioner(10,rdd2)
val ran_data22 = rdd2.partitionBy(r2_partitioner)
val r22 = ran_data22.mapPartitions(iter => Iterator(iter.length))
r22.collect

r2_partitioner: org.apache.spark.RangePartitioner[Long,Int] = org.apache.spark.RangePartitioner@f303e653
ran_data22: org.apache.spark.rdd.RDD[(Long, Int)] = ShuffledRDD[90] at partitionBy at <console>:50
r22: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[91] at mapPartitions at <console>:51
res17: Array[Int] = Array(104, 98, 94, 107, 95, 96, 106, 96, 107, 97)


In [32]:
h1.collect


res12: Array[Int] = Array(99, 100, 100, 100, 100, 100, 100, 100, 100, 100)


In [33]:
h2.collect


res13: Array[Int] = Array(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)


In [34]:
r1.collect


res14: Array[Int] = Array(101, 100, 102, 103, 96, 91, 107, 96, 98, 105)


In [35]:
r2.collect

res15: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 1000)


<br><br><br>
<span style="color:green;font-size:xx-large">Calibrating a Spark model</span>
<p>
<li>A spark context comes with a handy UI for evaluating the performance of a Spark model</li>
<li>http://localhost:4040 (usually)</li>
<li>Let's see how well the different partitions perform on our data when computing average processing time for each agency</li>

<span style="color:blue;font-size:large">Average processing time analysis</span>

In [3]:
//All transformations. So no processing cost

val combiner = (x: Double) => (1,x)
val merger = (x: (Int, Double),y: Double) => {
    val (c,acc) = x
    (c+1, acc + y)
}
val mergeAndCombiner = (x1: (Int, Double), x2: (Int, Double)) => {
    val (c1, acc1) = x1
    val (c2, acc2) = x2
    (c1+c2,acc1+acc2)
}
val getAvgFunction = (x: (String, (Int, Double))) => {
    val (identifier, (count,total)) = x
    (identifier,total/count)
}

val agency_time_map = sc.textFile(NYC_Data_Path)
                        .mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}
                        .map(l=>l.split(","))
                        .map(t => (t(2),t(10).toDouble))



combiner: Double => (Int, Double) = $Lambda$2326/0x0000000800e8e840@5774109
merger: ((Int, Double), Double) => (Int, Double) = $Lambda$2327/0x0000000800e8d840@14eed62e
mergeAndCombiner: ((Int, Double), (Int, Double)) => (Int, Double) = $Lambda$2328/0x0000000800e8d040@1ed65bc5
getAvgFunction: ((String, (Int, Double))) => (String, Double) = $Lambda$2329/0x0000000800e8c040@389b2be2
agency_time_map: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[4] at map at <console>:45


<span style="color:blue;font-size:large">Hash partitioner</span>


In [4]:
import org.apache.spark.HashPartitioner
val hash_data = agency_time_map.partitionBy(new HashPartitioner(2))
val h_avg_times = hash_data
                    .combineByKey(combiner,merger,mergeAndCombiner)
                    .map(t => (t._1,t._2._2.toDouble/t._2._1))
                    .collect

import org.apache.spark.HashPartitioner
hash_data: org.apache.spark.rdd.RDD[(String, Double)] = ShuffledRDD[5] at partitionBy at <console>:28
h_avg_times: Array[(String, Double)] = Array((DOITT,28.392176800309258), (DSNY,6.984900241530284), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,10.993121710573053), (DCA,3.1134426905912482), (DOE,43.447434186637715), (DPR,65.99335925229805), (HPD,13.242378653308426), (EDC,56.00096076391685), (DOHMH,15.397117888616057), (DOF,19.762443628704336), (OSE,0.12648533950617283), (TLC,53.68203580195772), (DEP,5.0066760309900165), (FDNY,402.1443981481481), (DOT,14.494687185115032), (OFFICE OF TECHNOLOGY AND INNOVATION,0.7780439814814815), (DFTA,13.390725308641976), (NYPD,0.3395727580072987), (DOB,39.106442592864965), (DHS,1.2574107460376571))


<span style="color:blue;font-size:large">Range partitioner</span>

In [5]:
import org.apache.spark.RangePartitioner
val r_partitioner = new RangePartitioner(2,agency_time_map)
val r_avg_times = agency_time_map.partitionBy(r_partitioner)
                .combineByKey(combiner,merger,mergeAndCombiner)
                    .map(t => (t._1,t._2._2.toDouble/t._2._1))
                    .collect

import org.apache.spark.RangePartitioner
r_partitioner: org.apache.spark.RangePartitioner[String,Double] = org.apache.spark.RangePartitioner@474db71
r_avg_times: Array[(String, Double)] = Array((DOITT,28.392176800309258), (DSNY,6.984900241530284), (DOF,19.762443628704336), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,10.993121710573053), (DEP,5.0066760309900165), (DCA,3.1134426905912482), (FDNY,402.1443981481481), (DOT,14.494687185115032), (DOE,43.447434186637715), (DPR,65.99335925229805), (DFTA,13.390725308641976), (NYPD,0.3395727580072987), (HPD,13.242378653308426), (EDC,56.00096076391685), (DOB,39.106442592864965), (DOHMH,15.397117888616057), (DHS,1.2574107460376571), (OSE,0.12648533950617283), (TLC,53.68203580195772), (OFFICE OF TECHNOLOGY AND INNOVATION,0.7780439814814815))


<span style="color:blue;font-size:large">Custom partitioner</span>
<li>Spark allows you to write your own partitioner</li>
<li>Only on (key,value) pairs</li>
<li>Hard to come up with a good use case because you need to know the key values in advance</li>
<li>but, consider our nyc data. We know what the various departments are so we know the keys</li>
<li>Let's build a partitioner that will partition based on the length of the agency name</li>


<li>First, we need to decide on the number of partitions</li>
<li>Then, we'll define a max key length and a min key length</li>
<li>Finally partition the data using the number of partitions, the max and the min key lengths</li>

In [6]:
h_avg_times

res0: Array[(String, Double)] = Array((DOITT,28.392176800309258), (DSNY,6.984900241530284), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,10.993121710573053), (DCA,3.1134426905912482), (DOE,43.447434186637715), (DPR,65.99335925229805), (HPD,13.242378653308426), (EDC,56.00096076391685), (DOHMH,15.397117888616057), (DOF,19.762443628704336), (OSE,0.12648533950617283), (TLC,53.68203580195772), (DEP,5.0066760309900165), (FDNY,402.1443981481481), (DOT,14.494687185115032), (OFFICE OF TECHNOLOGY AND INNOVATION,0.7780439814814815), (DFTA,13.390725308641976), (NYPD,0.3395727580072987), (DOB,39.106442592864965), (DHS,1.2574107460376571))


In [7]:
//max and min key lengths
val max_len = h_avg_times.map(t=>t._1.length).max
val min_len = h_avg_times.map(t=>t._1.length).min
val keys = h_avg_times.map(t=>t._1)
val key_lengths = h_avg_times.map(t=>t._1.length)

max_len: Int = 39
min_len: Int = 3
keys: Array[String] = Array(DOITT, DSNY, MAYORâS OFFICE OF SPECIAL ENFORCEMENT, DCA, DOE, DPR, HPD, EDC, DOHMH, DOF, OSE, TLC, DEP, FDNY, DOT, OFFICE OF TECHNOLOGY AND INNOVATION, DFTA, NYPD, DOB, DHS)
key_lengths: Array[Int] = Array(5, 4, 39, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 4, 3, 35, 4, 4, 3, 3)


<li>Since the two ultra long keys are outliers, we'll reset max to 6</li>
<li>Note that we'll be using for loops and vars in this example!</li>

In [10]:
val max_len=6
val min_len=3
val n = 3//the number of partitions

max_len: Int = 6
min_len: Int = 3
n: Int = 3


In [11]:
!rm -r custom

rm: custom: No such file or directory



In [12]:
val max_len=6
val min_len=3
val n = 3//the number of partitions

class KeyLenPartitioner(num:Int,max_len:Int,min_len:Int) extends org.apache.spark.Partitioner{
    override def numPartitions: Int = num //This is necessary!

    override def getPartition(key: Any): Int = {
        import scala.util.control.Breaks._
        val partition_increment = (max_len-min_len)/num //Note that this will be an Int
        val key_length = key.toString.size
        var partition = 0
        breakable {
            for(i<-num-1 to 0 by -1) {
                if (key_length >= min_len + partition_increment*num) {
                    partition = partition + 1
                    break
                } 
            }
        }
        partition
    }

 }



agency_time_map
    .partitionBy(new KeyLenPartitioner(n,max_len,min_len))
    .saveAsTextFile("custom")
   

defined class KeyLenPartitioner


In [15]:
!wc -l custom/*

       0 custom/_SUCCESS

 8778116 custom/part-00000
   57513 custom/part-00001
       0 custom/part-00002
 8835629 total



In [34]:
import scala.math.sqrt

import scala.math.sqrt


In [8]:
2.23606797749979 * 2.23606797749979

res6: Double = 5.000000000000001


22/10/03 04:28:24 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 680521 ms exceeds timeout 120000 ms
22/10/03 04:28:24 WARN SparkContext: Killing executors is not supported by current scheduler.


In [7]:
sqrt(43)*sqrt(43)

res5: Double = 42.99999999999999
