# Tuning Spark

This part is for the interested student, looking at the importance of tuning Spark for best performance. By design, much of the performance tuning can be done from the application level, e.g. in making the right choices for partitioning and methods used (avoiding shuffling as much as possible). Understanding the RDD DAG …

More and more of the tunining is done 'under the hood' in Spark, e.g. in Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level (not the case for Java and Scala on Spark). Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

There are more about tuning Spark, with a good starting point here: https://spark.apache.org/docs/latest/tuning.html

## Create RDD

*Initiate Spark Context - ONLY first time for each notebook. If you get problems with below, see [Help](/notebooks/spark_course/1-Course-Information-and-Links/If-you-get-problems-initiating-spark-context.ipynb)*

In [1]:
import os
from pyspark import SparkContext
sc = SparkContext(appName="search", master=os.environ['MASTER'])

Upload the data and cache it (we don't want to measure HDFS performance this time).

In [2]:
nn = sc.textFile("/uuData/names").cache()

Not so many names: 427500

In [3]:
nn.count()

427500

## Naive way, using groupByKey

In [4]:
import time

In [5]:
start = time.time()
sc.textFile("/uuData/names").map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name))).collect()
end = time.time()
print end - start

3.10979700089


But we already had the names cached (look above '.cache()', so let's see what difference that does the time for counting names.

In [6]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name))).collect()
end = time.time()
print end - start

1.73540711403


So it seems that for this size of data it is something else that takes more time. Let's break it up in steps.

In [7]:
start = time.time()
nn.map(lambda name : (name[0], name))
end = time.time()
print end - start

0.000255823135376


In [8]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey()
end = time.time()
print end - start

0.0273609161377


In [9]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name)))
end = time.time()
print end - start

0.030730009079


In [10]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name))).collect()
end = time.time()
print end - start

1.60329914093


Seems like the collect step is the major part in above.

## 1) (as before) avoid groupByKey, and 2) use partitioning

In [11]:
start = time.time()
sc.textFile("/uuData/names").distinct(numPartitions = 1).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

1.67960309982


Increase the number of partitions to 2 (more doesn't help on this small dataset).

In [12]:
start = time.time()
sc.textFile("/uuData/names").distinct(numPartitions = 2).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

1.62720608711


But we already had the names cached (look above '.cache()', so let's see what difference that does the time for counting names.

In [13]:
start = time.time()
nn.distinct(numPartitions = 1).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

1.40223312378


Now with partition 2.

In [14]:
start = time.time()
nn.distinct(numPartitions = 2).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

1.44529819489


From ? to ? seconds, from using no caching, groupByKey and no partitioning.

### Exercise (if there's time for it)
So in above case the importance of tuning gave us x(?) speedup. Create a larger data set, and try above again (try different partitions etc). See next notebook.