# Partitions matter!

This part is for the interested student, looking at the importance of tuning Spark for best performance. By design, much of the performance tuning can be done from the application level, e.g. in making the right choices for partitioning and methods used (avoiding shuffling as much as possible). Understanding the RDD DAG …

## Create RDD

*Initiate Spark Context - ONLY first time for each notebook. If you get problems with below, see [Help](/notebooks/spark_course/1-Course-Information-and-Links/If-you-get-problems-initiating-spark-context.ipynb)*

In [2]:
import os
from pyspark import SparkContext
sc = SparkContext(appName="search", master=os.environ['MASTER'])

### Exercise (if there's time for it)
So in above case the importance of tuning gave us x(?) speedup. Create a larger data set, and try above again (try different partitions etc).

Create a larger dataset out of the current one, using e.g. below, then put it into HDFS and read it into Spark (note: you can use something else than '100'):

Upload the data and cache it (we don't want to measure HDFS performance this time).

namesMedium is about 29M if you use '1..10' in above. 
namesLarge is about 290M if you use '1..100' in above. 

Then put it into HDFS.

Check that it's there

In [2]:
%%sh
hdfs dfs -ls /uuData/

Found 11 items
-rw-r--r--   1 ubuntu supergroup        534 2015-04-17 08:23 /uuData/README.md
-rw-r--r--   1 ubuntu supergroup     174449 2015-04-17 08:16 /uuData/access_log
-rw-r--r--   1 ubuntu supergroup      14989 2015-04-17 08:16 /uuData/error_log
-rw-r--r--   1 ubuntu supergroup     197105 2015-04-17 08:16 /uuData/lr_data.txt
drwxr-xr-x   - ubuntu supergroup          0 2015-04-17 08:16 /uuData/movies
-rw-r--r--   1 ubuntu supergroup    3004200 2015-04-17 08:16 /uuData/names
-rw-r--r--   1 ubuntu supergroup  303424200 2015-04-17 14:13 /uuData/namesLarge
-rw-r--r--   1 ubuntu supergroup   30042000 2015-04-17 14:26 /uuData/namesMedium
drwxr-xr-x   - ubuntu supergroup          0 2015-04-17 08:16 /uuData/pagecounts
-rw-r--r--   1 ubuntu supergroup         73 2015-04-17 08:16 /uuData/people.json
-rw-r--r--   1 ubuntu supergroup         32 2015-04-17 08:16 /uuData/people.txt


Let's start with namesMedium. Try namesLarge or other when/if you get time.

In [3]:
nn = sc.textFile("/uuData/namesMedium").cache()

More names now: 4275000

In [4]:
nn.count()

4275000

## Naive way, using groupByKey

In [5]:
import time

In [6]:
start = time.time()
sc.textFile("/uuData/names").map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name))).collect()
end = time.time()
print end - start

4.03503704071


But we already had the names cached (look above '.cache()', so let's see what difference that does the time for counting names.

In [7]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name))).collect()
end = time.time()
print end - start

14.2072148323


So it seems that for this size of data it is something else that takes more time. Let's break it up in steps.

In [8]:
start = time.time()
nn.map(lambda name : (name[0], name))
end = time.time()
print end - start

0.000245094299316


In [9]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey()
end = time.time()
print end - start

0.0188281536102


In [10]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name)))
end = time.time()
print end - start

0.0321490764618


In [11]:
start = time.time()
nn.map(lambda name : (name[0], name)).groupByKey().mapValues(lambda name: len(set(name))).collect()
end = time.time()
print end - start

13.9009840488


## 1) (as before) avoid groupByKey, and 2) use partitioning

In [12]:
start = time.time()
sc.textFile("/uuData/names").distinct(numPartitions = 1).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

1.51583003998


Increase the number of partitions to 2 (more doesn't help on this small dataset).

In [13]:
start = time.time()
sc.textFile("/uuData/names").distinct(numPartitions = 2).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

1.50060105324


But we already had the names cached (look above '.cache()', so let's see what difference that does the time for counting names.

In [14]:
start = time.time()
nn.distinct(numPartitions = 1).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

10.1247398853


#### Now with partition 4.

In [17]:
start = time.time()
nn.distinct(numPartitions = 8).map(lambda name :  (name[0], 1)).reduceByKey(lambda a, b: a + b).collect()
end = time.time()
print end - start

10.2350699902


From ? to ? seconds, from using no caching, groupByKey and no partitioning.

Do you see any new tuning effects?