# Homework 3: PySpark - II
### CS186, UC Berkeley, Spring 2016
### Due: Thursday Feb 25, 2016, 11:59 PM
### Note: **This homework is to be done individually!  Do not modify any existing method signatures.**
### **This is the second of two .ipynb files in this homework.

In [60]:
## On some computers it may be possible to run this lab 
## locally by using this script; you will need to run
## this each time you start the notebook.
## You do not need to run this on inst machines.

# from local_install import setup_environment
# setup_environment()

In [61]:
import pyspark
from utils import SparkContext as sc

In [62]:
from utils import CleanRDD
from utils import tests

# Part 3: CacheMap

In this part, we'll construct an rdd that is backed by a `ClockMap` and will behave like `rdd.map(func)`.  
First, implement the `ClockMap` class so that it maintains a cache (of limited `cacheSize`) using the clock replacement policy.

### * BEGIN STUDENT CODE *

In [63]:
class ClockMap:
    
    def __init__(self, cacheSize, func):
        """
        Do not change existing variables.
        [Optional] You are free to add additional items and methods.
        """
        self.cacheSize = cacheSize
        self.fn = func
        self._p = 0 # pointer
        self._increments = 0
        self._miss_count = 0
        self.buffers = [[None, 0] for x in range(cacheSize)]
        self.items_to_index = {}
        
    def _increment(self):
        """
        Do not change this method.
        Updates the clock pointer. The modulo maintains the clock nature.
        """
        self._increments += 1
        self._p = (self._p + 1) % self.cacheSize

    def __getitem__(self, k):
        """
        Returns func(k) using the buffer to cache limited results.
        
        :param k: Value to be evaluated
        
        >>> clock = ClockMap(4, lambda x: x ** 2)
        >>> clock[4]
        16
        >>> clock[3]
        9
        >>> clock._p
        2
        """
        
        # First check if in cache
        for i in range(len(self.buffers)):
            if self.buffers[i][0] == k:
                # Check the box
                self.items_to_index[i] = True
                return self.buffers[i][1]
            
        # Else, cache miss
        self._miss_count += 1
        
        # Move pointer around, to find where to replace
        while self._p in self.items_to_index and self.items_to_index.get(self._p):
            # Second chance! Uncheck and move on
            self.items_to_index[self._p] = False
            self._increment()
        
        # Found where to replace, now replace. And don't forget to check it. 
        result = self.fn(k)
        self.buffers[self._p] = [k, result]
        self.items_to_index[self._p] = True
        
        self._increment()
        return result


        
        

Now implement `cacheMap`, which will return an rdd.

In [64]:
def cacheMap(rdd, cacheSize, func):
    """
    Returns an RDD that behaves like rdd.map(func) but
    is implemented using the ClockMap.
    
    :param rdd: Given RDD
    :param cacheSize: Number of cache/buffer pages in the ClockMap
    :param func: Function to map with
    """
    
    def iterate(y, itr):
        clock = ClockMap(cacheSize, func)
        for x in itr:
            yield clock[x]
            
    return rdd.mapPartitionsWithIndex(iterate)

### * END STUDENT CODE *

Free test for you!

In [65]:
clock = ClockMap(4, lambda x: x ** 2)
print clock[4], clock[3]
print clock._p

16 9
2


Output should be 
```
16, 9
2
```

# Part 4: External Algorithms

You'll need an understanding of the partitioning step of external hashing, and the divide step of external sorting (recall the lecture on external algorithms).

In [66]:
from utils import *
import itertools
import bisect
import os

The following are some tools you may want to use (examples use cases included). You should Google the unfamiliar ones!

In [67]:
# itertools.islice
generator = (y for y in range(100))
test1 = itertools.islice(generator, 5)
print next(test1)
print next(test1)
test2 = itertools.islice(generator, 5)
print next(generator)
print next(test2)

0
1
2
3


In [68]:
# heapq.merge
generator1 = (odd for odd in range(100) if odd % 2)
generator2 = (even for even in range(100)[::2])
key = lambda x: x
test2 = heapq.merge([generator1, generator2], key=key, reverse=False)
next(test2)

0

In [69]:
# bisect.bisect_left
buckets = [2, 4, 4]
print "If we insert 3, it goes to %d" % bisect.bisect_left(buckets, 3)
print "If we insert 1, it goes to %d" % bisect.bisect_left(buckets, 1)
print "If we insert 4, it goes to %d" % bisect.bisect_left(buckets, 4)

buckets2 = [(1, 2), (3, 4), (5, 6)]
print bisect.bisect_left(buckets2, (0, 0))
print bisect.bisect_left(buckets2, (5, 7))

If we insert 3, it goes to 1
If we insert 1, it goes to 0
If we insert 4, it goes to 1
0
3


In [70]:
# RDD.sample
rdd = sc.parallelize(range(100))
fraction = 0.1
rdd.sample(False, fraction).collect()

[6, 12, 16, 17, 22, 43, 45, 47, 52, 64, 81, 88, 97]

In [71]:
# Serializer and os.unlink (Serializer is provided via utils.GeneralTools)
generator1 = (odd for odd in range(100) if odd % 2)
filename = "temp"
with open(filename, "w") as f:
    serializer.dump_stream(generator1, f)

with open(filename, "r") as f:
    stream = serializer.load_stream(f)
    print next(stream)

os.unlink(filename)

1


In [72]:
# get_used_memory - returns an int in MB
get_used_memory()

74

No need to modify the following function - it should come in handy!

In [73]:
def get_sort_dir(partId, n):
    """
    Returns a path for temporary file.

    :param n: Unique identification for file
    """
    d = "tmp/sort/" + str(partId) + "/"
    if not os.path.exists(d):
        os.makedirs(d)
    return os.path.join(d, str(n))

### * BEGIN STUDENT CODE *

In [51]:
def externalSortStream(iterator, partId=0, reverse=False, keyfunc=None, serial=serializer, limit=10, batch=100):
    """
    Given an iterator, returns an iterator of sorted elements (according to parameters). 
    
    :param iterator: iterator. Expects (Key, Value).
    :param keyfunc: function applied on the keykey.
    :param reverse: Reverse default ordering if true. (default is ascending; reverse is descending) 
    :param serializer: See README.
    :param limit: memory limit.
    :param batch: Number of elements to read at a time.
    """
    
    all_runs = [] # can be used to hold a list of iterators
    all_runs_paths = []
    run = [] # used to hold the current run of elements
    length_c = 0
    
    def load(fileobj):
        """
        Returns a generator object that outputs elements 
        from a serialized (saved) stream. Closes the file when done.
        
        :param fileobj: python object file
        """
        for _ in serial.load_stream(fileobj):
            yield _
        fileobj.close()
   
    
    
    # Get all runs into all_runs
    while True:
        
        # Fill up a single run
        while True:
            
            c = list(itertools.islice(iterator, batch))
            length_c = len(c)
        
            # Load up
            run = run + c
        
            # End case: hit memory limit, or nothing left to stream
            if get_used_memory() > limit or length_c < batch: 
                break
        
        
        # Sort run and save into stream.
        sorted_run = sorted(run, key=lambda x: keyfunc(x[0]), reverse=reverse)
        srun_path = get_sort_dir(partId, len(all_runs))
        f = open(srun_path, "w")
        serializer.dump_stream(sorted_run, f)
        
        # Link into all_runs
        f = open(srun_path, "r")
        all_runs.append(load(f))
        all_runs_paths.append(srun_path)
        

        # Keep making more runs, until nothing left to stream
        if length_c < batch:
            break;
    
        # Clear out run
        run = []
        
        
        
    for path in all_runs_paths:
        os.unlink(path)
    
    return heapq.merge(all_runs, key=lambda x: keyfunc(x[0]), reverse=reverse)

    

In [76]:
# Remember to run the import box above.

def partitionByKey(rdd, ascending=True, numPartitions=None, keyfunc=lambda x: x):
    """        
    Uses sampling to partitions the elements by the return value of 
    keyfunc.

    :param ascending: Smallest first.
    :param numPartitions: Number of partitions of the returning RDD.
    :param keyfunc: function to be applied to the key.
    """
    # Base cases done.

    if numPartitions is None:
        numPartitions = rdd.getNumPartitions()

    if numPartitions == 1:
        if rdd.getNumPartitions() > 1:
            rdd = rdd.coalesce(1)
        return rdd
    
    
    boundaries = getBuckets(rdd, ascending, numPartitions, keyfunc)
    return rdd.partitionBy(numPartitions, lambda x: bisect.bisect_left(boundaries, x))


def getBuckets(rdd, ascending=True, numPartitions=None, keyfunc=lambda x: x):
    """        
    [Optional] Returns a list of bucket boundaries of length (numPartitions - 1),
    in an order as specfied by the given parameters: ascending, keyfunc. 
    Bucket boundaries are determined by sampling as specified in the README.

    :param ascending: Smallest first.
    :param numPartitions: Number of partitions of the returning RDD.
    :param keyfunc: function to be applied to the key.
    """
    # Base cases done.
    
    
    
    
    # Sample about 10 per partition.
    fraction = (10 * numPartitions) / rdd.count()
    samples = rdd.sample(False, fraction).collect()
        
    sorted_samples = sorted(samples, key=keyfunc)
    
    skip_by = len(sorted_samples) / numPartitions
    if skip_by < 1:
        skip_by = 1
          
    return [sorted_samples[i][0] for i in range(skip_by - 1, len(sorted_samples) - 1, skip_by)]
    


In [78]:
def sortByKey(rdd, ascending=True, numPartitions=None, keyfunc=lambda x: x):
    """
    Returns an RDD after executing an external sort using 
    functions partitionByKey and externalSortStream. 

    :param ascending: Smallest first.
    :param numPartitions: Number of partitions of the returning RDD.
    :param keyFunc: function to be applied to the key.
    
    """
    
    def iterate(y, itr):
        return externalSortStream(itr, y, not ascending, keyfunc)

    
    return partitionByKey(rdd).mapPartitionsWithIndex(iterate)

### * END STUDENT CODE *

Here are tests for `partitionByKey` and `externalSortStream`:

In [79]:
test_stream = ((i, i) for i in range(100))
list(externalSortStream(test_stream, keyfunc=(lambda x: abs(50 - (x ** 2)))))[:10]

[(7, 7),
 (6, 6),
 (8, 8),
 (5, 5),
 (9, 9),
 (4, 4),
 (3, 3),
 (2, 2),
 (1, 1),
 (0, 0)]

Your output should be:
```
[(7, 7),
 (6, 6),
 (8, 8),
 (5, 5),
 (9, 9),
 (4, 4),
 (3, 3),
 (2, 2),
 (1, 1),
 (0, 0)]
```

In [87]:
rdd = CleanRDD(sc.parallelize(range(20), 4).map(lambda x: (x * 37 % 6, x ** 3 % 34)))
print("rdd count = " + str(rdd.count()))

def iterate(y, itr):
    for item in itr:
        yield (y, item)
            
    
print("numPartitions = " + str(thing.getNumPartitions()))
partitionByKey(rdd, ascending=True, numPartitions=5).mapPartitionsWithIndex(iterate).collect()



# FOR KICKS
newRdd = partitionByKey(rdd, ascending=True)
def counterFunction(y, iterator):
    count = 0
    for item in iterator:
        count += 1
    yield count
print(newRdd.mapPartitionsWithIndex(counterFunction).collect())


rdd count = 20
numPartitions = 5
[8, 3, 6, 3]


Your output should look rather well-distributed. Try forcing a skewed distribution and observe how effective the partitioning is.

Here's a test for `sortByKey`:

In [88]:
rdd = CleanRDD(sc.parallelize(range(100), 4).map(lambda x: (x *((-1) ** x) , x)))
sortByKey(rdd, keyfunc=lambda key: key, ascending=False).collect()[-10:]

[(-81, 81),
 (-83, 83),
 (-85, 85),
 (-87, 87),
 (-89, 89),
 (-91, 91),
 (-93, 93),
 (-95, 95),
 (-97, 97),
 (-99, 99)]

Your output should be:
```
[(-81, 81),
 (-83, 83),
 (-85, 85),
 (-87, 87),
 (-89, 89),
 (-91, 91),
 (-93, 93),
 (-95, 95),
 (-97, 97),
 (-99, 99)]
```

# Testing

In [90]:
tests.test3ClockMap(ClockMap)
tests.test3CacheMap(cacheMap)
tests.test4(sortByKey)

Task 3: PASS - task3ClockMap.txt matched reference output.
Task 3: PASS - task3CacheMap.txt matched reference output.
Task 4: PASS - task4.txt matched reference output.
