# Basics of Spark on HDInsight

<a href="http://spark.apache.org/" target="_blank">Apache Spark</a> is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. When you provision a Spark cluster in HDInsight, you provision Azure compute resources with Spark installed and configured. The data to be processed is stored in Azure Blob storage (WASB).



Now that you have created a Spark cluster, let us understand some basics of working with Spark on HDInsight. For detailed discussion on working with Spark, see [Spark Programming Guide](http://spark.apache.org/docs/2.0.0/sql-programming-guide.html).

----------
## Notebook setup

When using PySpark kernel notebooks on HDInsight, there is no need to create a SparkContext or a SparkSession; a SparkSession which has the SparkContext is created for you automatically when you run the first code cell, and you'll be able to see the progress printed. The contexts are created with the following variable names:
- SparkSession (spark)
- SparkContext (sc)

To run the cells below, place the cursor in the cell and then press **SHIFT + ENTER**.

## How do I make an Rsfas DD?

RDDs can be created from stable storage or by transforming other RDDs. Run the cells below to create RDDs from the sample data files available in the storage container associated with your Spark cluster. One such sample data file is available on the cluster at `wasb:///example/data/fruits.txt`.  The /// notation reads data from the default container.

In [1]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import udf
from pyspark.sql.functions import *
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructType, StructField
from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext("local",'app')
spark = SparkSession.builder.appName('name').config('spark.sql.shuffle.partitions',10).getOrCreate()

In [2]:
# In local mode:
fruits = sc.textFile('../data/fruits.txt')
yellowThings = sc.textFile('../data/yellowthings.txt')
print fruits.collect()
print yellowThings.collect()

[u'apple', u'banana', u'canary melon', u'grape', u'lemon', u'orange', u'pineapple', u'strawberry']
[u'banana', u'bee', u'butter', u'canary melon', u'gold', u'lemon', u'pineapple', u'sunflower']


In [4]:
# You can also read from other containers.
# The 'cluster' container under the storage account 'msbd' has been made public.
# Use the following format to read data from a public container
# The file can also be accessed from the web at: 
# https://msbd.blob.core.windows.net/cluster/data/course.txt

txtfile = sc.textFile('../data/course.txt')
txtfile.collect()

[u'Course Information',
 u'Lecture videos',
 u'',
 u'Course Description',
 u'Big data systems, including cloud computing and parallel data processing',
 u'frameworks, emerge as enabling technologies in managing and mining the ',
 u'massive amount of data across hundreds or even thousands of commodity ',
 u'servers in data centers. This course exposes students to both the theory ',
 u'and hands-on experience of this new technology.']

----------
## PySpark magics 

The PySpark kernel provides some predefined “magics”, which are special commands that you can call with `%%` (e.g. `%%MAGIC` <args>). The magic command must be the first word in a code cell and allow for multiple lines of content. You can’t put comments before a cell magic.

For more information on magics, see [here](http://ipython.readthedocs.org/en/stable/interactive/magics.html).

In [6]:
%%info

UsageError: Line magic function `%info` not found.


### Session configuration (%%configure)
 
Use the `%%configure` magic to configure new or existing Livy sessions.
* If a session is already running, you can change the configuration by using the `-f` argument with `%%configure` magic. This will delete the current session and recreate it with the applied configurations. If you don't provide the `-f` argument, an error will be displayed and no configuration changes will be applied.
* If you haven't already started the session, then the `-f` argument is not mandatory. Even if you use it with a session that you are just creating, it will not delete any currently running sessions.

These are some session attributes that can be used for configuration 
- **"name"**: Name of the application
- **"driverMemory"**: Memory for driver (e.g. 1000M, 2G) 
- **"executorMemory"**: Memory for executor (e.g. 1000M, 2G) 
- **"executorCores"**: Number of cores used by executor

In [8]:
%%Configure -f
{"executorCores":4}

UsageError: Cell magic `%%Configure` not found.


----------

## 1. RDD operations

In [9]:
# map
fruitsReversed = fruits.map(lambda fruit: fruit[::-1])

In [18]:
fruitsReversed.persist()
# try changing the file and re-execute with and without cache
print fruitsReversed.collect()

[u'elppa', u'ananab', u'nolem yranac', u'eparg', u'nomel', u'egnaro', u'elppaenip', u'yrrebwarts']


In [12]:
# filter
shortFruits = fruits.filter(lambda fruit: len(fruit) <= 5)
shortFruits.collect()

[u'apple', u'grape', u'lemon']

In [17]:
# flatMap
characters = fruits.flatMap(lambda fruit: list(fruit))
print characters.collect()

[u'a', u'p', u'p', u'l', u'e', u'b', u'a', u'n', u'a', u'n', u'a', u'c', u'a', u'n', u'a', u'r', u'y', u' ', u'm', u'e', u'l', u'o', u'n', u'g', u'r', u'a', u'p', u'e', u'l', u'e', u'm', u'o', u'n', u'o', u'r', u'a', u'n', u'g', u'e', u'p', u'i', u'n', u'e', u'a', u'p', u'p', u'l', u'e', u's', u't', u'r', u'a', u'w', u'b', u'e', u'r', u'r', u'y']


In [22]:
# union  Return a new RDD containing all items from two original RDDs. Duplicates are not culled
fruitsAndYellowThings = fruits.union(yellowThings)
print fruitsAndYellowThings.collect()

[u'apple', u'banana', u'canary melon', u'grape', u'lemon', u'orange', u'pineapple', u'strawberry', u'banana', u'bee', u'butter', u'canary melon', u'gold', u'lemon', u'pineapple', u'sunflower']


In [19]:
# intersection
yellowFruits = fruits.intersection(yellowThings)
yellowFruits.collect()

[u'lemon', u'canary melon', u'pineapple', u'banana']

In [24]:
# distinct  remove duplicates
distinctFruitsAndYellowThings = fruitsAndYellowThings.distinct()
print distinctFruitsAndYellowThings.collect()

[u'butter', u'grape', u'sunflower', u'gold', u'orange', u'lemon', u'apple', u'canary melon', u'bee', u'strawberry', u'pineapple', u'banana']


## 2. RDD actions (to trigger the operations)
Following are examples of some of the common actions available. For a detailed list, see [RDD Actions](https://spark.apache.org/docs/2.0.0/programming-guide.html#actions).

Run some transformations below to understand this better. Place the cursor in the cell and press **SHIFT + ENTER**.

In [25]:
# collect   Return all the elements of the dataset as an array 
fruitsArray = fruits.collect()
yellowThingsArray = yellowThings.collect()
print fruitsArray

[u'apple', u'banana', u'canary melon', u'grape', u'lemon', u'orange', u'pineapple', u'strawberry']


In [26]:
# count
numFruits = fruits.count()
numFruits

8

In [27]:
# take    Return an array with the first n elements of the dataset.
first3Fruits = fruits.take(3)
first3Fruits

[u'apple', u'banana', u'canary melon']

In [23]:
# reduce  Aggregate the elements of the dataset using a function func (which takes two arguments and returns one)
letterSet = fruits.map(lambda fruit: set(fruit)).reduce(lambda x, y: x.union(y))  #it is set union, remove duplicates.
letterSet

set([u'a', u' ', u'c', u'b', u'e', u'g', u'i', u'm', u'l', u'o', u'n', u'p', u's', u'r', u't', u'w', u'y'])

In [24]:
letterSet = fruits.flatMap(lambda fruit: list(fruit)).distinct().collect()
letterSet

[u'a', u'c', u'e', u'g', u'i', u'm', u'o', u's', u'w', u'y', u' ', u'b', u'l', u'n', u'p', u'r', u't']

## 3 Closure

In [28]:
counter = 0
rdd = sc.parallelize(xrange(10))

# Wrong: Don't do this!!
def increment_counter(x):
    global counter
    counter += x

rdd.foreach(increment_counter)

print counter

0


In [29]:
rdd = sc.parallelize(xrange(10))
accum = sc.accumulator(0)

def g(x):
    global accum
    accum += x

a = rdd.foreach(g)

print accum.value

45


In [32]:
rdd = sc.parallelize(xrange(10))
accum = sc.accumulator(0)

def g(x):
    global accum
    accum += x
    return x * x

a = rdd.map(g)
print accum.value

0


In [33]:
from operator import add

rdd = sc.parallelize(xrange(10))

print rdd.sum()

45


In [37]:
A = sc.parallelize(xrange(10))
print A.collect()

x = 5
B = A.filter(lambda z: z < x)
#B.cache()
print B.take(10)
x = 3
print B.count()
print B.collect() 

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4]
3
[0, 1, 2]


In [38]:
A = sc.parallelize(xrange(10))
B = A.map(lambda x: x*2)
A = B.map(lambda x: x+1)
A.take(10)

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

In [39]:
# Linear-time selection
data = [34, 67, 21, 56, 47, 89, 12, 44, 74, 43, 26]
A = sc.parallelize(data,2)
k = 4

while True:
    x = A.first()
    A1 = A.filter(lambda z: z < x)
    A2 = A.filter(lambda z: z > x)
    mid = A1.count()
    if mid == k:
        print x
        break
    if k < mid:
        A = A1
    else:
        A = A2
        k = k - mid - 1
    A.cache()

43


In [13]:
sorted(data)

[12, 21, 26, 34, 43, 44, 47, 56, 67, 74, 89]

### 4. Computing Pi using Monte Carlo simulation

In [40]:
# From the official spark examples.
import sys
import random

partitions = 100
n = 100 * partitions

def f(_):
    x = random.random()
    y = random.random()
    return 1 if x ** 2 + y ** 2 < 1 else 0

count = sc.parallelize(xrange(1, n + 1), partitions) \
          .map(f).reduce(lambda a,b: a+b)

print "Pi is roughly", 4.0 * count / n

Pi is roughly 3.1664


In [48]:
# glom()  an RDD created by assembling all elements within each partition into a list.
rdd = sc.parallelize([1, 2, 3, 4], 2)
print rdd.glom().collect()

[[1, 2], [3, 4]]


In [49]:
# Correct version
partitions = 100
n = 100 * partitions

def f(index, it):  # it  represents a partition of RDDs, including many elements.
    random.seed(index + 9836)
    for i in it:
        x = random.random()
        y = random.random()
        yield 1 if x ** 2 + y ** 2 < 1 else 0

count = sc.parallelize(xrange(1, n + 1), partitions) \
          .mapPartitionsWithIndex(f).reduce(lambda a,b: a+b)

print "Pi is roughly", 4.0 * count / n

Pi is roughly 3.1488


### 5 Key-Value Pairs

In [50]:
# reduceByKey used in pairs
numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)
numFruitsByLength.collect()

[(9, 1), (10, 1), (12, 1), (5, 3), (6, 2)]

In [53]:
from operator import add

lines = sc.textFile('../data/course.txt')
counts = lines.flatMap(lambda x: x.split()) \
              .map(lambda x: (x, 1)) \
              .reduceByKey(add)
print counts.take(10)

[(u'and', 3), (u'Information', 1), (u'videos', 1), (u'computing', 1), (u'servers', 1), (u'exposes', 1), (u'course', 1), (u'as', 1), (u'including', 1), (u'frameworks,', 1)]


In [58]:
print counts.sortBy(lambda x: x[1], False).take(10)  #descent order by the word frequency.

[(u'data', 4), (u'and', 3), (u'of', 3), (u'in', 2), (u'Course', 2), (u'the', 2), (u'Information', 1), (u'videos', 1), (u'computing', 1), (u'servers', 1)]


In [59]:
# Join Return a new RDD containing all pairs of elements having the same bey in the original RDDs
products = sc.parallelize([(1, "Apple"), (2, "Orange"), (3, "TV"), (5, "Computer")])
#trans = sc.parallelize([(1, 134, "OK"), (3, 34, "OK"), (5, 162, "Error"), (1, 135, "OK"), (2, 53, "OK"), (1, 45, "OK")])
trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])

print products.join(trans).collect()

[(2, ('Orange', (53, 'OK'))), (1, ('Apple', (134, 'OK'))), (1, ('Apple', (135, 'OK'))), (1, ('Apple', (45, 'OK'))), (3, ('TV', (34, 'OK'))), (5, ('Computer', (162, 'Error')))]


### 6. K-means clustering

In [3]:
import numpy as np

def parseVector(line):
    return np.array([float(x) for x in line.split(' ')])

def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/kmeans_data.txt
lines = sc.textFile('../data/kmeans_data.txt', 5)  

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/kmeans_bigdata.txt
# lines = sc.textFile('../data/kmeans_bigdata.txt', 5)  
# lines is an RDD of strings
K = 3
convergeDist = 0.01  
# terminate algorithm when the total distance from old center to new centers is less than this value

data = lines.map(parseVector).cache() # data is an RDD of arrays

kCenters = data.takeSample(False, K, 1)  # intial centers as a list of arrays
tempDist = 1.0  # total distance from old centers to new centers

while tempDist > convergeDist:
    closest = data.map(lambda p: (closestPoint(p, kCenters), (p, 1)))
    # for each point in data, find its closest center
    # closest is an RDD of tuples (index of closest center, (point, 1))
        
    pointStats = closest.reduceByKey(lambda p1, p2: (p1[0] + p2[0], p1[1] + p2[1]))
    # pointStats is an RDD of tuples (index of center,
    # (array of sums of coordinates, total number of points assigned))
    
    newCenters = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()
    # compute the new centers
    
    tempDist = sum(np.sum((kCenters[i] - p) ** 2) for (i, p) in newCenters)
    # compute the total disctance from old centers to new centers
    
    for (i, p) in newCenters:
        kCenters[i] = p
        
print "Final centers: ", kCenters


Final centers:  [array([ 0.05,  0.3 ,  0.05]), array([ 0.2,  0.4,  0.6]), array([ 9.1       ,  2.76666667,  6.16666667])]


### 7. PageRank

In [2]:
import re
from operator import add

def computeContribs(urls, rank):
    # Calculates URL contributions to the rank of other URLs.
    num_urls = len(urls)
    for url in urls:
        yield (url, rank / num_urls)

def parseNeighbors(urls):
    # Parses a urls pair string into urls pair."""
    parts = urls.split(' ')
    return parts[0], parts[1]

# Loads in input file. It should be in format of:
#     URL         neighbor URL
#     URL         neighbor URL
#     URL         neighbor URL
#     ...

# The data file can be downloaded at http://www.cse.ust.hk/msbd5003/data/*
lines = sc.textFile("../data/pagerank_data.txt", 2)
# lines = sc.textFile("../data/dblp.in", 5)

numOfIterations = 10

# Loads all URLs from input file and initialize their neighbors. 
links = lines.map(lambda urls: parseNeighbors(urls)) \
             .groupByKey()

# Loads all URLs with other URL(s) link to from input file 
# and initialize ranks of them to one.
ranks = links.mapValues(lambda neighbors: 1.0)

# Calculates and updates URL ranks continuously using PageRank algorithm.
for iteration in range(numOfIterations):
    # Calculates URL contributions to the rank of other URLs.
    contribs = links.join(ranks) \
                    .flatMap(lambda url_urls_rank:
                             computeContribs(url_urls_rank[1][0],
                                             url_urls_rank[1][1]))
    # After the join, each element in the RDD is of the form
    # (url, (list of neighbor urls, rank))
    
    # Re-calculates URL ranks based on neighbor contributions.
    ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)
    # ranks = contribs.reduceByKey(add).map(lambda (url, rank): (url, rank * 0.85 + 0.15))

print ranks.top(5, lambda x: x[1])

[(u'1', 1.2981882732854677), (u'3', 0.9999999999999998), (u'4', 0.9999999999999998), (u'2', 0.7018117267145316)]


### 8. Join vs. Broadcast Variables

In [1]:
products = sc.parallelize([(1, "Apple"), (2, "Orange"), (3, "TV"), (5, "Computer")])
trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])

print trans.join(products).collect()


[(1, ((134, 'OK'), 'Apple')), (1, ((135, 'OK'), 'Apple')), (1, ((45, 'OK'), 'Apple')), (2, ((53, 'OK'), 'Orange')), (3, ((34, 'OK'), 'TV')), (5, ((162, 'Error'), 'Computer'))]


In [1]:
products = {1: "Apple", 2: "Orange", 3: "TV", 5: "Computer"}
trans = sc.parallelize([(1, (134, "OK")), (3, (34, "OK")), (5, (162, "Error")), (1, (135, "OK")), (2, (53, "OK")), (1, (45, "OK"))])

# broadcasted_products = sc.broadcast(products)

def f(x):
    return (x[0], broadcasted_products.value[x[0]], x[1])

results = trans.map(lambda x: (x[0], broadcasted_products.value[x[0]], x[1]))
# results = trans.map(lambda x: (x[0], products[x[0]], x[1]))
print results.collect()


[(1, 'Apple', (134, 'OK')), (3, 'TV', (34, 'OK')), (5, 'Computer', (162, 'Error')), (1, 'Apple', (135, 'OK')), (2, 'Orange', (53, 'OK')), (1, 'Apple', (45, 'OK'))]
