*Initiate Spark Context - ONLY first time for each notebook. If you get problems with below, see [Help](/notebooks/spark_course/1-Course-Information-and-Links/If-you-get-problems-initiating-spark-context.ipynb)*

In [1]:
import os
from pyspark import SparkContext
sc = SparkContext(appName="search", master=os.environ['MASTER'])

## **Interactive Python Shell in the Browser**
A Python cell allows you to execute arbitrary Python commands just like in any Python shell.

In [2]:
print "The sum of 1 and 1 is %s" % (1+1) 

The sum of 1 and 1 is 2


####Notie: you can at any time click in the cell and make your own changes. Add cells when needed, e.g. when you want to try out something new, or breaking up some code into many steps. IPyNB is a de facto IDE.

## Standard Python libraries can be imported:
For other libraries that are not available by default, you can upload other libraries to the Workspace.

In [3]:
import re
m = re.search('(?<=abc)def', 'abcdef')
m.group(0)

'def'

#### Note: This means that you at any time can import what you need from python numerical libraries, matplotlib etc.

#Spark Examples
These examples give a quick overview of the Spark API ("translated" to Interactive Python Notebook from the Spark web site (http://spark.apache.org/examples.html)). Spark is built on the concept of distributed datasets, which contain arbitrary Java or Python objects. You create a dataset from external data, then apply parallel operations to it. There are two types of operations: transformations, which define a new dataset based on previous ones, and actions, which kick off a job to execute on a cluster.

In [6]:
words = sc.parallelize(["hello", "world", "goodbye","goodbye", "hello", "again"])
wordcounts = words.map(lambda s: (s, 1)).reduceByKey(lambda a, b : a + b).collect()
wordcounts

[('world', 1), ('again', 1), ('hello', 2), ('goodbye', 2)]

###Exercise 1: Calculate the number of unique words in the "words" rdd here.
(Hint: The answer should be 4.)

In [16]:
#Write your code here
wordcount

#### Solution
Link to [Solution pages](/notebooks/spark_course/7-Solutions-to-exercises/)

###Exercise 2: Create an rdd of numbers, and use Spark to find the mean.
(Hint: Use reduce to sum all the numbers and divide by the count)

In [42]:
#Write your code here
import numpy.random
data = numpy.random.rand(1,10000000).flatten()
numpy.mean(data)
numbers = sc.parallelize(data)
ct = numbers.count()
numbers.reduce(lambda a,b: a+b)/ct

0.50006752868025783

#### Solution
Link to [Solution pages](/notebooks/spark_course/7-Solutions-to-exercises/)

##Text Search (update)
In this example, we search through the error messages in a log file:

In [46]:
file = sc.textFile("/uuData/error_log")
errors = file.filter(lambda line: "error" in line)
# Count all the errors
print errors.count()

5


In [8]:
# Count errors mentioning MySQL
print errors.filter(lambda line: "File does not exist" in line).count()
# Fetch the MySQL errors as an array of strings
print errors.filter(lambda line: "MySQL" in line).collect()

5
[]


##In-Memory Text Search
Spark can cache datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using:

errors.cache()

After the first action that uses errors, later ones will be much faster.

##Word Count
In this example, we use a few more transformations to build a dataset of (String, Int) pairs called counts and then save it to a file.

In [9]:
file = sc.textFile("/uuData/access_log")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

##Estimating Pi
Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.

In [10]:
partitions = 100

In [11]:
import sys
from random import random
from operator import add

from pyspark import SparkContext


n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
print "Pi is roughly %f" % (4.0 * count / n)

Pi is roughly 3.138500


##Logistic Regression
This is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input in RAM across iterations.

In [12]:
iterations = 20

In [13]:
"""
A logistic regression implementation that uses NumPy (http://www.numpy.org)
to act on batches of input data using efficient matrix operations.
In practice, one may prefer to use the LogisticRegression algorithm in
MLlib, as shown in examples/src/main/python/mllib/logistic_regression.py.
"""

from collections import namedtuple
from math import exp
from os.path import realpath
import sys

import numpy as np

D = 10  # Number of dimensions


# Read a batch of points from the input file into a NumPy matrix object. We operate on batches to
# make further computations faster.
# The data file contains lines of the form <label> <x1> <x2> ... <xD>. We load each block of these
# into a NumPy array of size numLines * (D + 1) and pull out column 0 vs the others in gradient().
def readPointBatch(iterator):
    strs = list(iterator)
    matrix = np.zeros((len(strs), D + 1))
    for i in xrange(len(strs)):
        matrix[i] = np.fromstring(strs[i].replace(',', ' '), dtype=np.float32, sep=' ')
    return [matrix]


points = sc.textFile("/uuData/lr_data.txt").mapPartitions(readPointBatch).cache()

In [14]:
# Initialize w to a random value
w = 2 * np.random.ranf(size=D) - 1
print "Initial w: " + str(w)

# Compute logistic regression gradient for a matrix of data points
def gradient(matrix, w):
    Y = matrix[:, 0]    # point labels (first column of input file)
    X = matrix[:, 1:]   # point coordinates
    # For each point (x, y), compute gradient function, then sum these up
    return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)

def add(x, y):
    x += y
    return x

for i in range(iterations):
    print "On iteration %i" % (i + 1)
    w -= points.map(lambda m: gradient(m, w)).reduce(add)

print "Final w: " + str(w)

Initial w: [-0.12481223 -0.99055981 -0.77579258  0.33652082  0.22640931  0.03879368
  0.90980746  0.92434431 -0.85749732  0.39375674]
On iteration 1
On iteration 2
On iteration 3
On iteration 4
On iteration 5
On iteration 6
On iteration 7
On iteration 8
On iteration 9
On iteration 10
On iteration 11
On iteration 12
On iteration 13
On iteration 14
On iteration 15
On iteration 16
On iteration 17
On iteration 18
On iteration 19
On iteration 20
Final w: [ 505.71631611  643.38593344  613.49460251  397.34343394  470.52502102
  507.92458279  330.25231094  389.52729754  616.64854737  442.3351731 ]
