# Machine Learning At Scale

Data Analytics and Machine Learning at Scale 

---
__Name:__  *Dr. James G. Shanahan*   
__Email:__  *James.Shanahan  @ gmail.com   
__Quiz:__  Debugging strategies in Spark

# Please first choose which Spark cluster backs this notebook to get your SC/sqlContext

* Back this notebook by Spark that is running on your local machine in a Container world
* Back this notebook by Spark that is running an EMR Cluster (note one has to read and write data from/to S3 to run Spark jobs on EMR)
* Back this notebook by Spark that is rnning on your local machine natively

### Run the next cell if you wish to launch a Spark cluster on your local machine in a Container world and back this notebook by that cluster

In [15]:
import os
import sys 
import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print sc
print sqlContext

<pyspark.context.SparkContext object at 0x7f08e8082610>
<pyspark.sql.context.SQLContext object at 0x7f08c1ffc190>


### Run the next cell if you wish to back this notebook by an EMR cluster that is already up and running

In [None]:
import os
import sys 
# First, we initialize the Spark environment
import findspark
findspark.init('/usr/lib/spark')

import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print(sc)
print(sqlContext)

### Run the next cell if you wish to launch a Spark cluster on your local machine in NATIVE model and back this notebook by that cluster

In [1]:
import os
import sys #current as of 9/26/2015

# spark_home = os.environ['SPARK_HOME'] = '/Users/jshanahan/Dropbox/Lectures-UC-Berkeley-ML-Class-2015/spark-1.6.1-bin-hadoop2.6/'
spark_home = os.environ['SPARK_HOME'] = '/Users/jshanahan/Dropbox/Lectures-UC-Berkeley-ML-Class-2015/spark-1.6.1-bin-hadoop2.6/'
if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')
sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.9-src.zip'))

# First, we initialize the Spark environment

import findspark
#findspark.init()

import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)


print(sc)
print(sqlContext)


# Import some libraries to work with dates
import dateutil.parser
import dateutil.relativedelta as dateutil_rd

<pyspark.context.SparkContext object at 0x1050241d0>
<pyspark.sql.context.SQLContext object at 0x100575210>


# Create some data

In [12]:
%%writefile wordcount.txt
hello hi hi hallo
bonjour hola hi ciao
nihao konnichiwa ola
hola nihao hello

Overwriting wordcount.txt


In [13]:
cat wordcount.txt

hello hi hi hallo
bonjour hola hi ciao
nihao konnichiwa ola
hola nihao hello

## NOTES on Inputs to Spark

http://spark.apache.org/docs/latest/programming-guide.html
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

In [5]:
rdd = sc.parallelize('wordcount.txt')  #distributes the string
rdd.first()
#rdd.count()

'w'

In [16]:
rdd = sc.textFile('wordcount.txt')  #create an RDD
rdd.count()

4

In [5]:
rdd.first()

u'hello hi hi hallo'

# Debugging in Spark  

* ### PART 1: Write Mapper/reduce functions as standalone code and debug on a test record (key-value pair)
* ### PART 2: n a multi operation call: break it down and debug step by step on a small test data set


##  PART 1: debug each closure independently with small unit tests
Where a closure can be (e.g., mapper/reducer/filter function first)

In [17]:
# This is ia an example of  mapper function (referred to as closure in Spark as this function and 
# its state will be serialized and shipped to each worker)

def mySplitFunction(string):
    string.split()
mySplitFunction("hello hi hi hallo")

In [8]:
# debug this function to return the first token in a string record
# for some reason we get back the first character and not the first string
def mySplitFunction(string):
    toks = string.split()[0]
    return toks[0]

#fake out my mapper function and debug
print mySplitFunction("hello hi hi hallo")


h


In [10]:
## debug this function to return the first token in a string record
# for some reason we get back the first character and not the first string



# solution 
def mySplitFunction(string):
    toks = string.split()[0]
    return toks

#fake out my mapper function and debug
print mySplitFunction("hello hi hi hallo")


hello


## PART 2:  In a multi operation call: break it down and debug step by step on a small test data set
### Call one operation at a time and take a couple of results (e.g., take(1) and examine 


In [18]:
# output the tokens from each record (one to MANY transformation)

def mySplitFunction(string):
    string.split()
    
logFileNAME = 'wordcount.txt'
text_file = sc.textFile(logFileNAME)

#debug flatmap
counts = text_file.flatMap(lambda line: line.split(" ")).take(3)
print counts

#              .map(lambda word: (word, 1)) \
#              .reduceByKey(lambda a, b: a + b)
# wordCounts = counts.collect()
# for v in counts.collect():
#     print v

[u'hello', u'hi', u'hi']


In [21]:
# output the tokens and corresponding count from each record (one to one map function)

def mySplitFunction(string):
    string.split()
    
logFileNAME = 'wordcount.txt'
text_file = sc.textFile(logFileNAME)

#debug flatmap
counts = text_file.flatMap(lambda line: line.split(" ")) \
                  .map(lambda word: (word, 1)) \
                  .take(3)
print counts

#              .reduceByKey(lambda a, b: a + b)
# wordCounts = counts.collect()
# for v in counts.collect():
#     print v

[(u'hello', 1), (u'hi', 1), (u'hi', 1)]


In [5]:
# complete word count
#
Count words in file/directory
logFileNAME = 'wordcount.txt'
text_file = sc.textFile(logFileNAME)
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
wordCounts = counts.collect()
for v in counts.collect():
    print v

(u'ciao', 1)
(u'bonjour', 1)
(u'nihao', 2)
(u'hola', 2)
(u'konnichiwa', 1)
(u'hallo', 1)
(u'hi', 3)
(u'hello', 2)
(u'ola', 1)


In [10]:
print wordCounts

[(u'ciao', 1), (u'bonjour', 1), (u'nihao', 2), (u'hola', 2), (u'konnichiwa', 1), (u'hallo', 1), (u'hi', 3), (u'hello', 2), (u'ola', 1)]


__sortByKey([ascending], [numTasks])__	

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

In [35]:
wordCounts

[(u'hallo', 1),
 (u'konnichiwa', 1),
 (u'ola', 1),
 (u'ciao', 1),
 (u'bonjour', 1),
 (u'nihao', 2),
 (u'hello', 2),
 (u'hola', 2),
 (u'hi', 3)]

In [34]:
#Last 1
wordCounts[8:]

[(u'hi', 3)]

In [36]:
#first  5
wordCounts[:5]

[(u'hallo', 1), (u'konnichiwa', 1), (u'ola', 1), (u'ciao', 1), (u'bonjour', 1)]