# Word Count 

The word count program is the "hello world" of distributed computing, and is an essential to learning how to engage the cluster with Spark. 

When running Spark with the jupyter notebook driver, a Spark context is automatically added to the interpreter, accessible with the `sc` variable:

In [1]:
sc

<pyspark.context.SparkContext at 0x7f666dfb6250>

The first step is to load an RDD &mdash; a resilient distributed dataset &mdash; with data from HDFS. We'll load the complete works of Shakespeare, which is stored on HDFS under our user folder. 

The RDD is loaded as a collection of strings, where each element in the RDD is a single line from the text file. 

In [1]:
text = sc.textFile("hdfs://10.0.0.125/user/ec2-user/shakespeare.txt")

To count words, we need to split the string lines up by white space, we can use Python's `str.split()` for this. To apply transformations to the RDD, we will pass closures that describe the work, mapping them to each element of the RDD:

In [2]:
text = text.map(lambda x: x.split())

Transformations _do not execute_ immediately after they are declared. Instead a _lineage_ is built up, describing interactions with RDDs as they are transformed into new RDDs. We can see the lineage by printing the DAG as follows:

In [8]:
print(text.toDebugString())

(2) PythonRDD[2] at RDD at PythonRDD.scala:43 []
 |  hdfs://10.0.0.125/user/ec2-user/shakespeare.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
 |  hdfs://10.0.0.125/user/ec2-user/shakespeare.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []


In order to cause execution to occur on the cluster, we need to call an _action_ method on the RDD. For example we can use the `take` method to collect the first element of the RDD:

In [9]:
text.take(1)

[[u'hamlet@0', u'HAMLET']]

In [10]:
def splitter(sep=" "):
    def inner(line):
        return line.split(sep)
    return inner 

In [11]:
tab_splitter = splitter("\t")
comma_splitter = splitter(",")

In [14]:
map(comma_splitter, ["a\tb\tc", "a,b,c"])

[['a\tb\tc'], ['a', 'b', 'c']]

In [17]:
SEP = ","

def splitter2(line):
    return line.split(SEP)

In [18]:
map(splitter2, ["a\tb\tc", "a,b,c"])

[['a\tb\tc'], ['a', 'b', 'c']]

In [1]:
from operator import add

lens = text.map(lambda x: len(x))
count = lens.reduce(add)
print(count.collect())

In [21]:
text = sc.textFile("hdfs://10.0.0.125/user/ec2-user/shakespeare.txt")
text = text.flatMap(lambda line: line.split())
text = text.map(lambda word: (word, 1))
text = text.reduceByKey(add)

In [24]:
text = text.sortBy(lambda token: token[1], ascending=False)
text.coalesce(1).saveAsTextFile('shakes-counts')

In [20]:
text = sc.textFile("hdfs://10.0.0.125/user/ec2-user/shakespeare.txt")
text = text.map(lambda line: " ".join(line.split("\t")[1:]))
text = text.flatMap(lambda lineno: lineno.split())

text.take(20)

# lens = text.map(lambda words: len(words))

# lens.reduce(add)

[u'HAMLET',
 u'DRAMATIS',
 u'PERSONAE',
 u'CLAUDIUS',
 u'king',
 u'of',
 u'Denmark.',
 u'(KING',
 u'CLAUDIUS:)',
 u'HAMLET',
 u'son',
 u'to',
 u'the',
 u'late,',
 u'and',
 u'nephew',
 u'to',
 u'the',
 u'present',
 u'king.']