# Tutorial Lesson 1: Word Count

## Introduction

What words did Shakespeare most commonly use? This seems like a simple question, but answering it is a great way to learn the ins and outs of Spark. While you could probably figure this one out by writing a normal program in your language of choice (or even by hand), in order to implement it in a scalable way, you need to use a parallel data processing framework like Spark. 

The file ../data/shakespeare.txt contains the complete works of Shakespeare. We'll use Spark to find how many times each word was used in 3 different ways. 

If you haven't used IPython/Jupyter before: The gray boxes contain Python code. You can edit the code by clicking in the box, and then run it by pressing Ctrl-Enter. The output will appear below the box. You can always revert using the Revert option in the file menu. If you want to download this locally, you can get it from github.com/dfeldman/spark-training-materials.

## Some necessary helper code (not important to understand, just press Ctrl-Enter to run)

In [1]:
import pyspark
import pyspark.sql
import pandas, pandas.tools.plotting
import matplotlib.pyplot as plt
from pyspark.sql.functions import *

from IPython.display import display, HTML

try: sc = pyspark.SparkContext('local[*]')
except ValueError: pass
spark = pyspark.sql.SparkSession(sc)

# Useful function for displaying a DataFrame in a nice-looking way
def show(df):
   display(HTML(
    '<table><tr><th>{}</th></tr><tr>{}</tr></table>'.format(
        '</th><th>'.join(str(_) for _ in df.columns),
        '</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in df.take(50))
        )
     ))


def show_rdd(rdd):
    show(rdd.toDF())

def show_string_rdd(rdd):
    show(rdd.map(lambda x: (x,)).toDF())

## Python hints (skip if you are a Python expert)

If you're not familiar with Python, here are a few quick tips that will help you get started before we start with Spark.

Python has built-in arrays, like many other programming languages. While Spark doesn't use these built-in arrays to store your data, they have some similar properties so it's good to know about arrays first. 

In [None]:
x = [1, 2, 3]
y = ["a","b", "c"]
print(x)
print(y)

You can also make an array by splitting a string. 

In [None]:
text = "Here's some text"
text_split = text.split(" ") # Make a new array by cutting the string at every space
print(text_split)

In a lot of programming, the way you interact with an array is by using a "for" or "while" loop to iterate over each element of the array (and possibly change it at each point). **But that is not the right way to use Spark.** Instead, we write small functions that can be applied in parallel to every element of the array, and return a new resulting array all at once. Python has this functionality built in for its arrays too. 

In [None]:
# Define a function that returns the input plus 1
def xplusone(x):
    return x + 1

print(xplusone(1))

print(list(map(xplusone, [1, 2, 3, 4, 5])))

To make the code simpler, Python has a way to define a quick, very simple function called a "lambda function". This code is exactly the same as the above:

In [None]:
xplusone = lambda x: x + 1
print(xplusone(1))
print(list(map(xplusone, [1,2,3,4,5,6])))

But the whole point of a lambda function is that we don't even have to define it in advance. We can just define it when we need it:

In [None]:
print(list(map(lambda x: x + 1, [1,2,3,4,5,6])))

We'll use this style of programming throughout the tutorial (only using Spark RDDs instead of the built-in Python arrays, and Spark functions instead of the built-in map). The advantage is that Spark will take the function and run it across the entire cluster in parallel, instead of us defining the order in which to execute the function on the elements in the array. 

One other hint: In Python, most objects and functions have built-in documentation. This is very helpful 
as you're learning. You can see the documentation for an object by running print(<object>.__doc__), and see what's insie an object with print(dir(<object>))

In [None]:
raw_data = sc.textFile(name="../data/shakespeare.txt")
print(raw_data.__doc__)

In [None]:
print(dir(raw_data))

In [None]:
print(raw_data.map.__doc__)

----

## Word Count Version 1: Using RDDs

RDDs are the core data structure in Spark. Everything in Spark is built out of RDDs. 
This is a solution to the word count problem using only RDDs. RDDs are hard to use though, so in real life you would use DataFrames (next section). 

RDDs work like a list or array in a traditional programming language. All they do is store a collection of items. You can efficiently apply a function to every element in an RDD using map. 

Let's create an RDD:

In [None]:
raw_data = sc.textFile(name="../data/shakespeare.txt")

At any point, you can see the contents of an RDD by using the show_string_rdd function that I defined above:

In [None]:
show_string_rdd(raw_data)

Our first operation will be to split each line into words

In [None]:
words = raw_data.flatMap(lambda line: line.split(" "))
show_string_rdd(words)

Now, we'll convert all words to lower case:

In [None]:
words_lower = words.map(lambda x: x.lower())

And filter out empty words:

In [None]:
words_not_empty = words_lower.filter(lambda x: x != "")

This is where it gets a little bit tricky. An RDD is a collection of rows. But right now we just have a collection of individual strings (the words themselves). As a first step to getting a word count, we'll take each word, and turn it into a row of that word and the number 1 attached to it:

In [None]:
words_as_rows = words_not_empty.map(lambda word: (word, 1))  

This is where the magic happens. We'll use Spark's reduceByKey function to combine all the word rows, adding up all the attached numbers. 


In [None]:
words_reduced = words_as_rows.reduceByKey(lambda a, b: a + b)
show_rdd(words_reduced)

In an RDD, the first element in each row is the "key". In order to sort the RDD by the count of each word, we need to make the count the key. 

In [None]:
words_flipped = words_as_rows.map(lambda x: (x[1], x[0])) 

Then we can sort them:

In [None]:
words_sorted = words_flipped.sortByKey(ascending=False)
show_rdd(words_sorted)

We can do the whole thing without needing any intermediate lines like this:

In [None]:
raw_data = sc.textFile(name="../data/shakespeare.txt")
counts = ((raw_data
    .flatMap(lambda line: line.split(" "))  # Split each line of text into words
    .filter(lambda x: x != "")              # Filter out empty words
    .map(lambda x: x.lower())               # convert each word to lower case
    .map(lambda word: (word, 1))            # Turn each word X into tuple (X, 1)
    .reduceByKey(lambda a, b: a + b))       # Count the words
    .map(lambda x: (x[1], x[0]))            # Flip the structure (X, Y) to (Y, X) to make sorting easier
    .sortByKey(ascending=False))            # Sort

------

## Word Count Version 2: Using DataFrames


In general, you wouldn't use RDDs directly, because Spark provides a much easier-to-use high-level interface called DataFrames. These are similar to Pandas or R frames and provide a lot of built-in functionality for free. Internally, they are built entirely using DataFrames. 

In [None]:
counts2 = (raw_data
    .flatMap(lambda line: line.split(" "))).map(lambda x: (x,)).toDF()

df2=counts2.groupBy("_1").count().sort(desc("count"))
show(df2)

Can you modify this to exclude empty words, like the first example?

-------

## Word Count Version 3: Using SQL

Spark supports an even-easier interface, at least for many simple programs: SQL queries. Spark's SQL processing is built on DataFrames, which in turn are built on RDDs. HEre's an example:

In [None]:
counts3 = (raw_data
    .flatMap(lambda line: line.split(" "))).map(lambda x: (x,)).toDF()

# Install the "counts3" DataFrame as an SQL table
counts3.createOrReplaceTempView("table1")

df3=spark.sql("select lower(_1) as word, count(*) as ct from table1 where _1 != '' group by _1 order by ct desc")

show(df3)

## Just for fun: graphing the distribution

The easiest way to get a plot out of Spark is to convert the Spark DataFrame into a Pandas DataFrame (Pandas is another Python library). Pandas has great support for plots. 

In [None]:
%matplotlib inline

pdf = counts.toDF().limit(100).toPandas()
pdf.plot()

# Questions
(You can use versions 1, 2, or 3 to answer these. Or for an extreme challenge, try to do all 3). 

1. How many times did Shakespeare use the word "Romeo"?
1. How many distinct words are used?
1. What is the average number of times a word is used?
1. In version 1, change the file name to a nonexistent file. When does Spark notice that an error has occurred?