# Word Counting using PySpark with Jupyter

The common introduction to large scale distributed computing is to perform a word count.

We do that here.  :)


## Create a Local Spark Context

Configure and create a Spark Context object in order to work with Spark in this notebook.  See the _Introduction_ notebook for more detail about these operations.

In [1]:
import pyspark

# Local Spark configuration
conf = (
    pyspark
      .SparkConf()
      .setMaster('local[*]')
      .setAppName('Word Counting Notebook')
)


# Create a Spark context for local work of has not already been done.
try:
    sc
except:
    sc = pyspark.SparkContext(conf = conf)

# Check that we are using the expected version of PySpark.
print('Version: %s' % sc.version)

Version: 1.6.1


## Process Some Text

Once we have a working Spark instance, we can perform some actual work.  For the following, the full text for Shakespeare's _The Taming of the Shrew_ was obtained and will be processed.  The text was obtained from [lexically.net](http://lexically.net/wordsmith/support/shakespeare.html) which obtained the actual corpus from the [Online Library of Liberty](http://www.lexically.net).

If you download and process these files yourself, note that they are stored in 16 bit Unicode.  For simplicity, there is a local copy of this one play located in the _data_ directory that is stored in UTF8 format.

Here we are creating a Spark Resilient Distributed Dataset ([RDD](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)) object.  This is a very common way to store data to be manipulated with Spark.  It is an immutable stream of data that can be processed in order, and since it is immutable, Spark can partition it into chunks to distribute among the available processing resources when it is time to process.  RDDs will in general work best if they are sized to fit into memory.

The Spark Context _textFile_ method creates a new RDD loaded with the contents of a text file with each element of the RDD corresponding to a line in the original text file.

In [2]:
# UTF8 encoded textfile.
shrewText = sc.textFile("data/tamingoftheshrew.txt")

Like most text processing and NLP work flows, removal of stop words reduces the size of the required tasks.  We can use the standard stop words available from the _stopwords_  Python package for our list of words to remove from the corpus.  These stop words will be needed by each worker in a distributed processing environment.

Spark includes the concept of a [_broadcast_](http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables) variable.  This is a shared variable that Spark will make available to each distributed worker by copying the data as efficiently as possible to the destination workers.  Each host which executes worker processes will then have the data available without any additional copy operations.

In [3]:
# Grab stop words to remove from the corpus.
from stop_words import get_stop_words
stopwords = sc.broadcast( set(get_stop_words('en')) )

When we are processing input lines, we remove multiple whitespace portions from each line.  These are found and replaced using a regular expression pattern.  This pattern was broadcast to all worker processes since the pattern itself is immutable and common to all workers.

In addition to the whitespace, we find what appear to be HTML element tags in the file.  These indicate speaker and other non-verbal information for the play.  We can remove these using a regular expression as well.  And this regular expression is broadcast also.

In [4]:
# Regular expression to find (any number of) spaces.
import re
multispace = sc.broadcast( re.compile(r'\s+') )
elementtag = sc.broadcast( re.compile(r'<[^>]*>'))

Now that we have the corpus to process and some immutable data to work with, we can start working on the data. First, we split the input into individual words. We can easily do this by splitting on whitespace of any kind and size, then creating an output record for each word resulting from the split.

In the code below, the splitting is done internal to the flatMap call. In that call, each line is processed to replace whitespace of any kind with a single space, all text is converted to lower case for counting, and then the single spaces are used to split the line into a record for each word.

Once the words have been converted to records, the stop words are removed and any remaining empty records are removed.  Note that use of the broadcast variables requires that the _.value_ attribute be accessed to obtain the original variable from the broadcast variable.

In [5]:

# Split a line into words at each space.
split_line = lambda line: normalize_line(line).split(" ")

# Split the input and filter out the undesireable parts.
shrewWords  = (
    shrewText.flatMap(split_line)
      .filter(lambda w: len(w) > 0)               # Remove empty words.
      .filter(lambda w: w not in stopwords.value) # Remove stop words.
)

# Remove outside whitespace, convert to lower case, collapse remaining whitespace to single spaces.
def normalize_line(line):
    normal_line = line.strip().lower()  # Remove surrounding spaces, make lower case.
    normal_line = elementtag.value.sub(' ',normal_line) # Remove element tags.
    normal_line = multispace.value.sub(' ',normal_line) # Replace multi-spaces with single spaces.
    return(normal_line)


Now that we have a record for each individual word in the corpus, we can group and count them.  To do this we create Key/Value records by mapping the words.  The Key is set to the word, and the Value is given an integer value of 1.

Calling _.reduceByKey()_ on these Key/Value records groups the records for each word together and processes then using the provided function.  Here we add up all of the individual Values for the records of that Key.  Since each word in the corpus started with a Value of 1, adding these together results in the count of the number of times that (Key) word appears in the corpus.

We go on to sort the result in descending order by count, then save the result.  This is done by writing the RDD to a text file.

The saved result is stored in parts by the RDD and will need to be combined in order to see the entire output together.  Other storage types can provide a single output file for review.  The output will be in a directory with the location name, and the partition files are located there along with some metadata files.

In [6]:
# Count the words by mapping a value for each row
# and adding the values up for each unique key.
shrewCounts = (
    shrewWords.map(lambda word: (word, 1)) # Generate the Key/Value records.
      .reduceByKey(lambda x, y: x + y)     # Generate the word counts.
      .map(lambda t: (t[1],t[0]))          # Swap Key and Value to sort by Value.
      .sortByKey(False)                    # Sort in descending order.
      .map(lambda t: (t[1],t[0]))          # Swap back to original sense of Key/Value.
)

resultsLocation = 'shrewcounts'

# Ensure that there is no previous output in the location.
# Choose to store multiple results by using multiple locations.
import shutil
shutil.rmtree(resultsLocation,ignore_errors=True)

# Store the results
shrewCounts.saveAsTextFile(resultsLocation)

Now we can collect up and display interesting information about the processing results.  While these results appear expected -- Shakespeare would certainly have been filled with the use of words like _thou_ and _shall_, we can see that ETL processing is not complete for this corpus.  Here the _'tis_ word demonstrates that we have not dealt with punctuation marks.  If you examine the result file(s) created above, you will see many examples of punctuation and other symbols in the counted words.  Indeed, Shakespeare is full of 'em.

In [7]:
# Count the number of unique words and the total number of words.
# Sans stop words, of course.
countOfUniqueWords = shrewCounts.count()
totalCountOfWords  = shrewCounts.map(lambda t: t[1]).reduce(lambda x,y: x+y)

# Look at some results.
print('Unique words: ',countOfUniqueWords,', Total words: ',totalCountOfWords,'\n')
for k,v in shrewCounts.take(10):
    print(k,': ',v)


Unique words:  4845 , Total words:  11257 

will :  146
thou :  112
shall :  99
thy :  85
good :  76
sir, :  74
you, :  67
why, :  55
'tis :  52
let :  49
