# 03: WordCount - Alternative Implementations of WordCount

This exercise also implements *Word Count*, but it uses a slightly simpler approach. It also shows one way to make the code more configurable. We'll define variables for the input and output locations. The corresponding Spark program, [WordCount3.scala](https://github.com/deanwampler/spark-scala-tutorial/blob/master/src/main/scala/sparktutorial/WordCount3.scala) uses a utility library to support command-line arguments. (The library demonstrates some idiomatic, but fairly advanced Scala code, but you can ignore the details and just use it.) 

Next, instead of using the old approach of creating a `SparkContext`, like we did in <a href="02_WordCount.ipynb" target="02_WC">02_WordCount</a>, we'll use the now recommended approach of creating a `SparkSession` and extracting the `SparkContext` from it when needed. Finally, we'll also use [Kryo Serialization](http://spark.apache.org/docs/latest/tuning.html), which provides better compression and therefore better utilization of memory and network bandwidth (not that we really need it for this small dataset...).

This version also does some data cleansing to improve the results. The sacred text files included in the `data` directory, such as `kjvdat.txt` are actually formatted records of the form:

```text
book|chapter#|verse#|text
```

That is, pipe-separated fields with the book of the Bible (e.g., Genesis, but abbreviated "Gen"), the chapter and verse numbers, and then the verse text. We just want to count words in the verses, although including the book names wouldn't change the results significantly. (Now you can figure out the answer to one of the exercises in the previous example...)

The corresponding Spark program is [WordCount2.scala](https://github.com/deanwampler/spark-scala-tutorial/blob/master/src/main/scala/sparktutorial/WordCount2.scala). It shows you how to structure a Spark program, including imports and one way to construct the required [SparkContext](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext)

We'll use the KJV Bible text again. Subsequent exercises will add the ability to specify different input sources using command-line arguments.

This time, we'll define variables for the input and output data locations. So, if you want to use the same code to process different data, just edit the next cell:

In [1]:
val in  = "../data/kjvdat.txt"    // input file
val out = "output/kjv-wc3"        // output location (directory)

in = ../data/kjvdat.txt
out = output/kjv-wc3


output/kjv-wc3

Like before, we read the text file, convert the lines to lower case, and tokenize into works. Let's start this time by defining a helper method that handles the record format; it will strip off the leading `book|chapter#|verse#|`, leaving just the verse text. we do this by splitting the line on `|`, which returns an `Array[String]`. Then we keep the last element.

In [2]:
def toText(str: String): String = {
  val ary = str.toLowerCase.split("\\s*\\|\\s*")
  if (ary.length > 0) ary.last else ""
}

toText: (str: String)String


In [3]:
val input = sc.textFile(in).map(line => toText(line))  // could also write ...map(toText)

input = MapPartitionsRDD[2] at map at <console>:31


MapPartitionsRDD[2] at map at <console>:31

Recall that if you will read `input` several times, then cache the data so Spark doesn't reread from disk each time!

In [4]:
input.cache

MapPartitionsRDD[2] at map at <console>:31

The following is one long statement that is similar to what we saw in _02_WordCount_, but with a few differences. 

Take the `input` and:
1. Split each line on non-alphanumeric characters (a crude form of tokenization). `flatMap` "flattens" each array returned into a since `RDD` of words
2. Use `countByValue` to treat each word as a key, then count all the keys. This returns a Scala `Map[String,Long]` to the driver, so be careful about `OutOfMemory` (`OOM`) errors for very large datasets.

In [5]:
val wc1 = input
  .flatMap(line => line.split("""[^\p{IsAlphabetic}]+"""))
  .countByValue()     // Returns a Scala Map[T, Long] to the driver; no more RDD!

wc1 = Map(onam -> 4, professed -> 2, confesseth -> 3, brink -> 6, youthful -> 1, healings -> 1, sneezed -> 1, forgotten -> 46, precious -> 76, inkhorn -> 3, exorcists -> 1, derided -> 2, eatest -> 3, lover -> 4, centurion -> 21, plentiful -> 4, pasture -> 20, sargon -> 1, speaker -> 2, terrible -> 52, lion -> 104, rate -> 5, zorites -> 1, mole -> 1, lights -> 10, arimathaea -> 4, spokes -> 1, rage -> 18, submitted -> 3, engraver -> 3, ahava -> 3, ferret -> 1, snow -> 24, desolate -> 148, laughing -> 1, jabbok -> 7, shuttle -> 1, arodites -> 1, michael -> 15, darkened -> 19, camest -> 28, abhorrest -> 2, beheld -> 53, looks -> 5, alpha -> 4, crieth -> 17, adders -> 1, holpen -> 5, chargeable -> 5, galatians -> 1, jezaniah -> 2, pudens -> 1, jeremiah -> ...




Now let's convert back to an `RDD` for output. We'll use one partition (the `1` argument you'll see below). To do this, we first convert to a comma-separated string. Note that calling `map` on a scala `Map` passes two-element tuples for the key-value pairs to the function. We extract the first and second elements with the `_1` and `_2` methods, respectively, with which we format strings for output.

In [6]:
val wc2 = wc1.map(key_value => s"${key_value._1},${key_value._2}").toSeq  // returns a Seq[String]
val wc = sc.makeRDD(wc2, 1)

wc2 = List(onam,4, professed,2, confesseth,3, brink,6, youthful,1, healings,1, sneezed,1, forgotten,46, precious,76, inkhorn,3, exorcists,1, derided,2, eatest,3, lover,4, centurion,21, plentiful,4, pasture,20, sargon,1, speaker,2, terrible,52, lion,104, rate,5, zorites,1, mole,1, lights,10, arimathaea,4, spokes,1, rage,18, submitted,3, engraver,3, ahava,3, ferret,1, snow,24, desolate,148, laughing,1, jabbok,7, shuttle,1, arodites,1, michael,15, darkened,19, camest,28, abhorrest,2, beheld,53, looks,5, alpha,4, crieth,17, adders,1, holpen,5, chargeable,5, galatians,1, jezaniah,2, pudens,1, jeremiah,147, coney,2, ditch,6, despitefully,3, sheweth,20, nahaliel,2, sorrows,22, wiser,8, hananeel,4, nicopolis,1, rouse,1, lattice,3, shivers,1, forgettest,2, prophesying,6, shimrith,1,...




Save the results.

> **Note:** If you run the next cell more than once, _delete the output directory first!_ Spark, following Hadoop conventions, won't overwrite an existing directory.

In [7]:
println(s"Writing output to: $out")
wc.saveAsTextFile(out)

Writing output to: output/kjv-wc3


## Recap

Question: how is the output in `notebooks/output/kjv-wc3` different from the output we generated for _02_WordCount_, `notebooks/output/kjv-wc2`?

The `countByValue` function is very convenient for situations like this, but it's not widely used because of its narrow purpose and the risk of exceeding available memory in the job driver.

## Exercises

### Exercise 1: Try different inputs

Change the input `in` and output `out` definitions above to try different files. Does the helper function `toText` need to be changed?

### Exercise 2: Sort by word length

How would you tell the Scala collections library or the `RDD` API to sort by the length of the words, rather than alphabetically? Look at the sort methods in both libraries. Most of the time, you pass a function that will take as the argument the full "record", then you return something to use for sorting.

### Exercise 3: Repeat any of the _02_WordCount_ exercises

Some you might try doing in the Scala collection transformations, rather than using `RDD` transformations.