You should remove the `raise` exceptions below and insert your code in their place. The cells which say `DO NOT CHANGE THE CONTENT OF THIS CELL` are there to help you, if they fail, it's probably an indication of the fact that your code is wrong. You should not change their content - if you change them to make them correspond to what your program is producing, you will still not get the marks.

If you encounter an error while running your notebook that doesn't appear to be connected to RDDs (such as missing `imp`), you should check that you've run the initialization cells since you've started your latest cluster.

Before you turn your solution in, make sure everything runs as expected. With an attached cluster, you should **Clear State and Results** (under the **Clear** dropdown menu) and then click on the **Run all** icon. This runs all cells in the notebook from new. You should only submit this notebook if all cells run.

# Character counting

This homework lets you practise writing functions along with using RDDs. The result will be a number of functions which you combine to take an input text and compute character frequencies in Spark.

The first four cells check that your Databricks setup is correct, download some files for you and move them to the right place if needed.

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
import imp

try:
    imp.find_module('dbutils')
except ImportError:
    import pyspark
    sc = pyspark.SparkContext()

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
import os

def file_exists(path):
    try:
        dbutils.fs.ls(path)
        return True
    except Exception as e:
        if 'java.io.FileNotFoundException' in str(e):
            return False
        else:
            raise

file1 = "/FileStore/pride_and_prejudice.txt"
file2 = "/FileStore/madame_bovary.txt"
file3 = "/FileStore/polish_fairy_tales.txt"

os.environ['download_files'] = "false"
            
try:
    imp.find_module('dbutils')     
    if not file_exists(file1) == True:
        os.environ['download_files'] = "true"
except ImportError:
    print("Are you running this on Databricks?")
    os.environ['download_files'] = "false"

In [0]:
%%bash
# DO NOT CHANGE THE CONTENT OF THIS CELL
if [ $download_files = "true" ]; then
    curl http://www.gutenberg.org/files/1342/1342-0.txt >/tmp/pride_and_prejudice.txt
    curl http://www.gutenberg.org/cache/epub/14155/pg14155.txt >/tmp/madame_bovary.txt    
    curl http://www.gutenberg.org/cache/epub/36668/pg36668.txt >/tmp/polish_fairy_tales.txt
fi

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
try:
    imp.find_module('dbutils')
    if not file_exists(file1) == True:
        dbutils.fs.cp("file:/tmp/pride_and_prejudice.txt", file1)
        dbutils.fs.cp("file:/tmp/madame_bovary.txt", file2)
        dbutils.fs.cp("file:/tmp/polish_fairy_tales.txt", file3)
except ImportError:
    pass

Write a function `to_lower` which takes a single line string and returns the string in lower case. So for the input "hElLO", `to_lower("hElLO")` should return `hello`.

In [0]:
def to_lower(s):
    return s.lower()
#     raise NotImplementedError()

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
assert to_lower("hElLO WORLD!") == "hello world!", 'Unexpected lower case of hello world'
assert to_lower("123 RDDs") == "123 rdds", "Unexpected lower case of 123 RDDs"

Write a function `to_characters` which takes a single line string and returns a list of the (non space) characters contained within it. So for the input string "hello world", `to_characters("hello world")` should return `['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']`.

In [0]:
def to_characters(s):
    return [chr for chr in s.replace(" ","")]
#     raise NotImplementedError()

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
assert to_characters("hello world") == ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd'], "Unexpected hello world result"
assert to_characters("RDDs are fun!") == ['R', 'D', 'D', 's', 'a', 'r', 'e', 'f', 'u', 'n', '!'], "Unexpected RDDs are fun result"

Write the function `rdd_from_file` which takes a file path (such as `"/FileStore/pride_and_prejudice.txt"` or a variable which represents a file path, e.g. `file1`) as input and returns an RDD constructed by reading the lines into separate records.

In [0]:
print(dbutils.fs.head("dbfs:/FileStore/pride_and_prejudice.txt", 500))

In [0]:
def rdd_from_file(filename):
    filename = sc.textFile(filename)
    return filename.map(lambda line: line.split("\n"))
#     raise NotImplementedError()

rdd1 = rdd_from_file(file1)
rdd2 = rdd_from_file(file2)
rdd3 = rdd_from_file(file3)

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
assert rdd1.count() == 14580, "Something has gone wrong with the RDD reading of file1"
assert rdd2.count() == 15609, "Something has gone wrong with the RDD reading of file2"
assert rdd3.count() == 3112, "Something has gone wrong with the RDD reading of file3"

Write the function `rdd_to_character_value_pair` which takes an RDD as input, lower cases each record, and splits it into characters (minus spaces) and makes each character into a pair with the second element `1` returning a list of such pairs. So for the input "Hello World", `line_to_character_value_pair` should return: `[('h', 1), ('e', 1), ('l', 1), ('l', 1), ('o', 1), ('w', 1), ('o', 1), ('r', 1), ('l', 1), ('d', 1)]`

This function can use the functions you have defined above (i.e. you should just need to construct pairs out of the character list returned by `to_characters` after `to_lower` has been used).

In [0]:
def rdd_to_character_value_pairs(rdd):
    return rdd.map(lambda text: to_lower(text)).flatMap(lambda text: to_characters(text)).map(lambda line: (line,1))
#     raise NotImplementedError()

In [0]:
# Test 

text = ["   Hello   World!" ,"Hey Hey"]
text_rdd = sc.parallelize(text)

rdd_to_character_value_pairs(text_rdd).collect()

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
test_strings = ["hello WoRld", "RDDs are great fun!"]
testRDD = sc.parallelize(test_strings)
character_value_pairs = rdd_to_character_value_pairs(testRDD).collect()
assert character_value_pairs == [('h', 1), ('e', 1), ('l', 1), ('l', 1), ('o', 1), ('w', 1), ('o', 1), ('r', 1), ('l', 1), ('d', 1), ('r', 1), ('d', 1), ('d', 1), ('s', 1), ('a', 1), ('r', 1), ('e', 1), ('g', 1), ('r', 1), ('e', 1), ('a', 1), ('t', 1), ('f', 1), ('u', 1), ('n', 1), ('!', 1)], "Your function's output does not appear to match the specification"

Write a function `rdd_to_character_counts` which takes an RDD as input, creates pair RDDs of the format [(character, 1), .... ] and then adds up the 1s to get overall character counts. It should return an RDD containing (character, value) pairs where each character occurs only once and the value represents the frequency of the character in the original RDD.

Again, you can make use of previous functions you have created in this notebook.

In [0]:
def rdd_to_character_count(rdd):
    return rdd.flatMap(lambda text: to_characters(text[0])).map(lambda text: to_lower(text)).map(lambda line: (line,1)).reduceByKey(lambda v1,v2 : v1+v2)
#     raise NotImplementedError()

In [0]:
rdd1.take(10)

rdd_new = rdd1.flatMap(lambda text: to_characters(text[0])).map(lambda text: to_lower(text)).map(lambda line: (line,1)).reduceByKey(lambda v1,v2 : v1+v2)

rdd_new.collect()

In [0]:
# DO NOT CHANGE THE CONTENT OF THIS CELL
from pyspark.rdd import RDD

count1 = rdd_to_character_count(rdd1)
assert isinstance(count1, RDD), "The output of your rdd_to_character_count is expected to be an RDD"

assert count1.lookup('s') == [33875], "Your counting may not be working as expected (rdd1 - s)"
assert count1.lookup('h') == [34662], "Your counting may not be working as expected (rdd1 - h)"
assert count1.lookup('a') == [42760], "Your counting may not be working as expected (rdd1 - a)"

count2 = rdd_to_character_count(rdd2)
count3 = rdd_to_character_count(rdd3)

assert isinstance(count2, RDD), "The output of your rdd_to_character_count is expected to be an RDD"
assert isinstance(count3, RDD), "The output of your rdd_to_character_count is expected to be an RDD"

assert count2.lookup('s') == [44074], "Your counting may not be working as expected (rdd2 - s)"
assert count2.lookup('h') == [6177], "Your counting may not be working as expected (rdd2 - h)"
assert count2.lookup('a') == [46611], "Your counting may not be working as expected (rdd2 - a)"

assert count3.lookup('s') == [5882], "Your counting may not be working as expected (rdd3 - s)"
assert count3.lookup('h') == [6856], "Your counting may not be working as expected (rdd3 - h)"
assert count3.lookup('a') == [7778], "Your counting may not be working as expected (rdd3 - a)"

Write a function `rdd_with_ordered_character_count` which takes an RDD, converts it into a pair RDD of the form `(character, numerical_value)` representing the frequency of each character (lower cased) in the input, and returns an RDD containing `(character, value)` pairs where the values are presented in decreasing order. So for the input "hEllO", the output should be `[(l, 2), (o, 1), (h, 1), (e, 1)]`. You should consider using the function(s) you have defined previously in this homework.

In [0]:
def rdd_with_ordered_character_count(rdd):
    cc_rdd = rdd_to_character_count(rdd)
    ord_cc_rdd = cc_rdd.sortBy((lambda line: line[1]), ascending = False)
    return ord_cc_rdd
#     raise NotImplementedError()

In [0]:
# rdd1.take(10)

# rdd_new = rdd_to_character_count(rdd1)
#rdd1.flatMap(lambda text: to_characters(text[0])).map(lambda text: to_lower(text)).map(lambda line: (line,1)).reduceByKey(lambda v1,v2 : v1+v2)
# rdd_new2 = rdd_new.map(lambda line: (line[1], line[0])).sortByKey(ascending = False).map(lambda line: (line[1], line[0]))
# rdd_new2.take(20)

rdd_new = rdd_with_ordered_character_count(rdd2)
rdd_new.take(10)

In [0]:
ordered1 = rdd_with_ordered_character_count(rdd1)
ordered2 = rdd_with_ordered_character_count(rdd2)
ordered3 = rdd_with_ordered_character_count(rdd3)

assert ordered1.take(10) == [('e', 71358), ('t', 48274), ('a', 42760), ('o', 41411), ('i', 38885), ('n', 38736), ('h', 34662), ('s', 33875), ('r', 33556), ('d', 22850)], "Your sorting may be going wrong"
assert ordered2.take(10) == [('e', 79764), ('a', 46611), ('s', 44074), ('t', 40678), ('i', 40356), ('n', 36296), ('r', 35857), ('l', 35195), ('u', 33588), ('o', 29092)], "Your sorting may be going wrong"
assert ordered3.take(10) == [('e', 13055), ('t', 8876), ('o', 7908), ('a', 7778), ('n', 7040), ('h', 6856), ('i', 6624), ('r', 6365), ('s', 5882), ('d', 4795)], "Your sorting may be going wrong"

Now you can compare the results to see how the order of most frequent characters differs between languages! (You shouldn't need to change the following cell.) `file1` was an English text, `file2` was in French and `file3` was Polish.

In [0]:
zipped = zip(ordered1.collect(), ordered2.collect(), ordered3.collect())
for values in zipped:
    print(values)