# Big Data Assignment 3: Word Count

Rules:
1. All words have to be converted to lowercase
2. Words are separated by any number of spaces or tabs
3. If any word has any special character (other than a-z and A-Z), then that word is filtered out (dropped): with one exception: if a word ends with {".", ";", ",", "?", ":"} then drop that special character, and then keep/use it as a proper word
4. Therefore a valid word can have only {a-z, A-Z}
5. You may use the Spark's collect() for debugging purposes, but your submitted solution should ignore debugging statements. For each transformation, you need to provide a single comment line
6. You have to be able to explain your transformations
7. Must use some Python functions for transformations
8. You must keep transformations as simple as possible and use Python functions

You may download the input from here:
https://github.com/mahmoudparsian/big-data-mapreduce-course/blob/master/data/205-0.txt

### Step 1: Start Spark

In [1]:
spark.version

'3.3.2'

In [2]:
spark

### Step 2: Input Text

Create path to text file you wish to run the word count on, then create an RDD from it. 

In [3]:
input_path = "205-0.txt"

In [4]:
rdd = sc.textFile(input_path)

### Step 3: Define Function

1. word_split - this function will split the lines of text into individual word elements, which will need to be cleaned.

In [5]:
def word_split(record):
    tokens = record.split(" ")
    return (tokens)

2. make_low - this function will convert all our words to lowercase.

In [6]:
def make_low(record):
    low_record = record.lower()
    return low_record

3. check_words - this function will check to see if the word fits within the assignment's rules. All non-alphabetical words, like numbers, get dropped. Any word that contains a special character will also get dropped, unless it has one of {".", ";", ",", "?", ":"} on the end. In that case the word will be kept but the character will be dropped. 

In [9]:
def check_words(record):
    spec_char = (".", ";", ",", "?", ":")
    if record.isalpha():
        return record
    elif record.endswith(spec_char):
        return record[0:len(record)-1]
    else:
        pass

### Step 4: Begin Transformations

Use _word_split_ to create word elements instead of line elements. 

In [7]:
word_rdd = rdd.flatMap(word_split)

Use _make_low_ to make each element lowercase.

In [8]:
low_word_rdd = word_rdd.map(make_low)

Use _check_words_ to drop elements that don't fit in the assignment's rules.

In [10]:
clean_rdd = low_word_rdd.map(check_words)

### Step 5: Filter RDD

Next we will run three filters on the RDD. 

1. First filter will remove "None" values that were left with the pass command from _check_words_
2. Second filter will drop any remaining elements that are not alphabetical words. This is necessary as the _clean_words_ function will still keep values like "28," since it had a special character at the end. 
3. Third filter checks to make sure any word with less than 4 characters will be ignored.

In [11]:
filter_rdd = clean_rdd.filter(lambda x: x if x else False).filter(lambda x: x.isalpha()).filter(lambda x: len(x)>=4)

### Step 6: Create (K, V) Pairs

Each word will be made a key with a 1 count as the value. 

In [12]:
count_rdd = filter_rdd.map(lambda x: (x, 1))

### Step 7: Reduce by Key

This will sum the counts for each key, resulting in a word count for each word appearing in the document that meets our requirements. 

In [13]:
reduced_rdd = count_rdd.reduceByKey(lambda x,y: x+y)

### Step 8: Select Top 5 Words

In [14]:
reduced_rdd.takeOrdered(5, key = lambda x: -x[1])

                                                                                

[('that', 1330), ('with', 916), ('which', 870), ('they', 699), ('have', 673)]