# Part 1) RDDs

Calculate chi-square values using RDDs and transformations. Write the output to a file output_rdd.txt.

For this part, we will first set the configurations and initialize the context which we will be using.

In [None]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Part 1) RDDs")
sc = SparkContext(conf=conf)

### Read the data

We will use spark context sc - `sc.textFile` to read the file and then `json.loads` to load as json and select *category* and *reviewText* for further analysis. Also I will load stopwords with `sc.textFile` too.

In [None]:
import json
# create rdd
documents = sc.textFile("hdfs:///user/data/reviews_devset.json")
# load as json
dataset = documents.map(json.loads)
# select only reviewteext
rdd_data = dataset.map(lambda e: (e['category'], e['reviewText']) )

In [None]:
# read stopwords as a list
stopwords = sc.textFile("/user/data/stopwords").collect()

### Step 1

#### Tokenization, case folding and removing stop words

In this step we created a function called `tokenization_casefolding_stopwords` that will be called in `flatMap` function for every row.

We will `import re` because we need it in this tokenization process.
For each line the function will set all the line to lower case (for case folding) and than after substituting the delimiters like tabs, digits, and common delimiter characters, we will split the line by whitespaces and get only the unique words.

Each word will be checked if it is a stopword before returning. This function will return three different outputs:

- `(categoryCount~category, 1)` will count categories
- `(wordCount~word, 1)` will count the occurance of unique words in reviews
- `((category,word), 1)` will count the number of reviews in category which contain the word

In [None]:
import re
delimiter = "\\d+|\\t|\\.|\?|!|,|;|\\:|\(|\)|\[|\]|\{|\}|-|\"|`|~|#|&|\*|%|\$|\\\\|/"
def tokenization_casefolding_stopwords(row):
    # get the text from the row entry and lower case reviewText
    category = row[0]
    reviewText = row[1].lower()
    # remove unwanted chars
    reviewText = re.sub(delimiter, " ",reviewText)
    # split by space and get only the unique words
    words = reviewText.split(" ")
    unique_words = list(set(words))
    # ((category,word), 1) tuple
    categoryWord = map(lambda x: (category + "," + x, 1), filter(lambda x: len(x) > 1 and x not in stopwords, unique_words))
    # (word, 1) tuple
    wordCount = map(lambda x: ("wordCount~"+ x, 1), filter(lambda x: len(x) > 1 and x not in stopwords, unique_words))
    # return ((category,word), 1) tuples, (word, 1) tuples and (category, 1) tuples
    return (list(categoryWord) + list(wordCount) + list([("categoryCount~" + category, 1)]))

To get the number of rows in the dataset we use `rdd_data.count()`

In [None]:
N = rdd_data.count()

Here we will call the function `tokenization_casefolding_stopwords` for every row and after we will `reduceByKey` the output of the mapper by summing the values.

After the mapped_rdd is done, we will get all the `categoryWord~` values and store them in an *categoryWord* rdd.
To store `categoryCount~` and `wordCount~` values, we will use `collectAsMap` action

In [None]:
mapped_rdd = rdd_data.flatMap(lambda row: tokenization_casefolding_stopwords(row)).reduceByKey(lambda x, y: x + y)

In [None]:
categoryWord = mapped_rdd.filter(lambda x: "wordCount~" not in x[0] and "categoryCount~" not in x[0])

In [None]:
resultAsMap = mapped_rdd.filter(lambda x: "wordCount~" in x[0] or "categoryCount~" in x[0]).collectAsMap()

### Step 2

In this step we will calculate chi-square values for each `(category,word)` tuple. To achieve this we created a function called `chi_square_calc`. 

This function will take each tuple and calculate the chi-square by also involving `categories` and `words` that we have in `resultAsMap`. The function will output: `(category, word:chi-square)` tuples.

In [None]:
def chi_square_calc(row):
    category = row[0].split(",")[0]
    word = row[0].split(",")[1]
    cCount = resultAsMap["categoryCount~" + category]
    wCount = resultAsMap["wordCount~" + word]
    A = row[1]
    B = wCount - A
    C = cCount - A
    D = N - A - B - C
    chi_square = N * ((A * D) - (B * C)) * ((A * D) - (B * C)) / ((A + B) * (A + C) * (B + D) * (C + D))
    return (category, word + ":" + str(chi_square))

After calculating the chi-square values for each `(category,word)` tuple, we will `reduceByKey()` the result in order to group them by category and sort them. Then we will use `mapValues()` to sort the chi-square values in the descending order, but we first need to split the values we concatenated in the `reduceByKey()`, and also to take top 200 values for each category.

In the end we will join *categories* with their `(word:chi-square)` tuples by space-separating them.

In [None]:
def join_values(values):
    return values[0] + " " + " ".join(values[1])

In [None]:
results = categoryWord.map(lambda row: chi_square_calc(row)) \
    .reduceByKey(lambda x, y: x + " " + y) \
    .mapValues(lambda values: sorted(values.split(" "), key=lambda x: float(x.split(":")[1]), reverse=True)[:200]) \
    .sortByKey() \
    .map(lambda x: join_values(x))

### Step 3

In this step we will create a line with all unique words, sorted ascending, that are in the `results` rdd from previous step by first storing them in an array and then joining them together to create by using `join()` function.

After that we will join `results` rdd with the `one_line_rdd` we created and output the results in a text file by using `.saveAsTextFile()`. Before saving, we will check if the file exists in order to delete it.

In [None]:
words_array = sorted(results.flatMap(lambda x: x.split(" ")[1: ]).map(lambda x: x.split(":")[0] + " ").distinct().collect())

In [None]:
one_line_words = "".join(words_array).strip()

In [None]:
one_line_rdd = sc.parallelize([one_line_words])
output = sc.union([results, one_line_rdd])

In [None]:
pathToSave = "/user/Solution/output_rdd.txt"

try:
    import subprocess
    subprocess.call(["hadoop", "fs", "-rm", "-r", "-skipTrash", pathToSave])
except IOError:
    ""

output.saveAsTextFile(pathToSave)