# Labsession Pyspark
In this labsession we will be covering pyspark, the pythong implementation of Apache Spark, which uses the MapReduce paradigm. As was the case for previous sessions, fill in the CODE_HERE placeholders. Check your work by running the include ASSERT-statements.

The documentation for Pyspark is located at https://spark.apache.org/docs/latest/api/python/pyspark

#### First, the necesary packages are initialised
You do not need to code the following three blocks.

In [2]:
import json

In [3]:
import re
from collections import Counter
#REMARK: if you installed it yourself on Windows (following the tutorial) you might need to uncomment the next two lines
import findspark
findspark.init()
import numpy as np
import pyspark

In [4]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName('appName').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
from pyspark.sql import functions as F

### Excercise 1: Wordcount
For the first excercise we will be performing wordcount on a sample file.

First, the textfile, located in "data/wordcount.txt" needs to be converted to an RDD, and then split up.
Hint: use sc.textFile and flatMap

In [None]:
words = CODE_HERE

Now, count the actual word. Hint: Use map & reduceByKey

In [None]:
wordCounts=CODE_HERE

Finally, save the output in a human readable format. You can check the output of the wordcount by going to "data/output/part-00000". Note, if you re-run the following black, you wil get errors because the output files already exist. If you want to re-run this part of the code, delete the output files first.

In [None]:
wordCounts.saveAsTextFile("data/output/")

## Excercise 2: Ecommerce data
In this integrated excercise, we will be processing ecommerce data
### Let's load the ecommerce data and transform it into an RDD
#### the file contains JSON objects of the form

```
{
  "": "9",
  "Clothing ID": "1077",
  "Age": "34",
  "Title": "Such a fun dress!",
  "Review Text": "I'm 5\"5' and 125 lbs. i ordered the s petite to make sure the length wasn't too long. i
 typically wear an xs regular in retailer dresses. if you're less busty (34b cup or smaller), a s petite w
ill fit you perfectly (snug, but not tight). i love that i could dress it up for a party, or down for work
. i love that the tulle is longer then the fabric underneath.",
  "Rating": "5",
  "Recommended IND": "1",
  "Positive Feedback Count": "0",
  "Division Name": "General",
  "Department Name": "Dresses",
  "Class Name": "Dresses"
}
```

In [5]:
ecommerce_data = sc.textFile("./data/ecommerce.json")

### First convert each line (which is a string) to JSON using json.loads

#### remember: Spark uses lazy evaluation so applying map on an RDD does not apply the function on the data until you call a fucntion like collect(), take(), reduce(), count(), saveAsTextFile()... which triggers the execution

Hint: use json.loads to transform a string into a JSON object <br>
Hint: also put the RDD in memory using persist

In [None]:
ecommerce_json = ecommerce_data.map(CODE_HERE).CODE_HERE

### Let's first try to determine the average age of the reviewers on this e-commerce website

#### First construct an RDD where we extract the Age field of each JSON record and transform it to int

In [None]:
age_rdd = ecommerce_json.map(lambda js: CODE_HERE)

#### Let's make a sum of all ages so that we can later devide by the total number of records to get the average

In [None]:
age_sum = age_rdd.reduce(lambda x, y: CODE_HERE)

#### Now we still need to figure out how many records are in the RDD, use the count() function

In [None]:
n_records = age_rdd.CODE_HERE

#### compute the average now using the sum and the number of records

In [None]:
avg_age = CODE_HERE

In [None]:
assert np.isclose(avg_age,43.1985438)

### We will now do the same, but compute the average number of words in the Review Text
#### you can simply use the split function with the space character to split the text into words

In [None]:
nwords_sum = ecommerce_json.map(lambda js: len(CODE_HERE)).reduce(CODE_HERE)
avg_nwords = CODE_HERE

In [None]:
assert np.isclose(avg_nwords, 58.0843907)

### Next we will count the number of reviews per rating (i.e. 1, 2, 3, 4 and 5)

#### Similar to the word count exercise, build a RDD of tuples with the rating and a count of 1

In [None]:
rating_counts = ecommerce_json.map(CODE_HERE).CODE_HERE.collect()

#### sort the resulting tuples (after collecting the results) from high to low

In [None]:
sorted_rating_counts = sorted(rating_counts, key = CODE_HERE, reverse = True)

In [None]:
assert sorted_rating_counts[0] == ('5', 13131)

### A second way to achieve the same thing is by first grouping all records together with the same key and then counting how many appear

In [None]:
rating_counts = ecommerce_json.map(lambda js: (CODE_HERE, None)).CODE_HERE.map(CODE_HERE)

In [None]:
sorted_rating_counts = sorted(CODE_HERE.collect(), CODE_HERE, reverse = True)

In [None]:
assert sorted_rating_counts[0] == ('5', 13131)

### Next we will count the number of reviews per Department Name

In [None]:
category_counts = ecommerce_json.map(CODE_HERE).CODE_HERE

#### let's filter out all the Department names with less than 1000 reviews

In [None]:
category_counts_filtered = category_counts.filter(CODE_HERE).collect()

In [None]:
assert len(category_counts_filtered) == 5

#### select the Department with most reviews

In [None]:
largest_category = sorted(CODE_HERE)[0][0]

In [None]:
assert largest_category == "Tops"

#### slightly more complex, let's now count the number of reviews per Department Name and Rating

In [None]:
category_rating_counts = (ecommerce_json
                          .map(lambda js: (CODE_HERE,1))
                          .CODE_HERE
                         )

#### only keep the counts for the Jackets department and sort from high to low number of reviews

In [None]:
jackets = category_rating_counts.CODE_HERE.collect()

In [None]:
sorted_jackets = CODE_HERE

In [None]:
assert sorted_jackets[0] == (("Jackets",'5'), 631)

### Create per Clothing ID a list of all the ages

In [None]:
id2agelist = (ecommerce_json
              .CODE_HERE #map
              .groupByKey()
              .CODE_HERE #map
             )

#### keep only the Clothing IDs that have more than 500 reviews and compute both the average and standard deviation of the age per Clothing ID

#### The output should be tuples of the following form

```
('Clothing ID', {"avg": 43.4564, "std": 12.14566})
```

Hint: use np.array and np.std and np.mean

In [None]:
def compute_summary_statistics(tpl):
    id, age_list = tpl
    age_array = np.array(age_list)
    return (id, {"avg": CODE_HERE, "std": CODE_HERE })

stats = id2agelist.filter(CODE_HERE).map(CODE_HERE)

#### sort the results according the average age from high to low

In [None]:
sorted_stats = CODE_HERE

In [None]:
assert sorted_stats[0] == ('829', {'avg': 44.64136622390892, 'std': 12.447020037376479})

### Next we will compute the term frequency - inverse document frequency of all the words in the reviews

https://en.wikipedia.org/wiki/Tf%E2%80%93idf <br>

#### First we will compute the document frequencies, i.e. count how many times each word occurs in a review (use a regular expression for this to filter out real words (no punctuation and numbers)

The output of the document_frequencies RDD should be tuples of the form

```
('word', document_count)
```



In [None]:
#define the regular expression to extract only words with at least one character
regex = CODE_HERE

In [None]:
def document_terms(js):
    words = set(regex.findall(CODE_HERE))
    for word in words:
        yield CODE_HERE #also lower case the words
        
#use flatMap and reduceByKey
document_frequencies = ecommerce_json.CODE_HERE.CODE_HERE

In [None]:
assert document_frequencies.filter(lambda x: x[0]=="wonderful").collect()[0][1] == 290

#### Now we will compute the inverse document frequency using the formula np.log(number of documents/document count)

Hint: we compute the number of document before and stored it in the variable n_records

In [None]:
inverse_document_frequencies = document_frequencies.CODE_HERE #map
print(inverse_document_frequencies.take(2))

In [None]:
assert sorted(inverse_document_frequencies.collect(), key = lambda x: x[1], reverse=True)[0][0] == "narrowing"

### We will now compute the term frequencies per document

#### The output should be an RDD of tuples of the form

The document_id can be found under the empty key, i.e. js[""]

```
('word', ('document_id', count))
```

word has to be the first element in the tuple as we will use this as a key to join on at some point

In [None]:
def document_term_frequencies(js):
    record_id = js[""]
    words = CODE_HERE #use regex to extract all words
    word_count = Counter()
    CODE_HERE #for loop to count record_id-word pairs in document, lower case the words
    return [CODE_HERE for ((record_id, word), cnt) in word_count.items()]

term_frequencies = CODE_HERE #use flatMap
print(term_frequencies.take(2))

In [None]:
assert term_frequencies.filter(lambda x: x[0]=="it" and x[1][0]=='1').collect()[0][1][1] == 4

### Now we need to join the inverse document frequencies with the term frequencies per document

Hint: term frequencies should be stored in the term_frequrencies RDD and inverse document frequencies in the inverse_document_frequencies RDD

#### the output of the join will be an RDD with tuples of the form

```
('word', (idf, ('document_id', word count/term frequency) ) )
```

In [None]:
tf_idf_joined = CODE_HERE

In [None]:
assert tf_idf_joined.filter(lambda x: x[0]=='comfortable' and x[1][1][0]=="0").collect()[0] == ('comfortable', (2.073244314833701, ('0', 1)))

### Now lets multiply the term frequency/word count of each document with the inverse document frequency of the word

#### and now make the keys (i.e. first element of the tuples) of the records in the resulting RDD the id of the document so that we can group all words that belong to the same document together again

The output of this RDD should look like:

```
('document_id', ('word', tf*idf))
```

In [None]:
tf_idf = tf_idf_joined.map(lambda x: CODE_HERE)

In [None]:
assert np.isclose(tf_idf.filter(lambda x: x[0]=='0' and x[1][0]=="comfortable").collect()[0][1][1], 2.07324431)

### Finally we need to group all words that belong to the same document

#### The output should be a tuple of the form ('document_id', words_dict) where words_dict is a dict of the following format

```
{
    "word": tf_idf_word
    "word2": tf_idf_word2
    ...
}
```

In [None]:
def review_tf_idf(tpl):
    document_id, tpls = tpl
    words = CODE_HERE #build a dict mapping word to tf_idf
    CODE_HERE #return tuple with document_id and words dictionary

per_review = CODE_HERE #use groupByKey and map

In [None]:
expected = {'comfortable': 2.073244314833701, 'silky': 5.235846040622725, 'wonderful': 4.394278854944507, 'absolutely': 3.3820511804752176, 'sexy': 4.717052247207557, 'and': 0.36522226581530454}
assert per_review.filter(lambda x: x[0] == '0').collect()[0][1] == expected