# Text Processing

In this exercise you will create, once more, a word count but this time with Spark and RDDs.
Next, you will try to generate a model which generates sentences for you based on a training corpus.

## Checking the Data

In [None]:
! hdfs dfs -ls /dataset/text/

## Init Spark

In [2]:
import findspark
import pyspark

findspark.init()
sc = pyspark.SparkContext(appName="Text Processing")

## Read the holmes.txt
If you want, you can also try out `gutenberg_all.txt`

In [4]:
holmes_raw = sc.textFile("/dataset/text/holmes.txt")

### Caching

With `rdd.cache()` you can save an entire RDD into Memory. Note, however, that you need an action to trigger the computation.

You see the cached RDDs also in the Spark UI (in the storage tab)

In [None]:
%time holmes_raw.cache()

In [None]:
%time holmes_raw.count()

Do you see the time difference if we do the count again?

In [None]:
%time holmes_raw.count()

## Splitting the Words

In [8]:
holmes_words = holmes_raw.map(lambda word: word.split(" "))

As you can see, we have an Array of Arrays of Strings.

In [None]:
holmes_words.take(2)

With `flatmap` we flatten the map and only get an Array of Strings back.

In [10]:
holmes_words = ... your code ....

In [11]:
holmes_words.take(2)

['The', 'Project']

### Cleaning the Words

Try to:
- lower the words
- remove empty strings
- remove all special characters like `.,:,[,]` etc (checkout https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string )

use the PySpark API for RDDs if you need help https://spark.apache.org/docs/latest/api/python/reference/index.html

In [12]:
import re

holmes_words_cleaned = holmes_words \
    .map(lambda word: word.lower()) \
    ... your code ...

In [13]:
holmes_words_cleaned.take(2)

['the', 'project']

Let us cache this RDD as well

In [14]:
holmes_words_cleaned.cache()
holmes_words_cleaned.count()

107507

## Group Words for Word Count

It is tempting to use `groupBy`. But this is inefficient as you can see in the below example. The same word will be collected over and over in a list and needs a (network) shuffle

In [15]:
groupByExample = holmes_words_cleaned.groupBy(lambda word: word)

In [16]:
res = groupByExample.take(1)
print("word: :" + res[0][0])
print("values: :" + str(list(res[0][1])))

word: :in
values: :['in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'i

It is therefore more efficient to use the classical `map` and `reduceByKey` approach

In [17]:
word_count = holmes_words_cleaned.... your code ....

In [18]:
word_count.take(2)

                                                                                

[('project', 89), ('gutenberg', 32)]

Let us sort the output. Have a look at the `sortByKey` API Reference https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.sortByKey.html?highlight=sortbykey

- what is the key at the moment? how can you swap the key and the values?

In [40]:
sorted_word_count = word_count.... your code ....

In [None]:
sorted_word_count.take(4)

## Text Generator (advanced)

We will try to implement a text generator. Here is how

- first you load all words from holmes.txt into an RDD
- next, we comput [bigrams](https://en.wikipedia.org/wiki/Bigram). For example, the text "hello world how are you" would produce

```
("hello", "world"), ("world", "how"), ("how", "are"), ("are", "you")
```

- With the bigrams, we can count all following words given a seed word. E.g. in this very tiny example the probability that the word "world" comes after "hello" is 100%. 

In [22]:
import re

words = sc.textFile("/dataset/text/holmes.txt") \
    .flatMap(lambda word: word.split(" ")) \
    .map(lambda word: word.lower()) \
    .map(lambda word: re.sub(r'[^\w\s]', '', word)) \
    .filter(lambda word: word != '')

In [23]:
words.cache()
words.count()

                                                                                

107507

In [24]:
words.take(2)

['the', 'project']

### Compute the Bigrams 

we will use `zipWithIndex` so we can join afterwards

In [25]:
first = words.zipWithIndex().map(lambda word_index: (word_index[1], word_index[0]))
second = words.... your code ....

In [26]:
print(first.take(2))
print(second.take(2))

[(0, 'the'), (1, 'project')]
[(-1, 'the'), (0, 'project')]


In [27]:
print(first.count())
print(second.count())

107507
107507


In [28]:
# inner join first and second

bigram = 

In [29]:
print(bigram.count())
print(bigram.take(2))

                                                                                

107506
[(0, ('the', 'project')), (4, ('of', 'the'))]


### Define a method for computing the probabilites of following words

In [43]:
def get_probabilities(words):
    words = list(words)
    word_counts = [[x,words.count(x)] for x in set(words)]
    ... your code ...


In [44]:
get_probabilities(["hallo", "hallo", "welt"])

[('hallo', 0.6666666666666666), ('welt', 0.3333333333333333)]

In [46]:
#group by key is exactly what we want
bigram.map(lambda x: x[1]).groupByKey().take(1)

[('of', <pyspark.resultiterable.ResultIterable at 0x7fe022780460>)]

In [47]:
word_probabilities = bigram.map(lambda x: x[1]).groupByKey().mapValues(get_probabilities)

In [48]:
model = word_probabilities.collectAsMap()

In [51]:
model["eager"]

[('face', 0.3333333333333333),
 ('eyes', 0.16666666666666666),
 ('nature', 0.16666666666666666),
 ('and', 0.16666666666666666),
 ('enough', 0.16666666666666666)]

### Naive approach, always take the next word with the highest probability

In [54]:
def generate_sentences_naive(model, seed_word = "woman", num_words = 30):
    ... your code ...

In [55]:
generate_sentences_naive(model)

woman i have been a little more than the door and i have been a little more than the door and i have been a little more than the door 
--------------------------------------------------
end of generating sentences


### Better Approach, take the next word based on the probabilities we have computed

I have used the `choice` method from `numpy.random` (https://stackoverflow.com/questions/3679694/a-weighted-version-of-random-choice)

In [58]:
from numpy.random import choice

def generate_sentences(model, seed_word = "love", num_words = 50):
    ... your code ...

In [59]:
generate_sentences(model)

love matter lady my eye on different matter so sorely and down between the barred was face seared with only bring up wellington street sherlock holmes i oh the tip had still of iron piping not to me introduce a cellar which could at climbing down from if possible that 
--------------------------------------------------
end of generating sentences


# Stop the Cluster

In [60]:
sc.stop()