# Challenge:
### Transforming Hamlet into a Data Set

## Introduction

In the previous two missions, we covered the basics of PySpark, the MapReduce paradigm, transformations and actions, and how to do basic data cleanup in PySpark. In this challenge, you'll use the techniques you've learned to transform the text of Hamlet into a format that's more useful for data analysis.

**Resources**
* [PySpark's documentation for the RDD data structure](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)
* [Visual representation of methods](http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/blob/master/pyspark-pictures.ipynb) (IPython Notebook format)
* [Visual representation of methods](http://training.databricks.com/visualapi.pdf) (PDF format)

## Extract Line Numbers

The first value in each element (or line from the play) is a line number that identifies the line of the play the text is from. It appears in the following format:

```python
'hamlet@0'
'hamlet@8',
'hamlet@9',
...
```

We don't need the `hamlet@` at the beginning of these IDs for our data analysis. Let's extract just the integer part of the ID from each line, which is much more useful.

* Transform the RDD `split_hamlet` into a new RDD `hamlet_with_ids` that contains the clean version of the line ID for each element.
  * For example, we want to transform `hamlet@0` to `0`, and leave the rest of the values in that element untouched.
    * Recall that the `map()` function will run on each element in the RDD, where each element is a list that we can access using regular Python mechanics.

In [1]:
import pyspark

In [2]:
sc = pyspark.SparkContext()

In [3]:
raw_hamlet = sc.textFile("data/hamlet.txt")
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))
split_hamlet.take(5)

[['hamlet@0', '', 'HAMLET'],
 ['hamlet@8'],
 ['hamlet@9'],
 ['hamlet@10', '', 'DRAMATIS PERSONAE'],
 ['hamlet@29']]

In [6]:
def format_id(x):
    
    id = x[0].split('@')[1]
    results = []
    results.append(id)
    
    if len(x) > 1:
        for y in x[1:]:
            results.append(y)
    return results

hamlet_with_ids = split_hamlet.map(format_id)
hamlet_with_ids.take(5)

[['0', '', 'HAMLET'], ['8'], ['9'], ['10', '', 'DRAMATIS PERSONAE'], ['29']]

## Remove Blank Values

Next, we want to get rid of elements that don't contain any actual words (and just have an ID as the first value). These typically represent blank lines between paragraphs or sections in the play. We also want to remove any blank values (`''`) within elements, which don't contain any useful information for our analysis.

* Clean up the RDD and store the result as a new RDD `hamlet_text_only`.

In [19]:
real_text = hamlet_with_ids.filter(lambda line: len(line) > 1)
hamlet_text_only = real_text.map(lambda line: [l for l in line if l != ''])
hamlet_text_only.take(10)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND', '|'],
 ['273', '|']]

## Remove Pipe Characters

If you've been using `take()` to preview the RDD after each task, you may have noticed there are some pipe characters (`|`) in odd places that add no value for us. The pipe character may appear as a standalone value in an element, or as part of an otherwise useful string value.

* Remove any list items that only contain the pipe character (`|`), and replace any pipe characters that appear within strings with an empty character.
  * Assign the resulting RDD to `clean_hamlet`.

In [20]:
hamlet_text_only.take(10)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND', '|'],
 ['273', '|']]

In [55]:
def replace_pipeline(x):
    
    results = []
    
    for term in x:
        if term == "|":
            pass
        elif "|" in term:
            replaced = term.replace("|", "")
            results.append(replaced)
        else:
            results.append(term)

    return results

In [56]:
clean_hamlet = hamlet_text_only.map(replace_pipeline)
clean_hamlet.take(20)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)'],
 ['177', 'HORATIO', 'friend to Hamlet.'],
 ['204', 'LAERTES', 'son to Polonius.'],
 ['230', 'LUCIANUS', 'nephew to the king.'],
 ['261', 'VOLTIMAND'],
 ['273'],
 ['276', 'CORNELIUS'],
 ['288'],
 ['291', 'ROSENCRANTZ', '  courtiers.'],
 ['317'],
 ['320', 'GUILDENSTERN'],
 ['335'],
 ['338', 'OSRIC'],
 ['348', 'A Gentleman, (Gentlemen:)'],
 ['376', 'A Priest. (First Priest:)'],
 ['405', 'MARCELLUS']]

In [57]:
sc.stop()