In a previous file, we touched briefly on transformations and actions, and how these two methods affect the execution of code. In this file, we'll dive deeper into how those mechanisms work, and explore a wider range of the functions built into [the Spark core](http://spark.apache.org/docs/latest/api/python/pyspark.html).

The file **hamlet.txt** contains the entire text of [Shakespeare's play Hamlet](https://en.wikipedia.org/wiki/Hamlet). Shakespeare is well-known for his unique writing style and arguably one of the most influential writers in history. $Hamlet$ is one of his most popular plays.

Let's perform some text analysis on it. The file is in pure text format, though, and not ready for analysis. Before we can proceed, we'll have to clean up and reformat the data.

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\spark-2.4.5-bin-hadoop2.7'

In [2]:
import pyspark

sc = pyspark.SparkContext()

In [4]:
# Read the text file into an RDD

raw_hamlet = sc.textFile("hamlet.txt")
raw_hamlet.take(5)

['hamlet@0\t\tHAMLET',
 'hamlet@8',
 'hamlet@9',
 'hamlet@10\t\tDRAMATIS PERSONAE',
 'hamlet@29']

The text file uses the tab character (\t) as a delimiter. We'll need to split the file on the tab delimiter and convert the results into an RDD that's more manageable.

In [6]:
split_hamlet = raw_hamlet.map(lambda line:line.split("\t"))
split_hamlet.take(5)

[['hamlet@0', '', 'HAMLET'],
 ['hamlet@8'],
 ['hamlet@9'],
 ['hamlet@10', '', 'DRAMATIS PERSONAE'],
 ['hamlet@29']]

Lambda functions are great for writing quick functions we can pass into PySpark methods with simple logic. They fall short when we need to write more customized logic, though. Thankfully, PySpark lets us define a function in Python first, then pass it in. Any function that returns a sequence of data in PySpark (versus a guaranteed Boolean value, like **filter()** requires) must use a **yield** statement to specify the values that should be pulled later.

If we're unfamiliar with the **yield** statement in Python, read this excellent [Stack Overflow answer](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do/231855#231855) on the topic. To summarize, **yield** is a Python technique that allows the interpreter to generate data on the fly and pull it when necessary, instead of storing it to memory immediately. Because of its unique architecture, Spark takes advantage of this technique to reduce overhead and improve the speed of computations.

Spark runs the named function on every element in the RDD and restricts it in scope. Each instance of the function only has access to the object(s) we pass into the function, and the Python libraries available in our environment. If we try to refer to variables outside the scope of the function or import libraries, those actions may cause the computation to crash. That's because Spark compiles the function's code to Java to run on the RDD objects (which are also in Java).

Finally, not all functions require us to use **yield**; only the ones that generate a custom sequence of data do. For **map()** or **filter()**, we use **return** to return a value for every single element in the RDD we're running the functions on.

In the following code cell, we'll use the **flatMap()** method with the named function **hamlet_speaks** to check whether a line in the play contains the text **HAMLET** in all caps (indicating that Hamlet spoke). **flatMap()** is different than **map()** because it doesn't require an output for every element in the RDD. The **flatMap()** method is useful whenever we want to generate a sequence of values from an RDD.

In this case, we want an RDD object that contains tuples of the unique line IDs and the text "hamlet speaketh!," but **only for the elements in the RDD that have "HAMLET" in one of the values**. We can't use the **map()** method for this because it requires a return value for every element in the RDD.

We want each element in the resulting RDD to have the following format:

1. The first value should be the unique line ID (e.g.'hamlet@0') , which is the first value in each of the elements in the **split_hamlet** RDD.

2. The second value should be the string "hamlet speaketh!"

In [7]:
def hamlet_speaks(line):
    id = line[0]
    speaketh = False
    
    if "HAMLET" in line:
        speaketh = True
    if speaketh:
        yield id, "hamlet speaketh!"

In [11]:
hamlet_spoken = split_hamlet.flatMap(lambda line: hamlet_speaks(line))

In [20]:
hamlet_spoken.take(5)

[('hamlet@0', 'hamlet speaketh!'),
 ('hamlet@75', 'hamlet speaketh!'),
 ('hamlet@1004', 'hamlet speaketh!'),
 ('hamlet@9144', 'hamlet speaketh!'),
 ('hamlet@12313', 'hamlet speaketh!')]

**hamlet_spoken** now contains the line numbers for the lines where Hamlet spoke. While this is handy, we don't have the full line anymore. Instead, let's use a **filter()** with a named function to extract the original lines where Hamlet spoke. The functions we pass into **filter()** $must$ return values, which will be either **True** or **False**.

In [21]:
def filter_hamlet_speaks(line):
    if "HAMLET" in line:
        return True
    return False

In [22]:
hamlet_spoken_lines = split_hamlet.filter(lambda line: filter_hamlet_speaks(line))

In [23]:
hamlet_spoken_lines.take(5)

[['hamlet@0', '', 'HAMLET'],
 ['hamlet@75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['hamlet@1004', '', 'HAMLET'],
 ['hamlet@9144', '', 'HAMLET'],
 ['hamlet@12313',
  'HAMLET',
  '[Aside]  A little more than kin, and less than kind.']]

As we've discussed before, Spark has two kinds of methods, transformations and actions. While we've explored some of the transformations, we haven't used any actions other than **take()**.

Whenever we use an action method, Spark forces the evaluation of lazy code. If we only chain together transformation methods and print the resulting RDD object, we'll see the type of RDD (e.g. a PythonRDD or PipelinedRDD object), but not the elements within it. That's because the computation hasn't actually happened yet.

Even though Spark simplifies chaining lots of transformations together, it's good practice to use actions to observe the intermediate RDD objects between those transformations. This will let you know whether our transformations are working the way we expect them to.

### Count()

The **count()** method returns the number of elements in an RDD. **count()** is useful when we want to make sure the result of a transformation contains the right number of elements. For example, if we know there should be an element in the resulting RDD for every element in the initial RDD, we can compare the counts of both to ensure they match.

To get the number of elements in the RDD **hamlet_spoken_lines**, run **.count()** on it:

**hamlet_spoken_lines.count()**

### Collect()

We've used **take()** to preview the first few elements of an RDD, similar to the way we've used **head()** in pandas. But what about returning all of the elements in a collection? We need to do this to write an RDD to a CSV, for example. It's also useful for running some basic Python code over a collection without going through PySpark.

Running **.collect()** on an RDD returns a list representation of it. To get a list of all the elements in **hamlet_spoken_lines**, for example, we would write:

**hamlet_spoken_lines.collect()**

In [32]:
#  number of elements in hamlet_spoken_lines
spoken_count = hamlet_spoken_lines.count()
spoken_count

381

In [33]:
spoken_collect = hamlet_spoken_lines.collect()
len(spoken_collect)

381

In [35]:
spoken_101 = spoken_collect[100]
spoken_101

['hamlet@58478', 'HAMLET', 'A goodly one; in which there are many confines,']

While we've done some initial cleanup of the Hamlet data set, we hope we have a better idea of how to use PySpark to transform it into a format that's better for data analysis. We also learned how to use actions to explore an RDD before chaining another transformation to it.

If we'd like to learn how to install PySpark and integrate it with IPython Notebook, [this wonderful blog post](https://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/) will walk us through the steps. 