# PySpark Part 2
Notebook Created by Danielle Savage

In [34]:
# Lets begin with checking our Spark Context
sc 

## An Introduction to Transformations

* map (function) - This applies a function to each element in the RDD
* flatMap (function) - This calls each element of the RDD individually and concatenates multiple arrays into a single structure. 
* filter (function) - This returns an RDD that meets the fitering requirements

### A Review of Python Functions

Before continuing I thought it might be benifical to explain the types of functions that can be passed to a transformation. 

The easiest way to pass in functions, is by using a lambda function.

Lambda Function : Also known as an anonymous function is a shortened way to define functions in line. For example the two functions below are equivalent.

**User Defined Function**
```
def my_function( x ):
    result = x + 5
    return result
```
**Lambda Function**
```
lambda x : x + 5
```

Of course when passing lambda functions you are welcome to still user defined functions. What this means is that you can also do the following...

**Both Together**
```
lambda x : my_function( x )
```

**Using what we just learned lets split each line on a space**

### Map Example

Lets load in The Project Gutenberg EBook of Walden, and On The Duty Of Civil
Disobedience, by Henry David Thoreau.
If you have difficulty viewing this file in the respository feel free to download it [here](http://www.gutenberg.org/files/205/205-0.txt).

In [16]:
# Load in the txt file
walden = sc.textFile('data/Walden.txt')
# Apply the map function to split the text into words
words = walden.map(lambda line: line.split())
words.collect();

We can see from the output that this gave us a nested array of words for each individual line but, if we wanted this to instead be a single array we would instead use flatMap.

In [18]:
words_all = walden.flatMap(lambda line: line.split())
words_all.collect();

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Walden,',
 'and',
 'On',
 'The',
 'Duty',
 'Of',
 'Civil',
 'Disobedience,',
 'by',
 'Henry',
 'David',
 'Thoreau',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever.',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.org',
 'Title:',
 'Walden,',
 'and',
 'On',
 'The',
 'Duty',
 'Of',
 'Civil',
 'Disobedience',
 'Author:',
 'Henry',
 'David',
 'Thoreau',
 'Posting',
 'Date:',
 'July',
 '12,',
 '2008',
 '[EBook',
 '#205]',
 'Release',
 'Date:',
 'January,',
 '1995',
 '[Last',
 'updated:',
 'July',
 '29,',
 '2011]',
 'Language:',
 'English',
 'Character',
 'set',
 'encoding:',
 'UTF-8',
 '***',
 'START',
 'OF',
 'THIS',
 'PRO

Now that we have a list of words lets filter! Think of a word you want to know if it is in the text. I will choose `Walden` and see if the resulting array is long or short.

In [24]:
all_waldens = words_all.filter(lambda words: words =='Walden')
all_waldens.collect()

['Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden',
 'Walden']

## More Transformations

Now that we know the basics of what it looks like to apply transformations lets look at a few more.

* distinct ( ) - Returns only one of each element
* rdd1.union (rdd2) - Returns duplicates if there are any
* rdd1.intersention (rdd2) - Returns common elements
* rdd1.subtract (rdd2) - Returns elements that are only in rdd1
* rdd1.cartesian(rdd2) - Return the catesian product

Using distinct lets find the distinct words in Walden.

In [33]:
# we can do this with flatMap
words_all.distinct().collect();