# RDD Basic Operations- Part I

Topics to be covered in this notebook are as under.
* Transformations: `map` and `filter`
* Action: `collect`

As explained in the introductory notebook, we'll use the `textFile` method to create an RDD from an existing file. I've the `gzip` file already downloaded in my machine. If you're starighway starting from this notebook, you may want to copy the two commands for downloading the aforesaid file from the introductory notebook. 

In [10]:
data_gzip_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_gzip_file)
print type(raw_data) 

<class 'pyspark.rdd.RDD'>


### The `filter` transformation
As the name suggests, this transformation when applied to an RDD, returns an RDD with only those elements which satisfy the filtering criterion.

Let's filter out some specific interactions from the dataset. The last element in each line is a flag which tells us if the connection was `normal.`, `neptune.`, `smurf.` etc. I've taken two such interactions for demo purposes.   

In [14]:
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)
smurf_raw_data = raw_data.filter(lambda x: 'smurf.' in x)

In [15]:
from time import time
t0 = time()
normal_count = normal_raw_data.count()
smurf_count = smurf_raw_data.count()
tt = time() - t0
print "There are {} 'normal' interactions".format(normal_count)
print "There are {} 'smurf' interactions".format(smurf_count)
print "Count completed in {} seconds".format(round(tt,3))

There are 97278 'normal' interactions
There are 280790 'smurf' interactions
Count completed in 2.274 seconds


An astute reader would notice that we didn't measure the elapsed time for the transformation step (in this case `filter`). This is because in Spark, distributed computation actually takes place when we execute an _action_ and not during when a _transformation_ is applied to an RDD. This means that regardless of the number of transformations applied to an RDD, no computation takes place until an action is called on that RDD. 

### The `map` transformation
This transformation is analogous to the `lambda` function in Python. It enables us apply a function to every element in the RDD. 

In the code snippet below, we'll be reading the data as a comma separated value (CSV) and then pretty print the first element of the array. 

In [18]:
from pprint import pprint
csv_data = raw_data.map(lambda x: x.split(","))
print type(csv_data)
t0 = time()
head_rows = csv_data.take(5) # Returns an array with the first 5 elements of the csv_data
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
pprint(head_rows[0]) # Print the first element of the array
print "Length of 0th element of the array", len(head_rows[0])

<class 'pyspark.rdd.PipelinedRDD'>
Parse completed in 0.105 seconds
[u'0',
 u'tcp',
 u'http',
 u'SF',
 u'181',
 u'5450',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'1',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'0',
 u'8',
 u'8',
 u'0.00',
 u'0.00',
 u'0.00',
 u'0.00',
 u'1.00',
 u'0.00',
 u'0.00',
 u'9',
 u'9',
 u'1.00',
 u'0.00',
 u'0.11',
 u'0.00',
 u'0.00',
 u'0.00',
 u'0.00',
 u'0.00',
 u'normal.']
Length of 0th element of the array 42


Instead of built-in functions like `split` in the previous snippet, we can also use user-defined functions with `map`. In the snippet below, the `parse_interaction` function reads each line of the dataset and splits it by a comma. It returns a dictionary with the last element as the key and the other elements as value. Recall that each element of the array has 42 elements and the last one is the tag (e.g. normal.) which will be the key in the dictionary.  

In [6]:
def parse_interaction(line):
    elems = line.split(",")
    tag = elems[41]
    return (tag, elems)

key_csv_data = raw_data.map(parse_interaction)
head_rows = key_csv_data.take(5)
pprint(head_rows[0])

(u'normal.',
 [u'0',
  u'tcp',
  u'http',
  u'SF',
  u'181',
  u'5450',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'1',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'0',
  u'8',
  u'8',
  u'0.00',
  u'0.00',
  u'0.00',
  u'0.00',
  u'1.00',
  u'0.00',
  u'0.00',
  u'9',
  u'9',
  u'1.00',
  u'0.00',
  u'0.11',
  u'0.00',
  u'0.00',
  u'0.00',
  u'0.00',
  u'0.00',
  u'normal.'])


### The `collect` action
`collect` returns the elements of the dataset as an array back to the driver program.
<font color=blue>Driver program to be defined here. The idea isn't super clear to me yet. </font>. 

Broadly speaking, this command helps us get all the RDD elements in the memory and work with them. Needless to say, for this reason, it should be used with caution particularly when working with large RDDs. 

In [8]:
t0 = time()
all_raw_data = raw_data.collect()
tt = time() - t0
print "Data collected in {} seconds".format(round(tt,3))

Data collected in 4.205 seconds
