# Definition of an RDD and its creation from a file

### Definition of an RDD:
In Spark, the idea is to split the data into multiple nodes i.e. machines. Resilient Distributed Dataset (RDD) is a representation of this distributed dataset. It is an immutable distributed collection of data. Any action performed on an RDD is transmitted to all the nodes where the action is performed on the respective chunck of data.

It facilitates two types of operations which are as follows:
1. __Transformation__ : Operations on an RDD such as filter() or map() which yield another RDD. 
2. __Action__: Operations on an RDD which trigger a computation. The result of this computation is returned to the master node or written into a stable storage system. Examples include count(), first(), take(n) or collect(). 

### RDD Creation:

There are three methods to create RDDs in Spark. These are as follows:
1. `parallelize` method: By invoking this method in the driver program we can create a parallelized collections.
2. `textFile` method: This method creates an RDD by reading the URL of a file. It reads the file as a collection of lines. _In this notebook, we will use this method to create an RDD_.
3. __Existing RDD__: As mentioned earlier, transformation of an existing RDD also results into an RDD. 

#### Data:
Following __[this](https://github.com/jadianes/spark-py-notebooks)__ excellent tutorial on Spark for Python developers, I too will be using the reduced 10% dataset provided for the KDD'99 competition. This dataset can be downloaded from the __[UCI ML repository](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)__.  

This reduced dataset has about half a million network interactions. The file is provided as a _gzip_ file which will be downloaded locally. 

In [2]:
import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

Read data from the gzip file. Note that the `textFile` method can read directly from a compressed file. 

In [3]:
data_gzip_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_gzip_file)

Check if the data is loaded correctly. 

In [4]:
# Count the number of lines in the dataset
raw_data.count()

494021

In [5]:
# Read the first few lines of the dataset
raw_data.take(5)

[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']