# RDD: Definition and its creation

### Definition of an RDD:
In Spark, the idea is to split the data into multiple nodes i.e. machines. The partitioned data can be accessed through APIs defined in Spark. Resilient Distributed Dataset (RDD) is a representation of this distributed dataset. It is an immutable distributed collection of data. Any action performed on an RDD is transmitted to all the nodes where the action is performed on the respective chunk of data. Let's break down the term Resilient Distributed Dataset and understand the meaning and relevance of each word following the principle of LIFO.  

`Dataset` refers to the variable or object created by reading a file from a disk. File could be read from either a file stored on one's local machine or data sources like HDFS, S3 or HBase.  Once the data is read in Spark, we can referece it through an RDD. These RDDs are immutable meaning that we cannot change a section of an RDD. Every time a transformation is applied to an RDD, a new RDD is created. The series of transformations thus create a data analysis pipeline.

The second term is `distributed` which refers to the fact that Spark stores the data in a distributed fashion spread across a cluster of machines, say hundreds of instances on Amazon. Depending on the use case at hand (with a view to optimize perforamance), we can configure the size of each partition on the cluster. 

Lastly, in a distributed environment, node failures are a common phenomenon. In such an event, it is important for an application to be able to recover the work already done. `Resilient` refers to the fact that Spark keeps a track of each partition at every compute step. This way if a partition is lost, Spark can re-create it from a known set of partitions which were used to create the lost partition. This implicitly means that Spark can figure out where to start the re-computation process so that it can recover the lost partition in the least possible time.

Two types of operations on RDDs are are possible which are as follows:
1. __Transformation__ : Operations on an RDD such as filter() or map() which yield another RDD. 
2. __Action__: Operations on an RDD which trigger a computation. The result of this computation is returned to the master node or written into a stable storage system. Examples include count(), first(), take(n) or collect(). 

### RDD Creation:

There are three methods to create RDDs in Spark. These are as follows:
1. `parallelize` method: By invoking this method in the driver program we can create a parallelized collections.
2. `textFile` method: This method creates an RDD by reading the URL of a file. It reads the file as a collection of lines. _In this notebook, we will use this method to create an RDD_.
3. __Existing RDD__: As mentioned earlier, transformation of an existing RDD also results into an RDD. 

In the following paragraphs, we'll see the first two methods of creating an RDD. 

#### `parallelize` method
We will create an RDD by reading a list from the `driver program` i.e. `PySpark` in this case and distribute it over 3 partitions. As a next step, we'll collect the data from each of the partitions and print the list. Note that collecting the partitioned data requires the RDD to fit into the memory of the driver program. Therefore, collecting a big RDD should be done with caution. Finally, the `glom()` method allows us to see how the data is actually partitioned across the cluster of nodes. 

In [7]:
integer_RDD = sc.parallelize(range(10), 3) # Read a list and distribute across 3 partitions
print integer_RDD.collect() # Gather data from all partitions
print integer_RDD.glom().collect() # View how data is distributed across partitions

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]


#### `textFile method`
In order to see this method, following __[this](https://github.com/jadianes/spark-py-notebooks)__ excellent tutorial on Spark for Python developers, I too will be using the reduced 10% dataset provided for the KDD'99 competition. This dataset can be downloaded from the __[UCI ML repository](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)__.  

This reduced dataset has about half a million network interactions. The file is provided as a _gzip_ file which will be downloaded locally. 

In [2]:
import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

Read data from the gzip file. Note that the `textFile` method can read directly from a compressed file. 

In [3]:
data_gzip_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_gzip_file)

Check if the data is loaded correctly. 

In [4]:
# Count the number of lines in the dataset
raw_data.count()

494021

In [5]:
# Read the first few lines of the dataset
raw_data.take(5)

[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']