![](http://spark.apache.org/images/spark-logo.png) ![](https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg)


Sampling RDDs
==============

So far we have introduced RDD creation together with some basic transformations such as map and filter and some actions such as count, take, and collect.


This notebook will show how to sample RDDs. Regarding transformations, sample will be introduced since it will be useful in many statistical learning scenarios. Then we will compare results with the takeSample action.

## 1. Getting the data and creating the RDD

In this case we will use the complete dataset provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.

In [1]:
import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz", 
                        "kddcup.data.gz")

In [3]:
data_file = "./kddcup.data.gz"
raw_data = sc.textFile(data_file)

## 2. Sampling RDDs

In Spark, there are two sampling operations, the transformation ```sample``` and the action ```takeSample```. By using a transformation we can tell Spark to apply successive transformation on a sample of a given RDD. By using an action we retrieve a given sample and we can have it in local memory to be used by any other standard library (e.g. Scikit-learn).

### The ```sample``` transformation

The ```sample``` transformation takes up to three parameters.
- First is whether the sampling is done with replacement or not.
- Second is the sample size as a fraction.
- Finally we can optionally provide a *random seed*.

In [4]:
raw_data_sample = raw_data.sample(False, 0.1, 1234)
sample_size = raw_data_sample.count()
total_zie = raw_data.count()
print "Sample size is {} of {}".format(sample_size, total_zie)
                                  

Sample size is 489957 of 4898431


But the power of sampling as a transformation comes from doing it as part of a sequence of additional transformations. This will show more powerful once we start doing aggregations and key-value pairs operations, and will be specially useful when using Spark's machine learning library MLlib.



In the meantime, imagine we want to have an approximation of the proportion of normal. interactions in our dataset. We could do this by counting the total number of tags as we did in previous notebooks. However we want a quicker response and we don't need the exact answer but just an approximation. We can do it as follows.

In [5]:
from time import time

# transformations to be applied
raw_data_sample_items = raw_data_sample.map(lambda x: x.split(","))
sample_normal_tags = raw_data_sample_items.filter(lambda x: "normal." in x)

# actions + time
t0 = time()
sample_normal_tags_count = sample_normal_tags.count()
tt = time() - t0

sample_normal_ratio = sample_normal_tags_count / float(sample_size)
print "The ratio of 'normal' interactions is {}".format(round(sample_normal_ratio,3)) 
print "Count done in {} seconds".format(round(tt,3))

The ratio of 'normal' interactions is 0.199
Count done in 21.444 seconds
