![](http://spark.apache.org/images/spark-logo.png) ![](https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg)

RDD creation
================

## 1. Getting the data files

In [2]:
import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", 
                        "kddcup.data_10_percent.gz")

## 2. Creating a RDD from a file
The most common way of creating an RDD is to load it from file.
Notice that Spark's ```textFile``` can handle compressed files directly.

In [3]:
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

Now we have our data file loaded into the ```raw_data``` RDD.
Without getting into Spark **transformations and actions**, the most basic thing we can do to check that we got our RDD contents right is to ```count()``` the number of lines loaded from the file into the RDD.

In [4]:
raw_data.count()

494021


We can also check the first few entries in our data.

In [5]:
raw_data.take(5)

[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']

## 3. Creating and RDD using parallelize

Another way of creating an RDD is to parallelize an already existing list.

In [6]:
import numpy as np

In [7]:
a = np.arange(100)
data = sc.parallelize(a)

In [8]:
data.count()

100

In [9]:
data.take(5)

[0, 1, 2, 3, 4]