## RDD Introducion

Once a SparkContext is selected it can be used to build RDDs.

In [46]:
import platform
print(platform.python_version())
print(sc.pythonVer)
same_vers = sc.pythonVer == platform.python_version()[0:3]
print("Same python version for python and context: ", same_vers)

3.5.4
3.5
Same python version for python and context:  True


In [34]:
sc._conf.getAll() 

[('spark.driver.port', '50155'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.driver.host', '192.168.1.164'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.id', 'local-1527434635278'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.name', 'PySparkShell'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.master', 'local[2]')]

In [20]:
from pyspark import SparkContext

In [21]:
sc = SparkContext.getOrCreate()

In [48]:
sc.defaultMinPartitions

2

We can parallelize standard python lists into an RDD

In [136]:
data = sc.parallelize([("David",23),("Joan",50),("Ila",40),("David",30)])

In [137]:
type(data)

pyspark.rdd.RDD

In [138]:
data.getNumPartitions()

2

#### Tipical functions

- `rdd.min`
- `rdd.max`
- `rdd.mean`
- `rdd.std`
- `rdd.stdev`
- `rdd.variance`
- `rdd.histogram(5)`
- `rdd.stats()`


In [139]:
data.countByValue()

defaultdict(int,
            {('David', 23): 1,
             ('David', 30): 1,
             ('Ila', 40): 1,
             ('Joan', 50): 1})

In [140]:
data.countByKey()

defaultdict(int, {'David': 2, 'Ila': 1, 'Joan': 1})

In [141]:
data.max()

('Joan', 50)

In [142]:
data.min()

('David', 23)

You cannot index an RDD ! 

In [143]:
data[1]

TypeError: 'RDD' object does not support indexing

In [144]:
data

ParallelCollectionRDD[179] at parallelize at PythonRDD.scala:175

In [145]:
data.count()

4

## Selecting and filtering data 

### `collect` an RDD into a python object

We can `collect` an RDD to the master machine with the `collect` command. Be carefull if the amount of data is big since it might fit several machines but not the master machine.

In [146]:
data_here = data.collect()

In [147]:
type(data_here)

list

In [148]:
data_here[2]

('Ila', 40)

In [149]:
data.take(2)

[('David', 23), ('Joan', 50)]

In [150]:
data.first()

('David', 23)

In [151]:
data.top(2)

[('Joan', 50), ('Ila', 40)]

In [152]:
data.filter(lambda x:23 in x).collect()

[('David', 23)]

In [153]:
# unique items
data.distinct().collect()

[('David', 23), ('David', 30), ('Joan', 50), ('Ila', 40)]

## Reshape group and aggregate data 

### `reduce` method of an RDD

In [154]:
example = list(range(20))

In [155]:
rdd = sc.parallelize(example)

In [156]:
rdd.reduce(lambda a,b: a+b)

190

In [157]:
sum(rdd.collect())

190

### `groupBy` method

In [158]:
#rdd.groupBy(lambda x: x%3).mapValues(list).collect()

In [159]:
nonmult, mult = rdd.groupBy(lambda x: x%3==0).mapValues(list).collect()

In [160]:
nonmult

(False, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19])

In [163]:
mult

(True, [0, 3, 6, 9, 12, 15, 18])

### `groupByKey` method

In [164]:
data.groupByKey().mapValues(list).collect()

[('Joan', [50]), ('David', [23, 30]), ('Ila', [40])]

## Sort

### `sortBy`

In [178]:
rdd.sortBy(lambda x: x%2==0).collect()

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [174]:
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]