# Unit 3: Programming with RDDs

## Contents
```
3.1 Before we begin: Passing funtions to Spark
3.2 Transformations
3.3 Actions
3.4 Loading data from HDFS
3.5 Saving results back to HDFS
```

## Before we begin: Passing functions to Spark

Using lambda functions:

In [1]:
rdd1 = sc.parallelize(range(4))
rdd1.collect()

[0, 1, 2, 3]

In [2]:
rdd2 = rdd1.map(lambda x: 2*x)
rdd2.collect()

[0, 2, 4, 6]

Using normal functions:

In [3]:
def double(x):
    return 2*x

In [4]:
rdd3 = rdd1.map(double)
rdd3.collect()

[0, 2, 4, 6]

## Transformations

### map

In [5]:
rdd1 = sc.parallelize(range(4))
rdd1.collect()

[0, 1, 2, 3]

In [6]:
rdd2 = rdd1.map(lambda x: x + 5)
rdd2.collect()

[5, 6, 7, 8]

In [10]:
def plus_five(x):
    return x + 5

rdd1.map(plus_five).collect()

[5, 6, 7, 8]

### filter

In [11]:
rdd1 = sc.parallelize(['a1', 'a2', 'b1', 'b2'])
rdd1.collect()

['a1', 'a2', 'b1', 'b2']

In [14]:
rdd2 = rdd1.filter(lambda x: 'a' in x)
rdd2.collect()

['a1', 'a2']

### flatMap

In [16]:
rdd1 = sc.parallelize(['Space: the final frontier.',
                       'These are the voyages of the starship Enterprise.'])
rdd1.collect()

['Space: the final frontier.',
 'These are the voyages of the starship Enterprise.']

In [17]:
rdd2 = rdd1.map(lambda line: line.split())
rdd2.collect()

[['Space:', 'the', 'final', 'frontier.'],
 ['These', 'are', 'the', 'voyages', 'of', 'the', 'starship', 'Enterprise.']]

In [18]:
rdd3 = rdd1.flatMap(lambda line: line.split())
rdd3.collect()

['Space:',
 'the',
 'final',
 'frontier.',
 'These',
 'are',
 'the',
 'voyages',
 'of',
 'the',
 'starship',
 'Enterprise.']

### distinct

In [19]:
rdd1 = sc.parallelize([1, 1, 1, 2, 2])
rdd1.collect()

[1, 1, 1, 2, 2]

In [20]:
rdd2 = rdd1.distinct()
rdd2.collect()

[2, 1]

## Actions

In [21]:
rdd1 = sc.parallelize([1, 1, 1, 2, 2])

### reduce

In [22]:
rdd1.reduce(lambda a, b: a + b)

7

### count

In [25]:
rdd1.count()

5

### collect

In [27]:
rdd1.collect()

[1, 1, 1, 2, 2]

### first

In [28]:
rdd1.first()

1

### take

In [33]:
rdd1.take(4)

[1, 1, 1, 2]

### takeSample

In [37]:
rdd1.takeSample(withReplacement=False, num=10)

[2, 2, 1, 1, 1]

In [39]:
rdd1.takeSample(withReplacement=True, num=10)

[2, 2, 1, 1, 2, 1, 1, 2, 1, 1]

## Loading data from HDFS

### textFile

In [40]:
rdd = sc.textFile('datasets/meteogalicia.txt')

In [44]:
rdd.take(5)

[u'',
 u'',
 u'ESTACI\ufffdN AUTOM\ufffdTICA:Santiago-EOAS',
 u'CONCELLO:Santiago de Compostela',
 u'PROVINCIA:A Coru\ufffda']

In [51]:
rdd.takeSample(withReplacement=False, num=5)

[u'      1          2017-06-25 19:50:00    Velocidade do Vento (km/h)                 2,99',
 u'      1          2017-06-19 13:40:00    Chuvia (L/m2)                             0',
 u'      1          2017-06-12 00:30:00    Chuvia (L/m2)                             0',
 u'      1          2017-06-05 12:30:00    Temperatura media (\ufffdC)                    15,92',
 u'      1          2017-06-10 07:20:00    Visibilidade (m)                          19851']

Several files can also be loaded together at the same time but **be careful with the number of partitions generated**:

In [52]:
rdd1 = sc.textFile('datasets/slurmd/slurmd.log.*')

In [53]:
print rdd1.toDebugString()

(10) datasets/slurmd/slurmd.log.* MapPartitionsRDD[58] at textFile at NativeMethodAccessorImpl.java:0 []
 |   datasets/slurmd/slurmd.log.* HadoopRDD[57] at textFile at NativeMethodAccessorImpl.java:0 []


In [54]:
rdd1.takeSample(withReplacement=False, num=5)

[u'1493976072 2017 May  5 11:21:12 c6606 daemon info slurmd _run_prolog: run job script took usec=29474',
 u'1496661357 2017 Jun  5 13:15:57 c6601 user info slurmstepd task/cgroup: /slurm/uid_12619/job_762985: alloc=5120MB mem.limit=5120MB memsw.limit=unlimited',
 u'1496045723 2017 May 29 10:15:23 c6601 user err slurmstepd error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured',
 u'1489799048 2017 Mar 18 02:04:08 c6604 daemon info slurmd _run_prolog: run job script took usec=22594',
 u'1492933702 2017 Apr 23 09:48:22 c6610 daemon err slurmd error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured']

## Saving results back to HDFS

In [55]:
rdd.saveAsTextFile('results_directory')

It will create a separate file for each partition of the RDD.

## Exercises
Now try to apply the above concepts to solve the following problems:
* Unit 3 Working with meteorological data 1