# Based on Drabas & Lee  -- Learning PySpark

## Resilient Distributed Datasets

### Creating RDDs

#### Start the jupyter notebook from its own folder, otherwise python might not find some files to load!


In [1]:
# set the kernel to python 2 or Python [default]!

In [1]:
sc

In [2]:
# you only need to run this cell if the above spark context is not available when you start the notebook

if 0:
    import findspark
    findspark.init()
    import pyspark
    from pyspark import SparkContext
    sc = SparkContext()

There are two ways to create an RDD in PySpark. 

1) You can **parallelize a list** with **sc.parallelize()**

In [3]:
data = sc.parallelize([('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)])

In [4]:
print(data)

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195


or 2) **read from a repository** (a file or a database). 
We will discuss it below.

#### Schema

RDDs are schema-less data structures (unlike DataFrames, which we will discuss later). 

Thus we can **mix amost everything in an RDD**: tuple, dict, list, etc...

In [9]:
data_heterogenous = sc.parallelize([('Ferrari', 'fast'), {'Porsche': 100000}, ['Spain','visited', 4504]])
print(data_heterogenous)

ParallelCollectionRDD[3] at parallelize at PythonRDD.scala:195


The **.collect()** method returns all the elements of the RDD to the driver where it is serialized as a list

In [10]:
data_heterogenous_collected = data_heterogenous.collect() 
print(data_heterogenous_collected)

[('Ferrari', 'fast'), {'Porsche': 100000}, ['Spain', 'visited', 4504]]


After collect, you can access the data in the object as you would normally do in Python.

In [11]:
data_heterogenous_collected[1]['Porsche']

100000

In [12]:
data_heterogenous_collected[2]

['Spain', 'visited', 4504]

In [13]:
data_heterogenous_collected[0]

('Ferrari', 'fast')

#### Reading from files

Note, that to execute the code above you will have to change the path where the data is stored. The dataset can be downloaded from http://tomdrabas.com/data/VS14MORT.txt.gz

In [14]:
!pwd

/home/ec2-user/Projects/ScalableML/PySparkDemos


In [15]:
#!wget http://tomdrabas.com/data/VS14MORT.txt.gz

In [19]:
# Fix hdfs if files are corrupt...
#
#!hdfs fsck -list-corruptfileblocks / 
#!hdfs dfsadmin -safemode leave
#!hdfs dfs -rm /hdfs_data/VS14MORT.txt.gz
#!hdfs dfs -rm /hdfs_data/*
#!hdfs dfs -rm -r /user/ec2-user/data_key*
#!hdfs fsck /
#!hdfs fsck / -delete
#!hdfs fsck /hdfs_data/ -delete

In [16]:
# Fix hdfs if the files are corrupt (e.g. having missing blocks)...

if 0:
    #!hdfs fsck -list-corruptfileblocks / 
    !hdfs dfsadmin -safemode leave
    !hdfs dfs -rm /hdfs_data/*
    !hdfs dfs -rm -r /user/ec2-user/data_key*
    !hdfs fsck / -delete

In [17]:
if 0:
    !hdfs dfs -mkdir -p /hdfs_data
    !hdfs dfs -ls /hdfs_data
    !hdfs dfs -put data/VS14MORT.txt.gz /hdfs_data
    !hdfs fsck /hdfs_data/VS14MORT.txt.gz

In [18]:
if 0:
    data_from_file_long = sc.textFile('data/VS14MORT.txt.gz', 4)
else:
    data_from_file_long = sc.textFile('/hdfs_data/VS14MORT.txt.gz', 4)
print(data_from_file_long)


/hdfs_data/VS14MORT.txt.gz MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:0


The last parameter in **sc.textFile(..., n)** specifies the **number of partitions** the dataset is divided into. <br>

Spark can read from a multitude of filesystems: Local ones such as NTFS, FAT, or Mac OS Extended (HFS+), or distributed filesystems such as HDFS, S3, Cassandra, among many others <br>

Note that Spark can automatically work with compressed datasets (like the Gzipped one in our preceding example). <br>

Depending on how the data is read, the object holding it will be represented slightly  differently. The **data read from a file** is represented as **MapPartitionsRDD** instead  of **ParallelCollectionRDD** when we **.paralellize(...)** a collection.

we can use **.sample()** to sample from an RDD and make it smaller

In [20]:
#to make the computations quicker in this demo, let us make the RDD smaller ...
data_from_file=data_from_file_long.sample(False, 0.0001,345)
data_from_file.count()

259

When you read from a text file, each row from the file forms an element of an RDD.

In [21]:
data_from_file.take(1)

[u'                   2                                        00 002  F4001 030101031S7                2014U7UN                                    R571380 11013636 0111R571                                                                                                                                                                          01 R571                                                                                                 01  11                                 100 601']

In [22]:
data_from_file.take(2)

[u'                   2                                        00 002  F4001 030101031S7                2014U7UN                                    R571380 11013636 0111R571                                                                                                                                                                          01 R571                                                                                                 01  11                                 100 601',
 u'                   1                                        06 006  F1065 391909  2M5                2014U7UN                                    E780173 111   37 0311I469 21I519 31E780                                                                                                                                                            03 E780 I469 I519                                                                                       02  32                                 100 702']

To **make it more readable**, let's create a list of elements so each line is represented as a list of values.

#### User defined functions

First, let's define the method with the help of the following code, which will parse the unreadable row into something that we can use:

In [23]:
def extractInformation(row):
    import re
    import numpy as np

    selected_indices = [
         2,4,5,6,7,9,10,11,12,13,14,15,16,17,18,
         19,21,22,23,24,25,27,28,29,30,32,33,34,
         36,37,38,39,40,41,42,43,44,45,46,47,48,
         49,50,51,52,53,54,55,56,58,60,61,62,63,
         64,65,66,67,68,69,70,71,72,73,74,75,76,
         77,78,79,81,82,83,84,85,87,89
    ]

    '''
        Input record schema
        schema: n-m (o) -- xxx
            n - position from
            m - position to
            o - number of characters
            xxx - description
        1. 1-19 (19) -- reserved positions
        2. 20 (1) -- resident status
        3. 21-60 (40) -- reserved positions
        4. 61-62 (2) -- education code (1989 revision)
        5. 63 (1) -- education code (2003 revision)
        6. 64 (1) -- education reporting flag
        7. 65-66 (2) -- month of death
        8. 67-68 (2) -- reserved positions
        9. 69 (1) -- sex
        10. 70 (1) -- age: 1-years, 2-months, 4-days, 5-hours, 6-minutes, 9-not stated
        11. 71-73 (3) -- number of units (years, months etc)
        12. 74 (1) -- age substitution flag (if the age reported in positions 70-74 is calculated using dates of birth and death)
        13. 75-76 (2) -- age recoded into 52 categories
        14. 77-78 (2) -- age recoded into 27 categories
        15. 79-80 (2) -- age recoded into 12 categories
        16. 81-82 (2) -- infant age recoded into 22 categories
        17. 83 (1) -- place of death
        18. 84 (1) -- marital status
        19. 85 (1) -- day of the week of death
        20. 86-101 (16) -- reserved positions
        21. 102-105 (4) -- current year
        22. 106 (1) -- injury at work
        23. 107 (1) -- manner of death
        24. 108 (1) -- manner of disposition
        25. 109 (1) -- autopsy
        26. 110-143 (34) -- reserved positions
        27. 144 (1) -- activity code
        28. 145 (1) -- place of injury
        29. 146-149 (4) -- ICD code
        30. 150-152 (3) -- 358 cause recode
        31. 153 (1) -- reserved position
        32. 154-156 (3) -- 113 cause recode
        33. 157-159 (3) -- 130 infant cause recode
        34. 160-161 (2) -- 39 cause recode
        35. 162 (1) -- reserved position
        36. 163-164 (2) -- number of entity-axis conditions
        37-56. 165-304 (140) -- list of up to 20 conditions
        57. 305-340 (36) -- reserved positions
        58. 341-342 (2) -- number of record axis conditions
        59. 343 (1) -- reserved position
        60-79. 344-443 (100) -- record axis conditions
        80. 444 (1) -- reserve position
        81. 445-446 (2) -- race
        82. 447 (1) -- bridged race flag
        83. 448 (1) -- race imputation flag
        84. 449 (1) -- race recode (3 categories)
        85. 450 (1) -- race recode (5 categories)
        86. 461-483 (33) -- reserved positions
        87. 484-486 (3) -- Hispanic origin
        88. 487 (1) -- reserved
        89. 488 (1) -- Hispanic origin/race recode
     '''

    record_split = re\
        .compile(
            r'([\s]{19})([0-9]{1})([\s]{40})([0-9\s]{2})([0-9\s]{1})([0-9]{1})([0-9]{2})' + 
            r'([\s]{2})([FM]{1})([0-9]{1})([0-9]{3})([0-9\s]{1})([0-9]{2})([0-9]{2})' + 
            r'([0-9]{2})([0-9\s]{2})([0-9]{1})([SMWDU]{1})([0-9]{1})([\s]{16})([0-9]{4})' +
            r'([YNU]{1})([0-9\s]{1})([BCOU]{1})([YNU]{1})([\s]{34})([0-9\s]{1})([0-9\s]{1})' +
            r'([A-Z0-9\s]{4})([0-9]{3})([\s]{1})([0-9\s]{3})([0-9\s]{3})([0-9\s]{2})([\s]{1})' + 
            r'([0-9\s]{2})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})' + 
            r'([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})' + 
            r'([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})' + 
            r'([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})([A-Z0-9\s]{7})' + 
            r'([A-Z0-9\s]{7})([\s]{36})([A-Z0-9\s]{2})([\s]{1})([A-Z0-9\s]{5})([A-Z0-9\s]{5})' + 
            r'([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})' + 
            r'([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})' + 
            r'([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})' + 
            r'([A-Z0-9\s]{5})([A-Z0-9\s]{5})([A-Z0-9\s]{5})([\s]{1})([0-9\s]{2})([0-9\s]{1})' + 
            r'([0-9\s]{1})([0-9\s]{1})([0-9\s]{1})([\s]{33})([0-9\s]{3})([0-9\s]{1})([0-9\s]{1})')
    
    #parsing starts. When parsing fails we just put a -99 there to indicated parsing failed in that row. 
    try:
        rs = np.array(record_split.split(row))[selected_indices]
    except:
        rs = np.array(['-99'] * len(selected_indices))
    return rs
#     return record_split.split(row)

Once the record is parsed, we try to **convert the list into a NumPy array** and return it; 
**if this fails we return a list of default values -99** so we know this record did not parse properly.

Note: Defining pure python methods can slow down your application because Spark constantly needs to switch between Python interpreter and JVM. Whenver possible, we should you built-in python functions. 

Now, we will use the `extractInformation(...)` method to split and convert our dataset.
Note that **we pass only the method signature to .map(...)**: the method will hand over **one element of the RDD to the extractInformation(...) method at a time in each partition**:

In [24]:
# it is using lazy evaluztion ... so it is quick... because it doesn't do it yet...
data_from_file_converted = data_from_file.map(extractInformation)

In [25]:
data_from_file_converted

PythonRDD[10] at RDD at PythonRDD.scala:53

In [26]:
data_from_file_converted.take(1)

[array([u'2', u'00', u' ', u'0', u'02', u'F', u'4', u'001', u' ', u'03',
        u'01', u'01', u'03', u'1', u'S', u'7', u'2014', u'U', u'7', u'U',
        u'N', u' ', u' ', u'R571', u'380', u'110', u'136', u'36', u'01',
        u'11R571 ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ', u'01',
        u'R571 ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'01', u' ', u' ', u'1', u'1', u'100', u'6'], 
       dtype='<U40')]

### Passing  variables to the workers

when a job is submitted for execution, the job is sent to the driver (or a master) node. 

The driver node creates a DAGfor a job and decides which executor (or worker) nodes will run the specific tasks.

The driver then instructs the workers to execute their tasks and return the results  to the driver when done. 

Each executor gets a copy of the variables and methods from the driver. If, when running the task, the executor alters these variables or overwrites the methods, **it does so without affecting either other executors' copies** or the variables and methods of the driver.

### Transformations

#### .map(...)

The method is **applied to each element of the RDD**: in the case for the `data_from_file_conv` dataset you can think of this as a transformation of each row.

In [28]:
# select the 16th column (i.e. the 16th element in each row) and convert it to int
data_2014 = data_from_file_converted.map(lambda row: int(row[16]))
data_2014.take(10)

[2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014]

You can combine more columns.

In [31]:
# select a couple of columns
# Since the evaluation is lazy, this is fast
data_2014_2 = data_from_file_converted.map(lambda row: (row[16], int(row[16]), row[5], row[21]))

In [32]:
# if we need the values, then we do all the computation so it might take so time:
data_2014_2.take(10)

[(u'2014', 2014, u'F', u' '),
 (u'2014', 2014, u'F', u' '),
 (u'2014', 2014, u'F', u' '),
 (u'2014', 2014, u'M', u' '),
 (u'2014', 2014, u'M', u' '),
 (u'2014', 2014, u'M', u' '),
 (u'2014', 2014, u'F', u' '),
 (u'2014', 2014, u'M', u' '),
 (u'2014', 2014, u'M', u' '),
 (u'2014', 2014, u'M', u' ')]

#### .filter(...)

The **`.filter(...)`** method allows you to **select elements of your dataset that fit specified criteria**.

In [33]:
data_filtered = data_from_file_converted.filter(lambda row: row[5] == 'F' and row[21] == '9')

This command might take a while depending on how fast your computer is. 

In [35]:
data_filtered

PythonRDD[17] at RDD at PythonRDD.scala:53

In [36]:
data_filtered.count()

5

In [37]:
data_filtered.take(1)

[array([u'1', u'  ', u'2', u'1', u'03', u'F', u'1', u'037', u' ', u'33',
        u'13', u'06', u'  ', u'2', u'D', u'3', u'2014', u'N', u'1', u'B',
        u'Y', u'9', u'0', u'X44 ', u'420', u'122', u'   ', u'39', u'03',
        u'11T509 ', u'12X44  ', u'61T509 ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ', u'02',
        u'X44  ', u'T509 ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'01', u' ', u' ', u'1', u'1', u'100', u'6'], 
       dtype='<U40')]

#### .flatMap(...)

The **`.flatMap(...)` method works similarly to `.map(...)` but returns a flattened results instead of a list.** 

In [39]:
data_filtered_flat = data_filtered.flatMap(lambda row: (row[16], int(row[16]) + 1))

In [40]:
data_filtered_flat.count()

10

In [41]:
data_filtered_flat.take(8)

[u'2014', 2015, u'2014', 2015, u'2014', 2015, u'2014', 2015]

The results are flattened.

#### .distinct()

This method returns a **list of distinct values in a specified column**.
It might take for a while.

In [42]:
data_from_file_converted.count()

259

In [43]:
# Here we will find the distinct values of column 5. 
distinct_gender = data_from_file_converted.map(lambda row: row[5]).distinct().collect()
distinct_gender

['-99', u'M', u'F']

#### .sample(...)

The `.sample()` method returns a randomized sample from the dataset.

In [44]:
fraction = 0.1
#False, fraction, 666 = With raplecement? Fraction of dataset used to sampling, random seed
data_sample = data_from_file_converted.sample(False, fraction, 605)

data_sample.take(1)

[array([u'1', u'06', u' ', u'0', u'06', u'F', u'1', u'065', u' ', u'39',
        u'19', u'09', u'  ', u'2', u'M', u'5', u'2014', u'U', u'7', u'U',
        u'N', u' ', u' ', u'E780', u'173', u'111', u'   ', u'37', u'03',
        u'11I469 ', u'21I519 ', u'31E780 ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ', u'03',
        u'E780 ', u'I469 ', u'I519 ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'02', u' ', u' ', u'3', u'2', u'100', u'7'], 
       dtype='<U40')]

Let's confirm that we really got around 10% of all the records.

In [45]:
print('Original dataset: {0}, sample: {1}'.format(data_from_file_converted.count(), data_sample.count()))

Original dataset: 259, sample: 18


#### .leftOuterJoin(...)

Left outer join, just like the SQL world, **joins two RDDs based on the values found in both datasets**, and returns records from the left RDD with records from the right one appended where the two RDDs match.

In [46]:
rdd1 = sc.parallelize([('a', 1), ('b', 4), (8,'c')])
rdd2 = sc.parallelize([('a', 4), ('a', 2), ('b', '6'), ('a',1), ('d', 15)])

rdd3 = rdd1.leftOuterJoin(rdd2)
rdd3.collect()

[('a', (1, 4)),
 ('a', (1, 2)),
 ('a', (1, 1)),
 ('b', (4, '6')),
 (8, ('c', None))]

d is missing since this is only a leftOuterJoin

If we used `.join(...)` method instead we would have gotten only the values for `'a'` and `'b'` as these two values intersect between these two RDDs.

In [47]:
rdd4 = rdd1.join(rdd2)
rdd4.collect()

[('a', (1, 2)), ('a', (1, 4)), ('a', (1, 1)), ('b', (4, '6'))]

Another useful method is the `.intersection(...)` that returns the records that are *equal* in both RDDs.

In [48]:
rdd5 = rdd1.intersection(rdd2)
rdd5.collect()

[('a', 1)]

#### .repartition(...)

Repartitioning the dataset **changes the number of partitions the dataset is divided into**.

This functionality should be used sparingly and only when really necessary as **it shuffles the data around, which in effect results in a significant hit in terms of performance**:

In [49]:
rdd1 = rdd1.repartition(6)

len(rdd1.glom().collect())

6

In [51]:
rdd1.glom()

PythonRDD[64] at RDD at PythonRDD.scala:53

In [50]:
rdd1.glom().collect()

[[], [('b', 4)], [], [], [(8, 'c'), ('a', 1)], []]

The **.glom()** method, in contrast to .collect(), **produces a list where each element 
is another list of all elements of the dataset present in a specified partition**; the main 
list returned has as many elements as the number of partitions.

### Actions

**Actions, in contrast to transformations, execute the scheduled task on the 
dataset**; once you have finished transforming your data you can execute your 
transformations. 

This might contain no transformations (for example, .take(n) will 
just return n records from an RDD even if you did not do any transformations to it) 
or **execute the whole chain of transformations**.

#### .take(...)

The method **returns `n` top rows from a single data partition**.

The method is **preferred to .collect(...) as it only returns the n top rows from a 
single data partition in contrast to .collect(...), which returns the whole RDD**. 
This is especially important when you deal with large datasets:

In [52]:
data_first = data_from_file_converted.take(1)
data_first

[array([u'2', u'00', u' ', u'0', u'02', u'F', u'4', u'001', u' ', u'03',
        u'01', u'01', u'03', u'1', u'S', u'7', u'2014', u'U', u'7', u'U',
        u'N', u' ', u' ', u'R571', u'380', u'110', u'136', u'36', u'01',
        u'11R571 ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ', u'01',
        u'R571 ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'01', u' ', u' ', u'1', u'1', u'100', u'6'], 
       dtype='<U40')]

If you want somewhat **randomized records you can use .takeSample(...)** 
instead, which takes three arguments: 
* First whether the sampling should be with replacement, 
* the second specifies the number of records to return, 
* and the third is a seed to the pseudo-random numbers generator:

In [53]:
data_from_file_converted.count()

259

In [54]:
data_take_sampled = data_from_file_converted.takeSample(False, 2, 667)
data_take_sampled

[array([u'1', u'  ', u'2', u'1', u'10', u'F', u'1', u'064', u' ', u'38',
        u'18', u'08', u'  ', u'4', u'M', u'6', u'2014', u'U', u'7', u'C',
        u'N', u' ', u' ', u'C349', u'093', u'027', u'   ', u'08', u'02',
        u'11C349 ', u'21F179 ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ',
        u'       ', u'       ', u'       ', u'       ', u'       ', u'02',
        u'C349 ', u'F179 ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'     ', u'     ', u'     ', u'     ',
        u'     ', u'     ', u'01', u' ', u' ', u'1', u'1', u'100', u'6'], 
       dtype='<U40'),
 array([u'1', u'12', u' ', u'0', u'07', u'M', u'1', u'055', u' ', u'37',
        u'17', u'08', u'  ', u'4', u'M', u'1', u'2014', u'U', u'7', u'U',
        u'Y', u' ', u' ', u'I250', u'214', u'062', u'

#### .collect(...)

This method **returns all the elements of the RDD** to the driver

In [55]:
rdd1.collect()

[('b', 4), ('a', 1), (8, 'c')]

#### .reduce(...)

Another action that processes your data, the **`.reduce(...)` method *reduces* the elements of an RDD using a specified method**.

In [56]:
rdd1 = sc.parallelize([('a', 1), ('b', 4), ('c',8)])

rdd1.map(lambda row: row[1]).reduce(lambda x, y: x + y)

13

A word of caution is necessary here. **The functions passed as a reducer 
need to be associative**, that is, when the order of elements is changed the 
result does not, **and commutative**, that is, changing the order of operands 
does not change the result either.
The example of the associativity rule is **(5 + 2) + 3 = 5 + (2 + 3)**, and of the 
commutative is **5 + 2 = 2 + 5**. Thus, you need to be careful about 
what functions you pass to the reducer.

**If the reducing function is not associative and commutative you will sometimes get wrong results** depending how your data is partitioned.

In [61]:
# using only 1 partition!
data_reduce = sc.parallelize([1.0, 2.0, .5, .1, 5, .2], 1)

In [62]:
data_reduce.collect()

[1.0, 2.0, 0.5, 0.1, 5, 0.2]

If we were to reduce the data in a manner that we would like to *divide* the current result by the subsequent one, we would expect a value of 10

In [63]:
works = data_reduce.reduce(lambda x, y: x / y)
works

10.0

However, if you were to **partition the data into 3 partitions, the result will be wrong**.

In [64]:
data_reduce = sc.parallelize([1.0, 2.0, .5, .1, 5, .2], 3)
data_reduce.reduce(lambda x, y: x / y)

0.004

The **`.reduceByKey(...)` method works in a similar way to the `.reduce(...)` method but performs a reduction on a key-by-key basis**.

In [65]:
data_key = sc.parallelize([('a', 4),('b', 3),('c', 2),('a', 8),('d', 2),('b', 1),('d', 3)],4)
data_key.collect()

[('a', 4), ('b', 3), ('c', 2), ('a', 8), ('d', 2), ('b', 1), ('d', 3)]

In [66]:
data_key.reduceByKey(lambda x, y: x + y).collect()

[('a', 12), ('d', 5), ('c', 2), ('b', 4)]

#### .count()

The **`.count()` method counts the number of elements in the RDD**.

In [67]:
data_reduce.collect()

[1.0, 2.0, 0.5, 0.1, 5, 0.2]

In [68]:
data_reduce.count()

6

The .count(...) method produces the same result as the following method, but **it 
does not require moving the whole dataset to the driver**:

In [69]:
len(data_reduce.collect()) # WRONG -- DON'T DO THIS! collect() moves the whole dataset to the driver

6

If your dataset is in a form of a *key-value* you can use the `.countByKey()` method to get the counts of distinct keys.

In [70]:
data_key.collect()

[('a', 4), ('b', 3), ('c', 2), ('a', 8), ('d', 2), ('b', 1), ('d', 3)]

In [71]:
data_key.countByKey()

defaultdict(int, {'a': 2, 'b': 2, 'c': 1, 'd': 2})

In [72]:
data_key.countByKey().items()

[('a', 2), ('c', 1), ('b', 2), ('d', 2)]

#### .saveAsTextFile(...)

As the name suggests, the `.saveAsTextFile()` the RDD and saves it to text files: each partition to a separate file.

In [73]:
data_key.glom().collect()

[[('a', 4)], [('b', 3), ('c', 2)], [('a', 8), ('d', 2)], [('b', 1), ('d', 3)]]

In [95]:
#!rm -r data_key.txt
#!hdfs dfs -rm -r /user/ec2-user/data_key*

In [94]:
# make sure you delete the existing files before you run this cell!

data_key.saveAsTextFile('data_key.txt')

To read it back, you need to parse it back as, as before, all the rows are treated as strings.

In [107]:
def parseInput(row):
    import re
    
    pattern = re.compile(r'\(\'([a-z])\', ([0-9])\)')
    row_split = pattern.split(row)
    
    return (row_split[1], int(row_split[2]))
    
data_key_reread = sc \
    .textFile('data_key.txt') \
    .map(parseInput)
data_key_reread.collect()

[(u'a', 4), (u'b', 3), (u'c', 2), (u'a', 8), (u'd', 2), (u'b', 1), (u'd', 3)]