# Intro to RDDs

References:

- RDD API docs: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

Rules of thumb:
- Hit tab to auto-complete
- To see all available methods, place a dot (.) after the RDD (e.g. words.) and hit tab 
- Use `.collect()` to see the contents of the RDD

In [1]:
# verify that you're using the virtual environment
!which python

# you should see: /path/to/data-eng-bootcamp/.venv_data_eng_bootcamp/bin/python
# otherwise, stop the pyspark process in your terminal, activate the virtual environment, and run this command again

/Users/dunyangan/learning/data-eng-bootcamp/.venv_data_eng_bootcamp/bin/python


In [2]:
# like in the pyspark shell, SparkContext is already defined
sc

## 1. RDD Transformations and Actions

### 1.1 Working with in-memory data

#### 1.1.1 Working with numbers

In [5]:
rdd = sc.parallelize([1, 2, 3, 4, 5])

In [6]:
rdd.collect()

[1, 2, 3, 4, 5]

In [7]:
rdd.map(lambda n: n ** 2).collect()

[1, 4, 9, 16, 25]

In [8]:
# TODO: double each number in the original rdd
rdd.map(lambda n: n*2).collect()

[2, 4, 6, 8, 10]

In [12]:
# TODO: calculate sum of all numbers in the original rdd
rdd.reduce(lambda a,b: a+b)

15

In [17]:
# TODO: (i) square each number, (ii) filter out odd numbers (i.e. keep only even numbers) and (iii) calculate sum of remaining numbers (answer: 20)
rdd.map(lambda x: x **2).filter(lambda x: x % 2 == 0).reduce(lambda x,y: x+y)

20

#### 1.1.2 Working with strings

In [18]:
words = sc.parallelize(['hello', 'world', 'goodbye', 'world'])

In [19]:
words.collect()

['hello', 'world', 'goodbye', 'world']

In [20]:
wordcounts = words.map(lambda word: (word, 1))

In [21]:
# TODO: Given the list in the preceeding cell, how would you create a list of (word, count)?
# hint: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey
wordcounts.collect()

[('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1)]

In [10]:
wordcounts.reduceByKey(lambda a,b: a+b).collect()

[('hello', 1), ('world', 2), ('goodbye', 1)]

In [22]:
sentences = sc.parallelize(["Basics of the Unix Philosophy", "The ‘Unix philosophy’ originated with Ken Thompson's early meditations on how to design a small but capable operating system with a clean service interface. It grew as the Unix culture learned things about how to get maximum leverage out of Thompson's design. It absorbed lessons from many sources along the way"])

In [23]:
sentences.collect()

['Basics of the Unix Philosophy',
 "The ‘Unix philosophy’ originated with Ken Thompson's early meditations on how to design a small but capable operating system with a clean service interface. It grew as the Unix culture learned things about how to get maximum leverage out of Thompson's design. It absorbed lessons from many sources along the way"]

In [24]:
# TODO: word count, again! 
# hint: Highlight the whitespace between this cell and the next cell to see the hint

sentences.flatMap(lambda str : str.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a,b: a + b).collect()

[('Basics', 1),
 ('of', 2),
 ('the', 3),
 ('Unix', 2),
 ('Philosophy', 1),
 ('The', 1),
 ('‘Unix', 1),
 ('philosophy’', 1),
 ('originated', 1),
 ('with', 2),
 ('Ken', 1),
 ("Thompson's", 2),
 ('early', 1),
 ('meditations', 1),
 ('on', 1),
 ('how', 2),
 ('to', 2),
 ('design', 1),
 ('a', 2),
 ('small', 1),
 ('but', 1),
 ('capable', 1),
 ('operating', 1),
 ('system', 1),
 ('clean', 1),
 ('service', 1),
 ('interface.', 1),
 ('It', 2),
 ('grew', 1),
 ('as', 1),
 ('culture', 1),
 ('learned', 1),
 ('things', 1),
 ('about', 1),
 ('get', 1),
 ('maximum', 1),
 ('leverage', 1),
 ('out', 1),
 ('design.', 1),
 ('absorbed', 1),
 ('lessons', 1),
 ('from', 1),
 ('many', 1),
 ('sources', 1),
 ('along', 1),
 ('way', 1)]

<font color='white'>Hint: use flatMap!</font>

In [32]:
# Bonus task 1: strip non-alphabetical characters (e.g. ‘Unix and Unix)
import re;
sentences.flatMap(lambda str : str.split(" ")).map(lambda s: re.sub('[^a-zA-Z]+', '', s)).map(lambda w: (w, 1)).reduceByKey(lambda a,b: a+b).collect()

[('Basics', 1),
 ('of', 2),
 ('the', 3),
 ('Unix', 3),
 ('Philosophy', 1),
 ('The', 1),
 ('philosophy', 1),
 ('originated', 1),
 ('with', 2),
 ('Ken', 1),
 ('Thompsons', 2),
 ('early', 1),
 ('meditations', 1),
 ('on', 1),
 ('how', 2),
 ('to', 2),
 ('design', 2),
 ('a', 2),
 ('small', 1),
 ('but', 1),
 ('capable', 1),
 ('operating', 1),
 ('system', 1),
 ('clean', 1),
 ('service', 1),
 ('interface', 1),
 ('It', 2),
 ('grew', 1),
 ('as', 1),
 ('culture', 1),
 ('learned', 1),
 ('things', 1),
 ('about', 1),
 ('get', 1),
 ('maximum', 1),
 ('leverage', 1),
 ('out', 1),
 ('absorbed', 1),
 ('lessons', 1),
 ('from', 1),
 ('many', 1),
 ('sources', 1),
 ('along', 1),
 ('way', 1)]

In [35]:
# Bonus task 2: sort word count by frequency in descending order
# hint: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy
sentences.flatMap(lambda str : str.split(" ")).map(lambda s: re.sub('[^a-zA-Z]+', '', s)).map(lambda w: (w, 1)).reduceByKey(lambda a,b: a+b).sortBy(lambda x : x[1],ascending=False).collect()

[('the', 3),
 ('Unix', 3),
 ('of', 2),
 ('with', 2),
 ('Thompsons', 2),
 ('how', 2),
 ('to', 2),
 ('design', 2),
 ('a', 2),
 ('It', 2),
 ('Basics', 1),
 ('Philosophy', 1),
 ('The', 1),
 ('philosophy', 1),
 ('originated', 1),
 ('Ken', 1),
 ('early', 1),
 ('meditations', 1),
 ('on', 1),
 ('small', 1),
 ('but', 1),
 ('capable', 1),
 ('operating', 1),
 ('system', 1),
 ('clean', 1),
 ('service', 1),
 ('interface', 1),
 ('grew', 1),
 ('as', 1),
 ('culture', 1),
 ('learned', 1),
 ('things', 1),
 ('about', 1),
 ('get', 1),
 ('maximum', 1),
 ('leverage', 1),
 ('out', 1),
 ('absorbed', 1),
 ('lessons', 1),
 ('from', 1),
 ('many', 1),
 ('sources', 1),
 ('along', 1),
 ('way', 1)]

### 1.2 Creating a RDD by reading from a file

In [None]:
# TODO: Word count on a text file!
# Print the top 15 most frequent (word, count) pairs to the screen

In [36]:
words = sc.textFile('../data/word_count/unix-philosophy-basics.txt')

In [42]:
words.flatMap(lambda str : str.split(" ")).map(lambda s: re.sub('[^a-zA-Z]+', '', s)).filter(lambda s: s).map(lambda w: (w, 1)).reduceByKey(lambda a,b: a+b).sortBy(lambda x : x[1],ascending=False).take(20)

[('the', 214),
 ('to', 185),
 ('of', 161),
 ('and', 133),
 ('is', 109),
 ('a', 99),
 ('in', 79),
 ('it', 65),
 ('that', 64),
 ('be', 58),
 ('for', 53),
 ('you', 50),
 ('Rule', 46),
 ('programs', 43),
 ('as', 41),
 ('Unix', 38),
 ('with', 36),
 ('code', 34),
 ('are', 32),
 ('can', 29)]

In [None]:
# TODO: submit the preceeding task as a spark job
# 1. Create a python file named ./jobs/unix_philosophy_word_count.py
# 2. define spark session object
#   - from pyspark import SparkContext
#   - sc = SparkContext("local", "Unix Word Count")
# 3. Copy your solution code into the file
# 4. submit the job: ${SPARK_HOME}/bin/spark-submit --master local ./jobs/unix_philosophy_word_count.py


What are the most top 15 most common words in this file?

Highlight the section below to see the answer! vvv

<font color='white'>
[('', 226),
 ('the', 214),
 ('to', 185),
 ('of', 161),
 ('and', 133),
 ('is', 109),
 ('a', 99),
 ('in', 79),
 ('it', 65),
 ('that', 64),
 ('be', 58),
 ('for', 53),
 ('you', 50),
 ('Rule', 46),
 ('programs', 43)]
</font>

In [43]:
rdd = sc.parallelize([1,2])

In [44]:
!which python

/Users/dunyangan/learning/data-eng-bootcamp/.venv_data_eng_bootcamp/bin/python
