# Spark RDD
## Run script in Spark 

In [None]:
#!spark-submit --master spark://localhost:7077 --name wordcount "/usr/local/spark-3.0.1-bin-hadoop3.2/examples/src/main/python/wordcount.py" /demo/txt/victor_hugo-texts.txt

In [None]:
#!cat /usr/local/spark-3.0.1-bin-hadoop3.2/examples/src/main/python/wordcount.py

## pySpark

In [None]:
#!pip install pyspark

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName('RDD Test') \
    .getOrCreate()

In [None]:
spark

In [2]:
sc = spark.sparkContext

In [3]:
numbers_rdd = sc.parallelize(range(50))
numbers_rdd

PythonRDD[1] at RDD at PythonRDD.scala:53

In [4]:
numbers_rdd.take(3)

[0, 1, 2]

#### `.mean()`
Compute the average of the RDD (requires numerical values).

In [8]:
numbers_rdd.mean()

24.5

In [5]:
text_rdd = sc.textFile('/demo/txt/victor_hugo-texts.txt')
text_rdd

/demo/txt/victor_hugo-texts.txt MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [6]:
text_rdd.take(1)

['Mes vers fuiraient']

.collect() will take all values of the RDD.

In [7]:
#text_rdd.collect()

And now we will apply some **transformations**.

#### `.map(func)`
Applies `func` to every element of the RDD. Won't compute anything until an action is called.

In [14]:
out = text_rdd.map(lambda s: s.lower())

How do I get my result? -> `.take(...)` or `.collect()`

In [15]:
out.take(3)

['mes vers fuiraient',
 '',
 'mes vers fuiraient, doux et frêles, vers votre jardin si beau,']

### Chaining operations

In [19]:
length = text_rdd.map(lambda s: s.lower()).map(lambda s: (s, len(s)))

In [20]:
length.take(3)

[('mes vers fuiraient', 18),
 ('', 0),
 ('mes vers fuiraient, doux et frêles, vers votre jardin si beau,', 62)]

Let's add a new **transformation**, `filter`, it will filter based on a function returning a boolean value.  

_Note that when we're chaining operations, we go back to the line using Python syntax to do do, e.g. `\`.

In [21]:
text_rdd \
    .map(lambda s: s.lower()) \
    .map(lambda s: len(s)) \
    .filter(lambda c: c > 50) \
    .take(3)

[62, 53, 118]

In [23]:
text_rdd \
    .map(lambda s: s.lower()) \
    .map(lambda s: len(s)) \
    .filter(lambda c: c > 80) \
    .count()

471

### Key-value tuples
It's commong to use tuple values, as key-value pairs, like so:

In [27]:
tuples_rdd = sc.parallelize([
    ('banana', 4), ('orange', 12), ('apple', 3),
    ('pineapple', 1), ('banana', 3), ('orange', 6)])
tuples_rdd.collect()

[('banana', 4),
 ('orange', 12),
 ('apple', 3),
 ('pineapple', 1),
 ('banana', 3),
 ('orange', 6)]

In [26]:
tuples_rdd.groupByKey().collect()

[('banana', <pyspark.resultiterable.ResultIterable at 0x7fd055d5a670>),
 ('orange', <pyspark.resultiterable.ResultIterable at 0x7fd055d5af10>),
 ('apple', <pyspark.resultiterable.ResultIterable at 0x7fd055d5ac10>),
 ('pineapple', <pyspark.resultiterable.ResultIterable at 0x7fd06dd3b370>)]

In [28]:
tuples_rdd.groupByKey().map(lambda t: (t[0], sum(t[1]))).collect()

[('banana', 7), ('orange', 18), ('apple', 3), ('pineapple', 1)]

In [30]:
from operator import add
tuples_rdd.reduceByKey(add).collect()

[('banana', 7), ('orange', 18), ('apple', 3), ('pineapple', 1)]

In [None]:
!pwd
!ls /root/ipynb/data/titanic

In [None]:
df = (spark.read
          .format("csv")
          .option('header', 'true')
          .load("/demo/titanic/titanic-train.csv"))

In [None]:
df.head(1)