# RDD Concepts

## Spark Set Up

In [1]:
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession

app_name = "week1_demo"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()
sc = spark.sparkContext

ModuleNotFoundError: No module named 'pyspark'

## RDD Creation

In [2]:
## Create a RDD
my_RDD = sc.parallelize(range(1,1000))

In [3]:
## What happens when we try to display it?
my_RDD

PythonRDD[1] at RDD at PythonRDD.scala:53

In [5]:
## Uh oh! What's happening? ... LAZY EVALUATION!. Let's force exectution
my_RDD.take(5)

[1, 2, 3, 4, 5]

## Narrow Transformations

Let's start discussion Narrow Transformations. Narrow transformations run in parallel, thus they don't require any networking between nodes. A series of narrow transformations will be one stage. No need to shuffle.

### Main Narrow Transformations
- `map()`
- `flatMap()`
- `filter()`
- `mapValues()`

In [6]:
## LEt's use our previous RDD and expand the data structure to a tuple
cubes_RDD = my_RDD.map(lambda x: (x, x**3))
cubes_RDD

PythonRDD[3] at RDD at PythonRDD.scala:53

In [7]:
## Same as before, Spark WILL NOT EXECUTE until needed (Lazy Evaluation!)
cubes_RDD.take(5)

[(1, 1), (2, 8), (3, 27), (4, 64), (5, 125)]

How many RDDs do we have right now? 1 or 2?

Let's check the Spark UI!

In [11]:
## Let's now filter our cubes, I want the cubes that are multiples of 5
cubes_mult_5_RDD= cubes_RDD.filter(lambda x: x[1]%5 == 0)
cubes_mult_5_RDD.take(5)

[(5, 125), (10, 1000), (15, 3375), (20, 8000), (25, 15625)]

In [12]:
## Great! One of the powers of Spark is to be able to concatenate transformations - Creating a Spark job!
my_RDD = sc.parallelize(range(1,1000))\
        .map(lambda x: (x, x**3))\
        .filter(lambda x: x[1] % 5 == 0)

my_RDD.take(5)

[(5, 125), (10, 1000), (15, 3375), (20, 8000), (25, 15625)]

## Wide Transformations

Wide Transformations (which are equivalent to the Reduce Phase of Map Reduce) are those that require the inputs for more than one stage, thus forcing a shuffle. 

### Main Wide Transformations
- `reduceByKey()`
- `aggregateByKey()`
- `groupByKey()`

In [34]:
## Let's create a list of tuples of stock prices
sales_amounts = [
    ('Item1', 1200),
    ('Item1', 850),
    ('Item2',  350),
    ('Item2', 400)
]

sales_RDD = sc.parallelize(sales_amounts)

## collet() is an important action, similar to take(n), but it brings everything
sales_RDD.collect()

[('Item1', 1200), ('Item1', 850), ('Item2', 350), ('Item2', 400)]

In [53]:
## Let's compute our average sales per store
sales_RDD.mapValues(lambda x: (x,1))\
        .reduceByKey(lambda x,y : (x[0] + y[0], x[1] + y[1]))\
        .mapValues(lambda x: x[0]/float(x[1]))\
        .collect()

[('Item2', 375.0), ('Item1', 1025.0)]

# Actions