#Big Data Management
Databricks + Spark. File 2.  
2020.03.06

Render markdown with %md

### map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.

In [3]:
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# apply map(func) transformation to the RDD
rdd1 = rdd.map(lambda x: x * 5 + 1)

# show results of the new rdd
rdd1.collect()

### flatmap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [5]:
# Create RDD using sample data
rdd = sc.parallelize([1, 2, 3, 4, 5])

# apply map(func) transformation to the RDD
rdd2 = rdd.flatMap(lambda x: [x, x * 3])

# show results of a new rdd
rdd2.collect()

### Filter
Return a new dataset formed by selecting those elements of the source on which func returns true.

In [7]:
# Create RDD using sample data
rdd = sc.parallelize([1, 2, 3, 4, 5])

# apply filter(func) transformation to the RDD
rdd.filter(lambda x: x % 2 == 0).collect()

### Key-value pairs

A Key/Value RDD is an RDD whose elements comprise a pair of values – key and value. It should be in a tuple format such as (1,2) and then you apply key-value pair operations . For example: join(), groupByKey(), or reduceByKey()

In [9]:
# Setup the textFile RDD to read the README.md file
# Note: this is lazy
textFile = sc.textFile("databricks-datasets/samples/docs/README.md")

# split each line of readme file to words first, and then make a tuple of (word, 1)
textFile.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).take(1)

### reduceByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

In [11]:
# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("databricks-datasets/samples/docs/README.md")

# split each line of readme file to words first, and then make a tuple of word, 1
rdd_key = textFile.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))

# reduce by key

rdd_key.reduceByKey(lambda x, y: x + y).take(3)

### union

Return a new dataset that contains the union of the elements in the source dataset and the argument.

In [13]:
# create some rdds
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = rdd1.map(lambda x: x * 2)

# combine these rdds with a union
rdd1.union(rdd2).collect()

### groupByKey()
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.

In [15]:
# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

# split each line of the readme file to words, and then make a tuple of word, 1.
rdd_key = textFile.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))

# group by key
rdd_gp = rdd_key.groupByKey()

for (key, value) in rdd_gp.take(5):
  print(key, sum(value))

### join
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

In [17]:
# create two RRDs
rdd1 = sc.parallelize([('rock', 1), ('paper', 2), ('scissor', 1), ('hammer', 3)])
rdd2 = sc.parallelize([('hammer', 2), ('paper', 3), ('water', 1), ('fire', 3)])

# perform left outer join
rdd2.leftOuterJoin(rdd1).collect()

### stats()

Return the count, mean, standard deviation, max and min of the RDDs' elements in one operation.

In [19]:
# stats transformation to get mean, count, std dev
sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9]).stats()

### sample(withReplacement, fraction, seed=None)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.  
Parameters:	
withReplacement – can elements be sampled multiple times (replaced when sampled out)
fraction – expected size of the sample as a fraction of this RDD’s size without
seed – seed for the random number generator

In [21]:
# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

# split each line of the readme file to words first, then make a truple of the word.
rdd_key = textFile.flatMap(lambda x: x.split(' '))
rdd_key.sample(False, 0.02, 3).collect()

### Functions 
Create a function and use it for a transformation.

In [23]:
# create a function that tells if the line is small or large.

def strLenType(input):
  if len(input) < 15:
    return "Small"
  else:
    return "Large"

# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

# split each line of the readme to words, then tuple-ize them
textFile.map(lambda x: strLenType(x)).take(5)