# Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. 

# Setting up SparkContext
SparkContext (aka Spark context) is the heart of a Spark application.

You could also assume that a SparkContext instance is a Spark application.

Spark context sets up internal services and establishes a connection to a Spark execution environment.

Once a SparkContext is created you can use it to create RDDs, accumulators and broadcast variables, access Spark services and run jobs (until SparkContext is stopped).

A Spark context is essentially a client of Spark’s execution environment

In [None]:
import os, sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2.1"
sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python'))
sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], 'python/lib/py4j-0.10.4-src.zip'))

In [None]:
import pyspark

In [None]:
sparkConf = pyspark.SparkConf() \
    .set("spark.executor.memory", "2560m")\
    .set("spark.driver.memory", "2560m")\
    .set("spark.yarn.executor.memoryOverhead", 3584)\
    .set("spark.yarn.driver.memoryOverhead", 3584)\
    .set("spark.python.worker.memory", "1536m")\
    .set("spark.executor.instances", 11)\
    .set("spark.default.parallelism", 300)

Other configuration properties can be found [here](https://spark.apache.org/docs/latest/configuration.html)

In [None]:
sc = pyspark.SparkContext(
    master='yarn-client',
    appName='seminar3-rdd',
    conf=sparkConf
)
sc

Web UI (aka Application UI or webUI or Spark UI) is the web interface of a running Spark application to monitor and inspect Spark job executions in a web browser.

In [None]:
port = sc.uiWebUrl.split(':')[-1]
print 'http://cluster1:{}'.format(port)

# Getting the Data Files

In this notebook, we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.

The KDD Cup 1999 competition dataset is described in detail 
[here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99).

In [None]:
! wget "http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz" -O "/data/kddcup.data_10_percent.gz"

Put data into hdfs

In [None]:
! hdfs dfs -put /data/kddcup.data_10_percent.gz ./

## Creating a RDD from a File
The most common way of creating an RDD is to load it from a file. Notice that Spark's textFile can handle compressed files directly.

In [None]:
data_path = 'kddcup.data_10_percent.gz'
raw_data = sc.textFile(data_path)

Now we have our data file loaded into the raw_data RDD.

Without getting into Spark transformations and actions, the most basic thing we can do to check that we got our RDD contents right is to count() the number of lines loaded from the file into the RDD and check a few of them

In [None]:
raw_data.count()

In [None]:
raw_data.take(5)

Another way of creating an RDD is to parallelize an already existing list.

In [None]:
rdd_list = sc.parallelize([x + 5 for x in range(100)])
print rdd_list.count()
rdd_list.take(5)

# RDD Basic Operations
This section will introduce three basic but essential Spark operations. Two of them are the transformations map and filter. The other is the action collect. At the same time we will introduce the concept of persistence in Spark

### The filter Transformation
This transformation can be applied to RDDs in order to keep just elements that satisfy a certain condition. More concretely, a functions is evaluated on every element in the original RDD. The new resulting RDD will contain just those elements that make the function return True.

For example, imagine we want to count how many normal. interactions we have in our dataset. We can filter our raw_data RDD as follows.

In [None]:
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)

In [None]:
%%time
normal_raw_data.count()

### The map Transformation
By using the map transformation in Spark, we can apply a function to every element in our RDD. Python's lambdas are specially expressive for this particular.

In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows.

In [None]:
csv_data = raw_data.map(lambda x: x.split(","))
csv_data.take(1)[0]

### FlatMap transformation
By using flatMap you can map each row to multiple new rows. Like in word count example.


In [None]:
texts = sc.parallelize(['Of course we can use predefined functions with map and not just lambda',
                       'Imagine we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file', 
                       'We could proceed as follows'])
words = texts.flatMap(lambda x: x.lower().split(' '))
words.take(5)

In [None]:
words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).collect()

#### Using map to create PairRDD
If you have a tuple of length 2 as your RDD data type, you can use \*ByKey operations on your RDD, with first value of tuple being the key and second being the value. Let's create such RDD.

Of course we can use predefined functions with map and not just lambda. Imagine we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file. We could proceed as follows.

In [None]:
def parse_interaction(line):
    elems = line.split(",")
    tag = elems[41]
    return (tag, elems)

key_csv_data = raw_data.map(parse_interaction)

You can change key with standard map function. Let's say we want to aggregate data by tag and protocol.

In [None]:
def protocol_key(x):
    tag = x[0]
    proto = x[1][1]
    return '{}_{}'.format(tag, proto), 1

type_protocol = key_csv_data.map(protocol_key)
protocols_by_type = dict(type_protocol.reduceByKey(lambda x, y: x + y).collect())
protocols_by_type

Antother way to acheive this is to use groupBy functions. In this case we get iterable with values corresponding to each key as second tuple value.

In [None]:
grouped_by = key_csv_data.groupByKey()
grouped_by.take(2)

And then we can map values to desired statistic. Write a function that will get us same results as above.

In [None]:
def protocol_counter(values):
    # Task 1
    pass

assert protocol_counter([(0, 'udp'), (0, 'udp'), (0, 'tcp')])['udp'] == 2

In [None]:
protocols_by_type2 = dict(grouped_by.mapValues(protocol_counter).collect())
assert protocols_by_type2['normal.']['udp'] == protocols_by_type['normal._udp']

# DataFrame API
DataFrame is another Spark API which is very convinient for structured data.

To use it, we need to instantiate a SparkSession, which is essentialy just enhaced SparkContext.

In our case we can construct it directly from SparkContext, but if don't have one already, we can create session via builder, almost the same as with context

In [None]:
from pyspark.sql import SparkSession

ss = SparkSession(sc)

A DataFrame is a Dataset organized into named columns. 
It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: csv, structured data files, tables in Hive, external databases, or existing RDDs. 
To create one we utilize a DataFrameReader avaliable in SparkSession.

In [None]:
data = ss.read.csv(data_path)
data.show(5)

Sometimes it's more convinient to use pandas dataframe representation in notebooks like this

In [None]:
data.limit(5).toPandas()

We don't have column names in our data, but they are avaliable seperatley. Let's rename columns.

In [None]:
import requests
header = requests.get('http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names').text.split('\n')[1:-1]

types = [h.split(':')[1].strip(' .') for h in header]
header = [h.split(':')[0] for h in header]

In [None]:
data_with_header = data
for i, h in enumerate(header + ['tag']):
    data_with_header = data_with_header.withColumnRenamed('_c{}'.format(i), h)
data_with_header.limit(5).toPandas()

DataFrames have schema - information about columns in dataframe. You can view it like this.

In [None]:
data_with_header.printSchema()

All of the columns have string type - thats because we read them from csv and didn't use inferSchema flag. Lets cast continuous columns ourselves.

To do this, we use spark column functions. 
Column represents a column in a Dataset that holds a Catalyst Expression that produces a value per row.
You can construct Column insatance from it's name using pyspark.sql.functions.col and then call different functions on it, including cast.

In [None]:
import pyspark.sql.functions as sf

In [None]:
def cast_if_continuous(col_name, t):
    if t == u'continuous':
        return sf.col(col_name).cast('float').alias(h)
    else:
        return sf.col(col_name)

data_with_types = data_with_header.select([cast_if_continuous(h, t) for h, t in zip(header, types)] + ['tag'])

Now we have apropriate types in our dataframe

In [None]:
data_with_types.printSchema()

You also can do different transformations on columns. For example let's calculate mean error rate for each column.

There are several ways to introduce new column into our dataframe.
One of them is to use .withColumn, which accepts column expression and column name.

Another is to use .select with different column expressions as arguments.
Expressions also could be strings or constants, which internally transforms to columns using sf.col or sf.lit (literal value).
To provide a name for new column, you can call .alias on column.

You can use '\*' wildcard to select all columns in dataframe.

In [None]:
mean_er_df = data_with_types.select('tag', sf.col('protocol_type'), 
                         ((sf.col('dst_host_serror_rate') + 
                           sf.col('dst_host_srv_serror_rate') +
                           sf.col('dst_host_rerror_rate') + 
                           sf.col('dst_host_srv_rerror_rate') / 4).alias('mean_err_rate')))
mean_er_df.orderBy('mean_err_rate', ascending=False).show(5)

It's a lot easier to do aggregations on data using DataFrame API, because sf module also provides so called aggregate functions, which can be used with .groupby.

Let's calculate the same statistic as in RDD API

First, group data by two columns

In [None]:
grouped_df = data_with_types.groupBy('tag', 'protocol_type')
grouped_df

Now, aggregate it with corresponding function

In [None]:
pt_df = grouped_df.agg(sf.count('protocol_type').alias('count'))
pt_df.show(5)

In [None]:
protocols_by_type3 = {'{}_{}'.format(r['tag'], r['protocol_type']):r['count'] for r in pt_df.collect()}
assert protocols_by_type3['normal._tcp'] == protocols_by_type['normal._tcp']

As an exercise, calculate mean size (scr_bytes column) of payload for each tag. List of aggregate functions can be found [here](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$)

In [None]:
# Task 2
mean_src_bytes_by_tag_df = ...
mean_src_bytes_by_tag = ...
assert mean_src_bytes_by_tag['teardrop.'] == 28