# Key benefits of DataFrames

In this module we will deep dive a bit more into the RDD API and see what are the key differences between RDDs and Dataframes. Moreover, unlike in the previous excercise, now we're performing some aggregation on top of the the data as well.

Let's start with the following simple excersice: each student took different number of tests across the year and we would like to calculate the average or scores they achieved.

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

sc = pyspark.SparkContext()
spark = SparkSession(sc)

In [2]:
# Let's assume for now, our data just a list of tuples with name and the achieved score on a particular exam
data = [("Andras", 10), ("Bob", 20), ("Bob", 30), ("Andras", 12), ("Bob", 35)]

## Calculating avg. scores through the RDD API

Since Spark does not really know anything about the data stored in the RDD, we have to write a very explicit code on how to calculate these averages:

1. First we need to transform the dataset into key-value pairs. This should not be very difficult, as the Name will be the key. However, for the value, we also need to add an extra field which keeps track the number of elements as we'll need this value when calculating the avg.

2. We need to perform a **reduceByKey** operation. This is a simple **reduce** operation, but it is performed just on top of the list of values which has the same key value. During the reduce operation we just sum up all the value field (scores and number of elements)

3. And finally, we need to calculate the average itself, where we need to devide the sum of scores by the number of elements

In [3]:
rdd = sc.parallelize(data)
avg_rdd = rdd \
    .map(lambda x: (x[0], (x[1], 1))) \
    .reduceByKey(lambda value1, value2: (value1[0] + value2[0], value1[1] + value2[1])) \
    .map(lambda x: (x[0], x[1][0] / x[1][1]))
avg_rdd.collect()

[('Bob', 28.333333333333332), ('Andras', 11.0)]

### What are the drawbacks of this solution?

1. Cryptic and hard to read
2. Spark has no knowledge about the structure of the underlying data, we need to explicitly code into our job
3. Hard to maintain, as small changes in the data may require to rewrite the entire job
4. Not language agnostic, e.g. in Python and in Scala the same code will look very different due to differences in syntax

## Calculating avg. scores with DataFrames

In [4]:
df = spark.createDataFrame(data, ['Name', 'Age'])
result_df = df.groupBy('Name').agg(avg('Age'))
result_df.show()

+------+------------------+
|  Name|          avg(Age)|
+------+------------------+
|Andras|              11.0|
|   Bob|28.333333333333332|
+------+------------------+



### It is not hard to see the advantages of the DataFrame APIs compared to RDDs

1. The code is more expressive
2. Spark understands our intention. Instead of just passing higher-order (lambda) functions to RDD operations, Spark gains a knowledge on what are these functions are performing. In this case we want to calculate avg. grouped by Name. This helps Spark to optimize our jobs more effectively in an automated fashion
3. DataFrame API is very similar in Python and Scala which faciliates uniformity across programming languages.