# Key benefits of DataFrames

In this module we will deep dive a bit more into the RDD API and see what are the key differences between RDDs and Dataframes. Moreover, unlike in the previous excercise, now we're performing some aggregation on top of the the data as well.

Let's start with the following simple excersice: each student took different number of tests across the year and we would like to calculate the average or scores they achieved.

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

sc = pyspark.SparkContext()
spark = SparkSession(sc)

In [3]:
# Let's assume for now, our data just a list of tuples with name and the achieved score on a particular exam
data = [("Andras", 10), ("Bob", 20), ("Bob", 30), ("Andras", 12), ("Bob", 35)]

## Task #1: Calculating avg. scores through the RDD API

Since Spark does not really know anything about the data stored in the RDD, we have to write a very explicit code on how to calculate these averages:

0. First you need to copy the data from a python list into the Spark framework. We can use the SparkContext's **parallelize()** method as we did in the previous lab.

1. First we need to transform the dataset into key-value pairs. This should not be very difficult, as the Name will be the key. However, for the value, we also need to add an extra field which keeps track the number of elements as we'll need this value when calculating the avg.

2. We need to perform a **reduceByKey** operation. This is a simple **reduce** operation, but it is performed just on top of the list of values which has the same key value. During the reduce operation we just sum up all the value field (scores and number of elements)

3. The last computation step is to calculate the average itself, where we need to devide the sum of scores by the number of elements

4. Use the RDD's **collect()** method to copy the result RDD back to a regular python list to be able to print out the result

In [4]:
rdd = sc.parallelize(data)

# rdd = rdd.map(..).reduceByKey(...) .... 

rdd.collect()

[('Andras', 10), ('Bob', 20), ('Bob', 30), ('Andras', 12), ('Bob', 35)]

### Question: What are the drawbacks of this solution?


## Task #2: Calculating avg. scores with DataFrames

Let's implement the same logic above with the DataFrame API.


In [None]:
# The data is already structured (has 2 columns), so we can create a dataframe.
df = spark.createDataFrame( ... )

# Aggregate the dataframe by the Name column.
df.groupBy(**column name**).agg(**aggregate_function**).show()

### Question: what are the advantages of the DataFrame API over RDDs?
