# <u><p style="text-align: center;">Dataframes</p></u>

### Learning goals  
Students will:  
* Learn about Spark Dataframes

### Background

As we have seen, RDDs are the building blocks of Spark. RDDs have several advantages but in some cases their use can be problematic. Such cases can occur because Spark does not optimize transformations when we perform them directly to RDDs. Another example is that working with RDDs in some programming languages (including Python) can lead to poor performance. Also, transformation chains with RDDs can be difficult to comprehend since they show how the result will be achieved but not what the result will be.

Spark **DataFrames** were conceived to overcome the aforementioned problems. Similar to RDDs, DataFrames are distributed collections of data. The difference is that DataFrames provide a high-level abstraction over RDDs that allows us to use a query language to manipulate data. This abstraction is a logical plan that represents data and a schema. The logical plan is converted to a physical plan for execution. This conversion brings us closer to **what** we want to do rather than **how** we have to do it, because we let Spark figure out the most efficient way to carry out the operations. Dataframes are generally faster than RDDs, and they perform the same no matter what programming language we use with Spark.

### Code examples

Before proceeding to the examples, we are going to initialize Spark:

In [None]:
from pyspark import SparkContext
import os

#'swan_spark_conf' is a configuration provided by a plugin for Jupyter. We further extend this configuration with proxy settings.
swan_spark_conf = swan_spark_conf.setAll([('spark.ui.proxyBase', os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040')])

#instantiate a SparkContext object with our configuration
sc = SparkContext.getOrCreate(conf=swan_spark_conf)

#### Example 1:

DataFram

In [None]:
cows = sqlContext.createDataFrame([("Joel", "Angus", 450), ("Marcia", "Belted Galloway", 320), ("Gregor", "Hereford", 390), ("Anne", "Angus", 400), ("Ravi", "Belted Galloway", 250)],
                                    ('Name', 'Breed', "Weight"))

orderBy
groupby

#### Example 2:

DataFrames provide a convenient way to work with tabular data. In this example, we are going to read a file with Spark and convert into a DataFrame. The file contains the minimum and maximum daily temperatures for the years 2010-2015 in De Bilt, Netherlands. 

Then, we are going to find the minimum and maximum temperatures that occured during these years and also count how many days the temperature was below 0 $^\text{o}C$.

So, the first step is to load the data into a DataFrame:

In [None]:
from pyspark.sql import SQLContext

sqlc = SQLContext(sc)

dataDF = sqlc.read.csv('/home/jovyan/datasets/knmi-debilt.csv', header=True, inferSchema=True)

and then examine how the data look like by using the function `show`:

In [None]:
dataDF.show()

Dates are formatted as YYYYMMDD, temperatures are in Celcius degrees.

Next, to find the minimum and maximum temperatures we are going to use **aggregations** over the DataFrame. We can perform aggregations by using the `agg` function. The parameters of `agg` are expressions that indicate the aggregation that we want to perform. To find the maximum temperature a possible solution is:

In [None]:
from pyspark.sql import functions as F

result = dataDF.agg(F.max("Tmax")) #notice that Tmax is the name of the column
result.show()

and similarly for the minimum:

In [None]:
result = dataDF.agg(F.min("Tmin")) #notice that Tmin is the name of the column
result.show()

Now, to find how many days the temperature was below 0 $^\text{o}C$, we are first going to keep only the days with the required temperature by using the `filter` function:

In [None]:
below_zeroDF = dataDF.filter(F.col("Tmin") < 0)

followed by the `count` function:

In [None]:
below_zeroDF.count()

#### Example 3:

When working with DataFrames we can also write queries against the DataFrame. Using the previous dataset we are going to extract the date of the mamimum temperature using an SQL query:

In [None]:
df2 = spark.sql("SELECT Date ")

### Quiz

### More advanced examples:

In [None]:
dataDF.createTempView("data_view")
df2 = spark.sql("SELECT Date FROM MIN(Tmax)")

### Further reading

* 