# Window Functions

Spark also supports window functions for aggregations. Window functions allow more complex aggregations like sliding windows or ranking, where for each row a set of 'surrounding' rows are used for calculating an additional metric.

In this example, we will use the weather data and add a sliding average temparature to the existing columns. The result DataFrame shall have both metrics: The actual temperature (as stored in the original records) and an averaged value.

In [None]:
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

# 1 General Preparations

First we enable Matplot inline graphics and set the base location of all data

In [None]:
%matplotlib inline

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

# 2 Loading Data

Again we load data for the single year 2003 from S3 (or whatever storage location is used)

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

rawWeatherData = spark.read.text(storageLocation + "/2003")
weatherData = rawWeatherData.select(
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    to_timestamp(substring(col("value"),16,12),"yyyyMMddHHmm").alias("timestamp"),
    to_timestamp(substring(col("value"),16,12),"yyyyMMddHHmm").cast("long").alias("ts"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)

In [None]:
# Make the weather data available as a temporary view
weather_all = weatherData.cache()
weather_all.createOrReplaceTempView("weather_all")

In [None]:
# Peek inside the data, just to make sure everything looks right
# YOUR CODE HERE

# 3 Pick a single station

For our first steps, we limit ourselves to a single weather station. We pick one with `usaf='954920'` and `wban='99999'`. This is enough to demonstrate the basic functions of window functions for a sliding average.

In [None]:
weather_single = weatherData.where("usaf='954920' and wban='99999'").cache()
weather_single.createOrReplaceTempView("weather_single")

In [None]:
# Peek inside the data, just to make sure everything looks right
# YOUR CODE HERE

# 4 Sliding Average

Now we want to calculate the sliding average of the temperature as an additional metric. First we use SQL for that task, later we will see how to use the DataFrame API for performing the same task.

In order to perform a windowed aggregation, you use the following syntax to specify a column expression:
```
    AGGREGATE_FUNCTION(columns) OVER(window_specification)
```
The term `window_specification` is constructed from the following components:
```
    PARTITION BY category
    ORDER BY ordering_column [ASC|DESC]
    RANGE BETWEEN start PRECEEDING AND end FOLLOWING
```

* `PARTITION BY` works similar to a `GROUP BY` operation. It controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for  the category column are collected to the same machine before ordering and calculating the frame.  If no partitioning specification is given, then all data must be collected to a single machine. - it filters records which are used for creating each window
* `ORDER BY` sorts all records of a single window accordingly. It controls the way that rows in a partition are ordered, determining the position of the given row in its partition.
* `RANGE BETWEEN` states which rows will be included in the frame for the current input row, based on their relative position to the current row.  For example, “the three rows preceding the current row to the current row” describes a frame including the current input row and three rows appearing before the current row.

### Frame Types

As an alternative to `RANGE BETWEEN` there is also `ROWS BETWEEN`. While `RANGE BETWEEN` refers to the values of the sorting column, `ROWS BETWEEN` simply counts the number of rows. Both window types have their use: `RANGE BETWEEN` is perfect for sliding averages over time windows of constant duration, while `ROWS BETWEEN` is useful for ordered entries lacking a proper arithmetic scale.

### Boundaries 
Both frame types (range and rows) support different boundary types:
* `UNBOUNDED PRECEDING`
* `UNBOUNDED FOLLOWING`
* `CURRENT ROW`
* `<value> PRECEDING`
* `<value> FOLLOWING`

## 4.1 Sliding average calculation

In [None]:
result = spark.sql("""
-- YOUR CODE HERE
""").toPandas()

result

### Draw a picture

In order to verify our approach, let's draw a picture with Matplotlib, which shows the current temperature and the sliding average in a single plot.

In [None]:
# YOUR CODE HERE

## 4.2 Window Aggregation Functions

We already used simple standard aggregation functions, which are also available without windows. But there are also some special aggregation functions, which were specifically designed to be used with windowed aggregation and cannot be used without a window definition.

These are

Function class | SQL | DataFrame Function | Description
---------------|-----|--------------------|-------------
Ranking functions|rank|ranke|Get rank in window
|dense_rank|denseRank|
|percent_rank|percentRank|
|ntile|ntile|
|row_number|rowNumber|Get row number in window
Analytic functions|cume_dist|cumeDist|
|first_value|first|Pick first value in window
|last_value|last|Pick last value in window
|lag|lag|Pick preceeding value
|lead|lead|Pick following value

## 4.3 Exercise: Comparing to pervious day

Another use case for window functions is to compare todays temperature to yesterday at the same time. This can be achived by using the function `FIRST_VALUE` together with an appropriate window with a range from 86400 (number of seconds of one day) preceeding and the current row.

**Exercise**: Create a DataFrame with the columns `timestamp`, `temp` (current temperature) and `prev_temp` (previous temperature) and plot the first 300 records.

In [None]:
# YOUR CODE HERE

### Draw a picture

Again, draw a picture of the result.

In [None]:
# YOUR CODE HERE

# 5 DataFrame Window API

In addition to the SQL interface, there is also a direct Python interface for creating windowed aggregations. Let us reformulate the initial sliding window average aggregation using the Spark DataFrame API instead of SQL.

## 5.1 Sliding average

In [None]:
from pyspark.sql.window import Window

window_spec = # YOUR CODE HERE

result = # YOUR CODE HERE

result

### Draw a picture

Using Matplotlib, let's make a picture containing the current temperature and the average temperature in a single plot.

In [None]:
result.plot(x='timestamp', y=['temp','avg_temp'], figsize=[16,8])

## 5.2 Exercise: Compare temperature to previous day

Now perform the same task as the previous exercise: Make a plot of the current temperature and the one 24h ago using the `first` function. But this time, use the DataFrame API instead of SQL.

In [None]:
from pyspark.sql.window import Window

window_spec = # YOUR CODE HERE

result = # YOUR CODE HERE

result

### Draw a picture

In order to verify our approach, let's draw a picture with Matplotlib, which shows the current temperature and the previous temperature in a single plot.

In [None]:
# YOUR CODE HERE

## 5.3 Partitioned Windows

So far we only used windows covering a specific time range. This was good enough, since we were only looking at a single station. But in most cases, you want to perform analyses covering multiple different entitites (different weather stations in this example). In these cases you also need to *partition* the aggregation window, such that only records from the same entity are processed.

Let us calculate the difference of the current temperature to the average of the last day, but this time for all stations at once.

In [None]:
from pyspark.sql.window import Window

window_spec = # YOUR CODE HERE
    
# Common column expression for valid temperature value or NULL otherwise    
valid_temp = # YOUR CODE HERE

result = # YOUR CODE HERE

result.limit(300).toPandas()

### Draw a Picture

In order to check the result, we again pick a single station. But this time, we pick it from the final result and not from the input data

In [None]:
pdf = result.where("usaf='954920' and wban='99999'").limit(300).toPandas()
pdf.plot(x='timestamp', y=['temp','temp_avg_diff'], figsize=[16,8])

## 5.4 Exercise: Min/Max Change Analysis

Now we want to calculate for every weather station:
* The maximum upward difference of temperature within 5 days
* The maximum downward difference of temperature within 5 days

Logically, we want to perform the following steps for every weather station:
1. For every measurement, look back five days
2. Within these five days, find the minimum and maximum temperature
3. Calculate the difference of the current temepature and the minimum and maximum. Store these in `temp_rise` and `temp_fall`
4. Calculate the overall maximum of `temp_rise` and `temp_fall` per station for the whole year

In [None]:
# Calculate the number of seconds for five days
one_day = 24*60*60
five_days = 5*one_day
five_days

In [None]:
# Create a window, which creates a new partition per weather station and looks back 5 days
window_spec = # YOUR CODE HERE
    
# Create a column representing a valid temperature or NULL otherwise    
valid_temp = # YOUR CODE HERE

# Calculate the difference for each day from the maximum and minimum temperature of the last five days using the window
# The resulting DataFrame should have the following columns:
#   timestamp
#   usaf
#   wban
#   temp_rise = valid_temp - min(valid_temp).over(window_spec)
#   temp_fall = max(valid_temp).over(window_spec) - valid_temp
weather_rise_fall = # YOUR CODE HERE

# Calculate the maximum raise and fall for each station for the whole year. This should be done by a simple grouped aggregation.
# The groups are determined by the weather station id, which is given by usaf and wban
result = # YOUR CODE HERE

# Finally show the whole result by converting it to a Pandas DataFrame
result.toPandas()