# Window functions

In this notebook you will:
* solve analytical question using window functions

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, unix_timestamp, row_number, lead, avg
from pyspark.sql import Window

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('WF I')
    .getOrCreate()
)

# Task 1

* compute avg time between two consecutive answers for each user that answered at least 2 questions

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

data_input_path = os.path.join(project_path, 'data/answers')

In [None]:
answersDF = (
    spark
    .read
    .option('path', data_input_path)
    .load()
)

#### Take only users that answered at least 2 questions:

Hint:
* Define a [Window](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) per user_id
* Use count as a window function
* Filter only users with count > 1

In [None]:
# for this window avoid using the sort, because if you sort the window
# you will have to add frame definition .rowsBetween(Window().unboundedPreceding, Window().unboundedFollowing)
# if you don't sort, all records for a given user_id will be considered in the window
# with the sort the default frame is .rowsBetween(Window().unboundedPreceding, Window().currentRow)

w = Window().partitionBy('user_id')

data = (
    answersDF
    .filter(col('user_id').isNotNull())
    .withColumn('r', count('*').over(w))
    .filter(col('r') > 1)
)

In [None]:
data.orderBy('user_id', 'answer_id').show(n=5)

#### Compute the average time between answers:

Hint:
* Define new window also per user but sorted by creation_date
* Use [lead](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lead.html#pyspark.sql.functions.lead) function to add new column which contains the next answer
* Compute the time difference (use [unix_timestamp](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.unix_timestamp.html#pyspark.sql.functions.unix_timestamp) which returns the timestamp in seconds
* Group by user and compute the average

In [None]:
w2 = Window().partitionBy('user_id').orderBy('creation_date')

resultDF = (
    data
    .withColumn('next_answer', lead('creation_date').over(w2))
    .filter(col('next_answer').isNotNull())
    .withColumn('diff', unix_timestamp(col('next_answer')) - unix_timestamp(col('creation_date'))) # in sec
    .groupBy('user_id')
    .agg(
        avg('diff').alias('avg_response_period')
    )
    .orderBy('avg_response_period')
)

In [None]:
resultDF.show(n=10)

#### Bonus question (if you have time)

* Check the answers of the user with the fastest average response time. Also check the questions that correspond to these answers.

In [None]:
answersDF.filter(col('user_id') == '731255').select('question_id', 'creation_date', 'body').collect()

In [None]:
(
    spark.read.parquet(os.path.join(project_path, 'output/questions-transformed'))
    .filter(col('question_id').isin(['9833024', '9833336']))
    .select('tags', 'creation_date', 'title')
).show(truncate=False)

To read more about window functions check my [article](https://towardsdatascience.com/spark-sql-102-aggregations-and-window-functions-9f829eaa7549).

In [None]:
spark.stop()