# Using window functions solve the following problems:

1. Consider users that answered at least 5 questions and their answer was accepted.
2. Compute their average response time
3. Identify users that always improved: when answering a new question, they response time decreased as compared to the previous one.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, unix_timestamp, avg, lag, when, every
from pyspark.sql import Window

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Window Functions')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

questions_input_path = os.path.join(project_path, 'data/questions-json')

In [None]:
# create the input dataframes for answers and questions:


### 1. Consider users that answered at least 5 questions and their answer was accepted.

Hint:
* Join answers with questions and use accepted_answer_id as the joining key
* Filter out rows where user_id is Null
* Define a [Window](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) per user_id (from the answers dataset)
* Use `count` as the window function
* Filter only users with count >= 5

In [None]:
# for this window avoid using the sort, because if you sort the window
# you will have to add frame definition .rowsBetween(Window().unboundedPreceding, Window().unboundedFollowing)
# if you don't sort, all records for a given user_id will be considered in the window
# with the sort the default frame is .rowsBetween(Window().unboundedPreceding, Window().currentRow)

# your code here:


### 2. Compute average response time for each user

Hint:
* to compute the response time, subtract the creation_times (either cast them to long or use [unix_timestamp](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.unix_timestamp.html))
* compute avg of the response_time over the same window that you used above
* sort it by response_time to see users that answer questions quickly

In [None]:
# your code here:


### 3. Identify users that always improved: when answering a new question, their response time decreased as compared to the previous one.

Hint
* to compare previous with current response time, you can use [lag](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lag.html)
  * note that the function `lag` requires sorting the window
  * note that the first value in the window doesn't have previous value, so `lag` will create Null on the first row of each window
* add a new column that will carry the information if the user improved the response_time
  * use when-otherwise condition and assign True to the row if the user improved as compared to the previous row
* you can use [every](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.every.html) to check if the user improved every time

In [None]:
# your code here:


To read more about window functions check my [article](https://towardsdatascience.com/spark-sql-102-aggregations-and-window-functions-9f829eaa7549).

In [None]:
spark.stop()