## Compute and visualize the response time

* Compute the response time
  * for each question compute the time it took to have accepted answer
  * consider only questions with accepted answer
* Plot number of answered questions as a function of response time
  * choose hour as the time unit
  * create a bar chart (too see how many questions were answered within first hour, within second hour and so on)
  * plot a cumulative sum (too see for example how many questions in total were answered within first 10 hours and so on)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, unix_timestamp, ceil

import os
import matplotlib.pyplot as plt

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Interactive Analytics I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

questions_input_path = os.path.join(project_path, 'data/questions-json')

In [None]:
answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

In [None]:
questionsDF = (
    spark
    .read
    .format('json')
    .option('path', questions_input_path)
    .load()
)

#### Compute response time:

For each question compute how long it took to get accepted answer. Consider only questions that actually have accepted answers.

Hint:
* for each question and answer we now the time when it was created (`created_date`)
* [join](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join) questions with answers (use `accepted_answer_id` field in the join)
* use [unix_timestamp](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.unix_timestamp.html#pyspark.sql.functions.unix_timestamp) to compare the times (or cast it to long)
* convert to hours
* [ceil](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.ceil.html#pyspark.sql.functions.ceil) the numbers

In [None]:
hourly_data = (
    questionsDF.withColumn('creation_date', col('creation_date').cast('timestamp')).alias('questions')
    .join(answersDF.alias('answers'), questionsDF['accepted_answer_id'] == answersDF['answer_id'])
    .select(
        col('questions.creation_date').alias('question_time'),
        col('answers.creation_date').alias('answer_time')
    )
    .withColumn('response_time', unix_timestamp('answer_time') - unix_timestamp('question_time'))
    .filter(col('response_time') > 0)
    .withColumn('hours', ceil(col('response_time') / 3600))
)

#### Aggregate the data and visualise:

Hint:
* group by hour
* count
* convert to Pandas
* visualize (take first 24 hours to get rid of the long tail)

In [None]:
hourly_data_grouped = (
    hourly_data
    .groupBy('hours')
    .agg(count('*').alias('cnt'))
    .orderBy('hours')
)

In [None]:
hourly_data_local = hourly_data_grouped.limit(24).toPandas()

In [None]:
# inspect the data localy:

hourly_data_local.head(5)

For bar chart you can use df.plot.bar

In [None]:
hourly_data_local.plot(
    x='hours', y='cnt', figsize=(12, 6), 
    title='Response time of questions',
    legend=False,
    kind='bar',
    xlabel='Hour',
    ylabel='Number of answered questions'
)
plt.show()

#### Note

As you can see, big portion of the questions that have accepted answer were answered within the first hour.

#### Cumulative sum

* To compute cumulative sum you can use [cumsum()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html)
* add new col to the Pandas DataFrame as df['new_col'] = df['cnt'].cumsum()

In [None]:
hourly_data_local['cumsum'] = hourly_data_local['cnt'].cumsum()

hourly_data_local.plot(
    x='hours',
    y='cumsum',
    figsize=(12, 6),
    title='Cumulative size of answered questions',
    xlabel='Hour',
    ylabel='Number of answered questions',
    legend=False
)
plt.show()

In [None]:
spark.stop()