# Interactive Analytics

In this notebook you will answer 2 basic analytical questions about the data and visualise the result using Python library Matplotlib. We will see one way how Spark is integrated with Python library Pandas which allows you to access also other libraries of the Python ecosystem, for example Matplotlib for visualisation.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, count, unix_timestamp, when, lit, ceil

import os
import matplotlib.pyplot as plt

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Interactive Analytics I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

# Task I

* Find out how many answers are being produced per week
* Plot the time evolution: on the x axis have date dimmension, on the y axis have number of answers per week

#### Read the data from the source:

In [None]:
# use the answers dataset
# your code here:


#### Group the data

Hint:
* use `groupBy(window)`, where the `window` will be "1 week"
* docs for [window](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.window.html#pyspark.sql.functions.window)
* the output of grouping by `window` will be struct with two subfields `start` and `end`
* use the `start` subfield and change the type to `date` - this will be used in the plot

In [None]:
# your code here:


#### Visualise the data:

Hint
* convert the aggregated data to Pandas dataframe using [toPandas()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html#pyspark.sql.DataFrame.toPandas)
* use ploting options of Pandas dataframe
 * [plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)

In [None]:
# your code here:


In [None]:
# your code here:


# Task II

* Compute the response time
 * for each question compute the time it took to have accepted answer
 * consider only questions with accepted answer
* Plot number of answered questions as a function of response time
 * choose hour as the time unit
 * create a bar chart (too see how many questions were answered within first hour, within second hour and so on)
 * plot a cumulative sum (too see for example how many questions in total were answered within first 10 hours and so on)

#### Read the data from the source:

In [None]:
# use the questions dataset
# your code here:


#### Compute response time:

For each question compute how long it took to get accepted answer. Consider only questions that actually have accepted answers.

Hint:
* for each question and answer we now the time when it was created (`created_date`)
* [join](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join) questions with answers (use `accepted_answer_id` field in the join)
* use [unix_timestamp](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.unix_timestamp.html#pyspark.sql.functions.unix_timestamp) to compare the times
* convert to hours
* [ceil](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.ceil.html#pyspark.sql.functions.ceil) the numbers

In [None]:
# your code here:


#### Aggregate the data and visualise:

Hint:
* group by hour
* count
* convert to Pandas
* visualize (take first 24 hours)

In [None]:
# your code here:


In [None]:
# your code here:


For bar chart you can use df.plot.bar

In [None]:
# your code here:


#### Note

As you can see, big portion of the questions that have accepted answer were answered within the first hour.


* To compute cumulative sum you can use [cumsum()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html)
* add new col to the Pandas DataFrame as df['new_col'] = df['cnt'].cumsum()

In [None]:
# your code here:


In [None]:
# Also see what is the total number of questions with accepted answer:
# your code here:


In [None]:
spark.stop()