## What is the probability a user will answer at least k questions next year?

In this notebook we will see how Pandas user defined functions can be used for statistical modeling using [scipy](http://scipy.github.io/devdocs/reference/index.html) package. 

You will:
* prepare the input data using aggregation
* use vanilla UDF to compute the probability using Poisson distribution
* use Pandas UDF to compute the probability using Poisson distribution
* compare performance of both UDFs

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, lit, count, year, pandas_udf, avg, desc

from pyspark.sql.types import IntegerType

import os
import time

from scipy.stats import poisson
import pandas as pd

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Statistical Modeling')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

## Task
For each user compute probability that the user is going to answer 5 questions in the next year. Use simple model based on poisson distribution.

* Create a DataFrame with two cols: user_id, answers, where the second is the average number of questions the user answered per year in the past.
* Find out what is the last year and filter it out as it is incomplete.
* Implement UDF that will use poisson distribution from scipy package to compute the probability that if the user answered n questions per year, he will answer at least 5 questions in the next year
* Implement the UDF again, but this time as Pandas UDF

In [None]:
# we will need answers dataset:


## Create input DataFrame
* first find out what is the last year and filter it out as it is incomplete
* filter for rows where user_id is not null
* compute average number of answers per user and year
  * group by user and year
  * use count to see how many questions each user answered in each year
  * group by again but now only per user
  * compute the average per year for each user

In [None]:
# see what is the last year:


In [None]:
# your code here:


In [None]:
input_df.show(n=5)

## Define a python function
### Hint:

* it should take as argument year_average and return the probability that at least `k` questions will be answered in next year
* use [cdf](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html#scipy.stats.poisson) function of the poisson in scipy
* define `k` to be a constant equal 5
* test if the function works
* to calculate the probability use 1 - poisson.cdf(k=4, mu) as the cumulative distribution function gives P(X<=k) but we need P(X>=k) so we must compute the cumulative probability for k=4 and subtract it from 1.

In [None]:
# implement the function:

k = 5


In [None]:
# test the function:


## Define the UDF:
### Hint:

* once you have the python function, make the UDF from it. See udf in [docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html#pyspark.sql.functions.udf)
* the return type will be float, since we will compute probability
* make sure to use the `float()` function for the return value to cast it to float

In [None]:
# your code here:


## Apply the udf

In [None]:
# your code here:


## Try it with Pandas
* create local Pandas dataframe with input data
  * this will be for testing the Pandas function so you can sample the spark dataframe to get just some rows for testing
  * see sample in [docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html)
* pass a local Pandas series to poisson to see what it returns
* define a function that will take pandas series as input argument and will return also pandas series

### Hint:

* create a pandas series from pandas dataframe as `local_data['answers']`, where local_data is pd_df


In [None]:
# local pd dataframe:


In [None]:
# Check what the poisson.cdf returns if we pass in Pandas series:


In [None]:
# We can easily create a pandas series from the numpy array:


In [None]:
# Now define a function from it:


In [None]:
# Test the function:


### Hint

* Once you have the function make a pandas udf from it
* See [pandas_udf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html#pyspark.sql.functions.pandas_udf) in the docs

In [None]:
# define pandas udf 


In [None]:
# Apply the UDF:


## Compare the performace for both UDFs
### Hint

* run the query with the noop format and write it to make sure all transformations are executed
* use Python time module to define the start_time and end_time so you can subtract them and compute the execution time for each query

In [None]:
# execution of vanilla UDF:


In [None]:
# execution of Pandas UDF:


In [None]:
spark.stop()