# Statistical modeling

In this notebook we will see how user defined functions can be used for statistical modeling using [scipy](http://scipy.github.io/devdocs/reference/index.html) package. We will also see how to implement Pandas UDF which has better performace than vanilla UDF because it can laverage [Apache Arrow](https://arrow.apache.org/) under the hood for exchanging the data and vectorized execution that is supported by the scipy package.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, lit, count, year, pandas_udf, avg

from pyspark.sql.types import IntegerType

import os
import re

from scipy.stats import poisson
import pandas as pd

In [None]:
spark = (
    SparkSession
    .builder
    .appName('UDFs II')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

### Task

For each user compute probability that the user is going to answer 5 questions in the next year. Use simple model based on poisson distribution.

1. Create a DataFrame with two cols: `user_id`, `answers`, where the second is the average number of questions the user answered per year.
2. Implement UDF that will use [poisson](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html) distribution from scipy package to compute the probability that if the user answered <i>n</i> questions per year, he will answer 5 questions in the next year
3. Implement the UDF again, but this time as Pandas UDF

In [None]:
# we will need answers dataset:

answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

In [None]:
answersDF.show()

#### Create input DataFrame

* filter for rows where `user_id` is not null
* compute average number of answers per user per year
 * group by user and year
 * use count to see how many questions each user answered in each year
 * group by again but now only per user
 * compute the average per year for each user

In [None]:
input_df = (
    answersDF
    .filter(col('user_id').isNotNull())
    .withColumn('creation_year', year('creation_date'))
    .groupBy(
        'creation_year', 'user_id',
    )
    .agg(
        count('*').alias('answers')
    )
    .groupBy('user_id')
    .agg(
        avg('answers').alias('answers')
    )
)

In [None]:
input_df.show(n=5)

#### Define UDF:

Hint:
* the return type will be float, since we will compute probability
* use pmf function of the [poisson](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html) in scipy

In [None]:
@udf('float')
def get_probability(k, year_average):
    return float(poisson.pmf(k, year_average))

#### Apply the udf

In [None]:
(
    input_df
    .withColumn('probability', get_probability(lit(5), col('answers')))
    
).show(n=5)

#### Try it with Pandas

* create local Pandas dataframe with input data
* pass a local Pandas series to poisson to see what it returns

Hint:
* create a pandas series from pandas dataframe as `local_data['answers']`, where local_data is pd_df

In [None]:
local_data = input_df.toPandas()

In [None]:
# It returns numpy array

poisson.pmf(5, local_data['answers'])

In [None]:
# We can easily create a pandas series from it:

pd.Series(poisson.pmf(5, local_data['answers']))

In [None]:
# Define a pandas udf:

@pandas_udf('float')
def get_probability_pd(k, year_average):
    return pd.Series(poisson.pmf(k, year_average))

In [None]:
# Apply the UDF:

(
    input_df
    .withColumn('probability', get_probability_pd(lit(5), col('answers')))
).show(n=5)

#### Compare the performace for both UDFs

Hint
* run the query with the noop format
* check the execution time in SparkUI

In [None]:
# execution of vanilla UDF:

(
    input_df
    .withColumn('probability', get_probability(lit(5), col('answers')))
    .write
    .mode('overwrite')
    .format('noop')
    .save()
)

In [None]:
# execution of Pandas UDF:

(
    input_df
    .withColumn('probability', get_probability_pd(lit(5), col('answers')))
    .write
    .mode('overwrite')
    .format('noop')
    .save()
)

In [None]:
spark.stop()