# Predictive Analytics


* build ML prototype that will predict if a question will be ansered in the next 30 minutes
* model it as a binary classification
* first prepare simple model with some basic features
* then try to improve it by adding some more features
* use random forest as a classifier
* for modelling consider only questions that have accepted answer

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, unix_timestamp, when, lit, length, array_sort, udf, desc

from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.feature import VectorAssembler, Tokenizer, SQLTransformer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Predictive Analytics')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

questions_input_path = os.path.join(project_path, 'data/questions-json')

model_output_path = os.path.join(project_path, 'output/models/binary-classification')

<b>Load the data:</b>

Hint:
load all three datasets: anwers, questions, users

In [None]:
answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

questionsDF = (
    spark
    .read.format('json')
    .option('path', questions_input_path)
    .load()
)

usersDF = (
    spark
    .read
    .option('path', users_input_path)
    .load()
)

<b>Add label to the dataset</b>

hint:
* join questions with answers on `accepted_anser_id`
* join also with users on `user_id`
* compute response time using unix_timestamp or cast the timestamps to long
  * for questions you first need to cast the `creation_date` to timestamp if it is a string
* use 'when' condition to compute the label (when response time <= 1800 then 1 otherwise 0)
* cache the DataFrame for faster access in next steps

In [None]:
# your code here



<b>Take a look at the distribution of classes</b>

Hint
* group by label and do the count

In [None]:
# your code here



<b>Add some basic features:</b>

hint:
* add feature 'title_complexity'
 * compute the length of the question title

In [None]:
# your code here



<b>Prepare data</b>

hint:
* split the data for training and testing using [randomSplit](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html#pyspark.sql.DataFrame.randomSplit)

In [None]:
# your code here



<b>Build the pipeline and train the model:</b>

hint:
* use: 
 * [VectorAssembler](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html#pyspark.ml.feature.VectorAssembler)
 * [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier)
 * [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html#pyspark.ml.Pipeline)

In [None]:
# your code here



<b>Evaluate the model</b>

hint:
* use [BinaryClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html#pyspark.ml.evaluation.BinaryClassificationEvaluator) with areaUnderROC

In [None]:
# your code here



### Add more features

Hint:
* add features: 
  * `question_size` number of words in the question body
     * use [Tokenizer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Tokenizer.html#pyspark.ml.feature.Tokenizer) to split the text on words
     * use a [SQLTransformer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.SQLTransformer.html#pyspark.ml.feature.SQLTransformer) to compute the size
    * you can try to add a bunch of other features such as reputation, upvotes, downvotes of the user and so on
* train the model with this new pipeline
* evaluate the model
* see if the model improved

In [None]:
# your code here:



### Explore the importance of the features

Hint
* access the last stage of the model to get the instance of the RandomForestClassificationModel
  * use model.stages
* see [featureImportances](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html#pyspark.ml.classification.RandomForestClassificationModel.featureImportances)

In [None]:
# your code here:



### Hyperparameter tuning:

hint:
* use [ParamGridBuilder](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.ParamGridBuilder.html#pyspark.ml.tuning.ParamGridBuilder) to find optimal `numTrees` and optimal `maxDepth`
* after you fit the [CrossValidator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html) access the best model as `cross_model.bestModel`
* compute the accuracy using the evaluator on the predictions computed by the bestModel

Note:

If you run in local mode make the grid just 2 x 2 to avoid long run (3 x 3 can run over an hour)

In [None]:
# your code here:



### Explore the params of the model

Hint
* see `avgMetrics` of the cross_model
* see the stages of the bestModel
  * access the last stage to get the instance of the `RandomForestClassificationModel`
* see:
  * `getNumTrees`
  * `getMaxDepth()`
  * `toDebugString` to see full description of the model

In [None]:
# your code here:



### Save the model so you can use it later in some ml application

* use [write](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.PipelineModel.html#pyspark.ml.PipelineModel.write) on the [PipelineModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.PipelineModel.html#pyspark.ml.PipelineModel)

In [None]:
# your code here



In [None]:
spark.stop()