# Predictive Analytics

## Task I

* build ML prototype that will predict if a question will be ansered in the next 2 hours
* model it as binary classification
* first prepare simple model with some basic features
* then try to improve it by adding some more features
* use random forest as a classifier
* for modelling consider only questions that have accepted answer
* if you run in local mode do not hyperparameter tuning since it may run to long

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, unix_timestamp, when, lit, length, array_sort, udf, desc
)

from pyspark.sql.types import (
    ArrayType, StructType, StructField, StringType, IntegerType
)

from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Tokenizer, SQLTransformer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Predictive Analytics I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

questions_input_path = os.path.join(project_path, 'data/questions')

model_output_path = os.path.join(project_path, 'output/models/clustering')

<b>Load the data:</b>

In [None]:
answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

<b>Add label to the dataset</b>

hint:
* join questions with answers
* compute response time using [unix_timestamp](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.unix_timestamp.html#pyspark.sql.functions.unix_timestamp)
* use 'when' condition to compute the label

In [None]:
# your code here


<b>Take a look at the distribution of classes</b>

In [None]:
# your code here


<b>Add some basic features:</b>

hint:
* add feature 'title_complexity'
 * compute the length of the question title

In [None]:
# your code here


<b>Prepare data</b>

hint:
* split the data for training and testing using [randomSplit](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html#pyspark.sql.DataFrame.randomSplit)

In [None]:
# your code here


<b>Build the pipeline and train the model:</b>

hint:
* use: 
 * [VectorAssembler](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html#pyspark.ml.feature.VectorAssembler)
 * [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier)
 * [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html#pyspark.ml.Pipeline)

In [None]:
# your code here


<b>Evaluate the model</b>

hint:
* use [BinaryClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html#pyspark.ml.evaluation.BinaryClassificationEvaluator) with areaUnderROC

In [None]:
# your code here


<b>Add more features</b>

hint:
* add features: 
    * 'question_size' number of words in the question body
     * use [Tokenizer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Tokenizer.html#pyspark.ml.feature.Tokenizer) to split the text on words
     * use a [SQLTransformer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.SQLTransformer.html#pyspark.ml.feature.SQLTransformer) to compute the size
    * you can try to create a feature using the tags

* train the model with this new pipeline
* evaluate the model
* see if the model improved

In [None]:
# your code here


#### Note

* Similarly you could look for other features and try to improve the evaluation metric

<b>Hyperparameter tuning:</b>

hint:
* use [ParamGridBuilder](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.ParamGridBuilder.html#pyspark.ml.tuning.ParamGridBuilder) to find optimal numTrees and optimal masDepth

Note:

If you run in local mode skip the hyperparameter tuning since it may run to long (more then hour).

In [None]:
# your code here


### Save the model so you can use it later in some ml application

* use [write](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.PipelineModel.html#pyspark.ml.PipelineModel.write) on the [PipelineModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.PipelineModel.html#pyspark.ml.PipelineModel)

In [None]:
# your code here
