# Predictive Analytics

## Task I

* build ML prototype that will predict if a question will be ansered in the next 2 hours
* model it as binary classification
* first prepare simple model with some basic features
* then try to improve it by adding some more features
* use random forest as a classifier
* for modelling consider only questions that have accepted answer
* if you run in local mode do not hyperparameter tuning since it may run to long

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, unix_timestamp, when, lit, length, array_sort, udf, desc
)

from pyspark.sql.types import (
    ArrayType, StructType, StructField, StringType, IntegerType
)

from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Tokenizer, SQLTransformer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Predictive Analytics I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

<b>Load the data:</b>

In [None]:
answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

<b>Add label to the dataset</b>

hint:
* join questions with answers
* compute response time using unix_timestamp
* use 'when' condition to compute the label

In [None]:
data_with_label = (
    questionsDF.alias('questions')
    .join(answersDF.alias('answers'), questionsDF['accepted_answer_id'] == answersDF['answer_id'])
    .select(
        col('questions.tags'),
        col('questions.creation_date').alias('question_time'),
        col('questions.title'),
        col('questions.body').alias('message'),
        col('answers.creation_date').alias('answer_time')
    )
    .withColumn('response_time', unix_timestamp('answer_time') - unix_timestamp('question_time'))
    .withColumn('label', when(col('response_time') <= 7200, lit(1)).otherwise(0))
).cache()

In [None]:
data_with_label.count()

<b>Take a look at the distribution of classes</b>

In [None]:
(
    data_with_label
    .groupBy('label')
    .count()
).show()

<b>Add some basic features:</b>

hint:
* add feature 'title_complexity'
 * compute the length of the question title

In [None]:
data_with_basic_features = (
    data_with_label
    .withColumn('title_complexity', length('title'))
)

<b>Prepare data</b>

hint:
* split the data for training and testing using randomSplit

In [None]:
train_data, test_data = data_with_basic_features.randomSplit([0.7, 0.3], 24)

<b>Build the pipeline and train the model:</b>

hint:
* use: 
 * VectorAssembler
 * RandomForestClassifier
 * Pipeline

In [None]:
features = ['title_complexity']

assembler = VectorAssembler(inputCols=(features), outputCol='features')

# Classifier:
rf = RandomForestClassifier(labelCol='label', featuresCol='features', seed=42)

pipeline = Pipeline(stages=[assembler, rf])

rf_model = pipeline.fit(train_data)

<b>Evaluate the model</b>

hint:
* use BinaryClassificationEvaluator with areaUnderROC

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')

predictions = rf_model.transform(test_data)

evaluator.evaluate(predictions)

<b>Add more features</b>

hint:
* add features: 
    * 'question_size' number of words in the question body
    * use Tokenizer to split the text on words
    * use a SQLTransformer to compute the size
    
* train the model with this new pipeline
* evaluate the model
* see if the model improved

In [None]:
sizeTrans = SQLTransformer(statement="SELECT *, size(words) AS message_size FROM __THIS__")

In [None]:
features = ['title_complexity', 'message_size']

tokenizer = Tokenizer(inputCol='message', outputCol='words')

assembler = VectorAssembler(inputCols=(features), outputCol='features')

rf = RandomForestClassifier(labelCol='label', featuresCol='features', seed=42)

pipeline = Pipeline(stages=[tokenizer, sizeTrans, assembler, rf])

rf_model = pipeline.fit(train_data)

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')

predictions = rf_model.transform(test_data)

evaluator.evaluate(predictions)

#### Note

* Similarly you could look for other features and try to improve the evaluation metric

<b>Hyperparameter tuning:</b>

hint:
* use ParamGridBuilder to find optimal numTrees and optimal masDepth

Note:

If you run in local mode skip the hyperparameter tuning since it may run to long (more then hour).

In [None]:
paramGrid = (
  ParamGridBuilder()
  .addGrid(rf.maxDepth, [3, 5, 8])
  .addGrid(rf.numTrees, [50, 100, 150])
  .build()
)

#cross_model = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid).fit(train_data)
#rf_model = cross_model.bestModel
#predictions = rf_model.transform(test_data)
#evaluator.evaluate(predictions)

In [None]:
spark.stop()