# Predictive Analytics


* build ML prototype that will predict if a question will be ansered in the next 30 minutes
* model it as a binary classification
* first prepare simple model with some basic features
* then try to improve it by adding some more features
* use random forest as a classifier
* for modelling consider only questions that have accepted answer

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, unix_timestamp, when, lit, length, array_sort, udf, desc

from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.feature import VectorAssembler, Tokenizer, SQLTransformer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Predictive Analytics')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

questions_input_path = os.path.join(project_path, 'data/questions-json')

model_output_path = os.path.join(project_path, 'output/models/binary-classification')

<b>Load the data:</b>

Hint:
load all three datasets: anwers, questions, users

In [None]:
answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
)

questionsDF = (
    spark
    .read.format('json')
    .option('path', questions_input_path)
    .load()
)

usersDF = (
    spark
    .read
    .option('path', users_input_path)
    .load()
)

<b>Add label to the dataset</b>

hint:
* join questions with answers on `accepted_anser_id`
* join also with users on `user_id`
* compute response time using unix_timestamp or cast the timestamps to long
  * for questions you first need to cast the `creation_date` to timestamp if it is a string
* use 'when' condition to compute the label (when response time <= 1800 then 1 otherwise 0)
* cache the DataFrame for faster access in next steps

In [None]:
# your code here

data_with_label = (
    questionsDF.withColumn('creation_date', col('creation_date').cast('timestamp')).alias('questions')
    .join(answersDF.alias('answers'), questionsDF['accepted_answer_id'] == answersDF['answer_id'])
    .join(usersDF.alias('users'), questionsDF['user_id'] == usersDF['user_id'])
    .select(
        col('questions.tags'),
        col('questions.creation_date').alias('question_time'),
        col('questions.title'),
        col('questions.body').alias('message'),
        col('answers.creation_date').alias('answer_time'),
        col('users.reputation'),
        col('users.upvotes'),
        col('users.downvotes')
    )
    .withColumn('response_time', unix_timestamp('answer_time') - unix_timestamp('question_time'))
    .withColumn('label', when(col('response_time') <= 1800, lit(1)).otherwise(0))
).cache()

In [None]:
data_with_label.count()

<b>Take a look at the distribution of classes</b>

Hint
* group by label and do the count

In [None]:
# your code here

(
    data_with_label
    .groupBy('label')
    .count()
).show()

<b>Add some basic features:</b>

hint:
* add feature 'title_complexity'
 * compute the length of the question title

In [None]:
# your code here

data_with_basic_features = (
    data_with_label
    .withColumn('title_complexity', length('title'))
)

<b>Prepare data</b>

hint:
* split the data for training and testing using [randomSplit](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html#pyspark.sql.DataFrame.randomSplit)

In [None]:
# your code here

train_data, test_data = data_with_basic_features.randomSplit([0.7, 0.3], 24)

<b>Build the pipeline and train the model:</b>

hint:
* use: 
 * [VectorAssembler](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html#pyspark.ml.feature.VectorAssembler)
 * [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier)
 * [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html#pyspark.ml.Pipeline)

In [None]:
# your code here

features = ['title_complexity']

assembler = VectorAssembler(inputCols=(features), outputCol='features')

# Classifier:
rf = RandomForestClassifier(labelCol='label', featuresCol='features', seed=42)

pipeline = Pipeline(stages=[assembler, rf])

rf_model = pipeline.fit(train_data)

<b>Evaluate the model</b>

hint:
* use [BinaryClassificationEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html#pyspark.ml.evaluation.BinaryClassificationEvaluator) with areaUnderROC

In [None]:
# your code here

evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')

predictions = rf_model.transform(test_data)

evaluator.evaluate(predictions)

### Add more features

Hint:
* add features: 
  * `question_size` number of words in the question body
     * use [Tokenizer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Tokenizer.html#pyspark.ml.feature.Tokenizer) to split the text on words
     * use a [SQLTransformer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.SQLTransformer.html#pyspark.ml.feature.SQLTransformer) to compute the size
    * you can try to add a bunch of other features such as reputation, upvotes, downvotes of the user and so on
* train the model with this new pipeline
* evaluate the model
* see if the model improved

In [None]:
sizeTrans = SQLTransformer(statement="SELECT *, size(words) AS message_size FROM __THIS__")

In [None]:
# your code here:

features = ['title_complexity', 'message_size', 'reputation', 'upvotes', 'downvotes']

tokenizer = Tokenizer(inputCol='message', outputCol='words')

assembler = VectorAssembler(inputCols=(features), outputCol='features')

rf = RandomForestClassifier(labelCol='label', featuresCol='features', seed=42)

pipeline = Pipeline(stages=[tokenizer, sizeTrans, assembler, rf])

rf_model = pipeline.fit(train_data)

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='label', metricName='areaUnderROC')

predictions = rf_model.transform(test_data)

evaluator.evaluate(predictions)

In [None]:
rf_model.stages[-1].getMaxDepth()

### Explore the importance of the features

Hint
* access the last stage of the model to get the instance of the RandomForestClassificationModel
  * use model.stages
* see [featureImportances](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html#pyspark.ml.classification.RandomForestClassificationModel.featureImportances)

In [None]:
# your code here:

importances = rf_model.stages[-1].featureImportances

In [None]:
for feature, importance in zip(features, importances):
    print(f"{feature}: {importance}")

### Hyperparameter tuning:

hint:
* use [ParamGridBuilder](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.ParamGridBuilder.html#pyspark.ml.tuning.ParamGridBuilder) to find optimal `numTrees` and optimal `maxDepth`
* after you fit the [CrossValidator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html) access the best model as `cross_model.bestModel`
* compute the accuracy using the evaluator on the predictions computed by the bestModel

Note:

If you run in local mode make the grid just 2 x 2 to avoid long run (3 x 3 can run over an hour)

In [None]:
paramGrid = (
  ParamGridBuilder()
  .addGrid(rf.maxDepth, [3, 5])
  .addGrid(rf.numTrees, [50, 100])
  .build()
)

cross_model = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid).fit(train_data)
rf_model = cross_model.bestModel
predictions = rf_model.transform(test_data)
evaluator.evaluate(predictions)

### Explore the params of the model

Hint
* see `avgMetrics` of the cross_model
* see the stages of the bestModel
  * access the last stage to get the instance of the `RandomForestClassificationModel`
* see:
  * `getNumTrees`
  * `getMaxDepth()`
  * `toDebugString` to see full description of the model

In [None]:
# your code here:

cross_model.bestModel.stages[-1].getNumTrees

In [None]:
cross_model.bestModel.stages[-1].getMaxDepth()

In [None]:
cross_model.bestModel.stages[-1].summary

In [None]:
cross_model.bestModel.stages[-1].toDebugString

In [None]:
cross_model.avgMetrics

### Save the model so you can use it later in some ml application

* use [write](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.PipelineModel.html#pyspark.ml.PipelineModel.write) on the [PipelineModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.PipelineModel.html#pyspark.ml.PipelineModel)

In [None]:
# your code here

(
    rf_model
    .write()
    .overwrite()
    .save(model_output_path)
)

In [None]:
spark.stop()