# Presentation
In this workshop we will discover Mllib features, and apply them on the titanic dataset.

We will try to predict passenger survival rate based on a few features, with a logistic regression model.

## Load dataset
We need to download dataset and put it inside HDFS.

In [0]:
# download dataset, make sure it is available on your gateway
import urllib
url = "https://www.dropbox.com/s/uf1pfvrmbrmqoqz/titanic-passengers.csv?dl=1"
urllib.request.urlretrieve(url, "titanic.csv")
dbutils.fs.ls("file:/databricks/driver/")

# move the dataset to the file storage
dbutils.fs.mv("file:/databricks/driver/titanic.csv", "dbfs:/titanic.csv", recurse=True)

## Tools of the trade
We need a few imports to learn some model with MLLib.

In [0]:
from pyspark.sql import functions as F # you already know this one ! need it whenever you want to transform columns
from pyspark.ml.feature import *       # this package contains most of mllib feature engineering tools
from pyspark.ml import Pipeline        # pipeline is used to combine features

## Question 0
Load the dataset.

Make sure the remainder of the schema is correct.

In [0]:
df = spark.read.format("csv").load("dbfs:/titanic.csv", header=True, delimiter=";", inferSchema=True)
df.printSchema()
display(df)

In [0]:
train, test = df.cache().randomSplit([0.9, 0.1], seed=12345)

## Question 1
On training set, fit a model that predicts passenger survival probability, function of ticket price.

You will need to convert survived column in 0/1 to pass it to the logistic regression. Transform it with StringIndexer.

Use a pipeline ending with a logistic regression.

Compute model AUC on validation set.

Documentation:
- https://spark.apache.org/docs/latest/ml-classification-regression.html#binomial-logistic-regression
- https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
- https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

In [0]:
from pyspark.ml.classification import LogisticRegression

stages=[]
stages += [VectorAssembler(inputCols=["Fare"], outputCol="vec_fare")]
stages += [StringIndexer(inputCol="Survived", outputCol="int_survived")]
stages += [LogisticRegression(featuresCol="vec_fare", labelCol="int_survived")]
pipeline = Pipeline(stages=stages)

predictor = pipeline.fit(train)
df_pred = predictor.transform(test)

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')
predictionAndTarget = df_pred.select(F.col("int_survived").alias("target"), "prediction")
auc = evaluator.evaluate(predictionAndTarget)
display(df_pred)
print(auc)

## Question 2
We will do a lots of feature engineering now and we don't want you to copy-paste code all-way long.

Write the following function:

Inputs:
- pipeline
- training set
- validation set

Outputs:
- auc
- transformed dataset (with prediction)

Make sure it returns on previous pipeline.

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
def analyze(pipeline, train, test):
  predictor = pipeline.fit(train)
  df_pred = predictor.transform(test)
  evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')
  predictionAndTarget = df_pred.select(F.col("int_survived").alias("target"), "prediction")
  auc = evaluator.evaluate(predictionAndTarget)
  return (auc, df_pred)

In [0]:
(auc, pred) = analyze(pipeline, train, test)
print(auc)
display(pred)

## Question 3
Relying on raw continuous feature may be a bit rough.
We can try to bucketize numeric feature in five buckets instead.

In [0]:
stages=[]

stages += [QuantileDiscretizer(inputCol="Fare", outputCol="buk_fare", numBuckets=5)]

#or:
#buckets = [float("-inf")]+[10*i for i in range(0,5)]+[float("inf")]
#stages += [Bucketizer(splits=buckets, inputCol="Fare", outputCol="buk_fare")]

stages += [OneHotEncoder(inputCol="buk_fare", outputCol="vec_fare")]
stages += [StringIndexer(inputCol="Survived", outputCol="int_survived")]
stages += [LogisticRegression(featuresCol="vec_fare", labelCol="int_survived")]
pipeline = Pipeline(stages=stages)
(auc, pred) = analyze(pipeline, train, test)
print(auc)

## Question 4
Why don't you try to rely on other numerical features now ?

You can try to leverage 'Age', and maybe 'PassengerId' while we're at it.

Is it better ?

In [0]:
stages=[]
stages += [QuantileDiscretizer(inputCol="PassengerId", outputCol="buk_passengerid", numBuckets=5)]
stages += [OneHotEncoder(inputCol="buk_passengerid", outputCol="vec_passengerid")]
stages += [QuantileDiscretizer(inputCol="Fare", outputCol="buk_fare", numBuckets=5)]
stages += [OneHotEncoder(inputCol="buk_fare", outputCol="vec_fare")]

stages += [StringIndexer(inputCol="Survived", outputCol="int_survived")]
stages += [VectorAssembler(inputCols=["vec_passengerid", "vec_fare"], outputCol="features")]
stages += [LogisticRegression(featuresCol="features", labelCol="int_survived")]
pipeline = Pipeline(stages=stages)
(auc, df_pred) = analyze(pipeline, train, test)
print(auc)

## Question 5
We should try to use categorial features.

Remember, spark just understands vectors. So you need to convert categories in vectors with OneHotEncoder.

Try several categories and identify what works.

In [0]:
categories = ["Pclass", "SibSp", "Parch"]
for col_name in categories:
  stages=[]
  stages += [OneHotEncoder(inputCol=col_name, outputCol=f"vec_{col_name}")]
  stages += [StringIndexer(inputCol="Survived", outputCol="int_survived")]
  stages += [LogisticRegression(featuresCol=f"vec_{col_name}", labelCol="int_survived")]
  pipeline = Pipeline(stages=stages)
  (auc,pref) = analyze(pipeline, train, test)
  print(f"{col_name} : {auc}")

Sex is not numeric, we need to convert it before one-hot-encoding it !

In [0]:
stages=[]
stages += [StringIndexer(inputCol="Sex", outputCol="int_sex")]
stages += [OneHotEncoder(inputCol="int_sex", outputCol="vec_sex")]
stages += [StringIndexer(inputCol="Survived", outputCol="int_survived")]
stages += [LogisticRegression(featuresCol="vec_sex", labelCol="int_survived")]
pipeline = Pipeline(stages=stages)
(auc,pref) = analyze(pipeline, train, test)
print(f"{col_name} : {auc}")

## Question 6

Try to:
- rely on name feature
- cross features. E.g., try to use features like : passenger is male and passenger is older than 30 years.
- use feature hashing