# Presentation
In this workshop we will discover Mllib features, and apply them on the titanic dataset.

We will try to predict passenger survival rate based on a few features, with a logistic regression model.

## Install Spark Environment
Since we are not running on databricks, we will need to install Spark by ourselves, every time we run the session.  
We need to install Spark, as well as a Java Runtime Environment.  
Then we need to setup a few environment variables.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!curl -O https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()

## Optional step : Enable SparkUI through secure tunnel
This step is useful if you want to look at Spark UI.
First, you need to create a free ngrok account : https://dashboard.ngrok.com/login.  
Then connect on the website and copy your AuthToken.

In [None]:
# this step downloads ngrok, configures your AuthToken, then starts the tunnel
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
!./ngrok authtoken my_ngrok_auth_token_retrieved_from_website # <-------------- change this line !
get_ipython().system_raw('./ngrok http 4050 &')

**Now** get the Spark UI url on https://dashboard.ngrok.com/endpoints/status. We're done !

## Load dataset
We need to download dataset and put it inside HDFS.

In [None]:
# download dataset, make sure it is available on your gateway
import urllib
url = "https://www.dropbox.com/s/1tl236ptjuwvcib/titanic-passengers.csv?dl=1"
urllib.request.urlretrieve(url, "titanic.csv")
dbutils.fs.ls("file:/databricks/driver/")

# move the dataset to the file storage
dbutils.fs.mv("file:/databricks/driver/titanic.csv", "dbfs:/titanic.csv", recurse=True)

## Tools of the trade
We need a few imports to learn some model with MLLib.

In [None]:
from pyspark.sql import functions as F # you already know this one ! need it whenever you want to transform columns
from pyspark.ml.feature import *       # this package contains most of mllib feature engineering tools
from pyspark.ml import Pipeline        # pipeline is used to combine features

## Question 0
Load the dataset.

Make sure the remainder of the schema is correct.

In [None]:
train, test = df.cache().randomSplit([0.9, 0.1], seed=12345)

## Question 1
On training set, fit a model that predicts passenger survival probability, function of ticket price.

You will need to convert survived column in 0/1 to pass it to the logistic regression. Transform it with StringIndexer.

Use a pipeline ending with a logistic regression.

Compute model AUC on validation set.

Documentation:
- https://spark.apache.org/docs/latest/ml-classification-regression.html#binomial-logistic-regression
- https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
- https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

## Question 2
We will do a lots of feature engineering now and we don't want you to copy-paste code all-way long.

Write the following function:

Inputs:
- pipeline
- training set
- validation set

Outputs:
- auc
- transformed dataset (with prediction)

Make sure it returns on previous pipeline.

## Question 3
Relying on raw continuous feature may be a bit rough.
We can try to bucketize numeric feature in five buckets instead.

## Question 4
Why don't you try to rely on other numerical features now ?

You can try to leverage 'Age', and maybe 'PassengerId' while we're at it.

Is it better ?

## Question 5
We should try to use categorial features.

Remember, spark just understands vectors. So you need to convert categories in vectors with OneHotEncoder.

Try several categories and identify what works.

Sex is not numeric, we need to convert it before one-hot-encoding it !

## Question 6

Try to:
- rely on name feature
- cross features. E.g., try to use features like : passenger is male and passenger is older than 30 years.
- use feature hashing