# \[05\] Training Models

## First thoughts how to train

We have 8 and a little bit more weeks of event data aggregated per week and user.

For training our model we have to make some decisions:

### How far in the future is a churn counted as predictable?  

A churn that will happen in 2 months can not be predicted from the current data.  
My feeling is that one or two weeks in the future should be fine. 

### How many data from the past do we need to make a prediction?  

Very old history data will not have an impact on the current decision of the user.  
But we only have two months of data. So, we can not go too far back anyway.  

### Which times to make comparisons?

What might be important is to detect changes from behaviour in the past to the current behaviour.  
So maybe comparing the events from the last week with the events from the last month might give new insights.

#### Example 1 Week future, 1 Week new history, 3 Weeks old history

The following image should illustrate, how the training data is derived from the full dataset.  
The green line is "now", the current time.  
We want to make a prediction, if the user will churn in the near future (= next week).  
The blue line marks the latest history (one week ago till "now").  
And finally the purple line marks the old history (4-weeks ago till 1-week ago).  

![weeks-for-training-datasets](imgs/weeks-for-training-datasets.png)

If we limit the history to 4 weeks, we can create multiple trainingdata sets for  
predicting the results of week 0 with data from week 1,2,3,4 (as shown in the image).  
Predicting the results of week 1 with data from week 2,3,4,5.  
...  
Predicting the results of week 3 with data from week 4,5,6,7.  

So we get 4 times more training data than we have users.
Maybe we can also make a prediction for week 4, but then we have only partial (5/7) data of week 8.

What might be problematic, if a user already churned (downgraded) already in the history.  
Then the comparison between old and latest history might not show differences.  


## Starter Model

For a first start I will make the following splits:

* Label = 1: In the next week the user will have at least one churn
* history data: one month of history data will be used to make a prediction
* comparison: the last week of the history data will be put into relation with the previous three weeks


## Setup Spark Session

for a detailed description what is done here see [01-setup-spark-session.ipynb](01-setup-spark-session.ipynb)


In [3]:
EVENT_DATA_URL = "s3a://udacity-dsnd/sparkify/sparkify_event_data.json"
# EVENT_DATA_URL = "s3a://udacity-dsnd/sparkify/mini_sparkify_event_data.json"

CLEAN_DATA_URL = EVENT_DATA_URL.replace("/sparkify/", "/sparkify/output/02-cleaned-")
WEEK_AGGREGATED_DATA_URL = EVENT_DATA_URL.replace("/sparkify/", "/sparkify/output/04-week-aggregated-")
MODEL_URL = EVENT_DATA_URL.replace("/sparkify/", "/sparkify/output/05-model-").replace(".json", "")

EXECUTOR_INSTANCES = 2
EXECUTOR_MEM = '6g'

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from cryptography.fernet import Fernet
import base64
import socket

!./install-s3-jars.sh

def decrypt(encrypted_text):
    """
    decrypts an encrypted text. The seed (master-password) for decryption is read from the file ".seed.txt"
    
    Input: encrypted_text
    
    Output: the decrypted text. If the text was not encrypted with the same seed, 
            an exception is raised.
    """
    with open('.seed.txt') as f:
        seed = f.read().strip()
    return Fernet(base64.b64encode((seed*32)[:32].encode('ascii')).decode('ascii')).decrypt(encrypted_text.encode('ascii')).decode('ascii')

AWS_ACCESS_KEY_ID='V6ge1JcQpvyYGJjb'
AWS_SECRET_ACCESS_KEY = decrypt('gAAAAABkDFI6865LaVJVgtTYo0aMx9-JTPbTo6cwOUjg5eNNPsZhBDoHbRZ8xuXQT0ImNfvqcecZuoJd1VzYQEpBaxyCnKvosii8O1KeqoL2NwKdKtL_AUfT4eW4dvJVP--VjEvc0gB4')
OWN_IP=socket.gethostbyname(socket.gethostname())
APP_NAME = "Sparkify"
SPARK_MASTER = "spark://bit-spark-master-svc.spark.svc.cluster.local:7077"
S3_HOST = "minio-api-service.minio.svc"

print(f'### SETUP SPARK SESSION "{APP_NAME}"')
spark = SparkSession.builder \
    .master(SPARK_MASTER) \
    .config("spark.jars","/home/jovyan/jars/aws-java-sdk-bundle-1.11.1026.jar,/home/jovyan/jars/hadoop-aws-3.3.2.jar") \
    .config("spark.driver.host", OWN_IP) \
    .config("spark.hadoop.fs.s3a.endpoint", S3_HOST) \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
    .config("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY_ID) \
    .config("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY) \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.executor.instances", EXECUTOR_INSTANCES) \
    .config("spark.executor.memory", EXECUTOR_MEM) \
    .appName(APP_NAME).getOrCreate()
print(f"Spark version: {spark.version}")
sc = spark.sparkContext
sc.setLogLevel("WARN")



### SETUP SPARK SESSION "Sparkify"
Spark version: 3.3.2


## Get Aggregated Data

There are two possibilities, how to get the data aggregated per week.  
Load the cleaned data saved in step 04 from S3 or reapply the transformations to the original dataset.  

**Only apply one of both possibilities**

### Possibility 1 - Load aggregated dataset

In [4]:
print(f"### LOAD DATA {WEEK_AGGREGATED_DATA_URL}")
df_userweek = spark.read.json(WEEK_AGGREGATED_DATA_URL)
print(f"### PERSIST")
df_userweek_persist = df_userweek.persist()
df_userweek = df_userweek_persist

### LOAD DATA s3a://udacity-dsnd/sparkify/output/04-week-aggregated-sparkify_event_data.json
### PERSIST


### Possibility 2 - Load original dataset and Apply Transformations

For a detailed description what is done here see [02-data-introspection.ipynb](02-data-introspection.ipynb) and [04-aggregate-data.ipynb](04-aggregate-data.ipynb) 

In [33]:
import pyspark.sql.functions as F
from pyspark.sql import Window
from pyspark.sql.types import IntegerType

print(f"### LOAD DATA {EVENT_DATA_URL}")
df = spark.read.json(EVENT_DATA_URL)

# --- Step 02 Cleanup

def norm_colname(name):
    """
    Input: name which can contain spaces with upper and lowercase letters.
    Output: all spaces replaced with an underscore and all letters converted to lowercase
    """
    return name.replace(' ', '_').lower()

print(f"### DROP UNUSED COLUMNS")
df = df.drop("artist", "auth", "firstName", "lastName", "length", "location", "method", "song", "userAgent")
print(f"### REMOVE EMPTY USERID")
df = df.filter(df.userId != '')
print(f"### ADD ID")
w = Window().orderBy("ts")
df = df.withColumn("id", F.row_number().over(w))
print(f"### VECTORIZE PAGE FEATURES")
page_features = df.groupBy("id").pivot("page").agg(F.lit(1)).na.fill(0)
page_features = page_features.toDF(*(("pg_"+norm_colname(col)) if col!="id" else "id" for col in page_features.columns))
df = df.join(page_features, "id")
print(f"### VECTORIZE LEVEL FEATURE")
df = df.withColumn("paid", (df.level == 'paid').cast('int'))
df = df.drop("level")
print(f"### VECTORIZE GENDER FEATURE")
df = df.withColumn("male", (df.gender == 'M').cast('int'))
df = df.drop("gender")
print(f"### VECTORIZE STATUS FEATURES")
status_features = df.groupBy("id").pivot("status").agg(F.lit(1)).na.fill(0)
status_features = status_features.toDF(*(("status_"+col) if col != "id" else "id" for col in status_features.columns)).drop("status_200")
df = df.join(status_features, "id")
df = df.drop("status")
print(f"### ADD SID")
df_sess_user = df.select("sessionId", "userId").dropDuplicates()
w = Window().orderBy("sessionId", "userId")
df_sess_user = df_sess_user.withColumn("sid", F.row_number().over(w))
df = df.join(df_sess_user, ["sessionId", "userId"])
df_session_start = df.groupBy("sid").agg(F.min("id").alias("id")).drop("sid").withColumn("session_start", F.lit(1).cast("int"))
df = df.join(df_session_start, "id", how="outer").fillna(0)
df = df.drop("sessionId", "itemInSession")
print(f"### PERSIST")
df_persist = df.persist()
df = df_persist

# --- Step 04 Aggregate Week

print(f"### TODO - MOVE TO CLEANUP / MAKE GENERIC")
ts_last = df.agg(F.max(df.ts).alias("ts_last")).collect()[0].ts_last
df = df.where(F.col("ts")!=ts_last)
print(f"### GET LAST TS")
ts_first = df.agg(F.min(df.ts).alias("ts_first")).collect()[0].ts_first
ts_last = df.agg(F.max(df.ts).alias("ts_last")).collect()[0].ts_last
days = (ts_last - ts_first)/one_day
print(f"first timestamp: {datetime.datetime.fromtimestamp(ts_first/1000.0)}")
print(f"last timestamp: {datetime.datetime.fromtimestamp(ts_last/1000.0)}")
print(f"days: {days}")
print(f"### ADD WEEK ID")
df = df.withColumn("wid", F.floor((ts_last-F.col("ts"))/one_week))
df.groupBy("wid").count().sort("wid").show()
print(f"### AGG MALE")
df_user = df.groupBy("userId").agg(F.max(F.col("male")).alias("usermale"), F.max((ts_last-F.col("registration"))/one_day).alias("userregistration"))
print(f"### COUNT NUM EVENTS")
aggs = [F.count(F.col("id")).alias("num_events")]
df_num_events_agg = df.groupBy("userId", "wid").agg(*aggs)
print(f"### AGG PG/STATUS")
sum_cols = [col for col in df.columns if col.startswith("pg_") or col.startswith("status_")]
aggs = [F.sum(F.col(col)).alias(col) for col in sum_cols]
df_pg_status_agg = df.groupBy("userId", "wid").agg(*aggs)
print(f"### AGG SESSIONSTART")
aggs = [F.sum(F.col("session_start")).alias("session_start"), F.max(F.col("id")).alias("max_id")]
df_sessionstart_agg = df.groupBy("userId", "wid").agg(*aggs)
print(f"### AGG SESSIONHOURS")
df_sessionhours_agg = df.groupBy("userId", "wid", "sid").agg(F.max(F.col("ts")), F.min(F.col("ts"))).withColumn("session_hours", (F.col("max(ts)")-F.col("min(ts)"))/one_hour).groupBy("userId", "wid").agg(F.sum(F.col("session_hours")).alias("session_hours"))
print(f"### AGG LAST PAID")
df_paid_agg = df.join(df_sessionstart_agg.withColumnRenamed("max_id", "id"), ["userId", "wid", "id"]).select("userId", "wid", "id", "paid").drop("id")
df_sessionstart_agg = df_sessionstart_agg.drop("max_id")
print(f"### PUTTING TOGETHER")
df_userweek = df_pg_status_agg.join(df_sessionstart_agg, ["userId", "wid"])
df_userweek = df_userweek.join(df_sessionhours_agg, ["userId", "wid"])
df_userweek = df_userweek.join(df_paid_agg, ["userId", "wid"])
df_userweek = df_userweek.join(df_num_events_agg, ["userId", "wid"])
df_userweek = df_userweek.join(df_user, ["userId"])

print(f"### PERSIST")
df_userweek_persist = df_userweek.persist()
df_userweek = df_userweek_persist
df_persist.unpersist()


### LOAD DATA s3a://udacity-dsnd/sparkify/mini_sparkify_event_data.json
### DROP UNUSED COLUMNS
### REMOVE EMPTY USERID
### ADD ID
### VECTORIZE PAGE FEATURES
### VECTORIZE LEVEL FEATURE
### VECTORIZE GENDER FEATURE
### VECTORIZE STATUS FEATURES
### ADD SID
### PERSIST


## Imports

Here are all imports which are needed in the cells below.

In [5]:
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel, DecisionTreeClassifier, DecisionTreeClassificationModel, LinearSVC
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler, MinMaxScaler, MaxAbsScaler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
import pyspark.sql.functions as F
from pyspark.sql import Window
from pyspark.sql.types import IntegerType
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import datetime


## Constants

constants that are used in the cells below

In [6]:
# timestamp constants for ts in milliseconds
one_hour =        60*60*1000  #     3.600.000
one_day =      24*60*60*1000  #    86.400.000
one_week =   7*24*60*60*1000  #   604.800.000
one_month = 28*24*60*60*1000  # 2.419.200.000

## Setup three Dataframes

We will setup the following three dataframes:
* df_label
* df_new_history
* df_old_history

The code will be flexible, so that we can also make experiments about the timeframes

In [12]:
# weeks to look into the future from the predict-timestamp for label
FUTURE_LOOKAHEAD_WEEKS = 2
# weeks to look into the past from the predict-timestamp for new history
PAST_NEAR_HISTORY_WEEKS = 1
# weeks to look into the past from the predict-timestamp for old history
PAST_OLD_HISTORY_WEEKS = 3

current_week = 2

label_week_min = current_week-FUTURE_LOOKAHEAD_WEEKS
label_week_max = current_week-1

newhistory_week_min = current_week
newhistory_week_max = newhistory_week_max+PAST_NEAR_HISTORY_WEEKS-1

oldhistory_week_min = newhistory_week_max+1
oldhistory_week_max = oldhistory_week_min+PAST_OLD_HISTORY_WEEKS-1

print(f"Weeks to aggregate:")
print(f"  label: {label_week_min} - {label_week_max}")
print(f"  newhistory: {newhistory_week_min} - {newhistory_week_max}")
print(f"  oldhistory: {oldhistory_week_min} - {oldhistory_week_max}")


Weeks to aggregate:
  label: 0 - 1
  newhistory: 2 - 2
  oldhistory: 3 - 5


## Constant and Current User Info

Get the constant and current user info from current week (=newhistory_week_min)

In [11]:
df_user = df_userweek.where(F.col("wid") == newhistory_week_min).select("userId", "paid", "usermale", "userregistration")

Because we are now setting the "current_week" 2 weeks into the past the relative value in days for  
"userregistration" has to be adapted.  
Going 1 week into the past means the userregistration has to be reduced by 7 days.

Theoretically the value can become negative, but then it means, that there is no history data at all. So, the user will not be used for training/prediction.

In [13]:
df_user = df_user.withColumn("userregistration", F.col("userregistration")-7*newhistory_week_min)

## Helper function to get dataset

In [14]:
def aggregate_week_data(from_week, to_week):
    """
    Input: from_week, to_week
    Output: aggregated sum data for the weeks from_week..to_week (both including)
    """
    dropcols = ["paid", "usermale", "userregistration", "wid"]
    df_weeks = df_userweek.where((F.col("wid")>=from_week)&(F.col("wid")<=to_week))
    if from_week == to_week:
        # no aggregation necessary, if there is only one week
        return df_weeks.drop(*dropcols)
    aggs = [F.sum(F.col(col)).alias(col) for col in df_weeks.columns if not col in ["userId", *dropcols]]
    df_weeks = df_weeks.groupBy("userId").agg(*aggs)
    return df_weeks    


## Create three Datasets

Using the helper function we can now split the data into three partitions as explained above.

In [16]:
df_label = aggregate_week_data(label_week_min, label_week_max)
df_newhistory = aggregate_week_data(newhistory_week_min, newhistory_week_max)
df_oldhistory = aggregate_week_data(oldhistory_week_min, oldhistory_week_max)

## Set Label

Check in df_label if a user churned ("pg_cancellation_confirmation" or "pg_submit_downgrade" event) and add column "label" to df_user containing this info.

In [17]:
df_label = df_label.withColumn("label", F.when(F.col("pg_cancellation_confirmation")+F.col("pg_submit_downgrade")>0, F.lit(1)).otherwise(F.lit(0))).select("userid", "label")
df_user = df_user.join(df_label, "userId")

## Add history

add new and old history to df_user. Rename columns to make them distinct

### Helper function

Add a prefix to all columns in a dataframe

In [18]:
def prefix_columns(df_orig, prefix, do_not_change_cols):
    """
    Input:  df_orig - Original DataFrame
            prefix - string to be added to all column names
            do_not_change_cols - liste of columns which should not be changed
    Output: new DataFrame, where all columns were renamed (prefix was added to the name at the beginning)
            except for the columns in "do_not_change_cols", those are unchanged.
    """
    newcols = [prefix+col if not col in do_not_change_cols else col for col in df_orig.columns]
    return df_orig.toDF(*newcols)

Add prefix "nh_" to all columns in new-history and prefix "oh_" to all columns in old-history:

In [19]:
df_user = df_user.join(prefix_columns(df_newhistory, "nh_", ["userId"]), "userId")
df_user = df_user.join(prefix_columns(df_oldhistory, "oh_", ["userId"]), "userId")

## Normalize Column values

to make the values from the old and the new history better comparable, the absolute values (number of specific events) is divided by the aggregated session_hour (the time the user spent in Sparkify during the aggregated time interval).
As an exception the session_start and session_hours are handled different. They are divided by the weeks of the aggregation interval. so, here, the new history is divided by 1 and the old history is divided by 3. So the session_start events are normalized to "per-week":

In [20]:
for c in df_newhistory.columns:
    if not c in ["userId", "session_hours", "session_start"]:
        df_user = df_user.withColumn("nhn_"+c, F.col("nh_"+c)/F.greatest(F.col("nh_session_hours"), F.lit(0.01)))
df_user = df_user.withColumn("nhn_session_hours", F.col("nh_session_hours")/PAST_NEAR_HISTORY_WEEKS)
df_user = df_user.withColumn("nhn_session_start", F.col("nh_session_start")/PAST_NEAR_HISTORY_WEEKS)

for c in df_oldhistory.columns:
    if not c in ["userId", "session_hours", "session_start"]:
        df_user = df_user.withColumn("ohn_"+c, F.col("oh_"+c)/F.greatest(F.col("oh_session_hours"), F.lit(0.01)))
df_user = df_user.withColumn("ohn_session_hours", F.col("oh_session_hours")/PAST_OLD_HISTORY_WEEKS)
df_user = df_user.withColumn("ohn_session_start", F.col("oh_session_start")/PAST_OLD_HISTORY_WEEKS)        


## Create Feature Column

Multiple test run have shown, that the best selection for training are the normalized columns "nhn_..." and "ohn_...".  
The columns containing the absolute values are not used for training.  

In [40]:
featureCols = ["paid", "usermale", "userregistration"]
featureCols = [*featureCols, *[col for col in df_user.columns if col.startswith("nhn_") or  col.startswith("ohn_")]]

In [41]:
featureCols

['paid',
 'usermale',
 'userregistration',
 'nhn_num_events',
 'nhn_pg_about',
 'nhn_pg_add_friend',
 'nhn_pg_add_to_playlist',
 'nhn_pg_cancel',
 'nhn_pg_cancellation_confirmation',
 'nhn_pg_downgrade',
 'nhn_pg_error',
 'nhn_pg_help',
 'nhn_pg_home',
 'nhn_pg_login',
 'nhn_pg_logout',
 'nhn_pg_nextsong',
 'nhn_pg_register',
 'nhn_pg_roll_advert',
 'nhn_pg_save_settings',
 'nhn_pg_settings',
 'nhn_pg_submit_downgrade',
 'nhn_pg_submit_registration',
 'nhn_pg_submit_upgrade',
 'nhn_pg_thumbs_down',
 'nhn_pg_thumbs_up',
 'nhn_pg_upgrade',
 'nhn_status_307',
 'nhn_status_404',
 'nhn_session_hours',
 'nhn_session_start',
 'ohn_num_events',
 'ohn_pg_about',
 'ohn_pg_add_friend',
 'ohn_pg_add_to_playlist',
 'ohn_pg_cancel',
 'ohn_pg_cancellation_confirmation',
 'ohn_pg_downgrade',
 'ohn_pg_error',
 'ohn_pg_help',
 'ohn_pg_home',
 'ohn_pg_login',
 'ohn_pg_logout',
 'ohn_pg_nextsong',
 'ohn_pg_register',
 'ohn_pg_roll_advert',
 'ohn_pg_save_settings',
 'ohn_pg_settings',
 'ohn_pg_submit_downg

Pack all feature columns as vector in the column "feature

In [42]:
assembler = VectorAssembler(inputCols=featureCols, outputCol="features")
df_testtrain_vec=assembler.transform(df_user).select("userId", "label","features")

## Train / Test split

Divide the data into Train- and Testdata. 70% of the data will be used for training, the other 30% are used for testing (validation).

In [43]:
df_train, df_test = df_testtrain_vec.randomSplit([0.7, 0.3], seed=42)

## Handling unbalanced data

The data is unbalanced. There are much more users, who do not churn, than users who churn.  
To handle this, we add a "weight" column, which can be used during training, to give the data different weights.  
The sum of all weights for rows labeled with "1" should be equal to the sum of all weights of rows labeld with "0".

Here is a helper function:

In [44]:
def add_weight_col(df_train):
    """
    Input:  Training DataFrame with a column "label" containing a binary classification (1 or 0)
    Output: Newly created DataFrame, which contains an additional "weight" column, which compensates 
            the different frequency of both partitions.  
    """
    label_counts = df_train.agg(F.sum(F.col("label")).alias("l1"), F.sum(1-F.col("label")).alias("l0")).collect()[0]
    l0 = label_counts.l0
    l1 = label_counts.l1
    w1 = l0 / (l0+l1)
    w0 = l1 / (l0+l1)
    print(f"label 0: {l0}, label 1: {l1}")
    df_result = df_train.withColumn("weight", F.when(F.col("label")==1, F.lit(w1)).otherwise(F.lit(w0)))
    return df_result

In [45]:
df_train = add_weight_col(df_train)
df_train = df_train.persist()

label 0: 6315, label 1: 1183


In [46]:
print(f"### CLASSIFIERS")

def confuse(df_test_pred):
    n00 = df_test_pred.where((F.col("label")==0)&(F.col("prediction")==0)).count()
    n01 = df_test_pred.where((F.col("label")==0)&(F.col("prediction")==1)).count()
    n10 = df_test_pred.where((F.col("label")==1)&(F.col("prediction")==0)).count()
    n11 = df_test_pred.where((F.col("label")==1)&(F.col("prediction")==1)).count()
    s00 = "{:5d}".format(n00)
    s01 = "{:5d}".format(n01)
    s10 = "{:5d}".format(n10)
    s11 = "{:5d}".format(n11)
    print(f"                  ")
    print(f" Confusion Matrix: ")
    print(f"                  ")
    print(f"     | prediction| ")
    print(f"     |   0 |  1  | ")
    print(f" ----+-----+-----+ ")
    print(f" l 0 |{s00}|{s01}| ")
    print(f" b --+-----+-----+ ")
    print(f" l 1 |{s10}|{s11}| ")
    print(f" ----+-----+-----+ ")
    print(f"                   ")
    TP = n11
    TN = n00
    FP = n01
    FN = n10
    accuracy = 0
    if TP+TN+FP+FN!=0:
        accuracy = (TP+TN)/(TP+TN+FP+FN)
    precision = 0
    if TP+FP!=0:
        precision = TP/(TP+FP)
    recall = 0
    if TP+FN!=0:
        recall = TP/(TP+FN)
    f1 = 0
    if precision+recall!=0:
        f1 = 2*precision*recall/(precision+recall)
    print(f"CALC")
    print(f"  accuraccy: {accuracy}")
    print(f"  precision: {precision}")
    print(f"  recall:    {recall}")
    print(f"  f1:        {f1}")
    # https://towardsdatascience.com/matthews-correlation-coefficient-when-to-use-it-and-when-to-avoid-it-310b3c923f7e
    mcc = -9
    nenn = (TN+FN)*(FP+TP)*(TN+FP)*(FN+TP)
    if nenn!=0:   
        mcc = (TN*TP-FP*FN)/math.sqrt(nenn)
    print(f"  mcc:       {mcc}")
    return (accuracy, precision, recall, f1)
    
    
def hyper_tune_rf(num_tree_values, max_depth_values):
    best_f1 = -1
    best_model = None
    best_model_name = "?"
    for num_trees in num_tree_values:
        for max_depth in max_depth_values:
            model_name = f"rf_{num_trees}_{max_depth}"
            rf = RandomForestClassifier(featuresCol="features", numTrees=num_trees, maxDepth=max_depth, weightCol = "weight", seed=42)
            rf_model = rf.fit(df_train)
            predict_test  = rf_model.transform(df_test)
            accuracy, precision, recall, f1 = confuse(predict_test)
            print(f"  {model_name}: f1 {f1}")
            if f1 > best_f1:
                best_f1 = f1
                best_model = rf_model
                best_model_name = model_name
    print(f"best f1 {f1} for {best_model_name}")
    return (best_model, best_f1, best_model_name)


def hyper_tune_lr(max_iters, reg_params, elastic_net_params):
    # https://towardsdatascience.com/beginners-guide-to-linear-regression-with-pyspark-bfc39b45a9e9
    evaluator = RegressionEvaluator(predictionCol="prediction_orig", labelCol="label", metricName="rmse") 
    
    best_err = 9999
    best_model = None
    best_model_name = "?"
    for  max_iter in  max_iters:
        for reg_param in reg_params:
            for elastic_net_param in elastic_net_params:
                model_name = f"lr_{max_iter}_{reg_param}_{elastic_net_param}"
                lr = LinearRegression(featuresCol="features", maxIter= max_iter, regParam=reg_param, elasticNetParam=elastic_net_param)
                model = lr.fit(df_train)
                predict_test  = model.transform(df_test)
                predict_test = predict_test.withColumnRenamed("prediction", "prediction_orig")
                err = evaluator.evaluate(predict_test)
                print(f"err: {err}")
                thr = 0.15
                predict_test = predict_test.withColumn("prediction", F.when(F.col("prediction_orig")>=thr,1).otherwise(0))
                accuracy, precision, recall, f1 = confuse(predict_test)
                print(f"  {model_name}: f1 {f1}")
                if err < best_err:
                    best_err = err
                    best_model = model
                    best_model_name = model_name
    print(f"best f1 {f1} for {best_model_name}")
    return (best_model, best_err, best_model_name)


def hyper_tune_dt(max_depths, max_bins_list):
    best_f1 = -1
    best_model = None
    best_model_name = "?"
    for  max_depth in max_depths:
        for max_bins in max_bins_list:
            model_name = f"dt_{max_depth}_{max_bins}"
            dt = DecisionTreeClassifier(featuresCol="features", maxDepth=max_depth, maxBins=max_bins)
            model = dt.fit(df_train)
            predict_test  = model.transform(df_test)
            accuracy, precision, recall, f1 = confuse(predict_test)
            print(f"  {model_name}: f1 {f1}")
            if f1 > best_f1:
                best_f1 = f1
                best_model = model
                best_model_name = model_name
    print(f"best f1 {f1} for {best_model_name}")
    return (best_model, best_f1, best_model_name)


def hyper_tune_sv(max_iters, reg_params):
    best_f1 = -1
    best_model = None
    best_model_name = "?"
    for  max_iter in max_iters:
        for reg_param in reg_params:
            model_name = f"svm_{max_iter}_{reg_param}"
            lsvc = LinearSVC(featuresCol="features", maxIter=max_iter, regParam=reg_param)
            model = lsvc.fit(df_train)
            predict_test  = model.transform(df_test)
            accuracy, precision, recall, f1 = confuse(predict_test)
            print(f"  {model_name}: f1 {f1}")
            if f1 > best_f1:
                best_f1 = f1
                best_model = model
                best_model_name = model_name
    print(f"best f1 {f1} for {best_model_name}")
    return (best_model, best_f1, best_model_name)
    

### CLASSIFIERS


In [47]:
## Fit scaler to train dataset
#scaler = MaxAbsScaler().setInputCol('features').setOutputCol('scaled_features')
#df_train = df_train.drop("scaled_features")
#scaler_model = scaler.fit(df_train)
## Scale train and test features
#df_train = scaler_model.transform(df_train)
#df_test = df_test.drop("scaled_features")
#df_test = scaler_model.transform(df_test)


## Train Models

Different Models can now be trained and validated against the test set.

In [49]:
model, f1, model_name = hyper_tune_rf([100], [5])  

                  
 Confusion Matrix: 
                  
     | prediction| 
     |   0 |  1  | 
 ----+-----+-----+ 
 l 0 | 1762|  890| 
 b --+-----+-----+ 
 l 1 |  164|  322| 
 ----+-----+-----+ 
                   
CALC
  accuraccy: 0.6641172721478649
  precision: 0.26567656765676567
  recall:    0.6625514403292181
  f1:        0.3792697290930506
  mcc:       0.24294854801598947
  rf_100_5: f1 0.3792697290930506
best f1 0.3792697290930506 for rf_100_5


## Save the model

The model is stored and can be loaded for doing predictions on newly collected data

In [50]:
# -----------------

print(f"### SAVE MODEL {model_name} {f1*100}")
model_url = f'{MODEL_URL}_{model_name}_f1val{round(f1,3)}'
model.write().overwrite().save(model_url)
print(f"model saved to {model_url}")



### SAVE MODEL rf_100_5 37.926972909305064
model saved to s3a://udacity-dsnd/sparkify/output/05-model-sparkify_event_data_rf_100_5_f1val0.379


## Feature Importance

Which features were selected to make the predictions?

In [51]:
featimp = model.featureImportances
nameimp = {}
for i in range(len(featimp)):
    nameimp[featureCols[i]] = featimp[i]
sorted(nameimp.items(), key=lambda x:-x[1])

[('ohn_session_hours', 0.14584920440860216),
 ('ohn_session_start', 0.1071155057665855),
 ('nhn_session_hours', 0.07361439347198338),
 ('paid', 0.0664014373321682),
 ('ohn_pg_downgrade', 0.05502855242220322),
 ('ohn_pg_about', 0.05126235028010777),
 ('nhn_session_start', 0.03257332015388863),
 ('ohn_pg_login', 0.031356496574380596),
 ('ohn_pg_thumbs_up', 0.02738591484059622),
 ('ohn_pg_settings', 0.02442703365003172),
 ('nhn_pg_downgrade', 0.02312319305938885),
 ('nhn_pg_settings', 0.018972942112722876),
 ('ohn_pg_logout', 0.017614706100878206),
 ('nhn_pg_add_friend', 0.01682085101030543),
 ('nhn_pg_roll_advert', 0.01648004111998699),
 ('nhn_pg_nextsong', 0.015579254523506893),
 ('ohn_pg_home', 0.01510688537047101),
 ('nhn_pg_thumbs_down', 0.01503028877306283),
 ('nhn_pg_help', 0.014747180288166345),
 ('nhn_status_307', 0.01407328272069372),
 ('ohn_pg_add_friend', 0.012779985419448144),
 ('ohn_num_events', 0.012645668614730378),
 ('nhn_num_events', 0.012642013781147104),
 ('ohn_pg_save

The selected features make sense session_hours old-history vs. session_hours new-history.  
Also "paid" is useful, because a downgrade is only possible, if the user has a "paid" account.  
"downgrade" is obvious a good choice   :-)

# Comments on the Result

I really struggeled hard with the bad results my model has.  

I tried multiple optimizations, but in any way, there is either a big number of False-Positives,  
which is bad for the precision, or there are too many False-Negatives which is bad for the recall.

Looking at the current confusion Matrix

```
     | prediction| 
     |   0 |  1  | 
 ----+-----+-----+ 
 l 0 | 1762|  890| 
 b --+-----+-----+ 
 l 1 |  164|  322| 
 ----+-----+-----+ 
```

we can say the following:

 * "66% of the users who will churn are detected"
 
So, if our offer to bind them could convince them to stay, it would be helpful.  
On the other hand we have 3 times more False-Postives than True-Positives.  

Calculating the costs we can say, that only 1/4 of the costs invested in binding users to Sparkify
hits the right persons. 

So, the calcuation is:

```
CHURNERS * 2/3 * BINDING_SUCCESS --> Users kept in Sparkify (paid or free)
COST = COST_OF_OFFER * CHURNERS * 2.5
```







In [2]:
print("### STOP SPARK SESSION")
spark.stop()  

### STOP SPARK SESSION
