### Diseño de Prueba del Modelo

Usaremos varios tipos de modelos distintos para poder predecir la probabilidad de cancelación de la suscripción. Usaremos un random forest, gradient Boosting Machine. Para ello usamos la librería de H2o.

La evaluación de cualquiera de nuestros modelos es sometida a un conjunto de prueba. Los conjuntos de entrenamieno y prueba se construyeron a partir de la base de datos y se respetó la partición que aparece en los lineamientos del concurso. De hecho, ellos divieron a partir de Abril 2017, antes de esta fechha se consideraron datos de entrenamiento y después datos de prueba. Los datos de validación se construyeron de forma aleatoria la prueba de entrenamiento en una proporción del 20%.

Usaremos como medida de error la Log Loss, pues estamos en un problema de clasificación binaria, que está dada por la siguiente fórmula:


\begin{align}
\log loss = - \frac{1}{N} \sum_{i=1}^n y_i \log(p_i) + (1-y_i) \log(1 - p_i)
\end{align}

También usamos el AUC como medida de error, que es el área bajo la curva ROC. Esto nos da un resumen de qué tan bueno es el predictor sin importar donde ponemos el punto de corte.

In [1]:
import warnings
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
import numpy as np
import pandas as pd

Cargamos los datos de entrenamiento de la carpeta de AWS.

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_152-release"; OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12); OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
  Starting server from /home/lorena/anaconda3/envs/for_spark/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpqd0cxxak
  JVM stdout: /tmp/tmpqd0cxxak/h2o_lorena_started_from_python.out
  JVM stderr: /tmp/tmpqd0cxxak/h2o_lorena_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/Mexico_City
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,"28 days, 9 hours and 24 minutes"
H2O cluster name:,H2O_from_python_lorena_kgcph4
H2O cluster total nodes:,1
H2O cluster free memory:,1.688 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [3]:
#spotify = h2o.import_file("s3://proyectomineria/data/consolidated_train_table/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
#spotify = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
spotify = h2o.import_file("/home/lorena/Documents/mineria/proyecto/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000(1).csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


Convertimos el tipo de variables a factores y mostramos las primeras líneas del conjunto de entrenamiento.

In [4]:
spotify["is_churn"] = spotify["is_churn"].asfactor()
spotify["city"] = spotify["city"].asfactor()
spotify["gender"] = spotify["gender"].asfactor()

In [5]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,0,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,0,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,0,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0




Especificamos cuáles son los predictores y cuál es la variable de respuesta.

## Ingeniería de Características
En esta sección agregamos variables manualmente, considerando po el contexto cuales variables tienen sentido y son interpretables. Podemos agregar una variable de descuentos a partir de cuánto cuesta el producto menos la cantidad que se pagó realmente. También agregamos una variable binaria si el cliente tiene o no descuento.

In [6]:
spotify["discount"] = spotify["plan_list_price"] - spotify["actual_amount_paid"]
spotify["is_discount"]=spotify["discount"]>0
spotify["amount_per_day"]=spotify["plan_list_price"]/spotify["payment_plan_days"]
spotify["bd"]=(spotify["bd"]<=0 or spotify["bd"]>100).ifelse(np.nan,spotify["bd"])

In [7]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29.0,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0,0,0,4.36098
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2,0,0,5.58667
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0,0,0,3.3




In [8]:
predictors = ["city", "bd", "gender", "registered_via", "registered_init_time", "date", "num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "transaction_date", "membership_expire_date","is_cancel"]#, "discount", "is_discount", "amount_per_day"] 
predictors = ["city", "bd", "gender", "registered_via", "registered_init_time","date","num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "is_cancel","discount", "is_discount", "amount_per_day"]
response = "is_churn"

Hacemos la separación para el conjunto de entrenamiento y validación, 80% y 20% respectivamente.

In [9]:
train, valid = spotify.split_frame(ratios = [0.8], seed=1234)

## Modelos

### Gradient Boosting Machine

La idea de gradient boosting (GBM) es replicar la idea del residual en regresión, y usar
árboles de regresión. Es una técnica directa de un método ensamblado. La heurística nos dice que podemos obtener resultados muy buenos a partir de muchas aproximaciones burdas. Se construyen árboles en paralelo, y en cierto sentido este modelo aprende de los errores de los árboles generados en iteraciones previas.

A continuación mostramos la fase de entrenamiento usando la librería de H2o. Notese que usamos el conjunto de validación.

In [10]:
bin_num = [8,16,32,64,128,256,512,1024,2048,4096]
label = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [11]:
df=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num):
    spotify_gbm = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm.train(x=predictors, y=response, training_frame=train, validation_frame=valid)
    df.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]
    #print(label[key], 'training score', spotify_gbm.auc(train=True))
    #print(label[key], 'validation score', spotify_gbm.auc(valid=True))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Se muestra el AUC, esto nos dice que nuestro predictor es extremadamente bueno. Habrá que verificar con los datos de prueba para validar que en realidad no se sobreajustó. 

In [12]:
print(label[key], 'training score', spotify_gbm.auc(train=True))
print(label[key], 'validation score', spotify_gbm.auc(valid=True))

4096 training score 0.9733425312532167
4096 validation score 0.9671266221288971


In [13]:
print(df[df['training_score']==df['training_score'].max()])
print(df[df['validation_score']==df['validation_score'].max()])

   bin_num  training_score  validation_score
9   4096.0        0.973343          0.967127
   bin_num  training_score  validation_score
6    512.0        0.970728          0.968103


In [14]:
df

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.968807,0.967226
2,32.0,0.969142,0.967202
3,64.0,0.969056,0.967114
4,128.0,0.969117,0.967201
5,256.0,0.969795,0.967771
6,512.0,0.970728,0.968103
7,1024.0,0.970871,0.968027
8,2048.0,0.970247,0.967558
9,4096.0,0.973343,0.967127
0,8.0,0.969089,0.967691


Todos los modelos tienen un desempeño muy similar, el modelo con el mejor desempeño en el conjunto de entrenamiento es el último, pero su desempeño en el conjunto de validación es menor, lo cual quiere decir que está empezando a sobre ajustar.

Aquí se hace una predicción de los mismos datos con los que se ajustó.

In [15]:
final_gbm_predictions = spotify_gbm.predict(valid[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [16]:
final_gbm_predictions[:]

predict,p0,p1
0,0.996403,0.00359705
0,0.997364,0.00263618
0,0.997204,0.00279572
0,0.97096,0.0290397
1,0.130443,0.869557
0,0.997338,0.00266152
0,0.903726,0.0962741
0,0.905332,0.0946678
0,0.996374,0.00362579
0,0.99707,0.00292988




### Gradient Boosting Machine reduced Model

Se hace exactamente el mismo proceso usando todas las covariables que se extrajeron a partir de la tabla de logs.

In [17]:
spotify_logs=spotify[:,["msno","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","is_churn"]]

In [18]:
spotify_logs

msno,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,is_churn
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,1
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,0
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,0




In [19]:
train_logs, valid_logs = spotify_logs.split_frame(ratios = [0.8], seed=1234)

In [20]:
bin_num1 = [8,16,32,64,128,256,512,1024,2048,4096]
label1 = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [21]:
log_predictors = ["num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs"] 

In [22]:
df_logs=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num1):
    spotify_gbm1 = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm1.train(x=log_predictors, y=response, training_frame=train_logs, validation_frame=valid_logs)
    df_logs.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [23]:
df_logs

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.973343,0.967127
2,32.0,0.973343,0.967127
3,64.0,0.973343,0.967127
4,128.0,0.973343,0.967127
5,256.0,0.973343,0.967127
6,512.0,0.973343,0.967127
7,1024.0,0.973343,0.967127
8,2048.0,0.973343,0.967127
9,4096.0,0.973343,0.967127
0,8.0,0.973343,0.967127


In [24]:
model_path = h2o.save_model(model=spotify_gbm1, path="/home/lorena/Documents/mineria/proyecto", force=True)
model_path

'/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545345036982_833'

Este modelo tiene casi el mismo desempeño que el anterior. Lo cual sugiere que las variables que más explican son las que se encuentran en la tabla de user_log

### Distributed Random Forest

El segundo modelo con el que se evaluó fue un modelo de Random Forest (DRF). Es un método muy poderoso para la clasificacion.

Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification or regression trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value.

In [25]:
rf_v1 = H2ORandomForestEstimator(
    model_id="rf_covType_v1",
    ntrees=200,
    stopping_rounds=2,
    score_each_iteration=True,
seed=1000000)

In [26]:
rf_v1.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [27]:
rf_v1.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2018-12-20 16:33:27,0.009 sec,0.0,,,,,,,,,,,,
1,,2018-12-20 16:33:28,1.135 sec,1.0,0.239308,1.392855,0.758461,0.175108,8.422703,0.066998,0.239808,1.387094,0.765313,0.168558,8.319327,0.067463
2,,2018-12-20 16:33:29,2.045 sec,2.0,0.228797,1.257147,0.791033,0.22608,8.482562,0.061064,0.207022,0.596827,0.870207,0.371741,10.280253,0.062151
3,,2018-12-20 16:33:30,2.744 sec,3.0,0.222285,1.065798,0.815689,0.274482,8.669392,0.06405,0.196336,0.331332,0.912732,0.489462,11.524974,0.067951
4,,2018-12-20 16:33:30,3.412 sec,4.0,0.2168,0.927872,0.827185,0.30691,9.019514,0.06258,0.190387,0.23266,0.930394,0.551277,12.056767,0.053354
5,,2018-12-20 16:33:31,4.145 sec,5.0,0.211442,0.783892,0.84477,0.350278,9.262347,0.063398,0.186743,0.186505,0.940567,0.592254,12.152203,0.053248
6,,2018-12-20 16:33:32,4.800 sec,6.0,0.207293,0.660039,0.860471,0.385117,9.575216,0.0621,0.184399,0.15843,0.94871,0.617363,12.130608,0.049126
7,,2018-12-20 16:33:32,5.346 sec,7.0,0.203077,0.551374,0.874591,0.418408,10.044987,0.061021,0.182889,0.142957,0.952414,0.637304,12.699154,0.050379
8,,2018-12-20 16:33:33,5.890 sec,8.0,0.200276,0.470683,0.88437,0.443479,10.49955,0.059696,0.181894,0.133475,0.952919,0.648875,12.992669,0.051165
9,,2018-12-20 16:33:33,6.428 sec,9.0,0.197254,0.408074,0.895545,0.472718,10.630107,0.058958,0.180842,0.122769,0.956268,0.660479,13.2818,0.046895


Se generó un segundo modelo cambiando los hiperparámetros para poder compararlos.

In [28]:
rf_v2 = H2ORandomForestEstimator(
    model_id="rf_covType_v2",
    ntrees=200,
    max_depth=30,
    stopping_rounds=2,
    stopping_tolerance=0.01,
    score_each_iteration=True,
seed=1234)

In [29]:
rf_v2.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [30]:
final_rf_predictions = rf_v2.predict(valid[1:])

drf prediction progress: |████████████████████████████████████████████████| 100%


In [31]:
final_rf_predictions

predict,p0,p1
0,1.0,0.0
0,0.996959,0.00304055
0,0.991593,0.00840696
0,0.875933,0.124067
1,0.17297,0.82703
0,0.996944,0.00305576
0,0.945153,0.0548472
0,0.797175,0.202825
0,0.981544,0.0184565
0,0.996879,0.0031214




In [32]:
print('training score', rf_v1.auc(train=True))
print('validation score', rf_v1.auc(valid=True))

training score 0.9605057272906306
validation score 0.9667065862521441


In [33]:
print('training score', rf_v2.auc(train=True))
print('validation score', rf_v2.auc(valid=True))

training score 0.9376495535743474
validation score 0.9597401958566787


In [34]:
print('training score', rf_v2.logloss(train=True))
print('validation score', rf_v2.logloss(valid=True))

training score 0.18555483352663613
validation score 0.11890341834114056


### Ajuste de hiperparámetros óptimo.

Nos quedaremos con el modelo de Gradient Boosting porque tiene menor Log Loss. En esta sección se ajustan los hiperámetros del modelo. Será imporante usar los datos de validación.

In [35]:
print('validation score RF', rf_v1.logloss(valid=True))
print('validation score GBM', spotify_gbm.logloss(valid=True))

validation score RF 0.10340326719293488
validation score GBM 0.10300302139168878


In [36]:
gbm_params1 = {'learn_rate': [0.01, 0.1],
                'max_depth': [3, 5, 9]}#,
                #'sample_rate': [0.8, 1.0],
                #'col_sample_rate': [0.2, 0.5, 1.0]}
gbm_params2 = {'learn_rate': [i * 0.01 for i in range(1, 11)],
                'max_depth': [i for i in range(2, 11)]},
                #'sample_rate': [i * 0.1 for i in range(5, 11)]}#,
               # 'col_sample_rate': [i * 0.1 for i in range(1, 11)]}

gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid',
                          hyper_params=gbm_params1)#,
                          #search_criteria=search_criteria)


In [37]:
gbm_grid.train(x=predictors, y=response, training_frame=train, validation_frame=valid, seed=42)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [38]:
# Get los resultados del grid, en orden de mejor a peor por AUC
gbm_gridperf = gbm_grid.get_grid(sort_by='auc', decreasing=True)
print(gbm_gridperf)

# Escojes el mejor modelo (usando el AUC de valicación)
best_gbm = gbm_gridperf.models[0]

    learn_rate max_depth         model_ids                 auc
0          0.1         9  gbm_grid_model_6  0.9685849185877624
1          0.1         5  gbm_grid_model_4  0.9680274278698804
2         0.01         9  gbm_grid_model_5  0.9655053167310022
3         0.01         5  gbm_grid_model_3   0.962887734574687
4          0.1         3  gbm_grid_model_2  0.9625596744183935
5         0.01         3  gbm_grid_model_1  0.9521618430104382



## Evaluación

Summarize assessment results in terms of business success criteria, including a final statement regarding
whether the project already meets the initial business objectives.




### Reentrenamiento final de modelo con datos de entrenamiento y prueba con hiperparámetros optimizados.
Se mejora el ajuste de hiperparámetros para este modelo usando todos los datos.

In [39]:
#spotify_test = h2o.import_file("s3://proyectomineria/data/resumen_final_test/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")
#spotify_test = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")
spotify_test = h2o.import_file("/home/lorena/Documents/mineria/proyecto/testR.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [40]:
spotify_test

avg_num_unq,date,bd,payment_plan_days,city,avg_num_50,registered_init_time,msno,avg_num_75,plan_list_price,actual_amount_paid,avg_num_25,avg_num_100,membership_expire_date,is_churn,is_auto_renew,payment_method_id,registered_via,avg_num_985,gender,total_secs,is_cancel,transaction_date
13.0667,1483980000.0,0,30,1,0.933333,2014-07-14T00:00:00.000Z,++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0.733333,149,149,5.73333,6.46667,,0,1.0,41,7,0.666667,,1978.66,0,1485600000.0
61.1333,1487550000.0,31,30,15,1.26667,2006-06-03T00:00:00.000Z,+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,1.4,149,149,29.1333,33.6,,0,1.0,34,9,1.26667,male,9395.27,0,1487030000.0
19.8667,1487640000.0,31,30,9,1.86667,2004-03-30T00:00:00.000Z,+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,1.06667,149,149,12.4667,67.8,,0,1.0,34,9,3.26667,male,17219.0,0,1487030000.0
27.8,1487510000.0,29,30,15,0.6,2008-03-22T00:00:00.000Z,+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0.333333,149,149,2.33333,33.0,,0,1.0,34,9,0.666667,male,8571.42,0,1487030000.0
30.9333,1487390000.0,24,30,5,4.4,2014-03-20T00:00:00.000Z,+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0.933333,149,149,10.7333,19.6,,0,1.0,23,9,0.533333,female,5353.78,0,1484480000.0
46.6667,1487430000.0,32,30,13,1.6,2015-03-16T00:00:00.000Z,+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,1.66667,149,149,12.3333,40.6,,0,1.0,37,3,0.933333,male,10597.8,0,1486560000.0
20.6667,1486560000.0,0,30,5,0.4,2013-02-27T00:00:00.000Z,+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0.333333,149,149,1.0,21.5333,,0,1.0,40,3,0.133333,,5413.59,0,1486470000.0
15.0,1487640000.0,31,30,6,0.466667,2008-04-17T00:00:00.000Z,+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0.333333,180,180,1.26667,281.667,,0,1.0,36,9,1.86667,female,48663.0,0,1484400000.0
18.6667,1486030000.0,0,30,4,1.66667,2015-11-03T00:00:00.000Z,+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0.866667,180,180,5.2,13.4667,,0,0.5,29,7,0.466667,,3910.75,0,1485000000.0
23.2667,1487190000.0,0,30,1,2.73333,2012-12-17T00:00:00.000Z,+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,1.33333,99,99,8.93333,11.4,,0,1.0,41,7,3.0,,3921.46,0,1484910000.0




In [41]:
spotify_test["num_25"] = spotify_test["avg_num_25"]
spotify_test["num_50"] = spotify_test["avg_num_50"]
spotify_test["num_75"] = spotify_test["avg_num_75"]
spotify_test["num_985"] = spotify_test["avg_num_985"]
spotify_test["num_100"] = spotify_test["avg_num_100"]
spotify_test["num_unq"] = spotify_test["avg_num_unq"]
spotify_test = spotify_test[:, ["msno","is_churn","city","bd","gender","registered_via","registered_init_time","date","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","payment_method_id","payment_plan_days","plan_list_price","actual_amount_paid","is_auto_renew","transaction_date","membership_expire_date","is_cancel"]]
spotify_test["is_churn"] = spotify_test["is_churn"].asfactor()
spotify_test["city"] = spotify_test["city"].asfactor()
spotify_test["gender"] = spotify_test["gender"].asfactor()
spotify_test["discount"] = spotify_test["plan_list_price"] - spotify_test["actual_amount_paid"]
spotify_test["is_discount"]=spotify_test["discount"]>0
spotify_test["amount_per_day"]=spotify_test["plan_list_price"]/spotify_test["payment_plan_days"]
spotify_test["bd"]=(spotify_test["bd"]<=0 or spotify_test["bd"]>100).ifelse(np.nan,spotify_test["bd"])

In [42]:
spotify_test

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41,30,149,149,1.0,1485600000.0,,0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23,30,149,149,1.0,1484480000.0,,0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37,30,149,149,1.0,1486560000.0,,0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40,30,149,149,1.0,1486470000.0,,0,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36,30,180,180,1.0,1484400000.0,,0,0,0,6.0
+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0,4,,,7,2015-11-03T00:00:00.000Z,1486030000.0,5.2,1.66667,0.866667,0.466667,13.4667,18.6667,3910.75,29,30,180,180,0.5,1485000000.0,,0,0,0,6.0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41,30,99,99,1.0,1484910000.0,,0,0,0,3.3




In [43]:
best_gbm.train(x=predictors, y=response, training_frame=spotify)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [44]:
final_gbm_predictions = best_gbm.predict(spotify_test[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%




In [45]:
prediciones=best_gbm.predict(spotify_test)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [46]:
pred_df=prediciones.as_data_frame()
membs=spotify_test.as_data_frame().msno

In [47]:
result = pd.concat([membs,pred_df], axis=1, sort=False)

In [48]:
result.to_csv("test_predict_mochado.csv")

### Generación y carga de datos de prueba en Kaggle
Tuvimos pérdidas de datos en el conjunto de prueba. La tabla de prueba, en su forma original solo tiene una columna con ID's y es necesario construirla. Desafortunadamente no aparecen todas las covariantes en otras tablas por lo que se obtuvo una tabla llena de nulos y por esa razón no se sometió a Kaggle.

### Flask

In [49]:
model_path = h2o.save_model(model=best_gbm, path="/home/lorena/Documents/mineria/proyecto", force=True)

In [50]:
model_path

'/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545345036982_1271'

### Comentarios Finales

La pregunta que queríamos contestar es ¿qué clientes no van a renovar el servicio de KKBox? ¿qué caracteristicas tienen los clientes que no van a renovar la suscripción?

La respuesta se puede dar viendo las variables más importantes del modelo que se utiliza para predecir.

In [51]:
best_gbm.varimp

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1545345036982_1271


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.029064692507093592
RMSE: 0.17048370158784562
LogLoss: 0.09884709609538847
Mean Per-Class Error: 0.07938537538420998
AUC: 0.969611512930258
pr_auc: 0.7544750828606192
Gini: 0.9392230258605161
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.38188734074473263: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,215337.0,4857.0,0.0221,(4857.0/220194.0)
1,4791.0,10666.0,0.31,(4791.0/15457.0)
Total,220128.0,15523.0,0.0409,(9648.0/235651.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3818873,0.6885733,198.0
max f2,0.1724276,0.7536267,287.0
max f0point5,0.5891555,0.7121277,122.0
max accuracy,0.5053373,0.9600935,156.0
max precision,0.9808936,1.0,0.0
max recall,0.0023121,1.0,399.0
max specificity,0.9808936,1.0,0.0
max absolute_mcc,0.3818873,0.6666639,198.0
max min_per_class_accuracy,0.1154268,0.9144936,319.0


Gains/Lift Table: Avg response rate:  6.56 %, avg score:  6.57 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100021,0.8182538,14.7345955,14.7345955,0.9664828,0.8780832,0.9664828,0.8780832,0.1473766,0.1473766,1373.4595480,1373.4595480
,2,0.0200678,0.7450159,13.2209812,13.9753878,0.8672007,0.7792134,0.9166843,0.8284915,0.1330789,0.2804555,1222.0981184,1297.5387802
,3,0.0300020,0.6652483,11.1883444,13.0525492,0.7338744,0.7063027,0.8561528,0.7880327,0.1111471,0.3916025,1018.8344389,1205.2549240
,4,0.0400083,0.5901687,9.7693292,12.2313960,0.6407973,0.6257495,0.8022910,0.7474447,0.0977551,0.4893576,876.9329184,1123.1395985
,5,0.0500019,0.5202065,8.6035592,11.5063212,0.5643312,0.5573867,0.7547314,0.7094589,0.0859805,0.5753380,760.3559165,1050.6321241
,6,0.1000038,0.2015305,5.0654726,8.2858969,0.3322583,0.3306238,0.5434949,0.5200413,0.2532833,0.8286213,406.5472580,728.5896911
,7,0.1505022,0.1099230,2.0165168,6.1823154,0.1322689,0.1373048,0.4055151,0.3916207,0.1018309,0.9304522,101.6516810,518.2315362
,8,0.2000204,0.0251886,0.8061124,4.8513530,0.0528751,0.0751197,0.3182136,0.3132660,0.0399172,0.9703694,-19.3887595,385.1352971
,9,0.3057869,0.0042500,0.1443572,3.2232820,0.0094688,0.0090893,0.2114240,0.2080564,0.0152682,0.9856376,-85.5642836,222.3282036



Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2018-12-20 16:35:29,0.008 sec,0.0,0.2475689,0.2420866,0.5,0.0,1.0,0.9344072
,2018-12-20 16:35:30,0.330 sec,1.0,0.2325844,0.2010178,0.9550498,0.6396281,12.7009088,0.0532610
,2018-12-20 16:35:30,0.563 sec,2.0,0.2227437,0.1822136,0.9595423,0.6644571,12.9568553,0.0488943
,2018-12-20 16:35:30,0.765 sec,3.0,0.2147500,0.1689177,0.9611154,0.6767785,13.0816319,0.0508761
,2018-12-20 16:35:30,0.991 sec,4.0,0.2083623,0.1590817,0.9625334,0.6862189,13.0816319,0.0514150
,2018-12-20 16:35:31,1.238 sec,5.0,0.2030530,0.1512609,0.9630646,0.6921243,13.1148040,0.0463100
,2018-12-20 16:35:31,1.474 sec,6.0,0.1987282,0.1448235,0.9633168,0.6965331,13.1498168,0.0466877
,2018-12-20 16:35:31,1.699 sec,7.0,0.1951023,0.1395203,0.9643469,0.6995691,13.1756234,0.0461275
,2018-12-20 16:35:31,1.936 sec,8.0,0.1921164,0.1349763,0.9644754,0.7020741,13.7824644,0.0458814


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
date,10552.8808594,1.0,0.3002550
is_auto_renew,9704.1933594,0.9195776,0.2761078
is_cancel,7791.2460938,0.7383051,0.2216798
amount_per_day,3439.7360840,0.3259523,0.0978688
plan_list_price,948.5032349,0.0898810,0.0269872
---,---,---,---
bd,9.9843407,0.0009461,0.0002841
num_985,8.7884912,0.0008328,0.0002501
gender,8.0912981,0.0007667,0.0002302



See the whole table with table.as_data_frame()


<bound method ModelBase.varimp of >

Podemos ver como todas estas variables estan presentes en la tabla de "user_log". Sin embargo, cuando estas variables no aparecen en una observación es más complicado predecir si un usuario va a renovar o no su suscripción.

Se tiene una herramienta para la predicción de si un cliente va a abandonar o no el servicio.

Para contestar la pregunta de que debe de hacer KKBox para que los clientes renueven su suscripción es importante ver las variables explicativas. Si un cliente ha usado el servicio en los ultimos dias entonces es probable que renueve la suscripción. Hay variables que no aportan mucho en terminos explicativos, por ejemplo, "is cancel" es importante, pero eso es bastante obvio, si el cliente cancela el servicio ya sabemos que no va a renovar su suscripción; de igual manera si se tiene el servicio en "auto renew" es probable que el cliente renueve el servicio. De las variables más explicativas son variables del servicios. Por lo que KKBox tiene que asegurarse que su servicio sigue siendo deseable.