### Diseño de Prueba del Modelo

Usaremos varios tipos de modelos distintos para poder predecir la probabilidad de cancelación de la suscripción. Usaremos un random forest, gradient Boosting Machine. Para ello usamos la librería de H2o.

La evaluación de cualquiera de nuestros modelos es sometida a un conjunto de prueba. Los conjuntos de entrenamieno y prueba se construyeron a partir de la base de datos y se respetó la partición que aparece en los lineamientos del concurso. De hecho, ellos divieron a partir de Abril 2017, antes de esta fechha se consideraron datos de entrenamiento y después datos de prueba. Los datos de validación se construyeron de forma aleatoria la prueba de entrenamiento en una proporción del 20%.

Usaremos como medida de error la Log Loss, pues estamos en un problema de clasificación binaria, que está dada por la siguiente fórmula:


\begin{align}
\log loss = - \frac{1}{N} \sum_{i=1}^n y_i \log(p_i) + (1-y_i) \log(1 - p_i)
\end{align}

También usamos el AUC como medida de error, que es el área bajo la curva ROC. Esto nos da un resumen de qué tan bueno es el predictor sin importar donde ponemos el punto de corte.

In [164]:
import warnings
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
import numpy as np
import pandas as pd

Cargamos los datos de entrenamiento de la carpeta de AWS.

In [165]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,1 hour 57 mins
H2O cluster timezone:,America/Mexico_City
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,"28 days, 8 hours and 18 minutes"
H2O cluster name:,H2O_from_python_lorena_olreqv
H2O cluster total nodes:,1
H2O cluster free memory:,1.128 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [166]:
#spotify = h2o.import_file("s3://proyectomineria/data/consolidated_train_table/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
#spotify = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
spotify = h2o.import_file("/home/lorena/Documents/mineria/proyecto/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000(1).csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


Convertimos el tipo de variables a factores y mostramos las primeras líneas del conjunto de entrenamiento.

In [167]:
spotify["is_churn"] = spotify["is_churn"].asfactor()
spotify["city"] = spotify["city"].asfactor()
spotify["gender"] = spotify["gender"].asfactor()

In [168]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,0,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,0,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,0,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0




Especificamos cuáles son los predictores y cuál es la variable de respuesta.

## Ingeniería de Características
En esta sección agregamos variables manualmente, considerando po el contexto cuales variables tienen sentido y son interpretables. Podemos agregar una variable de descuentos a partir de cuánto cuesta el producto menos la cantidad que se pagó realmente. También agregamos una variable binaria si el cliente tiene o no descuento.

In [169]:
spotify["discount"] = spotify["plan_list_price"] - spotify["actual_amount_paid"]
spotify["is_discount"]=spotify["discount"]>0
spotify["amount_per_day"]=spotify["plan_list_price"]/spotify["payment_plan_days"]
spotify["bd"]=(spotify["bd"]<=0 or spotify["bd"]>100).ifelse(np.nan,spotify["bd"])

In [170]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29.0,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0,0,0,4.36098
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2,0,0,5.58667
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0,0,0,3.3




In [171]:
predictors = ["city", "bd", "gender", "registered_via", "registered_init_time", "date", "num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "transaction_date", "membership_expire_date","is_cancel"]#, "discount", "is_discount", "amount_per_day"] 
predictors = ["city", "bd", "gender", "registered_via", "registered_init_time","num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "is_cancel","discount", "is_discount", "amount_per_day"]
response = "is_churn"

Hacemos la separación para el conjunto de entrenamiento y validación, 80% y 20% respectivamente.

In [172]:
train, valid = spotify.split_frame(ratios = [0.8], seed=1234)

## Modelos

### Gradient Boosting Machine

La idea de gradient boosting (GBM) es replicar la idea del residual en regresión, y usar
árboles de regresión. Es una técnica directa de un método ensamblado. La heurística nos dice que podemos obtener resultados muy buenos a partir de muchas aproximaciones burdas. Se construyen árboles en paralelo, y en cierto sentido este modelo aprende de los errores de los árboles generados en iteraciones previas.

A continuación mostramos la fase de entrenamiento usando la librería de H2o. Notese que usamos el conjunto de validación.

In [173]:
bin_num = [8,16,32,64,128,256,512,1024,2048,4096]
label = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [174]:
df=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num):
    spotify_gbm = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm.train(x=predictors, y=response, training_frame=train, validation_frame=valid)
    df.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]
    #print(label[key], 'training score', spotify_gbm.auc(train=True))
    #print(label[key], 'validation score', spotify_gbm.auc(valid=True))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Se muestra el AUC, esto nos dice que nuestro predictor es extremadamente bueno. Habrá que verificar con los datos de prueba para validar que en realidad no se sobreajustó. 

In [175]:
print(label[key], 'training score', spotify_gbm.auc(train=True))
print(label[key], 'validation score', spotify_gbm.auc(valid=True))

4096 training score 0.9660228744883572
4096 validation score 0.9541978004297422


In [176]:
print(df[df['training_score']==df['training_score'].max()])
print(df[df['validation_score']==df['validation_score'].max()])

   bin_num  training_score  validation_score
9   4096.0        0.966023          0.954198
   bin_num  training_score  validation_score
4    128.0        0.959484          0.955245


In [177]:
df

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.958286,0.95493
2,32.0,0.958279,0.954488
3,64.0,0.959048,0.954978
4,128.0,0.959484,0.955245
5,256.0,0.959274,0.954579
6,512.0,0.960334,0.95438
7,1024.0,0.960435,0.954717
8,2048.0,0.962146,0.954426
9,4096.0,0.966023,0.954198
0,8.0,0.957817,0.954589


Todos los modelos tienen un desempeño muy similar, el modelo con el mejor desempeño en el conjunto de entrenamiento es el último, pero su desempeño en el conjunto de validación es menor, lo cual quiere decir que está empezando a sobre ajustar.

Aquí se hace una predicción de los mismos datos con los que se ajustó.

In [178]:
final_gbm_predictions = spotify_gbm.predict(valid[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [179]:
final_gbm_predictions[:]

predict,p0,p1
0,0.996392,0.00360826
0,0.997609,0.00239101
0,0.99704,0.00296019
0,0.739144,0.260856
1,0.124758,0.875242
0,0.997609,0.00239101
0,0.915241,0.0847592
0,0.915941,0.0840587
0,0.996321,0.00367914
0,0.997495,0.00250488




### Gradient Boosting Machine reduced Model

Se hace exactamente el mismo proceso usando todas las covariables que se extrajeron a partir de la tabla de logs.

In [180]:
spotify_logs=spotify[:,["msno","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","is_churn"]]

In [181]:
spotify_logs

msno,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,is_churn
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,1
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,0
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,0




In [182]:
train_logs, valid_logs = spotify_logs.split_frame(ratios = [0.8], seed=1234)

In [183]:
bin_num1 = [8,16,32,64,128,256,512,1024,2048,4096]
label1 = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [184]:
log_predictors = ["num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs"] 

In [185]:
df_logs=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num1):
    spotify_gbm1 = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm1.train(x=log_predictors, y=response, training_frame=train_logs, validation_frame=valid_logs)
    df_logs.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [186]:
df_logs

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.966023,0.954198
2,32.0,0.966023,0.954198
3,64.0,0.966023,0.954198
4,128.0,0.966023,0.954198
5,256.0,0.966023,0.954198
6,512.0,0.966023,0.954198
7,1024.0,0.966023,0.954198
8,2048.0,0.966023,0.954198
9,4096.0,0.966023,0.954198
0,8.0,0.966023,0.954198


In [187]:
model_path = h2o.save_model(model=spotify_gbm1, path="/home/lorena/Documents/mineria/proyecto", force=True)
model_path

'/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545334017097_5859'

Este modelo tiene casi el mismo desempeño que el anterior. Lo cual sugiere que las variables que más explican son las que se encuentran en la tabla de user_log

### Distributed Random Forest

El segundo modelo con el que se evaluó fue un modelo de Random Forest (DRF). Es un método muy poderoso para la clasificacion.

Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification or regression trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value.

In [188]:
rf_v1 = H2ORandomForestEstimator(
    model_id="rf_covType_v1",
    ntrees=200,
    stopping_rounds=2,
    score_each_iteration=True,
seed=1000000)

In [189]:
rf_v1.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [190]:
rf_v1.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2018-12-20 15:26:38,0.006 sec,0.0,,,,,,,,,,,,
1,,2018-12-20 15:26:38,0.371 sec,1.0,0.245711,1.529905,0.72003,0.190559,7.375163,0.089022,0.249436,1.580358,0.716015,0.185531,6.906414,0.090517
2,,2018-12-20 15:26:39,0.740 sec,2.0,0.236396,1.28003,0.775655,0.234305,7.610209,0.082654,0.219145,0.607001,0.860514,0.36848,8.621298,0.08189
3,,2018-12-20 15:26:39,1.295 sec,3.0,0.233434,1.108593,0.785717,0.265138,7.50555,0.084114,0.211103,0.32246,0.903563,0.444185,9.837826,0.08495
4,,2018-12-20 15:26:40,1.739 sec,4.0,0.228218,0.938032,0.803948,0.301631,7.622941,0.080238,0.206755,0.237483,0.920345,0.479558,10.223623,0.07524
5,,2018-12-20 15:26:40,2.216 sec,5.0,0.223709,0.791193,0.821909,0.333974,7.764927,0.07963,0.203557,0.194429,0.929099,0.510163,10.346394,0.077917
6,,2018-12-20 15:26:41,2.728 sec,6.0,0.219798,0.676722,0.84088,0.360421,8.015534,0.082265,0.201219,0.166313,0.93671,0.533692,11.159321,0.067569
7,,2018-12-20 15:26:41,3.281 sec,7.0,0.216557,0.580743,0.856649,0.38336,8.345906,0.080646,0.200243,0.151537,0.941581,0.543269,11.606062,0.072987
8,,2018-12-20 15:26:42,3.871 sec,8.0,0.21352,0.497769,0.871819,0.404332,8.69877,0.0758,0.199253,0.141744,0.945007,0.552556,11.798961,0.07388
9,,2018-12-20 15:26:43,4.486 sec,9.0,0.211306,0.430494,0.88453,0.422395,9.392875,0.082861,0.198694,0.134426,0.948027,0.555917,11.702511,0.069099


Se generó un segundo modelo cambiando los hiperparámetros para poder compararlos.

In [191]:
rf_v2 = H2ORandomForestEstimator(
    model_id="rf_covType_v2",
    ntrees=200,
    max_depth=30,
    stopping_rounds=2,
    stopping_tolerance=0.01,
    score_each_iteration=True,
seed=1234)

In [192]:
rf_v2.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [193]:
final_rf_predictions = rf_v2.predict(valid[1:])

drf prediction progress: |████████████████████████████████████████████████| 100%


In [194]:
final_rf_predictions

predict,p0,p1
0,0.999892,0.000107603
0,0.995853,0.00414703
0,0.989979,0.010021
0,0.871023,0.128977
1,0.193195,0.806805
0,0.995629,0.00437075
0,0.942481,0.0575189
0,0.804888,0.195112
0,0.982656,0.0173438
0,0.995689,0.00431088




In [195]:
print('training score', rf_v1.auc(train=True))
print('validation score', rf_v1.auc(valid=True))

training score 0.9449293111555299
validation score 0.9541046242303209


In [196]:
print('training score', rf_v2.auc(train=True))
print('validation score', rf_v2.auc(valid=True))

training score 0.9241879122188484
validation score 0.9460506474754001


In [197]:
print('training score', rf_v2.logloss(train=True))
print('validation score', rf_v2.logloss(valid=True))

training score 0.19336082636679344
validation score 0.13010450501512197


### Ajuste de hiperparámetros óptimo.

Nos quedaremos con el modelo de Gradient Boosting porque tiene menor Log Loss. En esta sección se ajustan los hiperámetros del modelo. Será imporante usar los datos de validación.

In [198]:
print('validation score RF', rf_v1.logloss(valid=True))
print('validation score GBM', spotify_gbm.logloss(valid=True))

validation score RF 0.12082164323006764
validation score GBM 0.1212337911389242


In [199]:
gbm_params1 = {'learn_rate': [0.01, 0.1],
                'max_depth': [3, 5, 9]}#,
                #'sample_rate': [0.8, 1.0],
                #'col_sample_rate': [0.2, 0.5, 1.0]}
gbm_params2 = {'learn_rate': [i * 0.01 for i in range(1, 11)],
                'max_depth': [i for i in range(2, 11)]},
                #'sample_rate': [i * 0.1 for i in range(5, 11)]}#,
               # 'col_sample_rate': [i * 0.1 for i in range(1, 11)]}

gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid',
                          hyper_params=gbm_params1)#,
                          #search_criteria=search_criteria)


In [200]:
gbm_grid.train(x=predictors, y=response, training_frame=train, validation_frame=valid, seed=42)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [201]:
# Get los resultados del grid, en orden de mejor a peor por AUC
gbm_gridperf = gbm_grid.get_grid(sort_by='auc', decreasing=True)
print(gbm_gridperf)

# Escojes el mejor modelo (usando el AUC de valicación)
best_gbm = gbm_gridperf.models[0]

     learn_rate max_depth          model_ids                 auc
0           0.1         9   gbm_grid_model_6  0.9734527216980862
1           0.1         5   gbm_grid_model_4  0.9716787980921754
2          0.01         9   gbm_grid_model_5  0.9704326936098066
3           0.1         9  gbm_grid_model_18  0.9685849185877624
4           0.1         9  gbm_grid_model_12   0.968329670408501
5           0.1         5  gbm_grid_model_16  0.9680274278698804
6           0.1         5  gbm_grid_model_10  0.9678245798653033
7           0.1         3   gbm_grid_model_2  0.9665087524432309
8          0.01         9  gbm_grid_model_17  0.9655053167310022
9          0.01         5   gbm_grid_model_3  0.9647187724312055
10         0.01         5   gbm_grid_model_9  0.9632185653485847
11         0.01         5  gbm_grid_model_15   0.962887734574687
12         0.01         9  gbm_grid_model_11  0.9625944041283593
13          0.1         3  gbm_grid_model_14  0.9625596744183935
14          0.1         3

## Evaluación

Summarize assessment results in terms of business success criteria, including a final statement regarding
whether the project already meets the initial business objectives.




### Reentrenamiento final de modelo con datos de entrenamiento y prueba con hiperparámetros optimizados.
Se mejora el ajuste de hiperparámetros para este modelo usando todos los datos.

In [202]:
#spotify_test = h2o.import_file("s3://proyectomineria/data/resumen_final_test/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")
#spotify_test = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")
spotify_test = h2o.import_file("/home/lorena/Documents/mineria/proyecto/testR.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [203]:
spotify_test

avg_num_unq,date,bd,payment_plan_days,city,avg_num_50,registered_init_time,msno,avg_num_75,plan_list_price,actual_amount_paid,avg_num_25,avg_num_100,membership_expire_date,is_churn,is_auto_renew,payment_method_id,registered_via,avg_num_985,gender,total_secs,is_cancel,transaction_date
13.0667,1483980000.0,0,30,1,0.933333,2014-07-14T00:00:00.000Z,++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0.733333,149,149,5.73333,6.46667,,0,1.0,41,7,0.666667,,1978.66,0,1485600000.0
61.1333,1487550000.0,31,30,15,1.26667,2006-06-03T00:00:00.000Z,+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,1.4,149,149,29.1333,33.6,,0,1.0,34,9,1.26667,male,9395.27,0,1487030000.0
19.8667,1487640000.0,31,30,9,1.86667,2004-03-30T00:00:00.000Z,+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,1.06667,149,149,12.4667,67.8,,0,1.0,34,9,3.26667,male,17219.0,0,1487030000.0
27.8,1487510000.0,29,30,15,0.6,2008-03-22T00:00:00.000Z,+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0.333333,149,149,2.33333,33.0,,0,1.0,34,9,0.666667,male,8571.42,0,1487030000.0
30.9333,1487390000.0,24,30,5,4.4,2014-03-20T00:00:00.000Z,+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0.933333,149,149,10.7333,19.6,,0,1.0,23,9,0.533333,female,5353.78,0,1484480000.0
46.6667,1487430000.0,32,30,13,1.6,2015-03-16T00:00:00.000Z,+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,1.66667,149,149,12.3333,40.6,,0,1.0,37,3,0.933333,male,10597.8,0,1486560000.0
20.6667,1486560000.0,0,30,5,0.4,2013-02-27T00:00:00.000Z,+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0.333333,149,149,1.0,21.5333,,0,1.0,40,3,0.133333,,5413.59,0,1486470000.0
15.0,1487640000.0,31,30,6,0.466667,2008-04-17T00:00:00.000Z,+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0.333333,180,180,1.26667,281.667,,0,1.0,36,9,1.86667,female,48663.0,0,1484400000.0
18.6667,1486030000.0,0,30,4,1.66667,2015-11-03T00:00:00.000Z,+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0.866667,180,180,5.2,13.4667,,0,0.5,29,7,0.466667,,3910.75,0,1485000000.0
23.2667,1487190000.0,0,30,1,2.73333,2012-12-17T00:00:00.000Z,+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,1.33333,99,99,8.93333,11.4,,0,1.0,41,7,3.0,,3921.46,0,1484910000.0




In [204]:
spotify_test["num_25"] = spotify_test["avg_num_25"]
spotify_test["num_50"] = spotify_test["avg_num_50"]
spotify_test["num_75"] = spotify_test["avg_num_75"]
spotify_test["num_985"] = spotify_test["avg_num_985"]
spotify_test["num_100"] = spotify_test["avg_num_100"]
spotify_test["num_unq"] = spotify_test["avg_num_unq"]
spotify_test = spotify_test[:, ["msno","is_churn","city","bd","gender","registered_via","registered_init_time","date","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","payment_method_id","payment_plan_days","plan_list_price","actual_amount_paid","is_auto_renew","transaction_date","membership_expire_date","is_cancel"]]
spotify_test["is_churn"] = spotify_test["is_churn"].asfactor()
spotify_test["city"] = spotify_test["city"].asfactor()
spotify_test["gender"] = spotify_test["gender"].asfactor()
spotify_test["discount"] = spotify_test["plan_list_price"] - spotify_test["actual_amount_paid"]
spotify_test["is_discount"]=spotify_test["discount"]>0
spotify_test["amount_per_day"]=spotify_test["plan_list_price"]/spotify_test["payment_plan_days"]
spotify_test["bd"]=(spotify_test["bd"]<=0 or spotify_test["bd"]>100).ifelse(np.nan,spotify_test["bd"])

In [205]:
spotify_test

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41,30,149,149,1.0,1485600000.0,,0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23,30,149,149,1.0,1484480000.0,,0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37,30,149,149,1.0,1486560000.0,,0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40,30,149,149,1.0,1486470000.0,,0,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36,30,180,180,1.0,1484400000.0,,0,0,0,6.0
+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0,4,,,7,2015-11-03T00:00:00.000Z,1486030000.0,5.2,1.66667,0.866667,0.466667,13.4667,18.6667,3910.75,29,30,180,180,0.5,1485000000.0,,0,0,0,6.0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41,30,99,99,1.0,1484910000.0,,0,0,0,3.3




In [206]:
best_gbm.train(x=predictors, y=response, training_frame=spotify)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [207]:
final_gbm_predictions = best_gbm.predict(spotify_test[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [208]:
prediciones=best_gbm.predict(spotify_test)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [209]:
pred_df=prediciones.as_data_frame()
membs=spotify_test.as_data_frame().msno

In [210]:
result = pd.concat([membs,pred_df], axis=1, sort=False)

In [211]:
result.to_csv("test_predict_mochado.csv")

### Generación y carga de datos de prueba en Kaggle
Tuvimos pérdidas de datos en el conjunto de prueba. La tabla de prueba, en su forma original solo tiene una columna con ID's y es necesario construirla. Desafortunadamente no aparecen todas las covariantes en otras tablas por lo que se obtuvo una tabla llena de nulos y por esa razón no se sometió a Kaggle.

### Flask

In [212]:
model_path = h2o.save_model(model=best_gbm, path="/home/lorena/Documents/mineria/proyecto", force=True)

In [213]:
model_path

'/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545334017097_6271'

### Comentarios Finales

La pregunta que queríamos contestar es ¿qué clientes no van a renovar el servicio de KKBox? ¿qué caracteristicas tienen los clientes que no van a renovar la suscripción?

La respuesta se puede dar viendo las variables más importantes del modelo que se utiliza para predecir.

In [214]:
best_gbm.varimp

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1545334017097_6271


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.035237645756813755
RMSE: 0.18771692986199662
LogLoss: 0.11489323442170712
Mean Per-Class Error: 0.08419715413733375
AUC: 0.9595068363110685
pr_auc: 0.6563580825421097
Gini: 0.919013672622137
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.32626717229591234: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,212502.0,7692.0,0.0349,(7692.0/220194.0)
1,5661.0,9796.0,0.3662,(5661.0/15457.0)
Total,218163.0,17488.0,0.0567,(13353.0/235651.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3262672,0.5946881,204.0
max f2,0.1657003,0.7197323,303.0
max f0point5,0.4671354,0.6312399,138.0
max accuracy,0.4641476,0.9513688,140.0
max precision,0.9597882,1.0,0.0
max recall,0.0026182,1.0,399.0
max specificity,0.9597882,1.0,0.0
max absolute_mcc,0.2579934,0.5679624,241.0
max min_per_class_accuracy,0.1874026,0.9027625,290.0


Gains/Lift Table: Avg response rate:  6.56 %, avg score:  6.57 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100021,0.6777997,13.8678546,13.8678546,0.9096309,0.7732473,0.9096309,0.7732473,0.1387074,0.1387074,1286.7854570,1286.7854570
,2,0.0200042,0.5918192,11.4422737,12.6550641,0.7505303,0.6282916,0.8300806,0.7007695,0.1144465,0.2531539,1044.2273663,1165.5064117
,3,0.0300020,0.5248324,9.7258546,11.6789372,0.6379457,0.5586764,0.7660537,0.6534185,0.0972375,0.3503914,872.5854644,1067.8937169
,4,0.0400041,0.4639760,8.3892758,10.8564346,0.5502758,0.4940751,0.7121035,0.6135784,0.0839102,0.4343016,738.9275829,985.6434594
,5,0.0500019,0.4149903,7.0274638,10.0908354,0.4609508,0.4381431,0.6618858,0.5785003,0.0702594,0.5045610,602.7463834,909.0835416
,6,0.1000038,0.2567555,4.7652964,7.4280659,0.3125690,0.3270081,0.4872274,0.4527542,0.2382739,0.7428350,376.5296427,642.8065922
,7,0.1500015,0.1867775,3.2025820,6.0196510,0.2100662,0.2212462,0.3948455,0.3755892,0.1601216,0.9029566,220.2582049,501.9650991
,8,0.2000246,0.0417230,1.3140069,4.8428407,0.0861893,0.1165863,0.3176553,0.3108165,0.0657307,0.9686873,31.4006946,384.2840654
,9,0.3019465,0.0050030,0.1656715,3.2640643,0.0108668,0.0112896,0.2140990,0.2097113,0.0168856,0.9855729,-83.4328522,226.4064348



Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2018-12-20 15:28:32,0.018 sec,0.0,0.2475689,0.2420866,0.5,0.0,1.0,0.9344072
,2018-12-20 15:28:32,0.275 sec,1.0,0.2365464,0.2089622,0.9437967,0.5279200,10.7054541,0.0780519
,2018-12-20 15:28:33,0.468 sec,2.0,0.2291430,0.1922037,0.9471793,0.5368189,11.2051049,0.0766091
,2018-12-20 15:28:33,0.656 sec,3.0,0.2232677,0.1803427,0.9477386,0.5400711,11.2616311,0.0688858
,2018-12-20 15:28:33,0.875 sec,4.0,0.2180669,0.1706770,0.9505813,0.5594826,11.2716751,0.0674854
,2018-12-20 15:28:33,1.121 sec,5.0,0.2138789,0.1631844,0.9509486,0.5662901,11.3958829,0.0683002
,2018-12-20 15:28:34,1.361 sec,6.0,0.2104485,0.1571033,0.9524804,0.5730491,12.3165279,0.0677994
,2018-12-20 15:28:34,1.596 sec,7.0,0.2076304,0.1519563,0.9530572,0.5792164,12.1964676,0.0677358
,2018-12-20 15:28:34,1.838 sec,8.0,0.2052718,0.1476896,0.9532229,0.5866584,12.3141004,0.0670356


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
is_auto_renew,10046.8408203,1.0,0.3488143
is_cancel,8146.3046875,0.8108325,0.2828300
amount_per_day,4542.9794922,0.4521799,0.1577268
payment_method_id,1424.7645264,0.1418122,0.0494661
registered_init_time,1399.0455322,0.1392523,0.0485732
---,---,---,---
num_25,49.5311508,0.0049300,0.0017197
num_985,40.4902573,0.0040301,0.0014058
gender,8.2533884,0.0008215,0.0002865



See the whole table with table.as_data_frame()


<bound method ModelBase.varimp of >

Podemos ver como todas estas variables estan presentes en la tabla de "user_log". Sin embargo, cuando estas variables no aparecen en una observación es más complicado predecir si un usuario va a renovar o no su suscripción.

Se tiene una herramienta para la predicción de si un cliente va a abandonar o no el servicio.

Para contestar la pregunta de que debe de hacer KKBox para que los clientes renueven su suscripción es importante ver las variables explicativas. Hay variables que no aportan mucho en terminos explicativos, por ejemplo, "is cancel" es importante, pero eso es bastante obvio, si el cliente cancela el servicio ya sabemos que no va a renovar su suscripción; de igual manera si se tiene el servicio en "auto renew" es probable que el cliente renueve el servicio. Las variables más explicativas son variables del servicio no caracteristicas de los clientes. Por lo que KKBox tiene que asegurarse que su servicio sigue siendo deseable.