### Diseño de Prueba del Modelo

Usaremos varios tipos de modelos distintos para poder predecir la probabilidad de cancelación de la suscripción. Usaremos un random forest, gradient Boosting Machine. Para ello usamos la librería de H2o.

La evaluación de cualquiera de nuestros modelos es sometida a un conjunto de prueba. Los conjuntos de entrenamieno y prueba se construyeron a partir de la base de datos y se respetó la partición que aparece en los lineamientos del concurso. De hecho, ellos divieron a partir de Abril 2017, antes de esta fechha se consideraron datos de entrenamiento y después datos de prueba. Los datos de validación se construyeron de forma aleatoria la prueba de entrenamiento en una proporción del 20%.

Usaremos como medida de error la Log Loss, pues estamos en un problema de clasificación binaria, que está dada por la siguiente fórmula:


\begin{align}
\log loss = - \frac{1}{N} \sum_{i=1}^n y_i \log(p_i) + (1-y_i) \log(1 - p_i)
\end{align}

También usamos el AUC como medida de error, que es el área bajo la curva ROC. Esto nos da un resumen de qué tan bueno es el predictor sin importar donde ponemos el punto de corte.

In [57]:
import warnings
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
import numpy as np
import pandas as pd

Cargamos los datos de entrenamiento de la carpeta de AWS.

In [58]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,4 mins 20 secs
H2O cluster timezone:,America/Mexico_City
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,"28 days, 5 hours and 48 minutes"
H2O cluster name:,H2O_from_python_lorena_y0ya6b
H2O cluster total nodes:,1
H2O cluster free memory:,1.258 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [59]:
#spotify = h2o.import_file("s3://proyectomineria/data/consolidated_train_table/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
#spotify = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
spotify = h2o.import_file("/home/lorena/Documents/mineria/proyecto/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000(1).csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


Convertimos el tipo de variables a factores y mostramos las primeras líneas del conjunto de entrenamiento.

In [60]:
spotify["is_churn"] = spotify["is_churn"].asfactor()
spotify["city"] = spotify["city"].asfactor()
spotify["gender"] = spotify["gender"].asfactor()

In [61]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,0,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,0,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,0,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0




Especificamos cuáles son los predictores y cuál es la variable de respuesta.

## Ingeniería de Características
En esta sección agregamos variables manualmente, considerando po el contexto cuales variables tienen sentido y son interpretables. Podemos agregar una variable de descuentos a partir de cuánto cuesta el producto menos la cantidad que se pagó realmente. También agregamos una variable binaria si el cliente tiene o no descuento.

In [62]:
spotify["discount"] = spotify["plan_list_price"] - spotify["actual_amount_paid"]
spotify["is_discount"]=spotify["discount"]>0
spotify["amount_per_day"]=spotify["plan_list_price"]/spotify["payment_plan_days"]
spotify["bd"]=(spotify["bd"]<=0 or spotify["bd"]>100).ifelse(np.nan,spotify["bd"])

In [63]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29.0,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0,0,0,4.36098
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2,0,0,5.58667
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0,0,0,3.3




In [64]:
predictors = ["city", "bd", "gender", "registered_via", "registered_init_time", "date", "num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "transaction_date", "membership_expire_date","is_cancel", "discount", "is_discount", "amount_per_day"] 
response = "is_churn"

Hacemos la separación para el conjunto de entrenamiento y validación, 80% y 20% respectivamente.

In [65]:
train, valid = spotify.split_frame(ratios = [0.8], seed=1234)

## Modelos

### Gradient Boosting Machine

La idea de gradient boosting (GBM) es replicar la idea del residual en regresión, y usar
árboles de regresión. Es una técnica directa de un método ensamblado. La heurística nos dice que podemos obtener resultados muy buenos a partir de muchas aproximaciones burdas. Se construyen árboles en paralelo, y en cierto sentido este modelo aprende de los errores de los árboles generados en iteraciones previas.

A continuación mostramos la fase de entrenamiento usando la librería de H2o. Notese que usamos el conjunto de validación.

In [66]:
bin_num = [8,16,32,64,128,256,512,1024,2048,4096]
label = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [67]:
df=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num):
    spotify_gbm = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm.train(x=predictors, y=response, training_frame=train, validation_frame=valid)
    df.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]
    #print(label[key], 'training score', spotify_gbm.auc(train=True))
    #print(label[key], 'validation score', spotify_gbm.auc(valid=True))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Se muestra el AUC, esto nos dice que nuestro predictor es extremadamente bueno. Habrá que verificar con los datos de prueba para validar que en realidad no se sobreajustó. 

In [68]:
print(label[key], 'training score', spotify_gbm.auc(train=True))
print(label[key], 'validation score', spotify_gbm.auc(valid=True))

4096 training score 0.9761901043183583
4096 validation score 0.9718443141290372


In [69]:
print(df[df['training_score']==df['training_score'].max()])
print(df[df['validation_score']==df['validation_score'].max()])

   bin_num  training_score  validation_score
9   4096.0         0.97619          0.971844
   bin_num  training_score  validation_score
9   4096.0         0.97619          0.971844


In [70]:
df

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.972887,0.97177
2,32.0,0.973409,0.971685
3,64.0,0.97329,0.971818
4,128.0,0.972324,0.970741
5,256.0,0.973648,0.97183
6,512.0,0.973555,0.971335
7,1024.0,0.973425,0.970748
8,2048.0,0.97421,0.97089
9,4096.0,0.97619,0.971844
0,8.0,0.97268,0.971275


Todos los modelos tienen un desempeño muy similar, el modelo con el mejor desempeño en el conjunto de entrenamiento es el último, pero su desempeño en el conjunto de validación es menor, lo cual quiere decir que está empezando a sobre ajustar.

Aquí se hace una predicción de los mismos datos con los que se ajustó.

In [71]:
final_gbm_predictions = spotify_gbm.predict(valid[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [72]:
final_gbm_predictions[:]

predict,p0,p1
0,0.996551,0.00344907
0,0.995995,0.00400488
0,0.996572,0.00342782
0,0.965165,0.0348354
1,0.173522,0.826478
0,0.997216,0.00278375
0,0.899156,0.100844
0,0.804109,0.195891
0,0.996381,0.00361921
0,0.997361,0.00263921




### Gradient Boosting Machine reduced Model

Se hace exactamente el mismo proceso usando todas las covariables que se extrajeron a partir de la tabla de logs.

In [73]:
spotify_logs=spotify[:,["msno","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","is_churn"]]

In [74]:
spotify_logs

msno,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,is_churn
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,1
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,0
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,0




In [75]:
train_logs, valid_logs = spotify_logs.split_frame(ratios = [0.8], seed=1234)

In [76]:
bin_num1 = [8,16,32,64,128,256,512,1024,2048,4096]
label1 = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [77]:
log_predictors = ["num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs"] 

In [78]:
df_logs=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num1):
    spotify_gbm1 = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm1.train(x=log_predictors, y=response, training_frame=train_logs, validation_frame=valid_logs)
    df_logs.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [79]:
df_logs

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.97619,0.971844
2,32.0,0.97619,0.971844
3,64.0,0.97619,0.971844
4,128.0,0.97619,0.971844
5,256.0,0.97619,0.971844
6,512.0,0.97619,0.971844
7,1024.0,0.97619,0.971844
8,2048.0,0.97619,0.971844
9,4096.0,0.97619,0.971844
0,8.0,0.97619,0.971844


In [80]:
model_path = h2o.save_model(model=spotify_gbm1, path="/home/lorena/Documents/mineria/proyecto", force=True)
model_path

'/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545331842912_1680'

Este modelo tiene casi el mismo desempeño que el anterior. Lo cual sugiere que las variables que más explican son las que se encuentran en la tabla de user_log

### Distributed Random Forest

El segundo modelo con el que se evaluó fue un modelo de Random Forest (DRF). Es un método muy poderoso para la clasificacion.

Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification or regression trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value.

In [81]:
rf_v1 = H2ORandomForestEstimator(
    model_id="rf_covType_v1",
    ntrees=200,
    stopping_rounds=2,
    score_each_iteration=True,
seed=1000000)

In [82]:
rf_v1.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [83]:
rf_v1.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2018-12-20 12:57:42,0.009 sec,0.0,,,,,,,,,,,,
1,,2018-12-20 12:57:43,1.038 sec,1.0,0.228247,1.490779,0.724261,0.144583,8.978217,0.056729,0.225414,1.429605,0.73282,0.148376,9.053709,0.055479
2,,2018-12-20 12:57:44,1.927 sec,2.0,0.223131,1.29541,0.771345,0.202941,8.954221,0.05833,0.198343,0.583631,0.85674,0.384831,11.178754,0.058539
3,,2018-12-20 12:57:45,2.632 sec,3.0,0.219264,1.116331,0.795641,0.250004,9.096266,0.060958,0.189151,0.324248,0.910554,0.507657,12.239988,0.048148
4,,2018-12-20 12:57:45,3.134 sec,4.0,0.213187,0.954386,0.81942,0.290982,9.499896,0.060199,0.183589,0.222323,0.935918,0.577245,13.214154,0.050039
5,,2018-12-20 12:57:46,3.549 sec,5.0,0.210175,0.743433,0.837275,0.344902,9.604657,0.061534,0.182642,0.165013,0.94072,0.627963,13.070399,0.049359
6,,2018-12-20 12:57:46,4.175 sec,6.0,0.20519,0.626553,0.857658,0.384346,9.937069,0.06148,0.179972,0.139721,0.950668,0.653738,12.025819,0.049954
7,,2018-12-20 12:57:47,4.757 sec,7.0,0.202235,0.534683,0.870819,0.419153,10.157883,0.060238,0.179459,0.130338,0.954252,0.662961,12.602705,0.046555
8,,2018-12-20 12:57:47,5.403 sec,8.0,0.198933,0.461061,0.885567,0.450701,10.345223,0.05987,0.178261,0.125145,0.955719,0.67538,12.570555,0.044834
9,,2018-12-20 12:57:48,6.014 sec,9.0,0.19545,0.399547,0.895726,0.481049,10.627464,0.057471,0.176588,0.118537,0.959304,0.688122,13.117101,0.041689


Se generó un segundo modelo cambiando los hiperparámetros para poder compararlos.

In [84]:
rf_v2 = H2ORandomForestEstimator(
    model_id="rf_covType_v2",
    ntrees=200,
    max_depth=30,
    stopping_rounds=2,
    stopping_tolerance=0.01,
    score_each_iteration=True,
seed=1234)

In [85]:
rf_v2.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [86]:
final_rf_predictions = rf_v2.predict(valid[1:])

drf prediction progress: |████████████████████████████████████████████████| 100%


In [87]:
final_rf_predictions

predict,p0,p1
0,1.0,0.0
0,0.997467,0.00253298
0,0.993596,0.0064038
0,0.958457,0.0415427
1,0.32801,0.67199
0,0.997791,0.00220904
0,0.913596,0.086404
0,0.69316,0.30684
0,0.993408,0.00659198
0,0.997634,0.00236561




In [88]:
print('training score', rf_v1.auc(train=True))
print('validation score', rf_v1.auc(valid=True))

training score 0.9616455827987666
validation score 0.9705483678097715


In [89]:
print('training score', rf_v2.auc(train=True))
print('validation score', rf_v2.auc(valid=True))

training score 0.9446001799492036
validation score 0.9632700395440698


In [90]:
print('training score', rf_v2.logloss(train=True))
print('validation score', rf_v2.logloss(valid=True))

training score 0.17060865286583715
validation score 0.11714252622701693


### Ajuste de hiperparámetros óptimo.

Nos quedaremos con el modelo de Gradient Boosting porque tiene menor Log Loss. En esta sección se ajustan los hiperámetros del modelo. Será imporante usar los datos de validación.

In [91]:
print('validation score RF', rf_v1.logloss(valid=True))
print('validation score GBM', spotify_gbm.logloss(valid=True))

validation score RF 0.09935382446200215
validation score GBM 0.09864033817617632


In [92]:
gbm_params1 = {'learn_rate': [0.01, 0.1],
                'max_depth': [3, 5, 9]}#,
                #'sample_rate': [0.8, 1.0],
                #'col_sample_rate': [0.2, 0.5, 1.0]}
gbm_params2 = {'learn_rate': [i * 0.01 for i in range(1, 11)],
                'max_depth': [i for i in range(2, 11)]},
                #'sample_rate': [i * 0.1 for i in range(5, 11)]}#,
               # 'col_sample_rate': [i * 0.1 for i in range(1, 11)]}

gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid',
                          hyper_params=gbm_params1)#,
                          #search_criteria=search_criteria)


In [93]:
gbm_grid.train(x=predictors, y=response, training_frame=train, validation_frame=valid, seed=42)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [94]:
# Get los resultados del grid, en orden de mejor a peor por AUC
gbm_gridperf = gbm_grid.get_grid(sort_by='auc', decreasing=True)
print(gbm_gridperf)

# Escojes el mejor modelo (usando el AUC de valicación)
best_gbm = gbm_gridperf.models[0]

    learn_rate max_depth         model_ids                 auc
0          0.1         9  gbm_grid_model_6  0.9739041054477297
1         0.01         9  gbm_grid_model_5  0.9710978980345377
2          0.1         5  gbm_grid_model_4    0.97074761803746
3         0.01         5  gbm_grid_model_3    0.96517119561995
4          0.1         3  gbm_grid_model_2  0.9647902668093772
5         0.01         3  gbm_grid_model_1  0.9502427858900389



## Evaluación

Summarize assessment results in terms of business success criteria, including a final statement regarding
whether the project already meets the initial business objectives.




### Reentrenamiento final de modelo con datos de entrenamiento y prueba con hiperparámetros optimizados.
Se mejora el ajuste de hiperparámetros para este modelo usando todos los datos.

In [95]:
#spotify_test = h2o.import_file("s3://proyectomineria/data/resumen_final_test/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")
#spotify_test = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")
spotify_test = h2o.import_file("/home/lorena/Documents/mineria/proyecto/testR.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [96]:
spotify_test

avg_num_unq,date,bd,payment_plan_days,city,avg_num_50,registered_init_time,msno,avg_num_75,plan_list_price,actual_amount_paid,avg_num_25,avg_num_100,membership_expire_date,is_churn,is_auto_renew,payment_method_id,registered_via,avg_num_985,gender,total_secs,is_cancel,transaction_date
13.0667,1483980000.0,0,30,1,0.933333,2014-07-14T00:00:00.000Z,++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0.733333,149,149,5.73333,6.46667,,0,1.0,41,7,0.666667,,1978.66,0,1485600000.0
61.1333,1487550000.0,31,30,15,1.26667,2006-06-03T00:00:00.000Z,+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,1.4,149,149,29.1333,33.6,,0,1.0,34,9,1.26667,male,9395.27,0,1487030000.0
19.8667,1487640000.0,31,30,9,1.86667,2004-03-30T00:00:00.000Z,+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,1.06667,149,149,12.4667,67.8,,0,1.0,34,9,3.26667,male,17219.0,0,1487030000.0
27.8,1487510000.0,29,30,15,0.6,2008-03-22T00:00:00.000Z,+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0.333333,149,149,2.33333,33.0,,0,1.0,34,9,0.666667,male,8571.42,0,1487030000.0
30.9333,1487390000.0,24,30,5,4.4,2014-03-20T00:00:00.000Z,+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0.933333,149,149,10.7333,19.6,,0,1.0,23,9,0.533333,female,5353.78,0,1484480000.0
46.6667,1487430000.0,32,30,13,1.6,2015-03-16T00:00:00.000Z,+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,1.66667,149,149,12.3333,40.6,,0,1.0,37,3,0.933333,male,10597.8,0,1486560000.0
20.6667,1486560000.0,0,30,5,0.4,2013-02-27T00:00:00.000Z,+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0.333333,149,149,1.0,21.5333,,0,1.0,40,3,0.133333,,5413.59,0,1486470000.0
15.0,1487640000.0,31,30,6,0.466667,2008-04-17T00:00:00.000Z,+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0.333333,180,180,1.26667,281.667,,0,1.0,36,9,1.86667,female,48663.0,0,1484400000.0
18.6667,1486030000.0,0,30,4,1.66667,2015-11-03T00:00:00.000Z,+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0.866667,180,180,5.2,13.4667,,0,0.5,29,7,0.466667,,3910.75,0,1485000000.0
23.2667,1487190000.0,0,30,1,2.73333,2012-12-17T00:00:00.000Z,+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,1.33333,99,99,8.93333,11.4,,0,1.0,41,7,3.0,,3921.46,0,1484910000.0




In [97]:
spotify_test["num_25"] = spotify_test["avg_num_25"]
spotify_test["num_50"] = spotify_test["avg_num_50"]
spotify_test["num_75"] = spotify_test["avg_num_75"]
spotify_test["num_985"] = spotify_test["avg_num_985"]
spotify_test["num_100"] = spotify_test["avg_num_100"]
spotify_test["num_unq"] = spotify_test["avg_num_unq"]
spotify_test = spotify_test[:, ["msno","is_churn","city","bd","gender","registered_via","registered_init_time","date","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","payment_method_id","payment_plan_days","plan_list_price","actual_amount_paid","is_auto_renew","transaction_date","membership_expire_date","is_cancel"]]
spotify_test["is_churn"] = spotify_test["is_churn"].asfactor()
spotify_test["city"] = spotify_test["city"].asfactor()
spotify_test["gender"] = spotify_test["gender"].asfactor()
spotify_test["discount"] = spotify_test["plan_list_price"] - spotify_test["actual_amount_paid"]
spotify_test["is_discount"]=spotify_test["discount"]>0
spotify_test["amount_per_day"]=spotify_test["plan_list_price"]/spotify_test["payment_plan_days"]
spotify_test["bd"]=(spotify_test["bd"]<=0 or spotify_test["bd"]>100).ifelse(np.nan,spotify_test["bd"])

In [98]:
spotify_test

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41,30,149,149,1.0,1485600000.0,,0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34,30,149,149,1.0,1487030000.0,,0,0,0,4.96667
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23,30,149,149,1.0,1484480000.0,,0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37,30,149,149,1.0,1486560000.0,,0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40,30,149,149,1.0,1486470000.0,,0,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36,30,180,180,1.0,1484400000.0,,0,0,0,6.0
+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0,4,,,7,2015-11-03T00:00:00.000Z,1486030000.0,5.2,1.66667,0.866667,0.466667,13.4667,18.6667,3910.75,29,30,180,180,0.5,1485000000.0,,0,0,0,6.0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41,30,99,99,1.0,1484910000.0,,0,0,0,3.3




In [99]:
best_gbm.train(x=predictors, y=response, training_frame=spotify)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [44]:
final_gbm_predictions = best_gbm.predict(spotify_test[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%


### Generación y carga de datos de prueba en Kaggle
Tuvimos pérdidas de datos en el conjunto de prueba. La tabla de prueba, en su forma original solo tiene una columna con ID's y es necesario construirla. Desafortunadamente no aparecen todas las covariantes en otras tablas por lo que se obtuvo una tabla llena de nulos y por esa razón no se sometió a Kaggle.

### Flask

In [100]:
model_path = h2o.save_model(model=best_gbm, path="/home/lorena/Documents/mineria/proyecto", force=True)

In [101]:
model_path

'/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545331842912_2112'

In [104]:
predictors = ["city", "bd", "gender", "registered_via", "registered_init_time", "date", "num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "transaction_date", "membership_expire_date","is_cancel", "discount", "is_discount", "amount_per_day"] 

entrada=h2o.H2OFrame([[6,31,'female' ,9,'2008-04-17T00:00:00.000Z',1.48764e+09,1.26667,0.466667,0.333333,1.86667 ,281.667 ,15 ,48663 ,36,30,180,180,1 ,1.4844e+09 ,'nan',0,0,0,6 ]],column_names=predictors )

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [102]:
model = h2o.load_model('/home/lorena/Documents/mineria/proyecto/GBM_model_python_1545331842912_2112')

In [105]:
model.predict(entrada)

gbm prediction progress: |████████████████████████████████████████████████| 100%


predict,p0,p1
0,0.995919,0.00408149




In [25]:
list(model.predict(entrada).as_data_frame().p1)

gbm prediction progress: |████████████████████████████████████████████████| 100%


[0.07019678105532251, 0.07019678105532251]

### Comentarios Finales

La pregunta que queríamos contestar es ¿qué clientes no van a renovar el servicio de KKBox? ¿qué caracteristicas tienen los clientes que no van a renovar la suscripción?

La respuesta se puede dar viendo las variables más importantes del modelo que se utiliza para predecir.

In [161]:
spotify_gbm.varimp

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1545267303907_1544


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.025181063719013828
RMSE: 0.15868542377614217
LogLoss: 0.08911879358469388
Mean Per-Class Error: 0.07788901383613367
AUC: 0.9761390246657221
pr_auc: 0.8161295616919984
Gini: 0.9522780493314442
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.38229947354839167: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,173314.0,2925.0,0.0166,(2925.0/176239.0)
1,3352.0,8997.0,0.2714,(3352.0/12349.0)
Total,176666.0,11922.0,0.0333,(6277.0/188588.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3822995,0.7413786,197.0
max f2,0.2157611,0.7731338,266.0
max f0point5,0.5643941,0.7811257,137.0
max accuracy,0.4775978,0.9675589,164.0
max precision,0.9801671,1.0,0.0
max recall,0.0027364,1.0,399.0
max specificity,0.9801671,1.0,0.0
max absolute_mcc,0.3921909,0.7238143,193.0
max min_per_class_accuracy,0.0996866,0.9173452,323.0


Gains/Lift Table: Avg response rate:  6.55 %, avg score:  6.56 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100165,0.8487063,15.2149288,15.2149288,0.9962943,0.9044424,0.9962943,0.9044424,0.1524010,0.1524010,1421.4928834,1421.4928834
,2,0.0200013,0.7650320,14.6713646,14.9435790,0.9607010,0.8075324,0.9785260,0.8560645,0.1464896,0.2988906,1367.1364636,1394.3579050
,3,0.0300019,0.6876133,13.0204688,14.3025423,0.8525981,0.7277271,0.9365500,0.8132854,0.1302130,0.4291036,1202.0468769,1330.2542290
,4,0.0400025,0.6197091,10.7694176,13.4192611,0.7051962,0.6550395,0.8787116,0.7737239,0.1077010,0.5368046,976.9417576,1241.9261111
,5,0.0500032,0.5075914,9.4333620,12.6220813,0.6177094,0.5643313,0.8265111,0.7318454,0.0943396,0.6311442,843.3362012,1162.2081291
,6,0.1000011,0.1928804,4.4151197,8.5188181,0.2891081,0.3191041,0.5578239,0.5254857,0.2207466,0.8518908,341.5119675,751.8818070
,7,0.1500520,0.0886558,1.6373322,6.2234493,0.1072147,0.1264424,0.4075200,0.3923819,0.0819500,0.9338408,63.7332154,522.3449296
,8,0.2000021,0.0285082,0.7262888,4.8505437,0.0475584,0.0672536,0.3176202,0.3111817,0.0362782,0.9701190,-27.3711153,385.0543749
,9,0.3138217,0.0052024,0.1785768,3.1560746,0.0116935,0.0092503,0.2066641,0.2016746,0.0203255,0.9904446,-82.1423177,215.6074559




ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.029329559515360582
RMSE: 0.17125875018626224
LogLoss: 0.09949140668682328
Mean Per-Class Error: 0.08128073390842139
AUC: 0.9711340222032976
pr_auc: 0.7414389196720271
Gini: 0.9422680444065952
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3363429546383909: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,42847.0,1108.0,0.0252,(1108.0/43955.0)
1,876.0,2232.0,0.2819,(876.0/3108.0)
Total,43723.0,3340.0,0.0422,(1984.0/47063.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3363430,0.6923077,210.0
max f2,0.1441327,0.7580483,298.0
max f0point5,0.5196911,0.7197339,148.0
max accuracy,0.4651229,0.9605210,166.0
max precision,0.9809284,1.0,0.0
max recall,0.0027333,1.0,399.0
max specificity,0.9809284,1.0,0.0
max absolute_mcc,0.3179044,0.6704092,219.0
max min_per_class_accuracy,0.0996309,0.9140927,322.0


Gains/Lift Table: Avg response rate:  6.60 %, avg score:  6.61 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100291,0.8363720,14.0196779,14.0196779,0.9258475,0.8943177,0.9258475,0.8943177,0.1406049,0.1406049,1301.9677895,1301.9677895
,2,0.0200157,0.7527209,12.7906097,13.4064485,0.8446809,0.7959589,0.8853503,0.8452427,0.1277349,0.2683398,1179.0609683,1240.6448532
,3,0.0300023,0.6812475,11.2763561,12.6974234,0.7446809,0.7179019,0.8385269,0.8028559,0.1126126,0.3809524,1027.6356143,1169.7423445
,4,0.0400102,0.6167981,10.0628738,12.0384363,0.6645435,0.6517409,0.7950080,0.7650571,0.1007079,0.4816602,906.2873838,1103.8436263
,5,0.0500181,0.5056034,9.6449270,11.5595311,0.6369427,0.5612498,0.7633815,0.7242783,0.0965251,0.5781853,864.4927002,1055.9531054
,6,0.1000149,0.2001066,4.9745771,8.2677535,0.3285168,0.3286306,0.5459953,0.5264965,0.2487130,0.8268983,397.4577075,726.7753550
,7,0.1500329,0.0906013,1.9812663,6.1719610,0.1308411,0.1314707,0.4075910,0.3948026,0.0990991,0.9259974,98.1266313,517.1961034
,8,0.2000085,0.0303028,0.8498362,4.8421366,0.0561224,0.0694524,0.3197705,0.3135083,0.0424710,0.9684685,-15.0163830,384.2136570
,9,0.3136009,0.0052024,0.1869449,3.1559346,0.0123457,0.0094826,0.2084152,0.2033842,0.0212355,0.9897040,-81.3055119,215.5934607



Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2018-12-19 21:12:18,0.017 sec,0.0,0.2473733,0.2417906,0.5,0.0,1.0,0.9345186,0.2483511,0.2432733,0.5,0.0,1.0,0.9339609
,2018-12-19 21:12:18,0.249 sec,1.0,0.2314424,0.1992102,0.9557214,0.6486054,13.1375388,0.0502630,0.2321319,0.1998287,0.9561146,0.6554836,13.0163575,0.0481482
,2018-12-19 21:12:18,0.522 sec,2.0,0.2218755,0.1811407,0.9610447,0.6807193,13.2038340,0.0484548,0.2224980,0.1817207,0.9613900,0.6634200,12.9191338,0.0479995
,2018-12-19 21:12:19,0.738 sec,3.0,0.2141779,0.1684220,0.9617178,0.6868675,13.2721161,0.0490222,0.2147524,0.1689717,0.9619775,0.6685359,13.4154845,0.0488494
,2018-12-19 21:12:19,0.983 sec,4.0,0.2074436,0.1580406,0.9633711,0.6982998,13.8665082,0.0483647,0.2079692,0.1585502,0.9635987,0.6934135,13.5671973,0.0450460
,2018-12-19 21:12:19,1.192 sec,5.0,0.2016971,0.1496734,0.9649577,0.7067961,13.7310369,0.0449498,0.2021624,0.1501301,0.9651622,0.7014525,13.6630923,0.0446423
,2018-12-19 21:12:19,1.405 sec,6.0,0.1969132,0.1429133,0.9654358,0.7132042,13.7205626,0.0448968,0.1973880,0.1434040,0.9655918,0.7091710,13.5866052,0.0432187
,2018-12-19 21:12:19,1.615 sec,7.0,0.1929787,0.1373164,0.9657791,0.7205447,13.9847305,0.0445203,0.1935450,0.1379228,0.9658551,0.7145968,13.8007918,0.0438986
,2018-12-19 21:12:20,1.836 sec,8.0,0.1896627,0.1325676,0.9669842,0.7237647,14.0002428,0.0427387,0.1903160,0.1332676,0.9671021,0.7175398,13.8673745,0.0421775


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
date,7605.6538086,1.0,0.2430066
transaction_date,6338.9101562,0.8334471,0.2025332
is_auto_renew,5606.2451172,0.7371155,0.1791240
is_cancel,5143.9160156,0.6763279,0.1643522
registered_init_time,3349.7875977,0.4404339,0.1070284
plan_list_price,1762.1883545,0.2316945,0.0563033
actual_amount_paid,660.5139160,0.0868451,0.0211039
payment_plan_days,328.7272339,0.0432214,0.0105031
payment_method_id,212.2201538,0.0279029,0.0067806


<bound method ModelBase.varimp of >

Podemos ver como todas estas variables estan presentes en la tabla de "user_log". Sin embargo, cuando estas variables no aparecen en una observación es más complicado predecir si un usuario va a renovar o no su suscripción.

Se tiene una herramienta para la predicción de si un cliente va a abandonar o no el servicio.

Para contestar la pregunta de que debe de hacer KKBox para que los clientes renueven su suscripción es importante ver las variables explicativas. Si los clientes utilizan el servicio, entonces no cancelan su suscripción. Por lo tanto KKBox debe de incentivar tener usuarios activos ya que estos son más propensos a renovar suscripción.