### Diseño de Prueba del Modelo

Usaremos varios tipos de modelos distintos para poder predecir la probabilidad de cancelación de la suscripción. Usaremos un random forest, gradient Boosting Machine. Para ello usamos la librería de H2o.

La evaluación de cualquiera de nuestros modelos es sometida a un conjunto de prueba. Los conjuntos de entrenamieno y prueba se construyeron a partir de la base de datos y se respetó la partición que aparece en los lineamientos del concurso. De hecho, ellos divieron a partir de Abril 2017, antes de esta fechha se consideraron datos de entrenamiento y después datos de prueba. Los datos de validación se construyeron de forma aleatoria la prueba de entrenamiento en una proporción del 20%.

Usaremos como medida de error la Log Loss, pues estamos en un problema de clasificación binaria, que está dada por la siguiente fórmula:


\begin{align}
\log loss = - \frac{1}{N} \sum_{i=1}^n y_i \log(p_i) + (1-y_i) \log(1 - p_i)
\end{align}

También usamos el AUC como medida de error, que es el área bajo la curva ROC. Esto nos da un resumen de qué tan bueno es el predictor sin importar donde ponemos el punto de corte.

In [82]:
import warnings
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
import numpy as np
import pandas as pd

Cargamos los datos de entrenamiento de la carpeta de AWS.

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_152-release"; OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12); OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
  Starting server from /home/lorena/anaconda3/envs/for_spark/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpl1f19406
  JVM stdout: /tmp/tmpl1f19406/h2o_lorena_started_from_python.out
  JVM stderr: /tmp/tmpl1f19406/h2o_lorena_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/Mexico_City
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,27 days
H2O cluster name:,H2O_from_python_lorena_hk1sm5
H2O cluster total nodes:,1
H2O cluster free memory:,1.688 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [24]:
#spotify = h2o.import_file("s3://proyectomineria/data/consolidated_train_table/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
#spotify = h2o.import_file("/home/toto/Desktop/3er_Semestre/Mineria/Proyecto/kkbox_churn_prediction/data/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000.csv")
spotify = h2o.import_file("/home/lorena/Documents/mineria/proyecto/part-00000-acf23e82-8c41-458f-9399-57e2a260de4b-c000(1).csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


Convertimos el tipo de variables a factores y mostramos las primeras líneas del conjunto de entrenamiento.

In [25]:
spotify["is_churn"] = spotify["is_churn"].asfactor()
spotify["city"] = spotify["city"].asfactor()
spotify["gender"] = spotify["gender"].asfactor()

In [26]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,0,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,0,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,0,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0




Especificamos cuáles son los predictores y cuál es la variable de respuesta.

## Ingeniería de Características
En esta sección agregamos variables manualmente, considerando po el contexto cuales variables tienen sentido y son interpretables. Podemos agregar una variable de descuentos a partir de cuánto cuesta el producto menos la cantidad que se pagó realmente. También agregamos una variable binaria si el cliente tiene o no descuento.

In [112]:
spotify["discount"] = spotify["plan_list_price"] - spotify["actual_amount_paid"]
spotify["is_discount"]=spotify["discount"]>0
spotify["amount_per_day"]=spotify["plan_list_price"]/spotify["payment_plan_days"]
spotify["bd"]=(spotify["bd"]<=0 or spotify["bd"]>100).ifelse(np.nan,spotify["bd"])

In [113]:
spotify

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel,discount,is_discount,amount_per_day
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41.0,30,149.0,149.0,1,1481620000.0,,0.0,0,0,4.96667
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31.0,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31.0,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29.0,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34.0,30,149.0,149.0,1,1483110000.0,,0.0,0,0,4.96667
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,1,13,29.0,female,3,2012-06-12T00:00:00.000Z,1486150000.0,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,32.0,410,1788.0,1788.0,0,1452730000.0,,0.0,0,0,4.36098
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24.0,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23.0,30,149.0,149.0,1,1480500000.0,,0.0,0,0,4.96667
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32.0,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37.0,30,149.0,149.0,1,1482570000.0,,0.0,0,0,4.96667
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40.0,30,149.0,149.0,1,1483020000.0,,0.2,0,0,4.96667
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31.0,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36.4,30,167.6,167.6,1,1481310000.0,,0.2,0,0,5.58667
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41.0,30,99.0,99.0,1,1479380000.0,,0.0,0,0,3.3




In [29]:
predictors = ["city", "gender", "registered_via", "registered_init_time", "date", "num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "transaction_date", "is_cancel"] 
response = "is_churn"

Hacemos la separación para el conjunto de entrenamiento y validación, 80% y 20% respectivamente.

In [30]:
train, valid = spotify.split_frame(ratios = [0.8], seed=1234)

## Modelos

### Gradient Boosting Machine

La idea de gradient boosting (GBM) es replicar la idea del residual en regresión, y usar
árboles de regresión. Es una técnica directa de un método ensamblado. La heurística nos dice que podemos obtener resultados muy buenos a partir de muchas aproximaciones burdas. Se construyen árboles en paralelo, y en cierto sentido este modelo aprende de los errores de los árboles generados en iteraciones previas.

A continuación mostramos la fase de entrenamiento usando la librería de H2o. Notese que usamos el conjunto de validación.

In [31]:
bin_num = [8,16,32,64,128,256,512,1024,2048,4096]
label = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [114]:
df=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num):
    spotify_gbm = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm.train(x=predictors, y=response, training_frame=train, validation_frame=valid)
    df.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]
    #print(label[key], 'training score', spotify_gbm.auc(train=True))
    #print(label[key], 'validation score', spotify_gbm.auc(valid=True))

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Se muestra el AUC, esto nos dice que nuestro predictor es extremadamente bueno. Habrá que verificar con los datos de prueba para validar que en realidad no se sobreajustó. 

In [34]:
print(label[key], 'training score', spotify_gbm.auc(train=True))
print(label[key], 'validation score', spotify_gbm.auc(valid=True))

4096 training score 0.9761939106469716
4096 validation score 0.9711941449713034


In [35]:
print(df[df['training_score']==df['training_score'].max()])
print(df[df['validation_score']==df['validation_score'].max()])

     bin_num training_score validation_score
4096       9       0.976194         0.971194
   bin_num training_score validation_score
16       1       0.972592         0.971782


In [68]:
df

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.565955,0.524743
2,32.0,0.565955,0.524743
3,64.0,0.565955,0.524743
4,128.0,0.565955,0.524743
5,256.0,0.565955,0.524743
6,512.0,0.565955,0.524743
7,1024.0,0.565955,0.524743
8,2048.0,0.565955,0.524743
9,4096.0,0.565955,0.524743
0,8.0,0.565955,0.524743


Todos los modelos tienen un desempeño muy similar, el modelo con el mejor desempeño en el conjunto de entrenamiento es el último, pero su desempeño en el conjunto de validación es menor, lo cual quiere decir que está empezando a sobre ajustar.

Aquí se hace una predicción de los mismos datos con los que se ajustó.

In [37]:
final_gbm_predictions = spotify_gbm.predict(valid[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [38]:
final_gbm_predictions[:]

predict,p0,p1
0,0.996101,0.00389887
0,0.995024,0.00497576
0,0.994554,0.00544602
0,0.96348,0.0365199
1,0.225799,0.774201
0,0.997099,0.00290078
0,0.9071,0.0929004
0,0.80014,0.19986
0,0.996138,0.00386171
0,0.996924,0.00307643




### Gradient Boosting Machine reduced Model

Se hace exactamente el mismo proceso usando todas las covariables que se extrajeron a partir de la tabla de logs.

In [58]:
spotify_logs=spotify[:,["msno","num_25","num_50","num_75","num_985","num_100","num_unq","total_secs","is_churn"]]

In [59]:
spotify_logs

msno,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs,is_churn
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,0
+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=,2.4,0.866667,0.866667,1.33333,18.2,20.5333,4813.96,1
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,0
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,0




In [62]:
train_logs, valid_logs = spotify_logs.split_frame(ratios = [0.8], seed=1234)

In [55]:
bin_num1 = [8,16,32,64,128,256,512,1024,2048,4096]
label1 = ["8","16","32","64","128","256","512","1024","2048","4096"]

In [60]:
predictors = ["num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs"] 
response = "is_churn"

In [63]:
df_logs=pd.DataFrame(index=range(1,len(bin_num)),columns=['bin_num','training_score','validation_score'])
for key, num in enumerate(bin_num1):
    spotify_gbm1 = H2OGradientBoostingEstimator(nbins_cats = num, seed=1234)
    spotify_gbm1.train(x=predictors, y=response, training_frame=train_logs, validation_frame=valid_logs)
    df_logs.loc[key]=[num, spotify_gbm.auc(train=True),spotify_gbm.auc(valid=True)]

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [65]:
df_logs

Unnamed: 0,bin_num,training_score,validation_score
1,16.0,0.976194,0.971194
2,32.0,0.976194,0.971194
3,64.0,0.976194,0.971194
4,128.0,0.976194,0.971194
5,256.0,0.976194,0.971194
6,512.0,0.976194,0.971194
7,1024.0,0.976194,0.971194
8,2048.0,0.976194,0.971194
9,4096.0,0.976194,0.971194
0,8.0,0.976194,0.971194


Este modelo tiene casi el mismo desempeño que el anterior. Lo cual sugiere que las variables que más explican son las que se encuentran en la tabla de user_log

### Distributed Random Forest

El segundo modelo con el que se evaluó fue un modelo de Random Forest (DRF). Es un método muy poderoso para la clasificacion.

Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification or regression trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value.

In [69]:
rf_v1 = H2ORandomForestEstimator(
    model_id="rf_covType_v1",
    ntrees=200,
    stopping_rounds=2,
    score_each_iteration=True,
seed=1000000)

In [70]:
rf_v1.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [71]:
rf_v1.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2018-12-19 16:53:49,0.011 sec,0.0,,,,,,,,,,,,
1,,2018-12-19 16:53:50,0.617 sec,1.0,0.269186,0.868662,0.513917,0.067324,0.994387,0.669054,0.269384,0.897111,0.506091,0.065404,0.914862,0.933961
2,,2018-12-19 16:53:51,1.252 sec,2.0,0.266339,0.795675,0.510733,0.066931,1.027813,0.759736,0.260098,0.413837,0.510568,0.067228,1.280553,0.808385
3,,2018-12-19 16:53:51,1.691 sec,3.0,0.263564,0.678781,0.509434,0.066906,1.020261,0.776457,0.255662,0.308066,0.509762,0.067107,1.187021,0.889467
4,,2018-12-19 16:53:52,2.093 sec,4.0,0.263017,0.637131,0.510677,0.067274,1.029005,0.934384,0.254603,0.292731,0.506109,0.066037,0.928401,0.895311
5,,2018-12-19 16:53:52,2.469 sec,5.0,0.261161,0.559632,0.511625,0.067604,1.075853,0.786831,0.253422,0.280011,0.506834,0.066495,0.996642,0.878461
6,,2018-12-19 16:53:52,2.898 sec,6.0,0.259759,0.504029,0.511247,0.067405,1.016064,0.776067,0.252892,0.273616,0.504515,0.065993,0.771594,0.892357
7,,2018-12-19 16:53:53,3.296 sec,7.0,0.258784,0.46317,0.510229,0.067011,0.963579,0.820876,0.252512,0.270614,0.502117,0.06563,0.642995,0.900304
8,,2018-12-19 16:53:53,3.693 sec,8.0,0.257628,0.426994,0.509276,0.066802,1.028358,0.850921,0.252047,0.267808,0.503804,0.066297,0.771972,0.885196
9,,2018-12-19 16:53:54,4.099 sec,9.0,0.256217,0.385987,0.509921,0.066836,0.995969,0.779375,0.251595,0.264298,0.501874,0.066045,0.868043,0.872936


Hicimos otro bosque ... PONER RAZONAMIENTO

In [72]:
rf_v2 = H2ORandomForestEstimator(
    model_id="rf_covType_v2",
    ntrees=200,
    max_depth=30,
    stopping_rounds=2,
    stopping_tolerance=0.01,
    score_each_iteration=True,
seed=1234)

In [73]:
rf_v2.train(x=predictors, y=response, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [74]:
final_rf_predictions = rf_v2.predict(valid[1:])

drf prediction progress: |████████████████████████████████████████████████| 100%


In [75]:
final_rf_predictions

predict,p0,p1
1,0.97136,0.02864
1,0.932399,0.0676007
1,0.848484,0.151516
1,0.985223,0.0147775
1,0.815392,0.184608
1,0.945014,0.0549863
1,0.911694,0.0883059
1,0.961508,0.0384919
1,0.885579,0.114421
1,0.954337,0.0456633




In [76]:
print('training score', rf_v2.auc(train=True))
print('validation score', rf_v2.auc(valid=True))

training score 0.5055928356865304
validation score 0.5126451719444552


In [77]:
print('training score', rf_v1.auc(train=True))
print('validation score', rf_v1.auc(valid=True))

training score 0.5071932725488784
validation score 0.5104037166828658


In [79]:
print('training score', rf_v1.logloss(train=True))
print('validation score', rf_v1.logloss(valid=True))

training score 0.2579732998808618
validation score 0.2474873861906916


### Ajuste de hiperparámetros óptimo.

Nos quedaremos con el modelo de Gradient Boosting porque tiene menor Log Loss. En esta sección se ajustan los hiperámetros del modelo. Será imporante usar los datos de validación.

In [80]:
print('validation score RF', rf_v1.logloss(valid=True))
print('validation score GBM', spotify_gbm.logloss(valid=True))

validation score RF 0.2474873861906916
validation score GBM 0.24325090576825645


In [None]:
predictors = ["city", "gender", "registered_via", "registered_init_time", "date", "num_25", "num_50", "num_75", "num_985", "num_100", "num_unq", "total_secs","payment_method_id", "payment_plan_days", "plan_list_price", "actual_amount_paid", "is_auto_renew", "transaction_date", "is_cancel"] 
response = "is_churn"

In [110]:
gbm_params1 = {#'learn_rate': [0.01, 0.1]}#,
                'max_depth': [3, 5, 9]}#,
                #'sample_rate': [0.8, 1.0],
                #'col_sample_rate': [0.2, 0.5, 1.0]}
gbm_params2 = {'learn_rate': [i * 0.01 for i in range(1, 11)],
                'max_depth': [i for i in range(2, 11)]},
                #'sample_rate': [i * 0.1 for i in range(5, 11)]}#,
               # 'col_sample_rate': [i * 0.1 for i in range(1, 11)]}

# Search criteria
#search_criteria = {'strategy': 'RandomDiscrete', 'max_models': 36, 'seed': 1}

# Train and validate a random grid of GBMs
gbm_grid2 = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid2',
                          hyper_params=gbm_params1)#,
                          #search_criteria=search_criteria)


In [111]:
gbm_grid2.train(x=predictors, y=response, training_frame=train, validation_frame=valid, seed=42)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%

Hyper-parameter: col_sample_rate, 0.5
Hyper-parameter: learn_rate, 0.01
Hyper-parameter: max_depth, 5
Hyper-parameter: sample_rate, 1.0
failure_details: None
failure_stack_traces: water.Job$JobCancelledException
	at hex.tree.SharedTree$Driver.scoreAndBuildTrees(SharedTree.java:450)
	at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:360)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:215)
	at hex.ModelBuilder.trainModelNested(ModelBuilder.java:329)
	at hex.grid.GridSearch.startBuildModel(GridSearch.java:360)
	at hex.grid.GridSearch.buildModel(GridSearch.java:342)
	at hex.grid.GridSearch.gridSearch(GridSearch.java:220)
	at hex.grid.GridSearch.access$000(GridSearch.java:70)
	at hex.grid.GridSearch$1.compute2(GridSearch.java:137)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1310)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.jav

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

In [None]:
# Get the grid results, sorted by validation AUC
gbm_gridperf2 = gbm_grid2.get_grid(sort_by='auc', decreasing=True)
gbm_gridperf2

# Grab the top GBM model, chosen by validation AUC
best_gbm2 = gbm_gridperf2.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_gbm_perf2 = best_gbm2.model_performance(test)

best_gbm_perf2.auc()  #0.7811331652127048


## Evaluación

Summarize assessment results in terms of business success criteria, including a final statement regarding
whether the project already meets the initial business objectives.




### Reentrenamiento final de modelo con datos de entrenamiento y prueba con hiperparámetros optimizados.
Se mejora el ajuste de hiperparámetros para este modelo usando todos los datos.

In [55]:
spotify_test = h2o.import_file("s3://proyectomineria/data/resumen_final_test/part-00000-326c4568-e87c-4af5-9c77-6ee2aa5d17ae-c000.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [56]:
spotify_test

avg_num_unq,date,bd,payment_plan_days,city,avg_num_50,registered_init_time,msno,avg_num_75,plan_list_price,actual_amount_paid,avg_num_25,avg_num_100,membership_expire_date,is_churn,is_auto_renew,payment_method_id,registered_via,avg_num_985,gender,total_secs,is_cancel,transaction_date
13.0667,1483980000.0,0,30,1,0.933333,2014-07-14T00:00:00.000Z,++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0.733333,149,149,5.73333,6.46667,,0,1.0,41,7,0.666667,,1978.66,0,1485600000.0
61.1333,1487550000.0,31,30,15,1.26667,2006-06-03T00:00:00.000Z,+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,1.4,149,149,29.1333,33.6,,0,1.0,34,9,1.26667,male,9395.27,0,1487030000.0
19.8667,1487640000.0,31,30,9,1.86667,2004-03-30T00:00:00.000Z,+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,1.06667,149,149,12.4667,67.8,,0,1.0,34,9,3.26667,male,17219.0,0,1487030000.0
27.8,1487510000.0,29,30,15,0.6,2008-03-22T00:00:00.000Z,+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0.333333,149,149,2.33333,33.0,,0,1.0,34,9,0.666667,male,8571.42,0,1487030000.0
30.9333,1487390000.0,24,30,5,4.4,2014-03-20T00:00:00.000Z,+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0.933333,149,149,10.7333,19.6,,0,1.0,23,9,0.533333,female,5353.78,0,1484480000.0
46.6667,1487430000.0,32,30,13,1.6,2015-03-16T00:00:00.000Z,+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,1.66667,149,149,12.3333,40.6,,0,1.0,37,3,0.933333,male,10597.8,0,1486560000.0
20.6667,1486560000.0,0,30,5,0.4,2013-02-27T00:00:00.000Z,+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0.333333,149,149,1.0,21.5333,,0,1.0,40,3,0.133333,,5413.59,0,1486470000.0
15.0,1487640000.0,31,30,6,0.466667,2008-04-17T00:00:00.000Z,+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0.333333,180,180,1.26667,281.667,,0,1.0,36,9,1.86667,female,48663.0,0,1484400000.0
18.6667,1486030000.0,0,30,4,1.66667,2015-11-03T00:00:00.000Z,+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0.866667,180,180,5.2,13.4667,,0,0.5,29,7,0.466667,,3910.75,0,1485000000.0
23.2667,1487190000.0,0,30,1,2.73333,2012-12-17T00:00:00.000Z,+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,1.33333,99,99,8.93333,11.4,,0,1.0,41,7,3.0,,3921.46,0,1484910000.0




In [57]:
spotify_test = spotify_test[:, ["msno","is_churn","city","bd","gender","registered_via","registered_init_time","date","avg_num_25","avg_num_50","avg_num_75","avg_num_985","avg_num_100","avg_num_unq","total_secs","payment_method_id","payment_plan_days","plan_list_price","actual_amount_paid","is_auto_renew","transaction_date","membership_expire_date","is_cancel"]]

In [58]:
spotify_test

msno,is_churn,city,bd,gender,registered_via,registered_init_time,date,avg_num_25,avg_num_50,avg_num_75,avg_num_985,avg_num_100,avg_num_unq,total_secs,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,1,0,,7,2014-07-14T00:00:00.000Z,1483980000.0,5.73333,0.933333,0.733333,0.666667,6.46667,13.0667,1978.66,41,30,149,149,1.0,1485600000.0,,0
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,15,31,male,9,2006-06-03T00:00:00.000Z,1487550000.0,29.1333,1.26667,1.4,1.26667,33.6,61.1333,9395.27,34,30,149,149,1.0,1487030000.0,,0
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,9,31,male,9,2004-03-30T00:00:00.000Z,1487640000.0,12.4667,1.86667,1.06667,3.26667,67.8,19.8667,17219.0,34,30,149,149,1.0,1487030000.0,,0
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0,15,29,male,9,2008-03-22T00:00:00.000Z,1487510000.0,2.33333,0.6,0.333333,0.666667,33.0,27.8,8571.42,34,30,149,149,1.0,1487030000.0,,0
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,5,24,female,9,2014-03-20T00:00:00.000Z,1487390000.0,10.7333,4.4,0.933333,0.533333,19.6,30.9333,5353.78,23,30,149,149,1.0,1484480000.0,,0
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,13,32,male,3,2015-03-16T00:00:00.000Z,1487430000.0,12.3333,1.6,1.66667,0.933333,40.6,46.6667,10597.8,37,30,149,149,1.0,1486560000.0,,0
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0,5,0,,3,2013-02-27T00:00:00.000Z,1486560000.0,1.0,0.4,0.333333,0.133333,21.5333,20.6667,5413.59,40,30,149,149,1.0,1486470000.0,,0
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0,6,31,female,9,2008-04-17T00:00:00.000Z,1487640000.0,1.26667,0.466667,0.333333,1.86667,281.667,15.0,48663.0,36,30,180,180,1.0,1484400000.0,,0
+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0,4,0,,7,2015-11-03T00:00:00.000Z,1486030000.0,5.2,1.66667,0.866667,0.466667,13.4667,18.6667,3910.75,29,30,180,180,0.5,1485000000.0,,0
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0,1,0,,7,2012-12-17T00:00:00.000Z,1487190000.0,8.93333,2.73333,1.33333,3.0,11.4,23.2667,3921.46,41,30,99,99,1.0,1484910000.0,,0




In [59]:
final_gbm_predictions = spotify_gbm.predict(spotify_test[1:])

gbm prediction progress: |████████████████████████████████████████████████| 100%




In [62]:
len(spotify_test[:,:])

196372

In [63]:
msnos_test = spotify_test[:,0]

In [64]:
msnos_test[10,0]

'+2oK/qWmYvAnfNZsVV5pdsJ9n6d/LZn6CdwiJajGZas='

In [65]:
msnos_test["prediction"] = final_gbm_predictions[2]

In [66]:
msnos_test

msno,prediction
++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0.0029779
+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0.00350683
+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0.00350683
+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=,0.00350683
+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0.00388029
+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0.00347604
+0l/WkoOIugT69NYawwewSLZjIJ17kHIpDdWqcp53RI=,0.00416238
+2Df04hr61UUJijMM2xR97gtoQWWDJpnJVKQ7VMYN9o=,0.00612556
+2KZws+cYLzerLNA6dgCOpxKysRv4BQ8SiKtA0rV4QE=,0.0188944
+2eLsQv6T46iKwO+m+r6OFI2X3Oc9dGBMdti2COAe4w=,0.0027343




In [None]:
spotify_test.nrows

In [None]:
spotify_test["msno"].tolist()

### Generación y carga de datos de prueba en Kaggle
Tuvimos pérdidas de datos en el conjunto de prueba. La tabla de prueba, en su forma original solo tiene una columna con ID's y es necesario construirla. Desafortunadamente no aparecen todas las covariantes en otras tablas por lo que se obtuvo una tabla llena de nulos y por esa razón no se sometió a Kaggle.

### Comentarios Finales

La pregunta que queríamos contestar es ¿qué clientes no van a renovar el servicio de KKBox? ¿qué caracteristicas tienen los clientes que no van a renovar la suscripción?

La respuesta se puede dar viendo las variables más importantes del modelo que se utiliza para predecir.

In [102]:
spotify_gbm.varimp

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1545256950576_1699


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.06089497429265418
RMSE: 0.24676907077803364
LogLoss: 0.2397857158321531
Mean Per-Class Error: 0.45560947894756376
AUC: 0.5659549282143584
pr_auc: 0.0928522650996405
Gini: 0.1319098564287169
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.06704358431411025: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,110775.0,65464.0,0.3715,(65464.0/176239.0)
1,6714.0,5635.0,0.5437,(6714.0/12349.0)
Total,117489.0,71099.0,0.3827,(72178.0/188588.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.0670436,0.1350542,223.0
max f2,0.0604966,0.2662390,287.0
max f0point5,0.0736968,0.1123804,167.0
max accuracy,0.1814872,0.9346724,36.0
max precision,0.4808032,1.0,0.0
max recall,0.0316515,1.0,393.0
max specificity,0.4808032,1.0,0.0
max absolute_mcc,0.1112265,0.0619533,76.0
max min_per_class_accuracy,0.0661220,0.5347000,232.0


Gains/Lift Table: Avg response rate:  6.55 %, avg score:  6.55 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100006,0.0894150,2.9636142,2.9636142,0.1940615,0.1086091,0.1940615,0.1086091,0.0296380,0.0296380,196.3614160,196.3614160
,2,0.0200119,0.0819983,1.6743669,2.3186489,0.1096398,0.0851074,0.1518283,0.0968520,0.0167625,0.0464005,67.4366860,131.8648897
,3,0.0300284,0.0770477,1.5602982,2.0656868,0.1021705,0.0791912,0.1352640,0.0909609,0.0156288,0.0620293,56.0298228,106.5686790
,4,0.0407343,0.0750146,1.4068859,1.8925393,0.0921248,0.0758205,0.1239261,0.0869817,0.0150619,0.0770913,40.6885940,89.2539313
,5,0.0501410,0.0742256,1.4634489,1.8120395,0.0958286,0.0745747,0.1186548,0.0846541,0.0137663,0.0908576,46.3448925,81.2039488
,6,0.1000435,0.0721321,1.2413891,1.5273948,0.0812879,0.0729218,0.1000159,0.0788019,0.0619483,0.1528059,24.1389095,52.7394825
,7,0.1542834,0.0713672,1.0704546,1.3667527,0.0700948,0.0716756,0.0894968,0.0762966,0.0580614,0.2108673,7.0454572,36.6752749
,8,0.2003521,0.0705897,1.1284894,1.3119668,0.0738950,0.0709097,0.0859094,0.0750579,0.0519880,0.2628553,12.8489389,31.1966806
,9,0.3001092,0.0687494,1.0682677,1.2309605,0.0699516,0.0697729,0.0806050,0.0733012,0.1065673,0.3694226,6.8267702,23.0960547




ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.0616985064809762
RMSE: 0.2483918406086967
LogLoss: 0.24325090576825645
Mean Per-Class Error: 0.47994523400336164
AUC: 0.524742727842489
pr_auc: 0.07005882592746376
Gini: 0.049485455684977975
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.06252534045887229: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13366.0,30589.0,0.6959,(30589.0/43955.0)
1,825.0,2283.0,0.2654,(825.0/3108.0)
Total,14191.0,32872.0,0.6675,(31414.0/47063.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.0625253,0.1269038,262.0
max f2,0.0534759,0.2622512,329.0
max f0point5,0.0656056,0.0858150,228.0
max accuracy,0.4171972,0.9339396,0.0
max precision,0.0773244,0.0757790,134.0
max recall,0.0309780,1.0,396.0
max specificity,0.4171972,0.9999772,0.0
max absolute_mcc,0.0632714,0.0210637,254.0
max min_per_class_accuracy,0.0658937,0.5105904,225.0


Gains/Lift Table: Avg response rate:  6.60 %, avg score:  6.55 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100079,0.0884869,0.9323429,0.9323429,0.0615711,0.1127795,0.0615711,0.1127795,0.0093308,0.0093308,-6.7657057,-6.7657057
,2,0.0200157,0.0820528,1.2538405,1.0930917,0.0828025,0.0849303,0.0721868,0.0988549,0.0125483,0.0218790,25.3840510,9.3091727
,3,0.0300023,0.0771877,1.2242901,1.1367626,0.0808511,0.0792748,0.0750708,0.0923374,0.0122265,0.0341055,22.4290096,13.6762572
,4,0.0400102,0.0751958,0.7394444,1.0373803,0.0488323,0.0759944,0.0685077,0.0882495,0.0074003,0.0415058,-26.0555597,3.7380279
,5,0.0510167,0.0742495,1.3154712,1.0973766,0.0868726,0.0746589,0.0724698,0.0853174,0.0144788,0.0559846,31.5471221,9.7376576
,6,0.1001211,0.0721732,1.1073511,1.1022686,0.0731285,0.0729346,0.0727929,0.0792443,0.0543758,0.1103604,10.7351139,10.2268599
,7,0.1548775,0.0713672,1.1399503,1.1155908,0.0752813,0.0716889,0.0736727,0.0765731,0.0624196,0.1727799,13.9950278,11.5590823
,8,0.2000085,0.0706568,1.0693881,1.1051654,0.0706215,0.0709241,0.0729842,0.0752984,0.0482625,0.2210425,6.9388093,10.5165390
,9,0.3001509,0.0688116,0.9767305,1.0623144,0.0645024,0.0697997,0.0701543,0.0734638,0.0978121,0.3188546,-2.3269518,6.2314355



Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2018-12-19 16:53:16,0.009 sec,0.0,0.2473733,0.2417906,0.5,0.0,1.0,0.9345186,0.2483511,0.2432733,0.5,0.0,1.0,0.9339609
,2018-12-19 16:53:16,0.211 sec,1.0,0.2473316,0.2416284,0.5363688,0.0774899,1.7223519,0.7358475,0.2483403,0.2432265,0.5218290,0.0692189,1.0965284,0.7384145
,2018-12-19 16:53:16,0.349 sec,2.0,0.2472989,0.2415037,0.5369854,0.0767684,1.8439060,0.7359111,0.2483348,0.2431994,0.5222600,0.0688040,1.0211729,0.7387119
,2018-12-19 16:53:16,0.486 sec,3.0,0.2472675,0.2413793,0.5423392,0.0783020,1.8554798,0.5591342,0.2483222,0.2431463,0.5275527,0.0699152,1.0182050,0.7033976
,2018-12-19 16:53:16,0.686 sec,4.0,0.2472363,0.2412634,0.5450137,0.0785600,1.9571268,0.5618491,0.2483172,0.2431188,0.5302658,0.0706474,1.0133628,0.6929010
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
,2018-12-19 16:53:19,3.233 sec,17.0,0.2469502,0.2403184,0.5587048,0.0874699,2.5765795,0.5381042,0.2483394,0.2431162,0.5267856,0.0705651,1.0158514,0.7109832
,2018-12-19 16:53:19,3.444 sec,18.0,0.2469295,0.2402493,0.5593947,0.0880440,2.6907720,0.5961249,0.2483375,0.2431036,0.5270346,0.0707237,1.0179856,0.6783460
,2018-12-19 16:53:19,3.685 sec,19.0,0.2469136,0.2402074,0.5596111,0.0883336,2.6926719,0.5962362,0.2483428,0.2431156,0.5267281,0.0706208,1.0053136,0.6541019



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
num_100,70.3109436,1.0,0.2670194
num_50,42.8118553,0.6088932,0.1625863
num_75,38.6480789,0.5496737,0.1467736
num_unq,35.5148964,0.5051119,0.1348747
num_985,31.9771004,0.4547955,0.1214392
num_25,25.4513607,0.3619829,0.0966565
total_secs,18.6034718,0.2645886,0.0706503


<bound method ModelBase.varimp of >

Podemos ver como todas estas variables estan presentes en la tabla de "user_log". Sin embargo, cuando estas variables no aparecen en una observación es más complicado predecir si un usuario va a renovar o no su suscripción.

Se tiene una herramienta para la predicción de si un cliente va a abandonar o no el servicio.

Para contestar la pregunta de que debe de hacer KKBox para que los clientes renueven su suscripción es importante ver las variables explicativas. Si los clientes utilizan el servicio, entonces no cancelan su suscripción. Por lo tanto KKBox debe de incentivar tener usuarios activos ya que estos son más propensos a renovar suscripción.