In [297]:
!pip install plotly



In [298]:
import pandas as pd
import json
import statistics
import seaborn as sns
import plotly.express as px

# üìå Extracci√≥n

Para iniciar mi an√°lisis, importar√© los datos de la API de Telecom X. Estos datos est√°n disponibles en formato JSON y contienen informaci√≥n esencial sobre los clientes, incluyendo datos demogr√°ficos, tipo de servicio contratado y estado de evasi√≥n.

‚úÖ Cargando los datos directamente desde la API utilizando Python.

‚úÖ Normalizando las columnas con diccionarios.

‚úÖ Convirtiendo los datos a un DataFrame de Pandas para facilitar su manipulaci√≥n.

In [299]:
with open('../data/TelecomX_Data.json','r') as f:
    data = json.loads(f.read())

In [300]:
pd_data = pd.DataFrame(data)

# üîß Transformaci√≥n

## Conociendo el conjunto de datos

Ahora que extra√≠ los datos, es fundamental comprender la estructura del dataset y el significado de sus columnas. Esta etapa ayudar√° a identificar qu√© variables son m√°s relevantes para el an√°lisis de evasi√≥n de clientes.

üìå Para facilitar este proceso, hay en el README.md un diccionario de datos con la descripci√≥n de cada columna.

¬øQu√© debo hacer?

‚úÖ Explorar las columnas del dataset y verificar sus tipos de datos.

‚úÖ Consultar el diccionario para comprender mejor el significado de las variables.

‚úÖ Identificar las columnas m√°s relevantes para el an√°lisis de evasi√≥n.



In [301]:
pd_data.sample(3)

Unnamed: 0,customerID,Churn,customer,phone,internet,account
6861,9470-YFUYI,No,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'One year', 'PaperlessBilling': '..."
5434,7435-ZNUYY,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'No', 'OnlineSecurity': 'N...","{'Contract': 'One year', 'PaperlessBilling': '..."
6722,9254-RBFON,Yes,"{'gender': 'Female', 'SeniorCitizen': 0, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'One year', 'PaperlessBilling': '..."


## Normalizando nuestros datos

Vemos que las columnas poseen diccionarios, por lo que normalizar√© para separarlos en nuevas columnas, almacenandolo en un nuevo DataFrame para poder analizar y comparar los datos.

In [302]:
data_norm = pd.json_normalize(data)

In [303]:
data_norm.sample(3)

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.OnlineBackup,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total
114,0195-IESCP,Yes,Male,0,Yes,No,10,Yes,Yes,Fiber optic,...,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,85.25,855.3
7015,9659-QEQSY,No,Female,0,No,No,45,Yes,Yes,Fiber optic,...,Yes,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,115.65,5125.5
3748,5153-RTHKF,No,Female,0,No,No,34,Yes,Yes,Fiber optic,...,No,No,No,Yes,No,Month-to-month,No,Electronic check,85.35,2896.6


Vemos los tipos de datos almacenados en las columnas

In [304]:
data_norm.dtypes

customerID                    object
Churn                         object
customer.gender               object
customer.SeniorCitizen         int64
customer.Partner              object
customer.Dependents           object
customer.tenure                int64
phone.PhoneService            object
phone.MultipleLines           object
internet.InternetService      object
internet.OnlineSecurity       object
internet.OnlineBackup         object
internet.DeviceProtection     object
internet.TechSupport          object
internet.StreamingTV          object
internet.StreamingMovies      object
account.Contract              object
account.PaperlessBilling      object
account.PaymentMethod         object
account.Charges.Monthly      float64
account.Charges.Total         object
dtype: object

Observamos que la columna **account.Charges.Total** tiene datos de tipo *object*, entonces la convertimos a tipo *float64*.

In [305]:
data_norm['account.Charges.Total'] = pd.to_numeric(data_norm['account.Charges.Total'], errors='coerce')

In [306]:
data_norm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customerID                 7267 non-null   object 
 1   Churn                      7267 non-null   object 
 2   customer.gender            7267 non-null   object 
 3   customer.SeniorCitizen     7267 non-null   int64  
 4   customer.Partner           7267 non-null   object 
 5   customer.Dependents        7267 non-null   object 
 6   customer.tenure            7267 non-null   int64  
 7   phone.PhoneService         7267 non-null   object 
 8   phone.MultipleLines        7267 non-null   object 
 9   internet.InternetService   7267 non-null   object 
 10  internet.OnlineSecurity    7267 non-null   object 
 11  internet.OnlineBackup      7267 non-null   object 
 12  internet.DeviceProtection  7267 non-null   object 
 13  internet.TechSupport       7267 non-null   objec

Observamos que nuestro df es de 7267 filas por 21 columnas.

In [307]:
data_norm.shape

(7267, 21)

En total tenemos 152607 datos

In [308]:
data_norm.size

152607

Vemos una lista de las columnas.

In [309]:
data_norm.columns

Index(['customerID', 'Churn', 'customer.gender', 'customer.SeniorCitizen',
       'customer.Partner', 'customer.Dependents', 'customer.tenure',
       'phone.PhoneService', 'phone.MultipleLines', 'internet.InternetService',
       'internet.OnlineSecurity', 'internet.OnlineBackup',
       'internet.DeviceProtection', 'internet.TechSupport',
       'internet.StreamingTV', 'internet.StreamingMovies', 'account.Contract',
       'account.PaperlessBilling', 'account.PaymentMethod',
       'account.Charges.Monthly', 'account.Charges.Total'],
      dtype='object')

## Comprobaci√≥n de incoherencias en los datos

En este paso, verifico si hay problemas en los datos que puedan afectar el an√°lisis. Prestando atenci√≥n a valores ausentes, duplicados, errores de formato e inconsistencias en las categor√≠as. Este proceso es esencial para asegurarme de que los datos est√©n listos para las siguientes etapas.

In [310]:
data_norm.describe()

Unnamed: 0,customer.SeniorCitizen,customer.tenure,account.Charges.Monthly,account.Charges.Total
count,7267.0,7267.0,7267.0,7256.0
mean,0.162653,32.346498,64.720098,2280.634213
std,0.369074,24.571773,30.129572,2268.632997
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.425,400.225
50%,0.0,29.0,70.3,1391.0
75%,0.0,55.0,89.875,3785.3
max,1.0,72.0,118.75,8684.8


Vemos que 'Churn' tiene 3 valores posibles, cuando s√≥lo deber√≠an ser 2.

In [311]:
data_norm.nunique()

customerID                   7267
Churn                           3
customer.gender                 2
customer.SeniorCitizen          2
customer.Partner                2
customer.Dependents             2
customer.tenure                73
phone.PhoneService              2
phone.MultipleLines             3
internet.InternetService        4
internet.OnlineSecurity         3
internet.OnlineBackup           3
internet.DeviceProtection       3
internet.TechSupport            3
internet.StreamingTV            3
internet.StreamingMovies        3
account.Contract                3
account.PaperlessBilling        2
account.PaymentMethod           4
account.Charges.Monthly      1585
account.Charges.Total        6530
dtype: int64

Vemos los valores √∫nicos por columna para corroborar valores

In [312]:
for columna in data_norm.columns:
    print(f'{columna} \t {data_norm[columna].unique()}')

customerID 	 ['0002-ORFBO' '0003-MKNFE' '0004-TLHLJ' ... '9992-UJOEL' '9993-LHIEB'
 '9995-HOTOH']
Churn 	 ['No' 'Yes' '']
customer.gender 	 ['Female' 'Male']
customer.SeniorCitizen 	 [0 1]
customer.Partner 	 ['Yes' 'No']
customer.Dependents 	 ['Yes' 'No']
customer.tenure 	 [ 9  4 13  3 71 63  7 65 54 72  5 56 34  1 45 50 23 55 26 69 11 37 49 66
 67 20 43 59 12 27  2 25 29 14 35 64 39 40  6 30 70 57 58 16 32 33 10 21
 61 15 44 22 24 19 47 62 46 52  8 60 48 28 41 53 68 51 31 36 17 18 38 42
  0]
phone.PhoneService 	 ['Yes' 'No']
phone.MultipleLines 	 ['No' 'Yes' 'No phone service']
internet.InternetService 	 ['DSL' 'Fiber optic' 'No' 'Fibjsoner optic']
internet.OnlineSecurity 	 ['No' 'Yes' 'No internet service']
internet.OnlineBackup 	 ['Yes' 'No' 'No internet service']
internet.DeviceProtection 	 ['No' 'Yes' 'No internet service']
internet.TechSupport 	 ['Yes' 'No' 'No internet service']
internet.StreamingTV 	 ['Yes' 'No' 'No internet service']
internet.StreamingMovies 	 ['No' 'Yes' 'No 

‚ö†Ô∏è En 'Churn' no deber√≠a haber valores vac√≠os.

Verificamos duplicados de filas

In [313]:
sum(int(x) for x in data_norm.duplicated().values)

0

No hay filas duplicadas

Verificamos si hay un s√≥lo registro para cada cliente

In [314]:
len(data_norm['customerID'].unique()) == len(data_norm)

True

No hay m√°s de un registro para un cliente

Verificamos si hay valores NaN en cada columna num√©rica

In [315]:
for columna in ['customer.SeniorCitizen', 'customer.tenure', 'account.Charges.Monthly', 'account.Charges.Total']:
    print(columna, data_norm[columna].isna().sum())


customer.SeniorCitizen 0
customer.tenure 0
account.Charges.Monthly 0
account.Charges.Total 11


‚ö†Ô∏è Hay valores NaN en la columna *account.Charges.Total*

## Manejo de Inconsistencias

Aplico las correcciones necesarias. Ajusto los datos para asegurarme de que est√©n completos y coherentes, prepar√°ndolos para las siguientes etapas del an√°lisis.

In [316]:
data_norm.sample(3)

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.OnlineBackup,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total
3108,4325-NFSKC,Yes,Male,1,No,No,19,Yes,Yes,Fiber optic,...,No,No,No,Yes,No,Month-to-month,Yes,Electronic check,90.6,1660.0
6667,9168-INPSZ,No,Female,1,Yes,No,44,Yes,Yes,Fiber optic,...,No,Yes,Yes,Yes,Yes,Month-to-month,No,Electronic check,104.15,4495.65
7243,9961-JBNMK,Yes,Male,1,No,No,21,Yes,No,Fiber optic,...,No,Yes,No,Yes,Yes,Month-to-month,Yes,Bank transfer (automatic),96.8,2030.3


Buscamos la fila con 'Churn' == '' y las de 'account.Charges.Total' == NaN

In [317]:
data_norm[data_norm['Churn']=='']

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.OnlineBackup,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total
30,0047-ZHDTW,,Female,0,No,No,11,Yes,Yes,Fiber optic,...,No,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),79.00,929.30
75,0120-YZLQA,,Male,0,No,No,71,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Credit card (automatic),19.90,1355.10
96,0154-QYHJU,,Male,0,No,No,29,Yes,No,DSL,...,Yes,No,Yes,No,No,One year,Yes,Electronic check,58.75,1696.20
98,0162-RZGMZ,,Female,1,No,No,5,Yes,No,DSL,...,Yes,No,Yes,No,No,Month-to-month,No,Credit card (automatic),59.90,287.85
175,0274-VVQOQ,,Male,1,Yes,No,65,Yes,Yes,Fiber optic,...,Yes,Yes,No,Yes,Yes,One year,Yes,Bank transfer (automatic),103.15,6792.45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7158,9840-GSRFX,,Female,0,No,No,14,Yes,Yes,DSL,...,Yes,No,No,No,No,One year,Yes,Mailed check,54.25,773.20
7180,9872-RZQQB,,Female,0,Yes,No,49,No,No phone service,DSL,...,No,No,No,Yes,No,Month-to-month,No,Bank transfer (automatic),40.65,2070.75
7211,9920-GNDMB,,Male,0,No,No,9,Yes,Yes,Fiber optic,...,No,No,No,No,No,Month-to-month,Yes,Electronic check,76.25,684.85
7239,9955-RVWSC,,Female,0,Yes,Yes,67,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Bank transfer (automatic),19.25,1372.90


In [318]:
data_norm[data_norm['account.Charges.Total'].isna()]

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.OnlineBackup,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total
975,1371-DWPAZ,No,Female,0,Yes,Yes,0,No,No phone service,DSL,...,Yes,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,
1775,2520-SGTTA,No,Female,0,Yes,Yes,0,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,
1955,2775-SEFEE,No,Male,0,No,Yes,0,Yes,Yes,DSL,...,Yes,No,Yes,No,No,Two year,Yes,Bank transfer (automatic),61.9,
2075,2923-ARZLG,No,Male,0,Yes,Yes,0,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,
2232,3115-CZMZD,No,Male,0,No,Yes,0,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,
2308,3213-VVOLG,No,Male,0,Yes,Yes,0,Yes,Yes,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,
2930,4075-WKNIU,No,Female,0,Yes,Yes,0,Yes,Yes,DSL,...,Yes,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,
3134,4367-NUYAO,No,Male,0,Yes,Yes,0,Yes,Yes,No,...,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,
3203,4472-LVYGI,No,Female,0,Yes,Yes,0,No,No phone service,DSL,...,No,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,
4169,5709-LVOEQ,No,Female,0,Yes,Yes,0,Yes,No,DSL,...,Yes,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,


Eliminaremos esos registros. Creamos una lista con sus √≠ndices

In [319]:
filas_a_eliminar = set(data_norm[data_norm['Churn']==''].index) | set(data_norm[data_norm['account.Charges.Total'].isna()].index)

In [320]:
filas_a_eliminar

{30,
 75,
 96,
 98,
 175,
 219,
 312,
 351,
 368,
 374,
 380,
 382,
 395,
 439,
 451,
 495,
 540,
 590,
 640,
 669,
 681,
 739,
 791,
 842,
 876,
 877,
 903,
 912,
 932,
 973,
 975,
 992,
 1013,
 1017,
 1160,
 1172,
 1218,
 1236,
 1303,
 1364,
 1366,
 1517,
 1657,
 1705,
 1764,
 1775,
 1795,
 1805,
 1825,
 1860,
 1883,
 1955,
 2021,
 2075,
 2101,
 2138,
 2151,
 2154,
 2158,
 2200,
 2232,
 2245,
 2264,
 2308,
 2390,
 2394,
 2429,
 2467,
 2494,
 2576,
 2584,
 2613,
 2627,
 2644,
 2690,
 2726,
 2733,
 2751,
 2879,
 2913,
 2919,
 2930,
 2945,
 2953,
 2973,
 2989,
 3053,
 3060,
 3076,
 3134,
 3177,
 3199,
 3202,
 3203,
 3207,
 3220,
 3249,
 3266,
 3290,
 3300,
 3305,
 3320,
 3365,
 3378,
 3438,
 3468,
 3538,
 3590,
 3617,
 3619,
 3688,
 3724,
 3804,
 3827,
 3833,
 3844,
 3858,
 3900,
 3924,
 3968,
 4021,
 4072,
 4081,
 4128,
 4169,
 4196,
 4199,
 4282,
 4327,
 4390,
 4393,
 4396,
 4411,
 4413,
 4431,
 4497,
 4541,
 4578,
 4579,
 4599,
 4609,
 4662,
 4665,
 4713,
 4750,
 4753,
 4762,
 4769,


Eliminamos las filas seg√∫n los √≠ndices almacenados, y reseteamos el index

In [321]:
data_norm.drop(filas_a_eliminar, inplace=True)
data_norm.reset_index(drop=True, inplace=True)

Verifico si no quedaron registros con 'Churn' == '', y si la cantidad restante es correcta

In [322]:
( len(data_norm[data_norm['Churn']==''].index) == 0 ) & ( len(data_norm) == ( len(pd_data) - len(filas_a_eliminar) ) )

True

## Columna de cuentas diarias

Ahora que los datos est√°n limpios, creamos la columna "Cuentas_Diarias" solicitada por el desaf√≠o, bajo el nombre "account.Charges.Daily". Usamos la facturaci√≥n mensual para calcular el valor diario, proporcionando una visi√≥n m√°s detallada del comportamiento de los clientes a lo largo del tiempo.

In [323]:
data_norm['account.Charges.Daily'] = (data_norm['account.Charges.Monthly'] / 30).round(2)
data_norm.head(3)

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total,account.Charges.Daily
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3,2.19
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4,2.0
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85,2.46


## Estandarizaci√≥n y Transformaci√≥n de Datos

La estandarizaci√≥n y transformaci√≥n de datos es una etapa opcional, pero altamente recomendada, ya que busca hacer que la informaci√≥n sea m√°s consistente, comprensible y adecuada para el an√°lisis. Durante esta fase, por ejemplo, podemos convertir valores textuales como "S√≠" y "No" en valores binarios (1 y 0), lo que facilita el procesamiento matem√°tico y la aplicaci√≥n de modelos anal√≠ticos.

Adem√°s, traducir o renombrar columnas y datos hace que la informaci√≥n sea m√°s accesible y f√°cil de entender, especialmente cuando se trabaja con fuentes externas o t√©rminos t√©cnicos. Aunque no es un paso obligatorio, puede mejorar significativamente la claridad y comunicaci√≥n de los resultados, facilitando la interpretaci√≥n y evitando confusiones, especialmente al compartir informaci√≥n con stakeholders no t√©cnicos.

Los nombres de las columnas decido dejarlos en ingl√©s, me resulta m√°s compacto a la vista, que su correspondiente al espa√±ol.

Pero s√≠ modificar√© las columnas:
```
{
    'Churn': {'No': 'Active', 'Yes': 'Churned'},
    'customer.SeniorCitizen': {0: 'No', 1: 'Yes'}
}
```


In [324]:
diccionario_reemplazo_multiple = {
    'Churn': {'No': 'Active', 'Yes': 'Churned'},
    'customer.SeniorCitizen': {0: 'No', 1: 'Yes'}
}


In [325]:
data_norm = data_norm.replace(diccionario_reemplazo_multiple);
data_norm.sample(3)

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total,account.Charges.Daily
2632,3750-YHRYO,Active,Male,No,Yes,Yes,7,Yes,No,No,...,No internet service,No internet service,No internet service,No internet service,One year,No,Mailed check,20.65,150.0,0.69
76,0123-CRBRT,Active,Female,No,Yes,Yes,61,Yes,Yes,DSL,...,Yes,Yes,Yes,Yes,Two year,No,Mailed check,88.1,5526.75,2.94
6966,9903-LYSAB,Active,Male,No,Yes,No,18,Yes,Yes,Fiber optic,...,No,No,No,No,Month-to-month,Yes,Electronic check,73.15,1305.95,2.44


#üìä Carga y an√°lisis

## An√°lisis Descriptivo

Para comenzar, realizamos un an√°lisis descriptivo de los datos, calculando m√©tricas como media, mediana, desviaci√≥n est√°ndar y otras medidas que ayuden a comprender mejor la distribuci√≥n y el comportamiento de los clientes.

In [326]:
data_norm.describe()

Unnamed: 0,customer.tenure,account.Charges.Monthly,account.Charges.Total,account.Charges.Daily
count,7032.0,7032.0,7032.0,7032.0
mean,32.421786,64.798208,2283.300441,2.159891
std,24.54526,30.085974,2266.771362,1.002955
min,1.0,18.25,18.8,0.61
25%,9.0,35.5875,401.45,1.1875
50%,29.0,70.35,1397.475,2.34
75%,55.0,89.8625,3794.7375,2.9925
max,72.0,118.75,8684.8,3.96


Iniciaremos evaluando:
* *customer.tenure*: Tiempo que el cliente ha estado en la empresa. Es una variable continua y muy relevante para entender lealtad o permanencia.

* *account.Charges.Monthly*: Monto mensual que paga el cliente. Es clave para ver si los clientes que pagan m√°s tienden a irse o quedarse.

* *account.Charges.Total*: Total acumulado pagado por el cliente. Da una idea del valor del cliente para la empresa.

* *account.Charges.Daily*: Probablemente una derivada del total dividido por los d√≠as de permanencia. Tambi√©n puede ser √∫til para comparar con otras m√©tricas de consumo.

In [327]:
data_norm[data_norm['account.Charges.Total'].isna()]

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total,account.Charges.Daily


In [328]:
def analisis_Descriptivo(columna):
    print(
f'''Los valores m√≠nimo y m√°ximo respectivamente de {columna} es de {data_norm[columna].min()} y {data_norm[columna].max() }
El promedio de {columna} es de {data_norm[columna].mean():0.3f}
La mediana de {columna} es de {data_norm[columna].median():0.3f}
La desviaci√≥n est√°ndar de {columna} es de {statistics.pstdev(data_norm[columna]):0.3f}\n'''
)


In [329]:
for column in ['customer.tenure', 
               'account.Charges.Monthly', 
               'account.Charges.Total', 
               'account.Charges.Daily']:
    analisis_Descriptivo(columna)

Los valores m√≠nimo y m√°ximo respectivamente de account.Charges.Total es de 18.8 y 8684.8
El promedio de account.Charges.Total es de 2283.300
La mediana de account.Charges.Total es de 1397.475
La desviaci√≥n est√°ndar de account.Charges.Total es de 2266.610

Los valores m√≠nimo y m√°ximo respectivamente de account.Charges.Total es de 18.8 y 8684.8
El promedio de account.Charges.Total es de 2283.300
La mediana de account.Charges.Total es de 1397.475
La desviaci√≥n est√°ndar de account.Charges.Total es de 2266.610

Los valores m√≠nimo y m√°ximo respectivamente de account.Charges.Total es de 18.8 y 8684.8
El promedio de account.Charges.Total es de 2283.300
La mediana de account.Charges.Total es de 1397.475
La desviaci√≥n est√°ndar de account.Charges.Total es de 2266.610

Los valores m√≠nimo y m√°ximo respectivamente de account.Charges.Total es de 18.8 y 8684.8
El promedio de account.Charges.Total es de 2283.300
La mediana de account.Charges.Total es de 1397.475
La desviaci√≥n est√°ndar d

## Distribuci√≥n de evasi√≥n

En este paso, el objetivo es comprender c√≥mo est√° distribuida la variable "churn" (evasi√≥n) entre los clientes. Utilizar√© un gr√°fico de barras para visualizar la proporci√≥n de clientes que permanecieron y los que se dieron de baja.

In [330]:
data_norm.head(2)

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total,account.Charges.Daily
0,0002-ORFBO,Active,Female,No,Yes,Yes,9,Yes,No,DSL,...,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3,2.19
1,0003-MKNFE,Active,Male,No,No,No,9,Yes,Yes,DSL,...,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4,2.0


In [331]:
def comp_churn(df, column, title, xtitle, barNorm = None, out= True):
    fig = px.histogram(df, 
                    x = column, 
                    text_auto = True, 
                    color = 'Churn', 
                    barmode='relative', 
                    barnorm = barNorm,
                    title = title,
                    color_discrete_sequence=['green', 'red']
                    )
    fig.update_layout(
        title_x=0.5, # Centrar el t√≠tulo
        xaxis_title= xtitle,
        yaxis_title= 'Churn rate',
        bargap=0.2
    )
    fig.show()
    
    if out == True:
        print(title)
        for index, value in df[column][df['Churn'] == 'Churned'].value_counts().items():
            print(f"{index}: {(value * 100 / df[column][df[column] == index].value_counts().sum()):0.3f} %")

In [332]:
comp_churn(data_norm, 'Churn', 'Churn Rate', 'Churn Rate', out=False)

## Recuento de evasi√≥n por variables categ√≥ricas

Ahora, exploraremos c√≥mo se distribuye la evasi√≥n seg√∫n variables categ√≥ricas, como g√©nero, tipo de contrato, m√©todo de pago, entre otras.

Este an√°lisis puede revelar patrones interesantes, por ejemplo, si los clientes de ciertos perfiles tienen una mayor tendencia a cancelar el servicio, lo que ayudar√° a orientar acciones estrat√©gicas.

In [333]:
data_norm.head(2)

Unnamed: 0,customerID,Churn,customer.gender,customer.SeniorCitizen,customer.Partner,customer.Dependents,customer.tenure,phone.PhoneService,phone.MultipleLines,internet.InternetService,...,internet.DeviceProtection,internet.TechSupport,internet.StreamingTV,internet.StreamingMovies,account.Contract,account.PaperlessBilling,account.PaymentMethod,account.Charges.Monthly,account.Charges.Total,account.Charges.Daily
0,0002-ORFBO,Active,Female,No,Yes,Yes,9,Yes,No,DSL,...,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3,2.19
1,0003-MKNFE,Active,Male,No,No,No,9,Yes,Yes,DSL,...,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4,2.0


Creo un diccionario, que almacene los t√≠tulos y xlabel para cada uno de los gr√°ficos que realizar√© comparando el churn rate.

In [345]:
dicc_cat = {'customer.gender': ['Churn Rate by Gender', 'Gender'],
           'customer.SeniorCitizen': ['Churn Rate by Senior Citizen Status', 'Senior Citizen' ],
           'customer.tenure': ['Churn rate by customer tenure', 'Customer tenure'],
           'account.Contract': ["Customer Churn by Contract Type", 'Contract Type'],
           'account.PaymentMethod': ["Customer Churn by Payment Method", 'Payment Method'],
           'customer.Partner': ['Customer Churn by Partner Status', 'Partner Status'],
           'customer.Dependents': ['Customer Churn by Dependents', 'Dependents Status'],
           'phone.PhoneService': ['Customer Churn by Phone Service', 'Phone Service'],
           'phone.MultipleLines': ['Customer Churn by Multiple Lines', 'Multiple Lines'],
           'internet.InternetService': ['Customer Churn by Internet Service Type', 'Internet Service Type'],
           'internet.OnlineSecurity': ['Customer Churn by Online Security Subscription', 'Online Security Subscription'],
           'internet.OnlineBackup': ['Customer Churn by Online Backup Subscription', 'Online Backup Subscription'],
           'internet.DeviceProtection': ['Customer Churn by Device Protection Plan', 'Device Protection Plan'],
           'internet.TechSupport': ['Customer Churn by Tech Support Subscription', 'Tech Support Subscription'],
           'internet.StreamingMovies': ['Customer Churn by Streaming Movies Subscription', 'Streaming Movies Subscription'],
           'account.PaperlessBilling': ['Customer Churn by Paperless Billing Status', 'Paperless Billing Status'],
           'account.Charges.Daily': ['Customer Churn by Daily Charges', 'Daily Charges']
            }

In [346]:
for key, value in dicc_cat.items():
    comp_churn(data_norm, key, value[0], value[1])

Churn Rate by Gender
Female: 26.960 %
Male: 26.205 %


Churn Rate by Senior Citizen Status
No: 23.650 %
Yes: 41.681 %


Churn rate by customer tenure
1: 61.990 %
2: 51.681 %
3: 47.000 %
4: 47.159 %
5: 48.120 %
7: 38.931 %
9: 38.655 %
10: 38.793 %
8: 34.146 %
6: 36.364 %
12: 32.479 %
13: 34.862 %
15: 37.374 %
11: 31.313 %
16: 35.000 %
22: 30.000 %
17: 29.885 %
14: 31.579 %
18: 24.742 %
24: 24.468 %
25: 29.114 %
32: 27.536 %
19: 26.027 %
20: 25.352 %
21: 26.984 %
31: 24.615 %
30: 22.222 %
43: 23.077 %
49: 22.727 %
37: 23.077 %
29: 20.833 %
26: 18.987 %
35: 17.045 %
33: 21.875 %
41: 20.000 %
39: 25.000 %
53: 20.000 %
42: 21.538 %
47: 20.588 %
38: 22.034 %
66: 14.607 %
27: 18.056 %
40: 20.312 %
54: 19.118 %
23: 15.294 %
46: 16.216 %
28: 21.053 %
34: 18.462 %
58: 16.418 %
70: 9.244 %
50: 14.706 %
56: 12.500 %
36: 20.000 %
67: 10.204 %
48: 14.062 %
65: 11.842 %
68: 9.000 %
55: 14.062 %
52: 10.000 %
61: 10.526 %
51: 11.765 %
59: 13.333 %
69: 8.421 %
57: 12.308 %
71: 3.529 %
60: 7.895 %
72: 1.657 %
45: 9.836 %
44: 11.765 %
62: 7.143 %
63: 5.556 %
64: 5.000 %


Customer Churn by Contract Type
Month-to-month: 42.710 %
One year: 11.277 %
Two year: 2.849 %


Customer Churn by Payment Method
Electronic check: 45.285 %
Mailed check: 19.202 %
Bank transfer (automatic): 16.732 %
Credit card (automatic): 15.253 %


Customer Churn by Partner Status
No: 32.976 %
Yes: 19.717 %


Customer Churn by Dependents
No: 31.279 %
Yes: 15.531 %


Customer Churn by Phone Service
Yes: 26.747 %
No: 25.000 %


Customer Churn by Multiple Lines
Yes: 28.648 %
No: 25.081 %
No phone service: 25.000 %


Customer Churn by Internet Service Type
Fiber optic: 41.906 %
DSL: 18.998 %
No: 7.434 %


Customer Churn by Online Security Subscription
No: 41.779 %
Yes: 14.640 %
No internet service: 7.434 %


Customer Churn by Online Backup Subscription
No: 39.942 %
Yes: 21.567 %
No internet service: 7.434 %


Customer Churn by Device Protection Plan
No: 39.140 %
Yes: 22.539 %
No internet service: 7.434 %


Customer Churn by Tech Support Subscription
No: 41.647 %
Yes: 15.196 %
No internet service: 7.434 %


Customer Churn by Streaming Movies Subscription
No: 33.729 %
Yes: 29.952 %
No internet service: 7.434 %


Customer Churn by Paperless Billing Status
Yes: 33.589 %
No: 16.376 %


Customer Churn by Daily Charges
2.48: 53.030 %
2.34: 44.118 %
2.84: 49.123 %
2.68: 40.909 %
2.33: 47.170 %
2.5: 54.348 %
2.66: 45.455 %
2.98: 44.231 %
0.66: 9.163 %
0.67: 10.233 %
2.64: 38.889 %
2.32: 42.857 %
2.82: 41.667 %
3.15: 52.632 %
2.51: 37.736 %
2.83: 41.667 %
0.65: 9.479 %
3.34: 43.478 %
3.32: 46.341 %
2.65: 47.368 %
3.16: 36.000 %
1.5: 40.909 %
3.13: 58.621 %
3.14: 36.957 %
2.7: 36.170 %
2.52: 34.043 %
3.18: 43.243 %
0.84: 18.391 %
3.33: 57.143 %
2.49: 32.653 %
0.68: 7.273 %
3.17: 38.095 %
2.36: 42.857 %
2.86: 35.714 %
1.51: 44.118 %
2.99: 31.250 %
2.67: 36.585 %
2.85: 46.875 %
3.2: 48.276 %
3.02: 29.167 %
2.35: 37.838 %
2.46: 50.000 %
3.0: 35.897 %
2.97: 33.333 %
3.47: 36.842 %
3.38: 45.161 %
1.84: 33.333 %
2.31: 38.235 %
3.03: 39.394 %
2.63: 36.111 %
1.69: 46.429 %
3.28: 68.421 %
2.8: 38.235 %
3.48: 30.769 %
3.19: 41.379 %
1.68: 29.268 %
2.69: 31.579 %
2.81: 30.769 %
2.53: 40.000 %
1.64: 40.000 %
3.35: 34.286 %
3.3: 28.947 %
3.49: 42.308 %
3.53: 47.826 %
1.52: 44.000 %
3.2

In [337]:
df_grouped = data_norm.groupby(['customer.tenure', 'Churn']).size().unstack(fill_value=0)

df_grouped['churn_rate'] = (df_grouped.get('Active', 0) / df_grouped.sum(axis=1)) * 100

px.line(df_grouped.reset_index(),
        x='customer.tenure',
        y='churn_rate',
        markers=True,
        title='Churn Rate by Customer Tenure')


In [338]:
px.scatter(data_norm,
           x="customer.tenure",
           y="account.Charges.Monthly",
           color="Churn",
           title="Churn by Tenure vs Monthly Charges")

In [347]:
px.box(data_norm, x = 'Churn', y = 'account.Charges.Monthly', color = 'Churn', color_discrete_sequence=['green', 'red'], title = 'Churn by Monthly Charges')