# **Predicción de Abandono de Clientes en una empresa de telecomunicaciones**

Dataset descargado de Kaggle: https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data

Descripción del dataset Telco Customer Churn

🔔 Contexto general:

Cada fila: un cliente de una empresa Telco (telecomunicaciones).

Objetivo: predecir si ese cliente se fue (Churn=Yes) o sigue (Churn=No).

🕹️ Incluye datos demográficos, productos contratados, servicios adicionales, y comportamiento de facturación.

---

📋 Campos detallados:

Campo	Descripción negocio

customerID	Identificador único del cliente. No aporta al modelo predictivo pero sirve como referencia.
gender	Género del cliente (Male o Female).
SeniorCitizen	Si el cliente es adulto mayor (1 = sí, 0 = no).
Partner	Si tiene pareja (Yes = convive o está casado, No = soltero).
Dependents	Si tiene personas a cargo (dependientes: hijos, etc.).
tenure	Meses que el cliente lleva siendo cliente de la empresa. Indica antigüedad.
PhoneService	Si tiene servicio telefónico (Yes o No).
MultipleLines	Si tiene múltiples líneas telefónicas (Yes, No, No phone service).
InternetService	Tipo de servicio de internet (DSL, Fiber optic, No).
OnlineSecurity	Si tiene servicio de seguridad en línea (antivirus/protección, Yes, No, No internet service).
OnlineBackup	Si tiene servicio de backup online (Yes, No, No internet service).
DeviceProtection	Si tiene protección de dispositivos (Yes, No, No internet service).
TechSupport	Si tiene soporte técnico contratado (Yes, No, No internet service).
StreamingTV	Si tiene servicio de streaming TV (Yes, No, No internet service).
StreamingMovies	Si tiene servicio de streaming de películas (Yes, No, No internet service).
Contract	Tipo de contrato (Month-to-month, One year, Two year). Refleja compromiso de permanencia.
PaperlessBilling	Si usa facturación electrónica (Yes = sin factura en papel).
PaymentMethod	Método de pago (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
MonthlyCharges	Importe mensual en dólares que paga el cliente.
TotalCharges	Total facturado al cliente a lo largo de toda su relación con la empresa.
Churn	Variable target: si el cliente abandonó (Yes) o sigue (No).


---

🔎 Insights clave de negocio por variable

Variables demográficas: gender, SeniorCitizen, Partner, Dependents.

Permiten analizar perfiles socio-demográficos que podrían tener mayor propensión a churn.

Uso de servicios: PhoneService, InternetService, OnlineSecurity, etc.

Reflejan qué productos y bundles tiene contratados cada cliente.

Importante porque el churn podría estar relacionado con tipo o cantidad de servicios.


Facturación y método de pago: MonthlyCharges, TotalCharges, PaymentMethod.

Información sobre comportamiento económico de cliente.

Ejemplo típico: clientes que pagan con cheque podrían tener mayor propensión al churn.

Relación contractual: tenure, Contract, PaperlessBilling.

tenure es especialmente predictivo: churn es más alto en clientes nuevos.


---

✅ Resumen interpretativo: 👉 Este dataset simula la base de clientes de una telco que quiere anticipar quién podría cancelar su servicio, analizando:

Perfil demográfico.

Productos contratados.

Antigüedad.

Facturación.

Forma de pago.

In [None]:
# Verificar y corregir versión de NumPy para compatibilidad con Numba
import sys
import subprocess

def check_and_fix_numpy():
    try:
        import numpy as np
        numpy_version = np.__version__
        print(f"🔍 NumPy actual: {numpy_version}")
        
        # Verificar si la versión es compatible con Numba
        major, minor = [int(x) for x in numpy_version.split('.')[:2]]
        
        if major > 2 or (major == 2 and minor > 1):
            print("⚠️ NumPy versión incompatible con Numba detectada")
            print("🔧 Instalando NumPy compatible...")
            
            # Desinstalar e instalar versión compatible
            subprocess.check_call([sys.executable, "-m", "pip", "uninstall", "numpy", "-y"])
            subprocess.check_call([sys.executable, "-m", "pip", "install", "numpy<=2.1"])
            
            print("✅ NumPy actualizado a versión compatible")
            print("🔄 Reinicia el kernel después de esta celda")
        else:
            print("✅ NumPy versión compatible")
            
    except ImportError:
        print("📦 NumPy no instalado, se instalará en la siguiente celda")
    except Exception as e:
        print(f"❌ Error verificando NumPy: {e}")

check_and_fix_numpy()

In [None]:
#Instalar librerias necesarias con versiones compatibles
!pip install "numpy<=2.1" numba
!pip install ydata_profiling
#Actulizo el pip
!pip install --upgrade pip

## **Analisis automático EDA con Pandas Profiling**

In [None]:
#Importar las librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import skew, pearsonr
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


In [None]:
# 1. Lectura del dataset
#import kagglehub

# Load dataset
df = pd.read_csv("train.csv")
display(df.head())

df_test = pd.read_csv("test.csv")
display(df_test.head)



In [None]:
# 2. Info del dataset
df.shape


In [None]:
# 2. Info del dataset
df.info()

In [None]:
# 2. Generar el informe EDA
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Reporte EDA de Predicción de abandono de Clientes", explorative=True)
profile.to_notebook_iframe()  # Ver reporte en Jupyter Notebook
profile.to_file("eda_report.html")  # Para exportar a HTML

# Analisis Univariado
Generate frequency plots for the categorical columns in the file "telecom_customer_churn.csv" for customers who have churned (where 'Churn' is 'Yes').

## Identify categorical columns

Identify the categorical columns in the DataFrame that are suitable for frequency plotting.


Identify categorical columns by iterating through the DataFrame columns, checking their data types, and counting unique values, while excluding the 'customerID' column.



In [None]:
categorical_columns = []
for col in df.columns:
    if df[col].dtype == 'object' and col != 'customerID':
        if df[col].nunique() < 10:
            categorical_columns.append(col)

print("Categorical columns suitable for frequency plotting:")
print(categorical_columns)

## Filter data

Filter the DataFrame to include only the rows where 'Churn' is 'Yes'.


Filter the DataFrame to include only the rows where 'Churn' is 'Yes' and display the head to verify.



In [None]:
display(df.head(5))
df_churned = df[df['Churn'] == 'Yes']
display(df_churned.head())

## Generate frequency plots

For each identified categorical column, create a frequency bar plot to visualize the distribution of values for churned customers.


Iterate through the identified categorical columns (excluding 'Churn') and create a frequency bar plot for each using the churned customers data.



In [None]:
categorical_columns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

for col in categorical_columns:
    plt.figure(figsize=(8, 6))
    df_churned[col].value_counts().plot(kind='bar', color='deeppink')
    plt.title(f'Frequency of {col} for Churned Customers')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Summary:

### Data Analysis Key Findings

*   **Gender:** Among churned customers, the distribution of gender is nearly equal.
*   **Partner:** A significantly higher number of churned customers do not have a partner compared to those who do.
*   **Dependents:** The vast majority of churned customers do not have dependents.
*   **PhoneService:** Almost all churned customers have phone service.
*   **MultipleLines:** Churned customers are relatively split between having no multiple lines and having multiple lines.
*   **InternetService:** The highest frequency of churned customers have Fiber optic internet service, followed by DSL.
*   **OnlineSecurity:** A substantial majority of churned customers do not have online security.
*   **OnlineBackup:** A large number of churned customers do not have online backup.
*   **DeviceProtection:** More churned customers do not have device protection than those who do.
*   **TechSupport:** A significant majority of churned customers do not have tech support.
*   **StreamingTV:** Churned customers are relatively split between having and not having streaming TV.
*   **StreamingMovies:** Churned customers are relatively split between having and not having streaming movies.
*   **Contract:** The overwhelming majority of churned customers are on a Month-to-month contract.
*   **PaperlessBilling:** A much higher number of churned customers use paperless billing compared to those who do not.
*   **PaymentMethod:** Electronic check is the most frequent payment method among churned customers, followed by Mail check.



# Analisis de Correlacion de Churn vs otras variables
Analyze the correlation of all columns except 'customerID' with the 'Churn' column in the dataset "WA_Fn-UseC_-Telco-Customer-Churn.csv".

## Data preparation

Handle the 'TotalCharges' column by converting it to a numeric type, coercing errors to NaN, and then filling the missing values. Also, encode the categorical 'Churn' column into numerical representation.


Convert 'TotalCharges' to numeric, fill NaNs with 0, and encode 'Churn' as instructed.



In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(0, inplace=True)
df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
display(df.head())

## Handle categorical features

Convert other categorical columns into a numerical format suitable for correlation calculation, for instance using one-hot encoding.


To prepare the data for correlation analysis, I will first drop the non-informative 'customerID' column. Then, I will convert all remaining categorical columns into a numerical format using one-hot encoding with `pd.get_dummies()`, setting `drop_first=True` to prevent multicollinearity. Finally, I will display the head of the resulting DataFrame to verify the changes.



In [None]:
df_encoded = df.drop('customerID', axis=1)
df_encoded = pd.get_dummies(df_encoded, drop_first=True)
display(df_encoded.head())

## Calculate correlation

### Subtask:
Calculate the correlation matrix of the processed DataFrame and specifically extract the correlation of all columns with the 'Churn' column.


Calculate the correlation matrix and extract the correlations with the 'Churn' column, then sort and display them.



In [None]:
correlation_matrix = df_encoded.corr()
churn_correlation = correlation_matrix['Churn'].sort_values(ascending=False)
display(churn_correlation)

## Visualize correlation

Visualize the correlations, possibly using a heatmap for better visualization.


Generate a heatmap to visualize the correlations with the 'Churn' column.



In [None]:
plt.figure(figsize=(8, 10))
sns.heatmap(churn_correlation.to_frame(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation with Churn')
plt.show()

## Findings:

*   The 'TotalCharges' column was successfully converted to numeric type, and missing values were filled with 0.
*   The 'Churn' column was encoded numerically (Yes=1, No=0).
*   Categorical features (excluding 'customerID') were successfully one-hot encoded.
*   Features with the strongest positive correlation with 'Churn' include 'InternetService\_Fiber optic' (correlation ~0.31) and 'PaymentMethod\_Electronic check' (correlation ~0.30).
*   Features with the strongest negative correlation with 'Churn' include 'tenure' (correlation ~-0.35) and 'Contract\_Two year' (correlation ~-0.30).


#Implementación de Modelo de regresión Logistica
Implementar modelo de regresión logística con scikit-learn, para predicción de abandono del servicio (churn) utilizando sólo las variables con mayor correlación con churn (mayor a 0,2 o menor a -0,2).

Variables con mayor correlación: InternetService_Fiber optic, PaymentMethod_Electronic check, InternetService_No, StreamingTV_No internet service, OnlineSecurity_No internet service, OnlineBackup_No internet service, DeviceProtection_No internet service, StreamingMovies_No internet service, TechSupport_No internet service, Contract_Two year, tenure

Utilizar el 70 % de los datos para entrenar el modelo y el resto de los datos para predecir el modelo, luego evaluar el rendimiento del modelo.

### Analyze the "WA_Fn-UseC_-Telco-Customer-Churn.csv" dataset by performing the following steps:
1. Create frequency plots for columns where 'Churn' is 'Yes'.
2. Calculate the correlation matrix for all columns except 'customerID' with the 'Churn' column.
3. Implement a logistic regression model using scikit-learn to predict 'Churn'. Use only the columns with an absolute correlation greater than 0.2 with 'Churn' as features: 'InternetService_Fiber optic', 'PaymentMethod_Electronic check', 'InternetService_No', 'StreamingTV_No internet service', 'OnlineSecurity_No internet service', 'OnlineBackup_No internet service', 'DeviceProtection_No internet service', 'StreamingMovies_No internet service', 'TechSupport_No internet service', 'Contract_Two year', and 'tenure'.
4. Split the data into 70% for training and 30% for testing.
5. Train the logistic regression model on the training data.
6. Make predictions on the testing data.
7. Evaluate the model's performance using appropriate metrics.

## Select features

Based on the correlation analysis, select the features with an absolute correlation value greater than 0.2 with the 'Churn' column as the independent variables (X). Set the 'Churn' column as the dependent variable (y).


I need to select the specified features from the `df_encoded` dataframe to create the feature matrix `X` and the target vector `y`. Then I will display the head of `X` and `y` to verify the selection.



In [None]:
selected_features = [
    'InternetService_Fiber optic', 'PaymentMethod_Electronic check',
    'InternetService_No', 'StreamingTV_No internet service',
    'OnlineSecurity_No internet service', 'OnlineBackup_No internet service',
    'DeviceProtection_No internet service', 'StreamingMovies_No internet service',
    'TechSupport_No internet service', 'Contract_Two year', 'tenure'
]

X = df_encoded[selected_features]
y = df_encoded['Churn']

print("Feature matrix (X) head:")
display(X.head())
print("\nTarget vector (y) head:")
display(y.head())

## Split data

Split the dataset into training and testing sets using a 70/30 ratio.


Split the data into training and testing sets.



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Initialize and train model

Initialize a Logistic Regression model and train it using the training data.


Initialize and train the Logistic Regression model using the training data.



In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
print("Logistic Regression model trained successfully.")

The model has been trained, now I need to make predictions on the test set and evaluate the model.



In [None]:
y_pred = model.predict(X_test)
print("Predictions made on the test set.")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

test_probabilities = model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, test_probabilities)
print("\nROC AUC Score:", roc_auc)

fpr, tpr, thresholds = roc_curve(y_test, test_probabilities)
plt.figure()
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

### Este modelo tuvo una eficiencia del 80% con una buena fiabilidad (AUC=0,84)

## Key Findings

*   The data was split into training (70%) and testing (30%) sets, with 4930 samples for training and 2113 for testing.
*   A logistic regression model was trained using features with an absolute correlation greater than 0.2 with the 'Churn' column.
*   The trained model achieved an accuracy of 0.80 on the test set.
*   The ROC AUC score for the model was 0.84, indicating good discriminative power.
*   The confusion matrix shows that the model correctly predicted 1408 non-churned customers and 274 churned customers in the test set.

## Conclusion

*   The current model provides a good starting point for churn prediction.
*   Investigating the coefficients of the logistic regression model can help understand the impact of each selected feature on the probability of churn, which can provide useful business insights.


## Visualize ROC and Precision-Recall Curves

Generate the ROC curve and Precision-Recall curve for the trained logistic regression model.

Using the predicted probabilities from the model, plot the ROC curve and the Precision-Recall curve.

In [None]:
from sklearn.metrics import precision_recall_curve, auc

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, test_probabilities)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

# Plot Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, test_probabilities)
pr_auc = auc(recall, precision)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'Logistic Regression (AUC = {pr_auc:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.show()

## Summary of Visualizations

The ROC curve shows the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings. A higher AUC indicates better discrimination ability of the model.

The Precision-Recall curve shows the trade-off between precision (the proportion of correctly predicted positive instances among all predicted positive instances) and recall (the proportion of correctly predicted positive instances among all actual positive instances) at various threshold settings. This curve is particularly useful when dealing with imbalanced datasets.

In [None]:
# Generate Kaggle submission file using existing df_test data
from datetime import datetime
import os

print("🎯 Generando archivo de submission para Kaggle...")

# Usar df_test que ya está cargado (no necesitamos recargar test.csv)
print(f"📊 Usando df_test existente: {df_test.shape}")

# Paso 1: Guardar customer IDs del test data
customer_ids_test = df_test['customerID'].copy()
print(f"📊 Customer IDs del test: {len(customer_ids_test)}")

# Paso 2: Aplicar el mismo preprocesamiento que a los datos de entrenamiento
print("🔧 Aplicando preprocesamiento a df_test...")

# Crear copia para procesar
test_processed = df_test.copy()

# Convertir TotalCharges a numérico (mismo tratamiento que training)
test_processed['TotalCharges'] = pd.to_numeric(test_processed['TotalCharges'], errors='coerce')
test_processed['TotalCharges'].fillna(0, inplace=True)

# Remover customerID y aplicar one-hot encoding
test_encoded = test_processed.drop('customerID', axis=1)
test_encoded = pd.get_dummies(test_encoded, drop_first=True)

print(f"📊 Test data codificado: {test_encoded.shape}")
print(f"📋 Columnas después de encoding: {len(test_encoded.columns)}")

# Paso 3: Asegurar que las características coincidan con el modelo entrenado
selected_features = [
    'InternetService_Fiber optic', 'PaymentMethod_Electronic check',
    'InternetService_No', 'StreamingTV_No internet service',
    'OnlineSecurity_No internet service', 'OnlineBackup_No internet service',
    'DeviceProtection_No internet service', 'StreamingMovies_No internet service',
    'TechSupport_No internet service', 'Contract_Two year', 'tenure'
]

# Verificar y crear características faltantes
print("🔍 Verificando características del modelo...")
missing_features = []
for feature in selected_features:
    if feature not in test_encoded.columns:
        print(f"⚠️ Característica faltante: {feature} - Creando con valor 0")
        test_encoded[feature] = 0
        missing_features.append(feature)
    else:
        print(f"✅ {feature}")

if missing_features:
    print(f"📝 Se crearon {len(missing_features)} características faltantes")
else:
    print("✅ Todas las características están presentes")

# Seleccionar solo las características del modelo
X_test_submission = test_encoded[selected_features]
print(f"📊 Características finales para predicción: {X_test_submission.shape}")

# Paso 4: Generar predicciones de probabilidad
print("🤖 Generando predicciones para submission...")
submission_probabilities = model.predict_proba(X_test_submission)[:, 1]

print(f"📊 Predicciones generadas: {len(submission_probabilities)}")
print(f"📊 Customer IDs: {len(customer_ids_test)}")

# Verificar que las longitudes coincidan
if len(submission_probabilities) == len(customer_ids_test):
    print("✅ Longitudes coinciden correctamente")
    
    # Paso 5: Crear archivo de submission
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Crear directorio submissions si no existe
    os.makedirs("submissions", exist_ok=True)
    
    submission_file = f"submissions/submission_grupoM_{timestamp}.csv"
    print(f"📁 Archivo de submission: {submission_file}")
    
    # Crear DataFrame de submission
    submission_df = pd.DataFrame({
        'customerID': customer_ids_test,
        'Churn': submission_probabilities
    })
    
    # Guardar archivo
    submission_df.to_csv(submission_file, index=False)
    
    print(f"✅ Archivo '{submission_file}' generado exitosamente")
    
    # Mostrar estadísticas de las predicciones
    print(f"\n📈 Estadísticas de predicciones:")
    print(f"   - Total predicciones: {len(submission_probabilities):,}")
    print(f"   - Promedio: {submission_probabilities.mean():.4f}")
    print(f"   - Mínimo: {submission_probabilities.min():.4f}")
    print(f"   - Máximo: {submission_probabilities.max():.4f}")
    print(f"   - Predicciones > 0.5 (Churn): {(submission_probabilities > 0.5).sum():,} ({(submission_probabilities > 0.5).mean()*100:.1f}%)")
    
    print("\n📋 Primeras 10 filas del archivo de submission:")
    display(submission_df.head(10))
    
    print(f"\n🎯 ¡Archivo listo para subir a Kaggle!")
    
else:
    print(f"❌ Error: Longitudes no coinciden")
    print(f"   - Predicciones: {len(submission_probabilities)}")
    print(f"   - Customer IDs: {len(customer_ids_test)}")