# Logistic Regression - Ejemplo - Churn

**Contexto**  
Este conjunto de datos contiene el detalle de clientes de un banco y una variable binaria, que refleja si el cliente cerró su cuenta / continúa siendo cliente.

**Contenido**  
El conjunto de datos proviene de kaggle: [Churn Modelling](https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling).  
Contiene 10,000 renglones con las siguientes columnas:  

| Variable        | Definición                                                           | Valor          |
| --------------- | -------------------------------------------------------------------- | -------------- |
| RowNumber       | Número de renglón                                                    | Numérico       |
| CustomerId      | Id cliente                                                           | Numérico       |
| Surname         | Apellido del cliente                                                 | String         |
| CreditScore     | Calificación crediticia                                              | Numérico       |
| Geography       | Pais al que pertenece el cliente                                     | String         |
| Gender          | Género del cliente                                                   | Female, Male   |
| Age             | Edad del cliente                                                     | Años           |
| Tenure          | Número de años de permanencia del cliente                            | Años           |
| Balance         | Balance del cliente                                                  | Numérico       |
| NumOfProducts   | Número de productos que utiliza el cliente                           | Numérico       |
| HasCrCard       | Si el cliente tiene tarjeta de crédito con el banco                  | 0 = No, 1 = Si |
| IsActiveMember  | Si el cliente es un miembro activo del banco                         | 0 = No, 1 = Si |
| EstimatedSalary | Sueldo estimado del cliente                                          | USD            |
| Exited          | Si el cliente canceló la cuenta con el banco **(variable objetivo)** | 0 = No, 1 = Si |

**Planteamiento del problema**  
Se busca predecir si el cliente cancelará su cuenta, de acuerdo sus las características.

In [1]:
# Importar librerias
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## Cargar Datos

In [2]:
# Importar los datos
df = pd.read_csv('Churn_Modelling.csv')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
# Renombrar columnas
df.columns = ['num_renglon', 'id_cliente', 'apellido', 'calificacion_credito', 'geografia', 'genero',
              'edad', 'permanencia', 'balance', 'num_productos', 'tarjeta', 'activo', 'sueldo', 'cancelacion']

## Mapeos

In [4]:
print(df['geografia'].unique())

['France' 'Spain' 'Germany']


In [5]:
# Mapeos
df.replace('France',   '0', inplace=True)
df.replace('Spain',    '1', inplace=True)
df.replace('Germany',  '2', inplace=True)
df = df.astype({'geografia':'int'})
df.head()

Unnamed: 0,num_renglon,id_cliente,apellido,calificacion_credito,geografia,genero,edad,permanencia,balance,num_productos,tarjeta,activo,sueldo,cancelacion
0,1,15634602,Hargrave,619,0,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,1,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,0,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,0,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,1,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
print(df['genero'].unique())

['Female' 'Male']


In [7]:
df.replace('Female', '0', inplace=True)
df.replace('Male',   '1', inplace=True)
df = df.astype({'genero':'int'})
df.head()

Unnamed: 0,num_renglon,id_cliente,apellido,calificacion_credito,geografia,genero,edad,permanencia,balance,num_productos,tarjeta,activo,sueldo,cancelacion
0,1,15634602,Hargrave,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,1,0,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,1,0,43,2,125510.82,1,1,1,79084.1,0


In [8]:
# Seleccionar columnas
df = df[['calificacion_credito', 'geografia', 'genero', 'edad', 'permanencia', 'balance', 
         'num_productos', 'tarjeta', 'activo', 'sueldo', 'cancelacion']]
df.head()

Unnamed: 0,calificacion_credito,geografia,genero,edad,permanencia,balance,num_productos,tarjeta,activo,sueldo,cancelacion
0,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,608,1,0,41,1,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1,0.0,2,0,0,93826.63,0
4,850,1,0,43,2,125510.82,1,1,1,79084.1,0


## Normalización

In [9]:
# Variables independientes
X = df[['calificacion_credito', 'geografia', 'genero', 'edad', 'permanencia', 'balance', 
         'num_productos', 'tarjeta', 'activo', 'sueldo']]
X.head()

Unnamed: 0,calificacion_credito,geografia,genero,edad,permanencia,balance,num_productos,tarjeta,activo,sueldo
0,619,0,0,42,2,0.0,1,1,1,101348.88
1,608,1,0,41,1,83807.86,1,0,1,112542.58
2,502,0,0,42,8,159660.8,3,1,0,113931.57
3,699,0,0,39,1,0.0,2,0,0,93826.63
4,850,1,0,43,2,125510.82,1,1,1,79084.1


In [10]:
# Normalizar
scaler = StandardScaler()
X_adj = scaler.fit_transform(X)
print(X_adj)

[[-0.32622142 -0.9025865  -1.09598752 ...  0.64609167  0.97024255
   0.02188649]
 [-0.44003595  0.301665   -1.09598752 ... -1.54776799  0.97024255
   0.21653375]
 [-1.53679418 -0.9025865  -1.09598752 ...  0.64609167 -1.03067011
   0.2406869 ]
 ...
 [ 0.60498839 -0.9025865  -1.09598752 ... -1.54776799  0.97024255
  -1.00864308]
 [ 1.25683526  1.50591651  0.91241915 ...  0.64609167 -1.03067011
  -0.12523071]
 [ 1.46377078 -0.9025865  -1.09598752 ...  0.64609167 -1.03067011
  -1.07636976]]


In [11]:
# Variable dependiente
y = df['cancelacion']
y.head()

0    1
1    0
2    1
3    0
4    0
Name: cancelacion, dtype: int64

In [12]:
print('X:', len(X_adj), 'y:', len(y))

X: 10000 y: 10000


## Modelado

In [13]:
# Conjunto de entrenamiento y pruebas
X_train, X_test, y_train, y_test = train_test_split(X_adj, y, test_size=0.3, random_state=0)

In [14]:
print('X_train:', len(X_train), 'y_train:', len(y_train))
print('X_test:',  len(X_test),  'y_test:',  len(y_test))

X_train: 7000 y_train: 7000
X_test: 3000 y_test: 3000


In [15]:
# Entrenamiento
model = SVC(kernel='rbf')
model.fit(X_train,y_train)

In [16]:
# Predicciones
prediction = model.predict(X_test)
prediction

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

## Evaluacion

In [17]:
print(confusion_matrix(y_test, prediction))

[[2320   59]
 [ 355  266]]


In [18]:
print(classification_report(y_test,prediction))

              precision    recall  f1-score   support

           0       0.87      0.98      0.92      2379
           1       0.82      0.43      0.56       621

    accuracy                           0.86      3000
   macro avg       0.84      0.70      0.74      3000
weighted avg       0.86      0.86      0.84      3000

