## Entendimiento de los datos

Vamos a trabajar con el conjunto de datos Churn-Modeling descargado de ***Kaggle***. Este conjunto de datos contiene detalles de los clientes de un banco y la variable objetivo es una variable binaria que refleja el hecho de si el cliente dejó el banco (cerró su cuenta) o si continúa siendo un cliente.

Las características capturan información sociodemográfica del cliente, información de productos financieros, el comportamiento y balance de su cuenta. Las características son las siguientes:

* **CustomerId**: Id único para identificar el cliente.
* **Surname**: Apellido del cliente.
* **CreditScore**: Puntaje de crédito del cliente.
* **Geography**: País al que pertenece el cliente.
* **Gender**: Genero.
* **Age**: Edad.
* **Tenure**: Número de años que el cliente ha estado en el banco.
* **Balance**: Saldo bancario del cliente.
* **NumOfProducts**: Número de productos bancarios que utiliza el cliente.
* **HasCrCard**: Si el cliente tiene tarjeta de crédito con el banco.
* **IsActiveMember**: Si el cliente es miembro activo del banco o no.
* **EstimatedSalary**: Salario estimado en dólares.
* **Exited**: 1-Si el cliente cerró la cuenta con el banco; 0-Si el cliente es retenido.

In [1]:
import warnings
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

warnings.filterwarnings(action="ignore")

In [2]:
churn = pd.read_csv("https://raw.githubusercontent.com/stivenlopezg/DS-ONLINE-76/master/data/churn-modeling.csv",dtype={"CustomerId": "category"})
churn.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,Yes,Yes,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,No,Yes,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,Yes,No,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,No,No,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,Yes,Yes,79084.1,0


### Preprocesamiento

In [3]:
cols_to_drop = ["CustomerId", "Surname"]

churn.drop(labels=cols_to_drop, axis="columns", inplace=True)

numerical_features = churn.select_dtypes(include="number").columns.tolist()
categorical_features = churn.select_dtypes(exclude="number").columns.tolist()

numerical_features.remove("Exited")

In [4]:
exited = churn.pop("Exited")

X_train, X_test, y_train, y_test = train_test_split(churn, exited,
                                                    test_size=0.3, random_state=42)

In [5]:
numeric_preprocessing = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())

categoric_preprocessing = make_pipeline(SimpleImputer(strategy="most_frequent"))

In [6]:
X_train.loc[:, numerical_features] = numeric_preprocessing.fit_transform(X_train[numerical_features])
X_train.loc[:, categorical_features] = categoric_preprocessing.fit_transform(X_train[categorical_features])

X_train = pd.get_dummies(data=X_train, columns=categorical_features)

X_train.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male,HasCrCard_No,HasCrCard_Yes,IsActiveMember_No,IsActiveMember_Yes
9069,-0.344595,-0.65675,-0.34217,1.583725,0.819663,1.248986,0,0,1,1,0,0,1,0,1
2603,-0.095181,-0.46638,0.698162,1.344106,-0.903352,1.522114,0,1,0,1,0,1,0,0,1
7738,-0.947345,-0.561565,0.351385,-1.222055,0.819663,1.264394,1,0,0,0,1,1,0,1,0
1579,-0.354987,0.199916,1.04494,-0.618965,-0.903352,1.647781,0,1,0,0,1,0,1,0,1
5058,0.642668,-0.180824,1.391718,1.152808,0.819663,0.875726,1,0,0,0,1,1,0,0,1


In [7]:
X_test.loc[:, numerical_features] = numeric_preprocessing.transform(X_test[numerical_features])
X_test.loc[:, categorical_features] = categoric_preprocessing.transform(X_test[categorical_features])

X_test = pd.get_dummies(data=X_test, columns=categorical_features)

X_test.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male,HasCrCard_No,HasCrCard_Yes,IsActiveMember_No,IsActiveMember_Yes
6252,-0.583617,-0.65675,-0.688948,0.324894,0.819663,-1.024156,1,0,0,0,1,1,0,1,0
4684,-0.303026,0.390286,-1.382503,-1.222055,0.819663,0.790674,1,0,0,0,1,0,1,0,1
1731,-0.531655,0.485471,-0.34217,-1.222055,0.819663,-0.733117,1,0,0,1,0,0,1,1,0
4742,-1.518919,1.913248,1.04494,0.683891,0.819663,1.212328,1,0,0,0,1,0,1,0,1
4521,-0.957737,-1.132675,0.698162,0.777369,-0.903352,0.24046,1,0,0,1,0,0,1,0,1
