## Entendimiento de los datos

Vamos a trabajar con el conjunto de datos Churn-Modeling descargado de ***Kaggle***. Este conjunto de datos contiene detalles de los clientes de un banco y la variable objetivo es una variable binaria que refleja el hecho de si el cliente dejó el banco (cerró su cuenta) o si continúa siendo un cliente.

Las características capturan información sociodemográfica del cliente, información de productos financieros, el comportamiento y balance de su cuenta. Las características son las siguientes:

* **CustomerId**: Id único para identificar el cliente.
* **Surname**: Apellido del cliente.
* **CreditScore**: Puntaje de crédito del cliente.
* **Geography**: País al que pertenece el cliente.
* **Gender**: Genero.
* **Age**: Edad.
* **Tenure**: Número de años que el cliente ha estado en el banco.
* **Balance**: Saldo bancario del cliente.
* **NumOfProducts**: Número de productos bancarios que utiliza el cliente.
* **HasCrCard**: Si el cliente tiene tarjeta de crédito con el banco.
* **IsActiveMember**: Si el cliente es miembro activo del banco o no.
* **EstimatedSalary**: Salario estimado en dólares.
* **Exited**: 1-Si el cliente cerró la cuenta con el banco; 0-Si el cliente es retenido.


### Carga de módulos

In [None]:
# !pip install missingno

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

In [4]:
churn = pd.read_csv("https://raw.githubusercontent.com/stivenlopezg/DS-ONLINE-76/master/data/churn-modeling.csv",
                    dtype={"CustomerId": "category"})
churn.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,Yes,Yes,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,No,Yes,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,Yes,No,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,No,No,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,Yes,Yes,79084.1,0


In [5]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   CustomerId       10000 non-null  category
 1   Surname          10000 non-null  object  
 2   CreditScore      10000 non-null  int64   
 3   Geography        9980 non-null   object  
 4   Gender           10000 non-null  object  
 5   Age              10000 non-null  int64   
 6   Tenure           10000 non-null  int64   
 7   Balance          10000 non-null  float64 
 8   NumOfProducts    10000 non-null  int64   
 9   HasCrCard        10000 non-null  object  
 10  IsActiveMember   10000 non-null  object  
 11  EstimatedSalary  9988 non-null   float64 
 12  Exited           10000 non-null  int64   
dtypes: category(1), float64(2), int64(5), object(5)
memory usage: 1.3+ MB


In [6]:
# Estadística descriptivas

churn.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,9988.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,1.5302,100066.908601,0.2037
std,96.653299,10.487806,2.892174,62397.405202,0.581654,57519.993379,0.402769
min,350.0,18.0,0.0,0.0,1.0,11.58,0.0
25%,584.0,32.0,3.0,0.0,1.0,50910.6775,0.0
50%,652.0,37.0,5.0,97198.54,1.0,100185.24,0.0
75%,718.0,44.0,7.0,127644.24,2.0,149388.2475,0.0
max,850.0,92.0,10.0,250898.09,4.0,199992.48,1.0


In [7]:
churn.describe(exclude="number")

Unnamed: 0,CustomerId,Surname,Geography,Gender,HasCrCard,IsActiveMember
count,10000,10000,9980,10000,10000,10000
unique,10000,2932,3,2,2,2
top,15565701,Smith,France,Male,Yes,Yes
freq,1,32,5008,5457,7055,5151


### Datos missing

Vamos a mirar si hay datos missing en nuestro set de datos.

In [11]:
churn.isna().mean() * 100

CustomerId         0.00
Surname            0.00
CreditScore        0.00
Geography          0.20
Gender             0.00
Age                0.00
Tenure             0.00
Balance            0.00
NumOfProducts      0.00
HasCrCard          0.00
IsActiveMember     0.00
EstimatedSalary    0.12
Exited             0.00
dtype: float64

### Preprocesamiento

* Variables numéricas:
    * Atípicos
    * Imputar
    * Escalar
    * Discretizar (Opcional)

* Variables categóricas:
    * Imputar
    * Codificar (OHE, LabelEncoder, u OrdinalEncoder)

In [12]:
cols_to_drop = ["CustomerId", "Surname"]

churn.drop(labels=cols_to_drop, axis=1, inplace=True)

churn.sample(n=1)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
6548,683,France,Male,38,7,109346.13,2,Yes,No,102665.92,0


In [16]:
numerical_features = churn.select_dtypes(include="number").columns.tolist()
numerical_features.remove("Exited")

In [17]:
categorical_features = churn.select_dtypes(exclude="number").columns.tolist()
categorical_features

['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

In [18]:
numerical_features

['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

In [20]:
exited = churn.pop("Exited")

# train_data, test_data, train_label, test_label

X_train, X_test, y_train, y_test = train_test_split(churn, exited,
                                                    test_size=0.3)

In [22]:
X_train.isna().sum()

CreditScore         0
Geography          16
Gender              0
Age                 0
Tenure              0
Balance             0
NumOfProducts       0
HasCrCard           0
IsActiveMember      0
EstimatedSalary     8
dtype: int64

#### Preprocesamiento sobre las numericas

In [21]:
#Imputacion

imputer_num = SimpleImputer(strategy="median").fit(X_train[numerical_features])

imputer_num.statistics_

array([6.5200000e+02, 3.7000000e+01, 5.0000000e+00, 9.7245995e+04,
       1.0000000e+00, 1.0099709e+05])

In [23]:
X_train.loc[:, numerical_features] = imputer_num.transform(X_train[numerical_features])

X_train.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


CreditScore         0
Geography          16
Gender              0
Age                 0
Tenure              0
Balance             0
NumOfProducts       0
HasCrCard           0
IsActiveMember      0
EstimatedSalary     0
dtype: int64

In [24]:
# Escalar

scaler = StandardScaler().fit(X_train[numerical_features])

X_train.loc[:, numerical_features] = scaler.transform(X_train[numerical_features])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


In [25]:
X_train.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
220,0.786534,France,Female,-0.745541,1.362446,0.611323,0.802502,Yes,Yes,-0.043223
9245,0.317031,France,Female,0.196121,-1.748049,0.704282,0.802502,Yes,No,1.309926
7609,0.317031,France,Male,0.666953,-1.748049,0.471697,-0.910677,Yes,No,-1.655739
8961,-0.496774,France,Male,0.47862,1.362446,-1.218782,-0.910677,Yes,No,0.657394
7188,0.598732,France,Female,-0.463042,-0.365607,-1.218782,-0.910677,Yes,Yes,-0.659392


#### Preprocesamiento Variables categoricas

In [26]:
# Imputacion

imputer_cat = SimpleImputer(strategy="most_frequent").fit(X_train[categorical_features])

X_train.loc[:, categorical_features] = imputer_cat.transform(X_train[categorical_features])

X_train.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
dtype: int64

In [27]:
# One Hot Encoder

X_train = pd.get_dummies(data=X_train, columns=categorical_features)

X_train

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male,HasCrCard_No,HasCrCard_Yes,IsActiveMember_No,IsActiveMember_Yes
220,0.786534,-0.745541,1.362446,0.611323,0.802502,-0.043223,1,0,0,1,0,0,1,0,1
9245,0.317031,0.196121,-1.748049,0.704282,0.802502,1.309926,1,0,0,1,0,0,1,1,0
7609,0.317031,0.666953,-1.748049,0.471697,-0.910677,-1.655739,1,0,0,0,1,0,1,1,0
8961,-0.496774,0.478620,1.362446,-1.218782,-0.910677,0.657394,1,0,0,0,1,0,1,1,0
7188,0.598732,-0.463042,-0.365607,-1.218782,-0.910677,-0.659392,1,0,0,1,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2601,-0.298540,0.196121,1.362446,-1.218782,0.802502,0.962041,0,0,1,1,0,0,1,0,1
6478,2.080275,-0.368876,1.362446,0.409180,-0.910677,-1.690671,1,0,0,0,1,0,1,0,1
9240,-0.642842,0.666953,1.708057,-1.218782,0.802502,1.181616,0,0,1,1,0,1,0,0,1
9282,-0.329840,-0.463042,-1.402438,-1.218782,-0.910677,0.685160,0,0,1,0,1,0,1,1,0


### En prueba

In [28]:
X_test.isna().sum()

CreditScore        0
Geography          4
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    4
dtype: int64

In [29]:
X_test.loc[:, numerical_features] = imputer_num.transform(X_test[numerical_features])

X_test.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)


CreditScore        0
Geography          4
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
dtype: int64