# Data cleaning

What the dataset has settle down is: 

> Stay ✔; Delete ❌; Output ⏭️; Check together 🅰️ 🅱️


`RowNumber` — Corresponds to the record (row) number and has no effect on the output ❌

`CustomerId` — Contains random values and has no effect on customer leaving the bank ❌

`Surname` — The surname of a customer has no impact on their decision to leave the bank ❌

`CreditScore` — Can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank ✔
 
`Geography` — A customer’s location can affect their decision to leave the bank ✔

`Gender` — It’s interesting to explore whether gender plays a role in a customer leaving the bank ✔ 

`Age` — This is certainly relevant, since older customers are less likely to leave their bank than younger ones ✔ 

`Tenure` — Refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank ✔ 

`Balance` — Also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances ✔

`NumOfProducts` — Refers to the number of products that a customer has purchased through the bank ✔ 

`HasCrCard` — Denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank ✔

`IsActiveMember` — Active customers are less likely to leave the bank ✔

`EstimatedSalary` — As with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries ✔

`Exited` — Whether or not the customer left the bank ⏭️

`Complain` — Customer has complaint or not 🅰️

`Satisfaction Score` — Score provided by the customer for their complaint resolution 🅰️

`Card Type` — Type of card hold by the customer 🅱️

`Points Earned` — The points earned by the customer for using credit card 🅱️

In [2]:
import pandas as pd
import os

## Delete unnecessary features

In [5]:
data = pd.read_csv(os.path.join("data","Customer-Churn-Records.csv"))
del data['RowNumber']
del data['CustomerId']
del data['Surname']

data.columns

Index(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
       'Exited', 'Complain', 'Satisfaction Score', 'Card Type',
       'Point Earned'],
      dtype='object')

In [9]:
data.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425
5,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1,1,5,DIAMOND,484
6,822,France,Male,50,7,0.0,2,1,1,10062.8,0,0,2,SILVER,206
7,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1,1,2,DIAMOND,282
8,501,France,Male,44,4,142051.07,2,0,1,74940.5,0,0,3,GOLD,251
9,684,France,Male,27,2,134603.88,1,1,1,71725.73,0,0,3,GOLD,342


## Check data types if correct

In [6]:
data.dtypes

CreditScore             int64
Geography              object
Gender                 object
Age                     int64
Tenure                  int64
Balance               float64
NumOfProducts           int64
HasCrCard               int64
IsActiveMember          int64
EstimatedSalary       float64
Exited                  int64
Complain                int64
Satisfaction Score      int64
Card Type              object
Point Earned            int64
dtype: object

Data types seem to be correct depending on each feature description. Just only to clarify that:

- 

In [18]:
data.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Point Earned
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2038,0.2044,3.0138,606.5151
std,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402842,0.403283,1.405919,225.924839
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0,0.0,1.0,119.0
25%,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0,0.0,2.0,410.0
50%,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0,0.0,3.0,605.0
75%,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0,0.0,4.0,801.0
max,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0,1.0,5.0,1000.0


In [19]:
data.describe(include='object')

Unnamed: 0,Geography,Gender,Card Type
count,10000,10000,10000
unique,3,2,4
top,France,Male,DIAMOND
freq,5014,5457,2507


## Check if missing data

In [16]:
data.isnull().sum()

CreditScore           0
Geography             0
Gender                0
Age                   0
Tenure                0
Balance               0
NumOfProducts         0
HasCrCard             0
IsActiveMember        0
EstimatedSalary       0
Exited                0
Complain              0
Satisfaction Score    0
Card Type             0
Point Earned          0
dtype: int64

In [17]:
data.isna().sum()

CreditScore           0
Geography             0
Gender                0
Age                   0
Tenure                0
Balance               0
NumOfProducts         0
HasCrCard             0
IsActiveMember        0
EstimatedSalary       0
Exited                0
Complain              0
Satisfaction Score    0
Card Type             0
Point Earned          0
dtype: int64

There is no null or NaN data in this dataset.

## Check if dupplicated data

In [22]:
sum(data.duplicated())

0

There are no duplicated data in this dataset.