# ML Course 2 - Practice

## Subject

We provide here a data set on the customers of a bank:
- CustomerId = customer id in the information system of the bank
- Surname = surname of the customer
- CreditScore = score attributed by the bank as estimation of capacity to reimburse a loan
- Geography = country of the customer
- Gender = sex of the customer
- Age = age of the customer
- Tenure = loan duration (year)
- Balance = amount of money on main account (\$)
- NumOfProducts = number of products the customer has in the banl
- HasCrCard = tells if the customer owns a credit card
- IsActiveMember = tells if the customer has an active account
- EstimatedSalary = estimated salary of the customer (\$)
- Exited = tells if the customer has left the bank

In [36]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

In [37]:
dataset_raw = pd.read_csv('bank_churn.csv')
dataset_raw.sample(n=10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
2119,2120,15791836,Wildman,690,France,Male,29,5,0.0,2,1,0,108577.97,0
4145,4146,15729018,Alexander,666,France,Female,33,2,147229.65,1,1,1,56410.17,0
6161,6162,15666430,Peck,579,France,Male,38,8,0.0,2,0,0,91763.67,0
8733,8734,15714241,Haddon,749,Spain,Male,42,9,222267.63,1,0,0,101108.85,1
2869,2870,15594084,Anderson,524,France,Male,22,9,0.0,2,1,0,74405.34,0
4671,4672,15808674,Ejikemeifeuwa,616,Germany,Female,45,6,128352.59,3,1,1,144000.59,1
1426,1427,15710206,Larson,591,France,Female,39,4,150500.64,1,1,0,14928.8,0
8557,8558,15752622,Kerr,729,France,Female,32,7,38550.06,1,0,1,179230.23,0
7180,7181,15632789,Maclean,794,France,Male,30,8,0.0,2,1,1,24113.91,0
3468,3469,15769586,Horan,820,France,Female,49,1,0.0,2,1,1,119087.25,0


The objective of the bank is to predict the churn, i.e. if a customer is likely to leave the bank, based on his/her profile (estimated salary, geography, age, etc.).
Here, we will prepare the data set so that it can be directly used for ML processing. The work is limited to data exploration and preparation, we will not make ML prediction yet.

Your tasks:
- Make some data exploration with at least three plots of your choice. Choose plots that provide interesting and meaningful information. 
Examples: distribution of the values within one feature, repartition of a feature depending on the target value, etc.  
For each graph, you must:
    - Plot the graph
    - Povide a title, axis labels and legend if applicable
    - Write a Markdown cell underneath to explain what insights you can draw from your graph. 
- Prepare the data set. Example: dropping irrelevant data, preparing the predictors and the response, data encoding, train/test split, data scaling.
    - Explain why you drop a features
    - Explain your choice of encoding
   
Bonus: Feature Engineering!  
Enrich the dataset by creating new features for your model to learn on. It can be a combination of other features or combining with external information.

****Bonus part can contain calculated column from other columns, just need to make sense.**

In [38]:
dataset_raw.duplicated().sum()

0

In [39]:
dataset_raw.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [40]:
dataset_raw.sample(n=10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
5571,5572,15708867,Niu,684,Spain,Female,38,3,134168.5,3,1,0,3966.5,1
7177,7178,15697310,O'Callaghan,559,Germany,Female,28,3,152264.81,1,0,0,64242.31,0
1974,1975,15679283,Parkhill,694,France,Female,33,4,129731.64,2,1,0,178123.86,0
8515,8516,15811389,Padovano,724,Germany,Female,35,0,171982.95,2,0,1,167313.07,0
2158,2159,15685706,Bird,731,France,Female,40,7,118991.79,1,1,1,156048.64,0
3430,3431,15780925,Tretyakova,625,France,Male,37,1,177069.24,2,1,1,96088.54,0
7282,7283,15567860,Burrows,581,Spain,Female,44,7,189318.16,2,1,0,45026.23,1
274,275,15800116,Bowman,712,Germany,Male,28,4,145605.44,1,0,1,93883.53,0
4398,4399,15707007,Onio,743,France,Female,39,8,0.0,1,1,0,94263.44,0
3834,3835,15704819,Ositadimma,734,Spain,Female,39,6,92126.26,2,0,0,112973.34,0


### We can drop 'Exited' column since we're searching for the churn probability

In [41]:
dataset_raw.drop('Exited', axis=1, inplace=True)
dataset_raw.sample(n=10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
1947,1948,15657812,Ch'iu,688,France,Male,52,1,0.0,2,1,1,172033.57
1489,1490,15617705,Ozioma,609,France,Female,39,8,141675.23,1,0,1,175664.25
5191,5192,15681075,Chukwualuka,682,France,Female,58,1,0.0,1,1,1,706.5
232,233,15787174,Sergeyev,512,France,Female,37,1,0.0,2,0,1,156105.03
490,491,15714689,Houghton,591,Spain,Male,29,1,97541.24,1,1,1,196356.17
7502,7503,15697844,Whitehouse,721,Spain,Female,32,10,0.0,1,1,0,136119.96
4975,4976,15573278,Kennedy,743,France,Male,39,6,0.0,2,1,0,44265.28
9809,9810,15763907,Watts,820,France,Female,39,1,104614.29,1,1,0,61538.43
9110,9111,15727391,Collier,688,Germany,Male,29,9,144553.5,2,1,0,143454.95
2218,2219,15752488,Emery,733,Spain,Female,31,9,102289.85,1,1,1,115441.66


In [56]:
dataset_raw['Surname'].str.findall('[^a-zA-Z\d\s:]').sum()

['?',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 '?',
 "'",
 '-',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 "'",
 "'",
 '?',
 '?',
 "'",
 "'",
 '-',
 '?',
 "'",
 '-',
 '?',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 '?',
 "'",
 '?',
 "'",
 '?',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 '?',
 "'",
 '?',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 '?',
 "'",
 '?',
 '?',
 '?',
 "'",
 "'",
 "'",
 '?',
 '?',
 "'",
 "'",
 "'",
 "'",
 '?',
 '-',
 "'",
 '?',
 "'",
 '?',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 "'",
 '?',
 '?',
 "'",
 "'",
 '?',
 "'",
 '?',
 '?',
 '?',
 "'",
 '-',
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 '-',
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 "'",
 '?',
 "'",
 "'",
 '?',
 '?',
 "'",
 "'",
 '?',
 '?',
 "'",
 '-',
 '?',
 "'",
 "'",
 "'",
 "'",
 "'",
 '?'