In [1]:
import pandas as pd

Want to predict if bank customers will churn, i.e. leave the bank, or not.

### EDA on bank churn dataset

In [2]:
# importing the bank churn dataset
df = pd.read_csv('Bank Customer Churn Prediction.csv')

In [3]:
# getting info on the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


OK no null data, thats good. Lets look at the values.

In [4]:
df.head()

Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


customer_id irrelevant, can drop. gender can drop too, since we don't want to be changing how we are treating people of various genders and use it in our predictions. Not sure what products_number and active_member mean? tenure seems like it would be impacted by age, e.g. someone who is 20 yrs old probably wont have a high tenure. Do we want age to be used in our predictions? Probably not. I guess it depends on the interests of the bank; if we find out young people keep leaving, maybe we can find out why and tailor strategy to keep them?

In [5]:
df = df.drop(labels=['customer_id','gender'],axis=1)

In [6]:
# checking details of the numerical columns to see what we can learn about tenure-age relationship.
# do we have lots of young people, etc.
df.describe()

Unnamed: 0,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


Max of tenure is only 10!? Ok thats really low considering 75% of people are over 32 years old. People really switch banks that often? Mean of churn is 0.2, meaning we have a skewed data set since the variable is binary.

In [7]:
# lets check if tenure changes a lot for younger people
df_young = df[df['age']<32]
df_young.describe()

Unnamed: 0,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
count,2372.0,2372.0,2372.0,2372.0,2372.0,2372.0,2372.0,2372.0,2372.0
mean,651.560708,27.351602,5.03204,73591.305232,1.554384,0.711636,0.513069,100765.968347,0.076307
std,96.312125,3.171252,2.848306,62959.481157,0.534742,0.453097,0.499935,58875.806811,0.265545
min,363.0,18.0,0.0,0.0,1.0,0.0,0.0,90.07,0.0
25%,582.0,25.0,3.0,0.0,1.0,0.0,0.0,48573.8125,0.0
50%,653.0,28.0,5.0,93597.185,2.0,1.0,1.0,103309.37,0.0
75%,717.0,30.0,7.0,126701.625,2.0,1.0,1.0,151968.1925,0.0
max,850.0,31.0,10.0,214346.96,4.0,1.0,1.0,199953.33,1.0


Oh interesting, the churn rate is much lower for young people. Also the tenure is basically the same.

Thinking about modeling now, probably wanna start with Naive Bayes since we have binary target variable. Lets check correlation and encode the categorical variables

In [10]:
df_corr = df.corr(numeric_only=True)
df_corr

Unnamed: 0,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
credit_score,1.0,-0.003965,0.000842,0.006268,0.012238,-0.005458,0.025651,-0.001384,-0.027094
age,-0.003965,1.0,-0.009997,0.028308,-0.03068,-0.011721,0.085472,-0.007201,0.285323
tenure,0.000842,-0.009997,1.0,-0.012254,0.013444,0.022583,-0.028362,0.007784,-0.014001
balance,0.006268,0.028308,-0.012254,1.0,-0.30418,-0.014858,-0.010084,0.012797,0.118533
products_number,0.012238,-0.03068,0.013444,-0.30418,1.0,0.003183,0.009612,0.014204,-0.04782
credit_card,-0.005458,-0.011721,0.022583,-0.014858,0.003183,1.0,-0.011866,-0.009933,-0.007138
active_member,0.025651,0.085472,-0.028362,-0.010084,0.009612,-0.011866,1.0,-0.011421,-0.156128
estimated_salary,-0.001384,-0.007201,0.007784,0.012797,0.014204,-0.009933,-0.011421,1.0,0.012097
churn,-0.027094,0.285323,-0.014001,0.118533,-0.04782,-0.007138,-0.156128,0.012097,1.0


In [11]:
# checking correlation since we want to use Naive Bayes
df_corr = df_corr[(df_corr>0.25)|(df_corr<-0.25)]
df_corr

Unnamed: 0,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
credit_score,1.0,,,,,,,,
age,,1.0,,,,,,,0.285323
tenure,,,1.0,,,,,,
balance,,,,1.0,-0.30418,,,,
products_number,,,,-0.30418,1.0,,,,
credit_card,,,,,,1.0,,,
active_member,,,,,,,1.0,,
estimated_salary,,,,,,,,1.0,
churn,,0.285323,,,,,,,1.0


### Feature engineering

In [12]:
# checking for any shenanigans or non-conventional naming
df.value_counts('country')

country
France     5014
Germany    2509
Spain      2477
Name: count, dtype: int64

In [13]:
# dummy encoding to convert categorical variables to binary
df = pd.get_dummies(df, drop_first=True)

### Modeling