## Preparing for Machine learning and Machine Learning

In [1]:
import pandas as pd
import numpy as np
import pickle

In [2]:
#Load the dataframe saved in EDA
churn_clean_df= pd.read_pickle('churn.pkl')

In [3]:
churn_clean_df.head() #just to be sure the data remains the same

Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,days_since_last_transaction,...,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn,Age_group
0,1,3135,66,Male,0.0,self_employed,187.0,2,755,224.0,...,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,0,Old Age
1,2,310,35,Male,0.0,self_employed,,2,3214,60.0,...,7799.26,12419.41,0.56,0.56,5486.27,100.56,6496.78,8787.61,0,Young Adult
2,4,2356,31,Male,0.0,salaried,146.0,2,41,,...,4910.17,2815.94,0.61,0.61,6046.73,259.23,5006.28,5070.14,0,Young Adult
3,5,478,90,,,self_employed,1020.0,2,582,147.0,...,2084.54,1006.54,0.47,0.47,0.47,2143.33,2291.91,1669.79,1,Old Age
4,6,2531,42,Male,2.0,self_employed,1494.0,3,388,58.0,...,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,1,Middle Age


In [4]:
churn_clean_df.columns #recheck the dataframe

Index(['customer_id', 'vintage', 'age', 'gender', 'dependents', 'occupation',
       'city', 'customer_nw_category', 'branch_code',
       'days_since_last_transaction', 'current_balance',
       'previous_month_end_balance', 'average_monthly_balance_prevQ',
       'average_monthly_balance_prevQ2', 'current_month_credit',
       'previous_month_credit', 'current_month_debit', 'previous_month_debit',
       'current_month_balance', 'previous_month_balance', 'churn',
       'Age_group'],
      dtype='object')

### Dealing with missing values

##### <i>check columns with missing values</i>

In [5]:
churn_clean_df.isna().sum()

customer_id                          0
vintage                              0
age                                  0
gender                             525
dependents                        2463
occupation                          80
city                               803
customer_nw_category                 0
branch_code                          0
days_since_last_transaction       3223
current_balance                      0
previous_month_end_balance           0
average_monthly_balance_prevQ        0
average_monthly_balance_prevQ2       0
current_month_credit                 0
previous_month_credit                0
current_month_debit                  0
previous_month_debit                 0
current_month_balance                0
previous_month_balance               0
churn                                0
Age_group                            0
dtype: int64

Four categorical variables have missing values, they are:
* Gender;
* Dependents;
* Occupation and 
* city
<br></br>



Only day_since_last_transaction has missing values in the numeric category

#### Dealing with missing values

From EDA, we have decided that due to too many unique values for City Column, we will not include it in our machine learning section, this is one of the shortcomings of machine learnings. <br></br>
Nevertheless, there are three ways one can deal with missing categorical values, one could replace the missing values with the most frequent value (mode), one could replace the missing value with an arbitiary number or one could drop the rows of the missing values. For this project however, we will replace the missing categorical values with mode.<br></br>
The conventional method of dealing with numerical missing values is to either replace them with the mean or drop the rows they are in.

In [6]:
#replacing the categorical values with mode, since city will be dropped, we wont include it
na_cat=['gender','dependents','occupation']

for i in na_cat:
    mode=churn_clean_df[i].value_counts(ascending=False).index[0] #fetches the mode
    
    churn_clean_df[i].fillna(mode, inplace=True) # replaces the missing value and overrides the dataframe
    
#replacing the numerical value
churn_clean_df['days_since_last_transaction'].fillna(churn_clean_df['days_since_last_transaction'].mean(), inplace=True)

### Getting the Log of the Benchmark columns

In [7]:
benchmark_col=['branch_code',
 'current_balance',
 'average_monthly_balance_prevQ',
 'average_monthly_balance_prevQ2',
 'current_month_credit',
 'previous_month_credit',
 'current_month_debit',
 'previous_month_debit',
 'previous_month_balance',
'previous_month_end_balance', 
'average_monthly_balance_prevQ2'
]

# getting log of benchmark column
negative_log_columns_minvalue_dict = {} #for keeping minimum value of columns with negative numbers

log_columns_withzeros_minvalue_dict={} #for keeping names of columns in 

for col in benchmark_col:
    
    minimum= churn_clean_df[col].min()
    if minimum <0:
        
        churn_clean_df[col] = np.log(churn_clean_df[col] + np.abs(minimum) + 1)
        
        #keeps record of the minimum value incase conversion from logarithm is needed in the future
        negative_log_columns_minvalue_dict[col] = minimum 
    
    elif minimum ==0:
        churn_clean_df[col] = np.log(churn_clean_df[col]+ 1)
        
        #keeps record of the minimum value incase conversion from logarithm is needed in the future
        log_columns_withzeros_minvalue_dict[col] = minimum 
    
        
    else:
        churn_clean_df[col]= np.log(churn_clean_df[col])

In [8]:
#saving the name and values of benchmarks with negative minimum values and zero minimum value

filenames=['churn benchmark with negative values.pkl', 
           'churn benchmark with zero minimum values.pkl']
logged_benchmarks= [negative_log_columns_minvalue_dict, 
                    log_columns_withzeros_minvalue_dict]

for log_file, fname in zip(logged_benchmarks, filenames):
    outfile= open(fname, 'wb')
    pickle.dump(log_file, outfile)
    outfile.close()

### Dealing with categorical variables

In [9]:
cat_col=['customer_id', 'occupation', 'gender', 'city', 'Age_group', 'churn', 'branch_code']

#seperating the dependent variable from the dataframe variable
y=churn_clean_df['churn'] 
X_unscaled=churn_clean_df.drop('churn', axis=1)

# getting the dummy variable
usable_cat_col= pd.get_dummies(churn_clean_df.loc[:,['gender','occupation']],      
                               prefix=['gender', 'occupation'], drop_first=True)  #drop first prevents dummy trap

X_unscaled=churn_clean_df.drop(cat_col, axis=1)
X_unscaled=pd.concat([X_unscaled,usable_cat_col], axis=1)

In [10]:
X_unscaled.head()

Unnamed: 0,vintage,age,dependents,customer_nw_category,days_since_last_transaction,current_balance,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,gender_Male,occupation_retired,occupation_salaried,occupation_self_employed,occupation_student
0,3135,66,0.0,2,224.0,8.848462,8.435827,7.285308,2.379147,-1.609438,-1.609438,-1.609438,-1.609438,1458.71,8.799606,1,0,0,1,0
1,310,35,0.0,2,60.0,9.29609,9.380524,8.961784,2.422367,-0.579818,-0.579818,8.610004,4.610755,6496.78,9.543989,1,0,0,1,0
2,2356,31,0.0,2,69.997814,9.150391,9.101179,8.499064,2.38592,-0.494296,-0.494296,8.707273,5.557716,5006.28,9.234356,1,0,1,0,0
3,478,90,0.0,2,147.0,8.961478,8.60199,7.642303,2.376833,-0.755023,-0.755023,-0.755023,7.670116,2291.91,8.830939,1,0,0,1,0
4,2531,42,2.0,3,58.0,8.769147,8.423386,7.404468,2.381297,-1.108663,6.571737,6.377781,7.338277,1157.15,8.832016,1,0,0,1,0


### Scaling and dividing datasets into train sets and test sets

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [12]:
scale= StandardScaler()
scale.fit(X_unscaled)
X = pd.DataFrame(scale.transform(X_unscaled), columns=X_unscaled.columns)

In [13]:
#saving the scaler for future use
with open('churn_scaler.pkl', 'wb') as f:
    pickle.dump(scale, f)

In [14]:
X.head()

Unnamed: 0,vintage,age,dependents,customer_nw_category,days_since_last_transaction,current_balance,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,gender_Male,occupation_retired,occupation_salaried,occupation_self_employed,occupation_student
0,0.478644,0.999147,-0.330877,-0.341489,1.894492,-0.727799,-0.814513,-1.255369,-0.575232,-0.882673,-0.900892,-1.129796,-1.145103,-0.142564,-0.826844,0.813874,-0.277108,-0.556106,0.785274,-0.279606
1,-1.275909,-0.741756,-0.330877,-0.341489,-0.1229903,0.138514,0.715389,0.729051,0.883783,-0.644332,-0.664697,1.223916,0.290772,-0.022705,0.656674,0.813874,-0.277108,-0.556106,0.785274,-0.279606
2,-0.005178,-0.966389,-0.330877,-0.341489,1.74818e-16,-0.143464,0.263,0.181336,-0.346602,-0.624535,-0.645078,1.246319,0.509369,-0.058165,0.03959,0.813874,-0.277108,1.798219,-1.273441,-0.279606
3,-1.171568,2.346943,-0.330877,-0.341489,0.9472594,-0.509075,-0.545418,-0.832799,-0.653352,-0.684889,-0.704889,-0.93301,0.996998,-0.122742,-0.764398,0.813874,-0.277108,-0.556106,0.785274,-0.279606
4,0.103512,-0.348649,1.756001,1.172672,-0.1475938,-0.881302,-0.834661,-1.114321,-0.502664,-0.766751,0.975873,0.709797,0.920396,-0.149738,-0.762253,0.813874,-0.277108,-0.556106,0.785274,-0.279606


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Machine Learning

In [16]:
# importing logistics regression model
from sklearn.linear_model import LogisticRegression

In [17]:
LR= LogisticRegression()

In [18]:
LR.fit(X_train, y_train)

LogisticRegression()

###### Evaluating the model

In [19]:
from sklearn.metrics import classification_report

In [20]:
print(classification_report(y_test,LR.predict(X_test)))

              precision    recall  f1-score   support

           0       0.86      0.98      0.91      7645
           1       0.74      0.28      0.41      1722

    accuracy                           0.85      9367
   macro avg       0.80      0.63      0.66      9367
weighted avg       0.84      0.85      0.82      9367



The accuracy of our model is 85 which is good, we might be able to do better though, however, a quick look at the recall value for churn i.e 1, which is given as 0.28, one would realize that the value is extremely low, this defeats the purpose of carrying out the analytics in the first place, essentially recall of 0.28 means that only 28% of the individual who will eventually churn will be predicted to churn, in other words only 28 out of 100 people who will churn will be indentified by our model, one need not be an expert to conclude that this is not good, our bank will probably throw the model in the trash and maybe chase me out of their premises. At this point it is pointless to improve accuracy because, it does not seem relevant to this model, we might have to focus on recall because of the bank's objective, we woundn't want to disappoint our employer now, would we? <br></br><br></br>
It is quite common for accuracy to be a useless metric when examining classification models performance, this usually happens when datasets are imbalance i.e some classes in the response variable occur much more often than others. In our churn dataset, class 0 of the churn column occurs much more frequently than class 1, referencing the churn countplot in the Exploratory Data Analysis notebook, one can deduce that 0 class which is denoted as "No" is about 80 of the total, while 1 denoted as "Yes" is only about 20 percent. <br></br><br></br>
The foregoing implies that we have to rerun the model and try to optimize recall in our model, it is important to note that there is free lunch, the new optimized model will definitely lead to more false positives, thus, reducing our precision.

#### Searching for the best model that improves recall draticall using Cross validation

In [21]:
from sklearn.model_selection import GridSearchCV

In [22]:
param_grid={'penalty':['l1', 'l2'], # penalty is used because of the suspicion that some variables might be irrelevant 
            'C':[1,10,50,100,1000],
           'fit_intercept':[True, False]}

grid = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid=param_grid,scoring='recall', cv=5)

In [23]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(solver='liblinear'),
             param_grid={'C': [1, 10, 50, 100, 1000],
                         'fit_intercept': [True, False],
                         'penalty': ['l1', 'l2']},
             scoring='recall')

In [24]:
grid.best_params_

{'C': 50, 'fit_intercept': False, 'penalty': 'l2'}

L2 penalty is chosen, which means all the independent variables are relevant. 

In [25]:
model= grid.best_estimator_

In [26]:
print(classification_report(y_test,model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.92      0.68      0.78      7645
           1       0.35      0.75      0.47      1722

    accuracy                           0.69      9367
   macro avg       0.63      0.71      0.63      9367
weighted avg       0.82      0.69      0.73      9367



The new model will be able to determine the 3 out of every 4 customers who will churn in the future, this is an improvement on the first model which could not really detect customers who will definitely churn. as expected, precision is now much lower, this model will predict more people who will not churn as those that will churn, this seems like what our bank wouldn't care much about for now, however, if the bank desires to embark on some strategy to reduce churn by focusing people predicted to churn by our model, the bank might waste more resource on those who woundn't churn. So what next? Well there is usually a middle ground, in our case we'll consider the coefficients of the variables in our model

###### Model coefficients

In [27]:
pd.Series(model.coef_.ravel(), index=X.columns)

vintage                          -0.127373
age                              -0.014610
dependents                        0.004770
customer_nw_category              0.044354
days_since_last_transaction      -0.006714
current_balance                  -1.297846
previous_month_end_balance       -0.540634
average_monthly_balance_prevQ     1.065785
average_monthly_balance_prevQ2   -0.091754
current_month_credit             -0.007859
previous_month_credit             0.056871
current_month_debit               0.233502
previous_month_debit              0.046023
current_month_balance             0.259036
previous_month_balance            0.567856
gender_Male                       0.018734
occupation_retired                0.016279
occupation_salaried               0.017560
occupation_self_employed          0.096477
occupation_student                0.047647
dtype: float64

Considering these coefficients, one can easily draw a conclusion that older people are less likely to churn,  because of age's negative value, in simple terms, age coefficient is interpreted as: the higher the age, the less likely an individual will churn, i.e the less likely they stop transacting with bank. <br></br>
Another interesting numerical demographic variable coefficient is dependents which implies that the more the number of dependent the more likely an individual will churn.

Considering the nominal variable gender, the coeffient for gender_male is positive, thus, is interpreted as the male gender are more likely to churn than female. For occupation, every other occupation especially self employed because of it's high coeffient are more likely to churn compared to those who work in the company. 

Wait! retired persons seem counter intuitive to the age coefficent, this is when the gender category comes in, recall that the female gender and company occupation were used as reference categories to prevent dummy trap, as such they are interdependent, as such a better interpretation will be retired people are much likely to churn than female who work in company.
 


It is important to reitrate that the essence of checking the coefficients of the variables is to tailor the bank's decision making, for example, the bank could want spend money on some customer retention program to prevent customers that have been predicted to churn from actually churning, because of low precision in our model, the bank will end up spending a lot more on people who are not going to churn, to minimize this the bank might prioritize based on the coefficients of the model to support their decision. for example the bank prioritize a self employed 35 years old man with a high average monthly bank balance who has been predicted to churn over an 80 years old woman high current month credit.

Finally, we have to save the model for future use

In [28]:
with open('Churn LR model.pkl', 'wb') as f:
    pickle.dump(model, f)

#  <i>Thank You</i>