Section 4 -  Business case 

# TOC 
1. [Data](#data)
    + [data preprocessing](#preprocessing)
<br>
    + [Encoding]()
<br>
    + [Splitting the data]()
<br>
    + [ANN modelling]()


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [2]:
data = pd.read_csv("../input/Churn_Modelling.csv")

# Data <a name = "data"></a>

## data preprocessing <a name = "preprocessing"> </a>

In [3]:
data.describe()
data.columns.tolist()

the above consist of the info of customers of a bank. We aim to find out / predict the churn rate of the customers (left the bank). The isactivate variable can be if the customer last login for the past one month. 

In [4]:
features = [
'CreditScore',
'Geography',
'Gender',
'Age',
'Tenure',
'Balance',
'NumOfProducts',
'HasCrCard',
'IsActiveMember',
'EstimatedSalary',
]
X = data[features]
y = data['Exited']

## Encoding Categorical variables 

In [5]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 

In [6]:
labelencode_country = LabelEncoder()
X.loc[:,'Geography'] = labelencode_country.fit_transform(X.loc[:,'Geography'])

In [7]:
labelencode_gender = LabelEncoder()
X.loc[:,'Gender'] = labelencode_gender.fit_transform(X.loc[:,'Gender'])

In [8]:
onehot_geo = OneHotEncoder(categorical_features = [1])
X = onehot_geo.fit_transform(X).toarray()

In [9]:
# drop one variable to avoid dummy variable trap
X = X[:,1:]

In [10]:
X.shape

## Splitting the data 


In [11]:
from sklearn.model_selection import train_test_split


In [12]:
# train on 8000, validate on the remaining 2000
x_train, x_valid, y_train, y_valid = train_test_split(X,y, test_size = 0.2, random_state= 0)

In [13]:
# feature scaling 
# very important for deep learning :
# apply scaling to ease the computation, also do not want one independent variable dominating the other one 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_valid = sc.transform(x_valid)



Taking a quick look here:


In [14]:
x_train.shape

In [15]:
x_valid.shape

# Modelling<a name = "modelling"></a>

In [16]:
# we need the sequential module to initialize 
# dense module to build the layers
from keras.models import Sequential
from keras.layers import Dense

In [17]:
# 2 ways to create a deep learning model
# 1 is to define the sequence of layers
# another is to define the graph

# let us create a classifier (since its a classification problem)
classifier = Sequential()
'''
# then we need to build the layers one by one, the input layer, the hidden layers and finally, the output layer 

# recall and refer to the NN chart (summary from section 3)
# the first step is to initialise the weights randomly. we will use the dense to help us 
# step 2, we input 11 (thats the no of columns we have ) variables for the input node
# step 3 build the activation function (Hidden layer ). Some of the best ones are reactifer and sigmoid (good use for output classification)
# step 4 compute the error 
# step 5 back propagation from right to left. Several ways of updating the weights and this way is define by the learning rate parameter, which decides how 
# much of the weight are dated 

# step 6 repeat 1-5 and update the weight either after each observations or a batch of observations. i.e we update the weights after 10 observations 

# step 7 when the entire training set passes through the ANN, then is 1 epoch. then we repeat many more epoch.
'''

# add input 
# Dense(output_dim = number of nodes we want to add in this hidden layer )
# by adding this hidden layer, we are specifying the number of inputs from the previous layer (input layer). So it is how many nodes we are adding to this 
# .add (which is a hidden layer)
# there is no fixed rule or answer. general tip: 
# nodes in HL = average ( nodes in input layer + nodes in output layer ) then further hyperparameter tuning 
# (11 + 1) / 2 = 6

# init is the type of random initialization for the weights (initi method ) 
# activation = activation function we wants 
# input_dim = requried for the first add since we are expecting the variables to come in 
classifier.add(Dense(output_dim = 6, init= 'uniform', activation = 'relu', input_dim = 11))



In [18]:
# time to add another hidden layer 
# second HL 
# the output dim can start via using the average, init is still needed to initialize the weights randomly
classifier.add(Dense(output_dim = 6, init= 'uniform', activation = 'relu'))

In [19]:
# time to add the output layer 
# sigmoid function returns 0-1 (which is what we need)

# if u are dealing with more than 1 dependent variable, we need to change output_dim to the number of classes
# activation = softmax. softmax is basically sigmoid but applied to a dependent variable > 2categories 
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))



In [20]:
# time to compile before train 
# optimizer = algorithm: we want adam (a type of stochastic GD) 
# loss = the objective function: logloss = binary_cross_entropy, if more than 2 outcomes , then use categorical_cross_entropy

classifier.compile(optimizer= 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'] )

In [21]:
# time to train???
# fit to training 
# x = train set, y = y train set, 
# batch_size = (no fixed rules)
classifier.fit(x_train, y_train, batch_size= 10, epochs= 100)

In [22]:
y_pred_val = classifier.predict(x_valid)
y_pred_val[:,0]

In [23]:
results = pd.DataFrame({'pred':y_pred_val[:,0], 'index':y_valid.index, 'actual':y_valid})

In [24]:
results['pred_convert'] =  results['pred'] > 0.5
y_prediction = y_pred_val > 0.5

In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score

confusion_matrix(y_valid, y_prediction)

In [26]:
accuracy_score(y_valid,y_prediction)

# homework 

Use our ANN model to predict if the customer with the following informations will leave the bank: 

- Geography: France
- Credit Score: 600
- Gender: Male
- Age: 40 years old
- Tenure 3 years
- Balance 60000 dollars
- Number of Products 2
- Does this customer have a credit card ? Yes
- Is this customer an Active Member, Yes
- Estimated Salary $50000
- So should we say goodbye to that customer ?

In [27]:
features

In [28]:
x_test = pd.DataFrame(columns = features)

In [29]:
x_test.CreditScore = 600
x_test.Geography = 'France'
x_test.Gender = 'Male'
x_test.Age = 40
x_test.Tenure = 3
x_test.Balance = 60000
x_test.NumOfProducts = 2
x_test.HasCrCard = 1
x_test.IsActiveMember = 1 
x_test.EstimatedSalary = 50000

In [30]:
x_test.loc[len(x_test)] = {'CreditScore':600,'Geography':'France','Gender':'Male','Age':40,'Tenure':3,'Balance':60000,'NumOfProducts':2,'HasCrCard':1,'IsActiveMember':1,'EstimatedSalary':50000}

In [31]:
# encode and transform 

x_test.loc[:,'Geography'] = labelencode_country.transform(x_test.loc[:,'Geography'])
x_test.loc[:,'Gender'] = labelencode_gender.transform(x_test.loc[:,'Gender'])
x_test = onehot_geo.transform(x_test).toarray()
x_test = x_test[:,1:]
# scaling 
x_test = sc.transform(x_test)

In [32]:
x_test.shape

In [33]:
# predict 
classifier.predict(x_test)

Ans: he will not leave the bank! 


# Tuning the ANN

## implementing k - fold validation

In [34]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense

'''
https://stackoverflow.com/questions/42815131/keras-for-implement-convolution-neural-network
'''

def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(output_dim = 6, kernel_initializer= 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(output_dim = 6, kernel_initializerkernel_initializer= 'uniform', activation = 'relu'))
    classifier.add(Dense(output_dim = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer= 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'] )
    
    return classifier 




In [35]:
# This is a wrapper that we use to fit to our data instead of directly using the classifier from keras
# format for kerasclassifier,
# build_fn: the model architecture 

classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)

# estimator: the model (sk learn model)
# cv is the number of folds
accuracies = cross_val_score(estimator= classifier, X = x_train, y = y_train, cv =10 )#, n_jobs = -1)





In [None]:
print('the mean accuracy is : {}'.format(accuracies.mean()))
print('the mean accuracy is : {}'.format(accuracies.mean()))


## Drop out regularization 

Prevents overfitting for deep learning 

In [36]:
'''
During training, some neurons are randomlydisabled (dropout) to prevent them from too dependent on 
each other when they learn the correlations. Therefore, by overwritting these neurons, the ANN 
learns several independent correlation of the data since each time, it is not the same config of the neurons

this also prevent neurons from learning too much === prevent overfitting 
'''
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout # applied to the neurons to randomly disable them 

# we will apply it to some layers randomly.: first hidden and second hidden. 
# if there is overfitting. should apply it to all the layers 
# let us add the drop out
# the dropout arguments are:
    # p: the fraction of the neuron u want to drop. e.g if = 0.1, then at each itera, 1 neuron will be disabled. start with 0.1 increment by 0.1 to 
    # see if overfitting is solved. In general, do not go over 0.5, as it may now underfitting. In this example, we will disable 10% of the 6 nodes.
    
    

classifier = Sequential()
classifier.add(Dense(output_dim = 6, kernel_initializer= 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dropout(p = 0.1 )) # drop out to first hidden layer
classifier.add(Dense(output_dim = 6, kernel_initializerkernel_initializer= 'uniform', activation = 'relu'))
classifier.add(Dropout(p = 0.1 )) # drop out for second layer 
classifier.add(Dense(output_dim = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer= 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'] )

## Parameter tuning 

In [47]:
'''
Using Grid Search for parameter tuning 
'''

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense


def build_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(output_dim = 6, kernel_initializer= 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(output_dim = 6, kernel_initializer= 'uniform', activation = 'relu'))
    classifier.add(Dense(output_dim = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer= optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'] )
    return classifier 




In [48]:
classifier = KerasClassifier(build_fn = build_classifier) # grid search will find the epoch and batch size 


'''
Create dict for the hyperparamters.
each key is the parameters name and values will be the list of values 

if you want to tune any parameters present in the ANN architecture, you must use a custom function to take into the custom input like
the example of optimizer. rmsprop is another SGD 
'''
parameters = {
    'batch_size':[25,32], # good practice to take power of 2
    'epochs':[100,500],
    'optimizer': ['adam','rmsprop']
}



grid_search = GridSearchCV(estimator= classifier ,
                          param_grid= parameters,
                          scoring = 'accuracy',
                          cv =10)

In [None]:
# fit this grid search to the data

final_ann = grid_search.fit(x_train, y_train)

In [None]:
best_parameter = final_ann.best_params_
best_accuracy = final_ann.best_score_