# Neural Networks - Practical 1

In this an initial practical for neural neyworks. It does load an existing dataset of 10k customers, pre-processing it, and learns an initial neural network model. 



## Required imports

Please note this practical also switched of some warnings. 

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import tensorflow as tf

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix



from tensorflow.python.keras.layers import Input, Dense
from tensorflow.python import keras

from tensorflow.python.keras.models import Sequential


## Read in the data 

This file contains a not very biological dataset. It is comprised of customers and their shopping behavious. I chose this one, to indicate a bit of pre-processing. A task which will potentially be required by the task for next week. 

A more detailed introduction in data wrangling will be introduced in another lecture. 


In [None]:
churn_file = './data/Churn_Modelling.csv'

df_churn = pd.read_csv(churn_file)
#df_churn_attributes = list(df_churn.columns.values)


# Drop non-data columns

Some of the columns contain very specific information. Here, we are not using these. 

In [None]:
# keep a copied dataframe without non-data columns
df_churn_copy = df_churn.drop(['RowNumber', 'Surname', 'CustomerId'],axis=1)

## Have a look at the dataframe

In [None]:
df_churn_copy.head()




## Encoding string data

To encode categorical data such as in the columns Geography and Gender, we use the LabelEncoder.

In [None]:
labelencoder_geography = LabelEncoder()
df_churn_copy['Geography'] = labelencoder_geography.fit_transform(df_churn_copy['Geography'])

labelencoder_gender = LabelEncoder()
df_churn_copy['Gender'] = labelencoder_gender.fit_transform(df_churn_copy['Gender'])


In [None]:
df_churn['Geography'].unique()

## Encoding categorical data using one-hot-encoding

Not all modelling algorithms can easily cope with categorical data. Within this section we will map categorical to  numerical ones. 

As an example, one could map 

| Geography | Geography_mapped |
| ------------- |:-------------:|
| France       | 0 |
| Spain        | 1 |
| Germany      | 2 |

As we have done above (see sectin just before).

However, one problem with this approach is that numerical values have a inherent ordinal meaning. e.g. if one would like to know how similar ```'France'``` to ```'Spain'``` or ```'France'``` to ```'Germany'``` one would after mapping compare 0 to 1 or 0 to 2. For algorithms, and especially Neural Networks these are different meanings. 

One approach is to encode this into the so called one-hot encoding. Here, one would create additional variables for each of the possible values. The mapping could look like the following:

| Geography | Geography__France | Geography__Spain | Geography__Germany | 
| ------------- |:-------------:|:-------------:|:-------------:|
| France      | 1 | 0 | 0 | 
| Spain       | 0 | 1 | 0 | 
| Germany     | 0 | 0 | 1 | 

For simplicity in the later session, we will use both approaches. But, please keep in mind that some of the results should be taken with a pinch of salt, when using the simply mapped version.



In [None]:
# we only one-hot encode column index number 1 (i.e. the second one)
onehotencoder = OneHotEncoder(categorical_features = [1])
df_churn_copy2 = pd.DataFrame(onehotencoder.fit_transform(df_churn_copy).toarray())


## Train-test split

For simplicity of the exercise, we just use a train-test split. Please feel free to do a propper CV in your own time. 

In [None]:

X = df_churn_copy2[range(12)]
y = df_churn_copy2[12]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

## Scaling

The datasets in other practicals were already scaled. Here, we do the scaling. Remeber, that one should apply pre-processing egnerally only on the training data. Hence, here we use the StandardScaler and fit the scaling on the training data only. The learnt scaling is then applied to the test set. 


In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Learning your first Neural Network

We use the Keras module of Tensorflow. This allows to combine NNs using a sequence of layers (using Sequential). 

In [None]:
#Initializing Neural Network
neural_network = Sequential()

### Adding the layers

We subsequently add two layers (the input layer needs only indirectly be described in the first layer by defining the input dimension). 

The first layer takes in a 12 dimensional vector, uses ReLu as activation function and has 6 hidden nodes. 

The output layer takes the 6 outputs from the hidden layer, uses the Sigmoid activation function and returns a single output. 


In [None]:
# Adding the input layer and the first hidden layer

neural_network.add(Dense(activation = 'relu', input_dim = 12, units=6))
neural_network.add(Dense(activation = 'sigmoid', units=1))


## Compiling the network

The network needs to be compiled for tensorflow. Here we are using ADAM as optimiser (in contrast to simple gradient descent and learning rate) and measure the performance on accuracy.

In [None]:
# Compiling Neural Network
neural_network.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

## Training the network

Here, we train the network. 


'batch_size' defines in what size of batches the examples are presented and the gradient is calculated

'epochs' defines how many epochs is used for training. 

In [None]:
# Fitting our model 
neural_network.fit(X_train, y_train, batch_size = 10, epochs = 10)

## Predicting the test set

In [None]:
# Predicting the Test set results
y_pred = neural_network.predict(X_test)
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix')
print(cm)

## Additional layers

One can also add additional layers to design a 'deeper' network. 

Please observe how the metric (accuracy changes over each epoch).


In [None]:
#Initializing Neural Network
neural_network = Sequential()
neural_network.add(Dense(activation = 'relu', input_dim = 12, units=6))
neural_network.add(Dense(activation = 'relu', units=6))
neural_network.add(Dense(activation = 'relu', units=6))
neural_network.add(Dense(activation = 'relu', units=6))
neural_network.add(Dense(activation = 'relu', units=6))
neural_network.add(Dense(activation = 'sigmoid', units=1))
neural_network.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
neural_network.fit(X_train, y_train, batch_size = 10, nb_epoch = 10)
y_pred = neural_network.predict(X_test)
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
print(cm)

## Sigmoid as the only activation function 

What happens, when we use the same design, but just use the Sigmoid function?

Why does this happen?

In [None]:
#Initializing Neural Network
neural_network = Sequential()
neural_network.add(Dense(activation = 'sigmoid', input_dim = 12, units=6))
neural_network.add(Dense(activation = 'sigmoid', units=6))
neural_network.add(Dense(activation = 'sigmoid', units=6))
neural_network.add(Dense(activation = 'sigmoid', units=6))
neural_network.add(Dense(activation = 'sigmoid', units=6))
neural_network.add(Dense(activation = 'sigmoid', units=1))
neural_network.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
neural_network.fit(X_train, y_train, batch_size = 10, nb_epoch = 10)
y_pred = neural_network.predict(X_test)
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
print(cm)

# Multiple classes

We currently wored on binary classes. To enable the NN to work on multiple classes, we have to add some additional archictecture around. 

For this example, we first convert the binary class into a binary vector of length 2: one for each class
We use this as output for training. To normalize the output for all possible classes (here just two) we use the softmax activation mapping the original output into probabilities. The effect is that the prediction will now be teh probability for each class. 

Have a look at the following code.




In [None]:
onehotencoder_labels = OneHotEncoder()

onehotencoder_labels.fit(np.array([y_train]).transpose()) 

# ecode using the new representation
y2_train = onehotencoder_labels.transform(np.array(np.array([y_train]).transpose())).toarray()
y2_test = onehotencoder_labels.transform(np.array(np.array([y_test]).transpose())).toarray()

In [None]:
#Initializing Neural Network
neural_network = Sequential()
neural_network.add(Dense(activation = 'relu', input_dim = 12, units=6))
neural_network.add(Dense(activation = 'sigmoid', units=6))
neural_network.add(Dense(activation = 'softmax', units=2))



In [None]:
neural_network.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
neural_network.fit(X_train, y2_train, batch_size = 10, nb_epoch = 10)



In [None]:
y_pred = neural_network.predict(X_test)



In [None]:
y_pred