<img src="py-logo.png" width="100pt"/>



# PYTHON FOR DATA SCIENCE 
# VI – NEURAL NETWORKS
*Lasse Ruokolainen*

*Seasoned Data Master, BILOT Consulting Oy* 

***

## (1) Data preparation
When using neural networks, certain data preparation is typically required. Data preparation is an important step in modeling the neural network. The procedure for the preparation of the data affects many important parameters. It reduces the modeling errors, speeds up the process of training the neural network and leads to simplification of the system as a whole. What is typically done is that categorical features are either label- or onehot encoded and numerical features are scaled.

### (1.1) Preprocessing

Read in the example data:

In [None]:
import pandas as pd

# Read the Titanic dataset
df = pd.read_csv('Datasets/titanic_dataset.csv')
df.head()

Let's define a function for making necessary preprocessing, so that we can easily process new data:

In [None]:
# Preprocessing function
def preprocess(data, columns_to_ignore, columns_to_scale):
    
    # Delete columns
    data.drop(columns_to_ignore, axis = 1, inplace=True)
    
    # Numeric encoding of "sex":
    data.sex = np.where(data.sex == 'female',1,0)
    
    # Scale values:
    data[columns_to_scale] = data[columns_to_scale].apply(lambda x: (x-min(x))/max(x))
        
    return data.values

And now we cam apply the function to our data:

In [None]:
import pandas as pd
import numpy as np

# Ignore 'name' and 'ticket':
to_ignore = ["name", "ticket"]

# Scale 'age' and 'fare':
to_scale = ["age", "fare"]

# Preprocess data
X = preprocess(df.drop('survived',axis=1), to_ignore, to_scale)

print(X.shape)

### (1.2) Data splitting

Of course, we need to split the data, to be able to judge the reliability of our model. Here a challenge is that we have quite a small sample in our disposal:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, df.survived.values, 
    test_size = 0.10, random_state = 123
)

print(X_train.shape)
print(X_test.shape)

## (2) Modeliing

### (2.1) Simple MLP

MLP stands for *Milty Layer Perceptron*

#### (a) *Model fitting*

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes = (11,7,),
                    verbose = True, batch_size = 16,
                    max_iter = 100,learning_rate='adaptive',
                    early_stopping=True, activation='relu')

mlp.fit(X_train,y_train)

Plot training loss:

In [None]:
import matplotlib.pyplot as plt

plt.plot(mlp.loss_curve_,'.-',markersize = 10)
plt.ylabel("Training loss")
plt.xlabel("Training epoch")
plt.show()

In [None]:
print('Training Accuracy: %.2f \n' %mlp.score(X_train,y_train))

#### (b) *Model evaluation*

This level of training accuracy does not indicate that the model would be over-fitting. 
Still, we need to see how well the model performs against the testing data. The `classification_report` function from `sklearn.metrics` is quite usefull:

In [None]:
from sklearn.metrics import classification_report

# Calculate model predictions:
preds = mlp.predict(X_test)

print('Test Accuracy: %.2f \n' %mlp.score(X_test,y_test))

print('Classification report:')
print(classification_report(y_true = y_test, y_pred = preds))

Given the rather small amount of training data (a bit over 1000 records), the model is performing surprisingly well.

Let's plot the ROC curve also:

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

predsp = mlp.predict_proba(X_test)[:,1]
auc = round(roc_auc_score(y_true=y_test, y_score=predsp),2)
fpr, tpr, ths = roc_curve(y_true=y_test, y_score=predsp,pos_label=1)

plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.gca().set_aspect('equal', adjustable='box')
plt.title('AUC: {}'.format(auc))
plt.show()

This analysis indicates that the model is able to differentiate between survivers and non-survivers with 82% probability.

### (2.2) Keras deep learning

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

In other words, it is much more flexible than the `MLPClassifier` in `sklearn`. Still one does not need to program everything from scratch.

#### (a) *Model building*
With Keras one needs to build the network architecture explicitly:

In [None]:
from keras.models import Sequential
from keras.layers import Dense

dnn = Sequential()
dnn.add(Dense(11, input_dim = 6,  kernel_initializer='uniform', activation='relu'))
dnn.add(Dense(7, input_dim = 6,  kernel_initializer='uniform', activation='relu'))
dnn.add(Dense(1, activation='sigmoid'))

dnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#### (b) *Model fitting*

In [None]:
history = dnn.fit(X_train, y_train, 
                    validation_data = (X_test, y_test), 
                    epochs = 10, batch_size = 16)

Visualize training loss:

In [None]:
import matplotlib.pyplot as plt

epochs = list(range(1,21))
H = pd.DataFrame(history.history)

plt.plot(epochs,H['loss'],'.-',label='train')
plt.plot(epochs,H['val_loss'],'r.-',label='test')
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.legend()
plt.show()

#### (c) *Model evaluation*

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Calculate model predictions:
preds = dnn.predict_classes(X_test)

print('Test Accuracy: %.2f \n' %accuracy_score(y_test,preds))

print('Classification report:')
print(classification_report(y_true = y_test, y_pred = preds))

It is very unlikely that the result will vary significantly between the two models implemented here, due to the small amount of data.