# Modelling and Training

Now, we are finally able to specify our neural network. To find the right hyperparameters, we use a combination of trying out common and reasonable hyperparameter combinations on the one hand and automated grid search by using the talos package on the other hand. In doing so, we aim to find a good balance between finding the optimal hyperparameters withouth relying on heavy computational power.

As we are dealing with a binary classification and are using the sigmoid function as the activation function in the output layer, it is sufficient to only include one neuron in the output layer. After trying out several other architectures, we decide to use three hidden layers containing 80, 40, and 10 neurons, respectively. We also choose to not include any dropout, as this did not improve our result. However, we included an l2-regularizer to prevent overfitting, which indeed improved our cross-validated performance measure. Furthermore we specified glorot_normal as our method to initialize our weights and biases. Although the He-initializer often is the best choice when using the Relu activation function in the hidden layers, the Glorot initializer turned out to be the better option. We also tried out several optimizers (SGD, Adam, Adadelta, Adagrad, RMSprop, Adamax, Nadam), and decided to use Adam, which naturally incorporates a momentum.

In [0]:
basemodel = Sequential()
basemodel.add(Dense(80, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu', input_dim = 125))
basemodel.add(Dense(40, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu'))
basemodel.add(Dense(10, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu'))
basemodel.add(Dense(1, activation='sigmoid'))

basemodel.compile(optimizer=Adam(lr = 0.00005), loss='binary_crossentropy', metrics=['accuracy'])

After having specified a specific model, we can now train the model. Here, we need to specify two further metaparameters. First, we choose a learning rate of 0.0005, a batch size of 24 and 10 epochs. Further down, we will try to optimize these vaules using the talos package.

In [0]:
story = basemodel.fit(x = X_train, y = y_train, validation_data=[X_test, y_test], epochs = 10, batch_size=24)

Train on 41079 samples, validate on 10270 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


To evaluate this model further below, we save the predictions on the test set based on this model:

In [0]:
predictions = basemodel.predict(X_test)

Due to the learning rate being a very important hyperparameter and because of its strong connection to the batch size, we now try to find the best values for these hyperparameters by doing grid search with the talos package:

In [0]:
import talos as ta

In [0]:
X_talos=np.asarray(X)
y_talos=np.asarray(y)

def minimodel(x_train, y_train, x_val, y_val, params):

    model = Sequential()
    model.add(Dense(80, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu', input_dim = 125))
    model.add(Dense(40, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu'))
    model.add(Dense(10, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer=Adam(lr = params['learningrate']), loss='binary_crossentropy', metrics=['accuracy'])

    story = model.fit(x = x_train, y = y_train,
          validation_data=[x_val, y_val],
          epochs = 10,
          batch_size=params['batchsize'])
    
    return story, model
  
p = {'learningrate': [0.00001, 0.00003, 0.00005, 0.00007, 0.0001],
    'batchsize': [24, 48]}

t = ta.Scan(X_talos, y_talos, p, minimodel)

  0%|          | 0/10 [00:00<?, ?it/s]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 10%|█         | 1/10 [00:15<02:20, 15.58s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 20%|██        | 2/10 [00:30<02:03, 15.41s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 30%|███       | 3/10 [00:45<01:47, 15.34s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 40%|████      | 4/10 [01:09<01:46, 17.83s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 50%|█████     | 5/10 [01:33<01:38, 19.65s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 60%|██████    | 6/10 [01:48<01:13, 18.46s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 70%|███████   | 7/10 [02:13<01:00, 20.17s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 80%|████████  | 8/10 [02:28<00:37, 18.67s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


 90%|█████████ | 9/10 [02:51<00:20, 20.13s/it]

Train on 35944 samples, validate on 15405 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


100%|██████████| 10/10 [03:15<00:00, 21.30s/it]


Let´s have a look at the results of our grid search:

In [0]:
print(t.data)

   round_epochs  val_loss   val_acc  ...       acc  learningrate  batchsize
0            10  1.540983  0.918338  ...  0.920654       0.00007         48
1            10  2.031460  0.918338  ...  0.920627       0.00001         48
2            10  1.524614  0.918403  ...  0.920571       0.00003         48
3            10  0.369941  0.918403  ...  0.920571       0.00005         24
4            10  0.392205  0.918403  ...  0.920571       0.00003         24
5            10  0.330297  0.918403  ...  0.920571       0.00005         48
6            10  0.548199  0.917884  ...  0.919931       0.00001         24
7            10  0.382778  0.918663  ...  0.920154       0.00010         48
8            10  0.357133  0.918403  ...  0.920627       0.00010         24
9            10  0.350741  0.918403  ...  0.920126       0.00007         24

[10 rows x 7 columns]


When looking at the different accuracies on the validation set, our initial guess of an learning rate of 0.00005 and a batch size of 24 seems to be a reasonable choice. Indeed, when trying out different combinations, the crossvalidated AUC seems to be maximized when using these vaules and we therefore stick to them.

One more issue which we need to consider is the fact, that the dataset is quite imbalanced: There are far more cases where users do not purchase than cases in which they do purchase an item. This might be a problem as our model might become biased towards the majority class. One possible way to face this problem is to oversample the minority class. We do this by implementing the popular SMOTE-algorithm. However, this method does not seem to improve our crossvalidated result. We therefore stick to train our model with the original, imbalanced dataset.

In [0]:
print(minidata.groupby('Kauf').size())
X_train_smote, y_train_smote = SMOTE(random_state=12, ratio = 1.0).fit_sample(X_train, y_train)
X_train_smote = pd.DataFrame(X_train_smote)
y_train_smote = pd.DataFrame(y_train_smote)
smotemodel = Sequential()
smotemodel.add(Dense(80, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu', input_dim = 125))
smotemodel.add(Dense(40, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu'))
smotemodel.add(Dense(10, kernel_regularizer=l2(0.01), kernel_initializer=glorot_normal(seed=42), bias_initializer='zeros', activation='relu'))
smotemodel.add(Dense(1, activation='sigmoid'))
smotemodel.compile(optimizer=Adam(lr = 0.00005), loss='binary_crossentropy', metrics=['accuracy'])
story2 = smotemodel.fit(x = X_train_smote, y = y_train_smote, validation_data=[X_test, y_test], epochs = 10, batch_size=24, verbose=0)
predictions_smote = smotemodel.predict(X_test)
print('AUC: ', metrics.roc_auc_score(y_test, predictions_smote))

Kauf
0    47237
1     4112
dtype: int64
AUC:  0.7263959240711243
