# Format of the data

When we create a model we must pre process our data in certain format. Keras `fit` function expects `x` and `y` data, `x` being the input or sample data and `y` being the output or target data for the given `x`.

* `x` can either be one of: 
    * numpy array
    * a tensor or a list of tensors
    * dictonary mapping input names to corresponding numpy array or tensors (useful in named inputs)
    * tf.data dataset
    * A generator/`keras.utils.Sequence` 
* `y` always has to be in the same format as `x`

# Data Prepration and Processing

Now we know what kind of data our model expects, we can prepare data to train our model. 

In [17]:
import numpy as np
from random import randint
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

## Example Data - Hypothetical Covid-19 Trial

### Trial Details

* An experimental drug was tested on individuals from ages 13 to 100 in a clinical trial of covid-19.
* The traisl has 3000 participants. Half were under 65 years old, half were 65 years or older.

### Findings 
* Around 85% of patients who were 65 or older experienced side effects.
* Around 85% of patients who were under 65 experienced no side effects.

In [57]:
# create sample data 

ages = []
labels = [] #1 -- had side effect 0 -- no side effect

random_younger_population = [randint(13,64) for x in range(1500)]
ages.extend(random_younger_population)
# 15% of 1500 yonger people experienced side effect 
labels.extend([1 for x in range(225)])
labels.extend([0 for x in range(1275)])


random_older_population = [randint(65, 100) for x in range(1500)]
ages.extend(random_older_population)
# 85% of 1500 older people experienced side effect 
labels.extend([1 for x in range(1275)])
labels.extend([0 for x in range(225)])

ages = np.array(ages)
labels = np.array(labels)

ages, labels = shuffle(ages, labels)


In [22]:
X_train, X_test, y_train, y_test = train_test_split(ages, labels, test_size=0.33)

In [23]:
# rescale data to be in between 0 to 1 instead of 13 to 65. This is to 
# simplify the data so we can optimize the model and run it faster
scaler = MinMaxScaler(feature_range=(0,1))
scaled_X_train = scaler.fit_transform(X_train.reshape(-1, 1))

# Building a model 

In [36]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Activation, Dense

In [29]:
# check to see if gups are avialable 
physical_devices = tf.config.experimental.list_physical_devices('GPU')
print('Num of GPUs available: ', len(physical_devices))
if len(physical_devices) > 0: 
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

Num of GPUs available:  0


In [61]:
model = Sequential()

# 1st hidden layer
model.add(Dense(16, input_shape=(1,), activation='relu'))
# 2nd hidden layer
model.add(Dense(units=32, activation='relu'))
# output layer
model.add(Dense(units=2, activation='softmax'))
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 16)                32        
_________________________________________________________________
dense_13 (Dense)             (None, 32)                544       
_________________________________________________________________
dense_14 (Dense)             (None, 2)                 66        
Total params: 642
Trainable params: 642
Non-trainable params: 0
_________________________________________________________________


# Training Model

In [62]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError

In [63]:
model.compile(optimizer=Adam(learning_rate=0.0001),
              loss='sparse_categorical_crossentropy',
             metrics=['accuracy'])

In [64]:
model.fit(x=scaled_X_train, y=y_train, validation_split=0.1, batch_size=10, epochs=30, shuffle=True, verbose=2)

Train on 1809 samples, validate on 201 samples
Epoch 1/30
1809/1809 - 1s - loss: 0.6774 - accuracy: 0.5345 - val_loss: 0.6598 - val_accuracy: 0.6468
Epoch 2/30
1809/1809 - 0s - loss: 0.6598 - accuracy: 0.6186 - val_loss: 0.6400 - val_accuracy: 0.7015
Epoch 3/30
1809/1809 - 0s - loss: 0.6448 - accuracy: 0.6600 - val_loss: 0.6240 - val_accuracy: 0.7463
Epoch 4/30
1809/1809 - 0s - loss: 0.6311 - accuracy: 0.6799 - val_loss: 0.6088 - val_accuracy: 0.7612
Epoch 5/30
1809/1809 - 0s - loss: 0.6176 - accuracy: 0.7137 - val_loss: 0.5951 - val_accuracy: 0.7761
Epoch 6/30
1809/1809 - 0s - loss: 0.6041 - accuracy: 0.7363 - val_loss: 0.5809 - val_accuracy: 0.7761
Epoch 7/30
1809/1809 - 0s - loss: 0.5909 - accuracy: 0.7496 - val_loss: 0.5681 - val_accuracy: 0.8010
Epoch 8/30
1809/1809 - 0s - loss: 0.5780 - accuracy: 0.7689 - val_loss: 0.5557 - val_accuracy: 0.8060
Epoch 9/30
1809/1809 - 0s - loss: 0.5655 - accuracy: 0.7739 - val_loss: 0.5458 - val_accuracy: 0.8358
Epoch 10/30
1809/1809 - 0s - loss: 

<tensorflow.python.keras.callbacks.History at 0x269bcab8848>

# Creating Validation Data Set

We can create validation data set using two methods. We can create a data set that contains (x, y) just as we created train data set and pass that to fit function using `validation_data` parameter. Or, we can simply pass `validation_split` parameter to the `fit` function as we did earlier. When we use `validation_split` it is important to remember that the shuffle of data within t

In [66]:
model.predict(X_test)

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]], dtype=float32)