In [1]:
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score

# Tabular Data

Tabular data is the data that you most often see. It is data that you can cleanly write in a table. It has a set number of rows and columns, and for our example below, all the data is numeric.

This is the one type of data that we will go over that is not necessarily suited to neural networks. Because it is so simple and so well studied, traditional ML can do quite well on it. 

That being said it makes a nice springboard to begin the rest of the tutorial.

To make this data we will be using sklearn `make_classification`. This will generate a dummy classification dataset:

In [2]:
dataset = make_classification(n_samples=10_000, n_features=20, n_classes=2)
X_syn, y_syn = dataset

Because we have two classes, this is binary classification, so predicting either 0 or a 1 based off of these 20 features.

So now that we have the data we can just throw it into a NN right? 

Well not quite yet. Because a NN is basically a linear ML alg, we first need to scale all the inputs:

In [3]:
df=pd.read_csv('titanic_train.csv')
df.drop(['Name'], axis=1,inplace=True)
df.drop(['Ticket'], axis=1,inplace=True)
df.drop(['PassengerId'], axis=1,inplace=True)
df.shape

(891, 9)

In [4]:
# check n/a
df.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [5]:
df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)
df['Cabin'].fillna('cabin_unkown',inplace=True)

In [6]:
numerical_cols = df.select_dtypes([np.number]).columns
cat_cols = list(set(df.columns)-set(numerical_cols))

In [7]:
# impute n/a Age
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
knn_predictions = imputer.fit_transform(df[numerical_cols])

age_pos = list(df.columns).index('Age')
# substract 1 from age col position, 0 is index.
df['Age'] = knn_predictions[:,age_pos-1]

In [8]:
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

In [9]:
X = df.drop(['Survived'], axis=1)
y = df['Survived']

In [10]:
cat_cols

['Cabin', 'Sex', 'Embarked']

In [11]:
X = pd.get_dummies(X, columns=cat_cols)

In [12]:
X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Cabin_A10,Cabin_A14,Cabin_A16,Cabin_A19,Cabin_A20,...,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Cabin_cabin_unkown,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,3,22.0,1,0,7.25,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
2,3,26.0,0,0,7.925,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,1
3,1,35.0,1,0,53.1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4,3,35.0,0,0,8.05,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1


In [13]:
y.shape, X.shape

((891,), (891, 158))

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train,y_train)
preds = clf.predict(X_test)

f1 = f1_score(y_test, preds)
acc = accuracy_score(y_test, preds.round())
print(f"f1 score:",f1)
print(f"Accuracy score:",acc)

f1 score: 0.7659574468085106
Accuracy score: 0.8156424581005587


In [15]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

standardized_x = ss.fit_transform(X)

Perfect, now we can just throw it into a NN :) 

Yup for this data there is not too much else to it but to build the NN.

In [16]:
import tensorflow as tf

# dropout probability
p = .1

We are going to be using keras to build our NN. Because this is tabular data we can follow a fairly simple structure of a NN:

1. Standardize/Normalize
2. (Optional) Regularize/Dropout
3. Apply a Dense Layer

Let me talk about the first and the last.

In LR, coefficients will create issues, taht;s why we standardise.

Standardizing is important because of the way that NNs train by using gradient descent. If a particular layer's input is too big, then the gradients might be massive and the training process goes out of wack. 
Standardisation is a monotonic function.
In mathematics, a monotonic function is a function between ordered sets that preserves or reverses the given order.

The dense layer is the core of the NN and applies a non-linear transformation to the inputs allowing the NN to represent any non-linear function - or something like that. Regardless without that you couldn't learn.

Dropout is a simple way of regularizing NNs. The reason I put this as optional, is that there is some debate on whether you need dropout in addition to batch normalization.

Ultimately you can experiment with the amt of dropout you need in your network, and if it's none, so be it.

---

So all that being said below is our first NN.

In [17]:
inputs = tf.keras.layers.Input((158,), name='numeric_inputs')

In [32]:
x = tf.keras.layers.Dropout(p)(inputs)
x = tf.keras.layers.Dense(30, activation='relu')(x)
#x = tf.keras.layers.BatchNormalization()(x)

x = tf.keras.layers.Dense(50, activation='relu')(x)
#x = tf.keras.layers.BatchNormalization()(x)

x = tf.keras.layers.Dense(50, activation='relu')(x)
#x = tf.keras.layers.BatchNormalization()(x)

x = tf.keras.layers.Dense(30, activation='relu')(x)
#x = tf.keras.layers.BatchNormalization()(x)

out = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(x)

Now there are probably a couple of questions as to the above:

* Why so many layers?
* Why so many neurons in each layer

Well a good rule of thumb is that your NN can have as many params as the number of data points that you have, and the above NN has half as many, so we could probably increase the number of parameters. 

As for the width vs the depth of the network, well there has been a ton of results on either side of the aisle and honeslty I'm not sure what to tell you other than experimentation.

Some things you might want to keep in mind are:

* Skip connections seem to be pretty cool
* Alternating small and large layers might be a thing too

In [33]:
model = tf.keras.models.Model(inputs=inputs, outputs=out)
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [34]:
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
numeric_inputs (InputLayer)  [(None, 158)]             0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 158)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 30)                4770      
_________________________________________________________________
dense_5 (Dense)              (None, 50)                1550      
_________________________________________________________________
dense_6 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_7 (Dense)              (None, 30)                1530      
_________________________________________________________________
output (Dense)               (None, 1)                 31  

As a final amendment to our data, I always like to use keras's `fit_generator` function, so I will often make a generator to feed data to the NN instead of using the default fit funtion.

In [35]:
import numpy as np

def bootstrap_sample_generator(batch_size):
    while True:
        batch_idx = np.random.choice(
            standardized_x.shape[0], batch_size)
        yield ({'numeric_inputs': standardized_x[batch_idx]}, 
               {'output': y[batch_idx]})

In [36]:
batch_size = 64

model.fit_generator(
    bootstrap_sample_generator(batch_size),
    steps_per_epoch=10_000 // batch_size,
    epochs=10,
    max_queue_size=10,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd30534fa10>

In [37]:
pred = model.predict(X_test)
f1 = f1_score(y_test, pred.round())

print(f"f1 score:",f1)

f1 score: 0.6567164179104478


In [38]:
accuracy_score(y_test, pred.round())

0.7430167597765364

In [26]:
def count_elements(array):
    (unique, counts) = np.unique(array.round(), return_counts=True)
    frequencies = np.asarray((unique, counts)).T
    return frequencies

In [27]:
count_elements(pred)

array([[  0., 140.],
       [  1.,  39.]])

In [28]:
count_elements(y_test)

array([[  0, 105],
       [  1,  74]])