# Neural Network and Decision Tree Analysis

I will be practicing supervised learning techniques gained from the machine learning course. While insurance.csv dataset is used throughout the process, it is to be noted that I do not intend to educe any meaningful outcome from the data. The data is solely used to implement the machine learning techniques. 

In [6]:
import pandas as pd
import numpy as np
import math
# importing packages necessary to implement neural network
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.losses import MeanSquaredError, BinaryCrossentropy
from tensorflow.keras.activations import sigmoid

In [7]:
df = pd.read_csv("Datasets/insurance.csv")
df.head

<bound method NDFrame.head of       age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]>

The dataset consists of 1338 examples with 7 columns. The columns are:

| Features     | Data     |
| ----------- | ----------- |
| age    | 18-64   |
| sex    | female/male    |
| bmi    | 16.0-53.1   |
| children    | 0-5    |
| smoker    | yes/no    |
| region    | SE/SW/NE/NW    |
| charges    | 1120-63800    |

# Neural Network Algorithm

Neural network is the algorithm that try to mimic the brain. Its composed of the input layer, hidden layer(s), and the output layer. It's to be noted that the input layer has to be composed of only numerical data. We will go through the process to turn non-numeric data to numerical data, applying one hot encoding where necessary. One-hot encoding refers to splitting up categorical data into the number of categories, making each category a binary data. 

In [8]:
#The following code convers 'male' to 1 'female' to 0
df_encoded = pd.get_dummies(df, columns=['sex'], dtype=int, drop_first=True)

#The following code convers 'yes' to 1 'no' to 0
df_encoded = pd.get_dummies(df_encoded, columns=['smoker'], dtype=int, drop_first=True)

#The following code implements one-hot encoding on 'region' feature
df_encoded = pd.get_dummies(df_encoded, columns=['region',], dtype=int)
df_encoded.head

<bound method NDFrame.head of       age     bmi  children      charges  sex_male  smoker_yes  \
0      19  27.900         0  16884.92400         0           1   
1      18  33.770         1   1725.55230         1           0   
2      28  33.000         3   4449.46200         1           0   
3      33  22.705         0  21984.47061         1           0   
4      32  28.880         0   3866.85520         1           0   
...   ...     ...       ...          ...       ...         ...   
1333   50  30.970         3  10600.54830         1           0   
1334   18  31.920         0   2205.98080         0           0   
1335   18  36.850         0   1629.83350         0           0   
1336   21  25.800         0   2007.94500         0           0   
1337   61  29.070         0  29141.36030         0           1   

      region_northeast  region_northwest  region_southeast  region_southwest  
0                    0                 0                 0                 1  
1                  

I will implement the neural network algorithm to calculate the probability of a specific person being a smoker. Hence I need to drop the column and store in a different array. 

In [9]:
smoker = df_encoded['smoker_yes'].to_numpy()
print(smoker.shape)

(1338,)


Before proceeding any further, I will start by scaling every feature by z-score normalization. 

In [10]:
#The code below implements z-score normalization on all the features of df_encoded
df_encoded_drop = df_encoded.drop('smoker_yes', axis=1)
df_z_scaled = df_encoded_drop.copy()

for column in df_z_scaled.columns:
    df_z_scaled[column] = (df_z_scaled[column]-df_z_scaled[column].mean()) / df_z_scaled[column].std()
    
df_z_scaled.head



<bound method NDFrame.head of            age       bmi  children   charges  sex_male  region_northeast  \
0    -1.438227 -0.453151 -0.908274  0.298472 -1.010141         -0.565056   
1    -1.509401  0.509431 -0.078738 -0.953333  0.989221         -0.565056   
2    -0.797655  0.383164  1.580335 -0.728402  0.989221         -0.565056   
3    -0.441782 -1.305043 -0.908274  0.719574  0.989221         -0.565056   
4    -0.512957 -0.292447 -0.908274 -0.776512  0.989221         -0.565056   
...        ...       ...       ...       ...       ...               ...   
1333  0.768185  0.050278  1.580335 -0.220468  0.989221         -0.565056   
1334 -1.509401  0.206062 -0.908274 -0.913661 -1.010141          1.768415   
1335 -1.509401  1.014499 -0.908274 -0.961237 -1.010141         -0.565056   
1336 -1.295877 -0.797515 -0.908274 -0.930014 -1.010141         -0.565056   
1337  1.551106 -0.261290 -0.908274  1.310563 -1.010141         -0.565056   

      region_northwest  region_southeast  region_southwes

In [11]:
#creating nparray of all necessary features
x_train = df_z_scaled[['age', 'bmi', 'children', 'charges', 'sex_male', 'region_northeast', 'region_northwest', 
                      'region_southeast', 'region_southwest']].to_numpy()
x_train.shape

(1338, 9)

Each neuron has an activation. The activation can be linear, sigmoid or relu. Since the neural network aims to predict the chances of a person being a smoker, we will use sigmoid activation. 

In [15]:
model = Sequential(
    [
        tf.keras.Input(shape=(9,)),
        #Create a dense layer with sigmoid activation, which is good for binary situations.
        Dense(3, activation='sigmoid', name='layer1'),
        Dense(1, activation='sigmoid', name='layer2')
    ]
)

In [10]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer1 (Dense)              (None, 3)                 30        
                                                                 
 layer2 (Dense)              (None, 1)                 4         
                                                                 
Total params: 34
Trainable params: 34
Non-trainable params: 0
_________________________________________________________________


In [16]:
#Describes the random biases and weights Tensorflow has initiated
W1, b1 = model.get_layer("layer1").get_weights()
W2, b2 = model.get_layer("layer2").get_weights()
print(f"W1{W1.shape}:\n", W1, f"\nb1{b1.shape}:", b1)
print(f"W2{W2.shape}:\n", W2, f"\nb2{b2.shape}:", b2)

W1(9, 3):
 [[-0.0243066  -0.6859954   0.5371651 ]
 [-0.46246472 -0.33147198  0.54098004]
 [ 0.322226   -0.69254774 -0.66457164]
 [-0.06176066  0.36415774  0.04277718]
 [ 0.35824233 -0.3874821  -0.5155927 ]
 [-0.6085216   0.6997232   0.3565182 ]
 [-0.00584257  0.42300636  0.40524262]
 [-0.16720092  0.31298     0.66041654]
 [-0.5707931  -0.4807679  -0.70085335]] 
b1(3,): [0. 0. 0.]
W2(3, 1):
 [[ 0.2804159]
 [-0.6430382]
 [ 0.7749101]] 
b2(1,): [0.]


In [18]:
model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.01),
)
#epochs means the entire data set should be applied during training 10 times.
model.fit(
    x_train,smoker,            
    epochs=10,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x29b096850>

In [19]:
#After fitting, the weights have been updated
W1, b1 = model.get_layer("layer1").get_weights()
W2, b2 = model.get_layer("layer2").get_weights()
print(f"W1{W1.shape}:\n", W1, f"\nb1{b1.shape}:", b1)
print(f"W2{W2.shape}:\n", W2, f"\nb2{b2.shape}:", b2)

W1(9, 3):
 [[ 0.97016275  0.85319906 -0.9603485 ]
 [ 1.4447821   1.4506173  -1.4787594 ]
 [ 0.16927463  0.22694588 -0.1928326 ]
 [-3.2898934  -2.8743703   3.203796  ]
 [-0.0976601  -0.22181818  0.13050695]
 [-0.2908088   0.31043464  0.10585709]
 [-0.35587466  0.26202214  0.16500361]
 [-0.40883565  0.13569534  0.23385766]
 [-0.40078142  0.2355054   0.21645196]] 
b1(3,): [ 1.4255282  1.3723803 -1.46342  ]
W2(3, 1):
 [[-3.2533317]
 [-4.443698 ]
 [ 2.9462836]] 
b2(1,): [-0.23018911]


## Predictions
Since now we have a trained model, we can use it to make predictions. Since this model predicts a probability, in order to make decision there has to be a threshold. We will set 0.5 as the threshold. 

In [58]:
X_test = np.array([
    [18, 33.770, 1, 1725.55230, 1, 0, 1, 0, 0],  # neg example
    [19, 27.900, 0, 16884.92400, 0, 0, 0, 1, 0]])   # pos example
print(X_test.shape)
col_means = np.mean(df_encoded_drop, axis=0)
col_means = col_means.values.reshape(1,-1)
print(col_means.shape)
col_std = np.std(df_encoded_drop, axis=0)
col_std = col_std.values.reshape(1,-1)
X_testn =(X_test - col_means) / col_std


predictions = model.predict(X_testn)
print("predictions = \n", predictions)

(2, 9)
(1, 9)
predictions = 
 [[4.8179663e-04]
 [7.3478907e-01]]


To convert the probabilities to a decision, we apply a threshold:

In [60]:
yhat = (predictions >= 0.5).astype(int)
print(f"decisions = \n{yhat}")

decisions = 
[[0]
 [1]]


It can be seen that the neural network was able to predict the probabilities of a person being a smoker. Now what if what I'm trying to guess is not binary, but categorical? This can be achieved by setting the activation as softmax function. For the given data, I will implement softmax function to predict the number of kids a person has based on features such as insurance cost, age, etc. 

# Multiclass Neural Network

In [3]:
#reading csv file and storing to a new variable
df2 = pd.read_csv("Datasets/insurance.csv")
df2.head

<bound method NDFrame.head of       age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]>

In [5]:
#The following code convers 'male' to 1 'female' to 0
df2_encoded = pd.get_dummies(df2, columns=['sex'], dtype=int, drop_first=True)

#The following code convers 'yes' to 1 'no' to 0
df2_encoded = pd.get_dummies(df2_encoded, columns=['smoker'], dtype=int, drop_first=True)

#The following code implements one-hot encoding on 'region' feature
df2_encoded = pd.get_dummies(df2_encoded, columns=['region',], dtype=int)
df2_encoded.head

<bound method NDFrame.head of       age     bmi  children      charges  sex_male  smoker_yes  \
0      19  27.900         0  16884.92400         0           1   
1      18  33.770         1   1725.55230         1           0   
2      28  33.000         3   4449.46200         1           0   
3      33  22.705         0  21984.47061         1           0   
4      32  28.880         0   3866.85520         1           0   
...   ...     ...       ...          ...       ...         ...   
1333   50  30.970         3  10600.54830         1           0   
1334   18  31.920         0   2205.98080         0           0   
1335   18  36.850         0   1629.83350         0           0   
1336   21  25.800         0   2007.94500         0           0   
1337   61  29.070         0  29141.36030         0           1   

      region_northeast  region_northwest  region_southeast  region_southwest  
0                    0                 0                 0                 1  
1                  

In [6]:
#storing 'children' column to y_train
y_train = df2_encoded['children'].to_numpy()

In [7]:
#dropping 'children' column, then implementing z-score normalization on df2_encoded
df2_encoded_drop = df2_encoded.drop('children', axis=1)
df2_z_scaled = df2_encoded_drop.copy()

for column in df2_z_scaled.columns:
    df2_z_scaled[column] = (df2_z_scaled[column]-df2_z_scaled[column].mean()) / df2_z_scaled[column].std()
    
df2_z_scaled.head



<bound method NDFrame.head of            age       bmi   charges  sex_male  smoker_yes  region_northeast  \
0    -1.438227 -0.453151  0.298472 -1.010141    1.969850         -0.565056   
1    -1.509401  0.509431 -0.953333  0.989221   -0.507273         -0.565056   
2    -0.797655  0.383164 -0.728402  0.989221   -0.507273         -0.565056   
3    -0.441782 -1.305043  0.719574  0.989221   -0.507273         -0.565056   
4    -0.512957 -0.292447 -0.776512  0.989221   -0.507273         -0.565056   
...        ...       ...       ...       ...         ...               ...   
1333  0.768185  0.050278 -0.220468  0.989221   -0.507273         -0.565056   
1334 -1.509401  0.206062 -0.913661 -1.010141   -0.507273          1.768415   
1335 -1.509401  1.014499 -0.961237 -1.010141   -0.507273         -0.565056   
1336 -1.295877 -0.797515 -0.930014 -1.010141   -0.507273         -0.565056   
1337  1.551106 -0.261290  1.310563 -1.010141    1.969850         -0.565056   

      region_northwest  region_so

In [8]:
#Creating and storing all features to nparray
x_train = df2_z_scaled[['age', 'bmi', 'charges', 'sex_male', 'smoker_yes', 
                        'region_northeast', 'region_northwest', 'region_southeast', 'region_southwest']].to_numpy()
x_train.shape

(1338, 9)

In [25]:
model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(6, activation = 'linear')    # < softmax activation here
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.legacy.Adam(0.001),
)

model.fit(
    x_train,y_train,
    epochs=10
)
        

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2a225bc10>

In [26]:
p_preferred = model.predict(x_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))

two example output vectors:
 [[ 2.5756073   1.140963    0.821749    0.5840602  -1.5949247  -2.1311479 ]
 [ 2.05817     1.365968    0.56684935  0.48262838 -1.6258609  -1.4897108 ]]
largest value 2.8685749 smallest value -3.3586311


The example output vectors are not probabilities. The output must be sent through a softmax function when performing prediction that expects a probability.

In [27]:
sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))

two example output vectors:
 [[0.6360243  0.15150103 0.11009884 0.08680721 0.00982278 0.00574587]
 [0.50343585 0.25195596 0.11331093 0.10415858 0.01264707 0.01449169]]
largest value 0.63833654 smallest value 0.0012424922


In [28]:
#to select the most likely category, softmax is not required. 
for i in range(5):
    print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")

[ 2.5756073  1.140963   0.821749   0.5840602 -1.5949247 -2.1311479], category: 0
[ 2.05817     1.365968    0.56684935  0.48262838 -1.6258609  -1.4897108 ], category: 0
[ 1.8462933   1.4596443   0.82461405  0.36059633 -1.6800543  -1.2358103 ], category: 0
[ 1.4287852   1.1868529   1.3219203   0.44475603 -1.166343   -1.5011184 ], category: 0
[ 1.5756184   1.364863    0.96450603  0.3092387  -0.9414451  -1.342018  ], category: 0


It can be seen that the neural network predicts the first 5 examples of the x_train to be all in category 0. In other words, the network predicts, given features stored in x_train, first 5 examples to have 0 child. But we know that is not certainly true. 3rd person has 3 children, 2nd person 1 child. This may suggest that the neural network is not accurate, perhaps we can logically deduce that the features stored in x_train such as age, bmi, insurance cost, and smoker were not very good predictors for assuming the number of children.

# Model evaluation and selection

How do we know, if the architecture for the 'model' neural network is optimal for calculating the probabilities of a person being a smoker, or predicting the number of children a person has as seen in previous examples? In order to answer such critical questions, we have to be able to evaluate a model or an architecture. Here I aim to implement code to achieve such result. I will use 'x_train' and 'smoker' data used in predicting the probability of a person being a smoker to conduct the model evaluation.

In [12]:
print(f"the shape of the inputs x is: {x_train.shape}")
print(f"the shape of the targets y is: {smoker.shape}")

the shape of the inputs x is: (1338, 9)
the shape of the targets y is: (1338,)


In [13]:
#This process splits the data into the training, cross validation, and test sets.

# Get 60% of the dataset as the training set. Put the remaining 40% in temporary variables.
x_bc_train, x_, y_bc_train, y_ = train_test_split(x_train, smoker, test_size=0.40, random_state=1)

# Split the 40% subset above into two: one half for cross validation and the other for the test set
x_bc_cv, x_bc_test, y_bc_cv, y_bc_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

# Delete temporary variables
del x_, y_

print(f"the shape of the training set (input) is: {x_bc_train.shape}")
print(f"the shape of the training set (target) is: {y_bc_train.shape}\n")
print(f"the shape of the cross validation set (input) is: {x_bc_cv.shape}")
print(f"the shape of the cross validation set (target) is: {y_bc_cv.shape}\n")
print(f"the shape of the test set (input) is: {x_bc_test.shape}")
print(f"the shape of the test set (target) is: {y_bc_test.shape}")

the shape of the training set (input) is: (802, 9)
the shape of the training set (target) is: (802,)

the shape of the cross validation set (input) is: (268, 9)
the shape of the cross validation set (target) is: (268,)

the shape of the test set (input) is: (268, 9)
the shape of the test set (target) is: (268,)


The evaluation of the error for classification model will be measured by getting the fraction of the data the model has misclassified. 

In [18]:
def build_models():
    
    model_1 = Sequential(
        [
            Dense(25, activation = 'relu'),
            Dense(15, activation = 'relu'),
            Dense(1, activation = 'linear')
        ],
        name='model_1'
    )

    model_2 = Sequential(
        [
            Dense(20, activation = 'relu'),
            Dense(12, activation = 'relu'),
            Dense(12, activation = 'relu'),
            Dense(20, activation = 'relu'),
            Dense(1, activation = 'linear')
        ],
        name='model_2'
    )

    model_3 = Sequential(
        [
            Dense(32, activation = 'relu'),
            Dense(16, activation = 'relu'),
            Dense(8, activation = 'relu'),
            Dense(4, activation = 'relu'),
            Dense(12, activation = 'relu'),
            Dense(1, activation = 'linear')
        ],
        name='model_3'
    )
    
    model_list = [model_1, model_2, model_3]
    
    return model_list

In [20]:
# Initialize lists that will contain the errors for each model
nn_train_error = []
nn_cv_error = []

# Build the models
models_bc = build_models()

# Loop over each model
for model in models_bc:
    
    # Setup the loss and optimizer
    model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.01),
    )

    print(f"Training {model.name}...")

    # Train the model
    model.fit(
        x_bc_train, y_bc_train,
        epochs=200,
        verbose=0
    )
    
    print("Done!\n")

    # Set the threshold for classification
    threshold = 0.5
    
    # Record the fraction of misclassified examples for the training set
    yhat = model.predict(x_bc_train)
    yhat = tf.math.sigmoid(yhat)
    yhat = np.where(yhat >= threshold, 1, 0)
    train_error = np.mean(yhat != y_bc_train)
    nn_train_error.append(train_error)

    # Record the fraction of misclassified examples for the cross validation set
    yhat = model.predict(x_bc_cv)
    yhat = tf.math.sigmoid(yhat)
    yhat = np.where(yhat >= threshold, 1, 0)
    cv_error = np.mean(yhat != y_bc_cv)
    nn_cv_error.append(cv_error)

# Print the result
for model_num in range(len(nn_train_error)):
    print(
        f"Model {model_num+1}: Training Set Classification Error: {nn_train_error[model_num]:.5f}, " +
        f"CV Set Classification Error: {nn_cv_error[model_num]:.5f}"
        )

Training model_1...
Done!

Training model_2...
Done!

Training model_3...
Done!

Model 1: Training Set Classification Error: 0.32828, CV Set Classification Error: 0.31961
Model 2: Training Set Classification Error: 0.33486, CV Set Classification Error: 0.33560
Model 3: Training Set Classification Error: 0.33705, CV Set Classification Error: 0.33103


In [22]:
# Select the model with the lowest error
model_num = 1

# Compute the test error
yhat = models_bc[model_num-1].predict(x_bc_test)
yhat = tf.math.sigmoid(yhat)
yhat = np.where(yhat >= threshold, 1, 0)
nn_test_error = np.mean(yhat != y_bc_test)

print(f"Selected Model: {model_num}")
print(f"Training Set Classification Error: {nn_train_error[model_num-1]:.4f}")
print(f"CV Set Classification Error: {nn_cv_error[model_num-1]:.4f}")
print(f"Test Set Classification Error: {nn_test_error:.4f}")

Selected Model: 1
Training Set Classification Error: 0.3283
CV Set Classification Error: 0.3196
Test Set Classification Error: 0.3349


This allows a determine which of the 3 models are able to predict the outcome with the smallest error. 