# Healthcare data: Vitamin D and Osteoporosis



**1. In our data(`healthTrain.csv`, `healthTest.csv`), there are 5 variables; Gender(A column; 1 if male, 0 for female), RIDAGEYR(B column), vitamin(C column), Calcium(D column), Osteop(E column). Our goal is to build a neural network for predicting binary Osteop variable(1 if osteoporosis; otherwise 0) using the other four variables(Gender, RIDAGEYR, vitamin, Calcium). You should solve the problem based on `HW4.ipynb` that I uploaded at LearnUs. In `HW4.ipynb`, I loaded the `healthTrain.csv` to fit our model later. Also, I added normalize layer to improve performance of the neural network.**

**(a) Search for the definitions of two functions:**
- `tf.keras.layers.dense`
- `tf.keras.layers.Dropout`

**Explain (1) the role of these functions and (2) arguments(options) of these functions. You can take a look at descriptions from https://www.tensorflow.org/.
But do not simply copy-paste the descriptions. You should paraphrase the descriptions. (25 point)**

(1) Role
- `tf.keras.layers.Dense`: densely-connected(=fully connected) layer 추가
- `tf.keras.layers.Dropout`: 일부 뉴런 누락시켜주는 layer 추가(overfitting 방지목적)

(2) Arguments

- `tf.keras.layers.dense` <br>
 *units*: 출력값의 dimension(출력될 node 개수) <br>
 *activation*: 사용할 activation function<br>
 *use_bias*: 해당 layer가 bias vector를 사용하는지 <br>
 *kernel_initializer*: weights matrix에 적용할 initializer <br>
 *bias_initializer*: bias vector에 적용할 initializer <br>
 *kernel_regularizer*: weights matrix에 적용할 regularizer <br>
 *bias_regularizer*: bias vector에 적용할 regularizer <br>
 *activity_regularizer*: activate된 출력값에 적용할 regularizer <br>
 *kernel_constraint*: weights matrix에 적용할 constraint <br>
 *bias_constraint*: bias vector에 적용할 constraint <br>
<br>
- `tf.keras.layers.Dropout` <br>
 *rate*: 뉴런을 drop시킬 비율 <br>
 *noise_shape*: 뉴런을 drop시킬 규칙(input 형식과 일치하게 입력해줘야 함) <br>
 *seed*: 랜덤값 seed

**(b) In HW4.ipynb, I wrote the skeleton of the neural network using tf.keras.models.Sequential. Feel free to use this to build your own neural network. You can choose the number of nodes, types of activation functions. But you have to use sigmoid activation function at the last layer for binary classifcation. You may add more hidden layers if you want. Using `model.summary` function, report the structure of your neural network. (25 point)**

In [1]:
# Importing the libraries
from numpy import loadtxt
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from tensorflow.keras.layers.experimental import preprocessing

In [2]:
# Loading the dataset
dataset_train = loadtxt('healthTrain.csv', delimiter=',')

# Splitting the dataset into input(X) and output(y) variables
x_train = dataset_train[:,0:4]
y_train = dataset_train[:,4]
print(x_train.shape)
print(y_train.shape)

(7000, 4)
(7000,)


In [3]:
# Normalization
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(x_train))
print(normalizer.mean.numpy())

[ 0.51271427 50.356857   60.863575    9.411029  ]


In [4]:
# Constructing the model
model = tf.keras.models.Sequential([
    normalizer,
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(1, activation='sigmoid') # for binary classification
])

Node 개수가 100개고 activation function으로는 relu를 사용하는 fully connected layer 2개와 0.4 rate로 뉴런을 drop 시켜주는 dropout layer 1개를 추가했다. <br>
마지막 layer에서는 하나의 값이 출력되어 sigmoid function에 입력되어야 하기 때문에 출력 dimension을 1로 지정해주어야 한다.

In [5]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
normalization (Normalization (None, 4)                 9         
_________________________________________________________________
dense (Dense)                (None, 100)               500       
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 10,710
Trainable params: 10,701
Non-trainable params: 9
_________________________________________________________________


완성된 모델은 normalization layer와 2개의 fully connected layer, 1개의 dropout layer, 그리고 최종 fully connected layer로 이루어진 sequential model이다. <br>
Hidden layer 추가로 각 단계에서 $4\times100+100$(bias) $=500$개, $100\times100+100=10100$개, $100\times1+1=101$개 parameter, 그리고 normalization에서 9개 non-trainable parameter까지 총 10710개 parameter가 사용된다.

**(c) Compile your model through `model.compile` function. (1) Explain why we are using binary crossentropy loss function for this problem. (2) Fit the neural network using model.fit function. You may choose "epochs" by your self. (3)
Report the accuracy for your model. (25 point)**

In [6]:
# Compiling the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

(1) Reason for crossentropy loss function <br>
<br>
Loss function은 출력값의 양식에 따라 사용해야할 종류가 달라진다. 특정 숫자가 출력되는 regression 데이터에서는 주로 MSE, RMSE 등이 사용되고, 0과 1처럼 레이블 값이 출력되는 classification 데이터에서는 주로 Cross Entropy가 사용된다. 위 데이터는 binary classification 데이터이므로, Cross Entropy 중에서도 Binary Cross Entropy를 loss function으로 사용했다.

(2) Fitting the NN

In [7]:
model.fit(x_train, y_train, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x12289803108>

Epoch 수를 100으로 지정해 fitting했다.

(3) Accuracy <br>
<br>
Epoch=100일 때 train accuracy는 0.9376이다.

**(d) In a similar way, load the test data set from `healthTest.csv`. Report the accuracy of your model using model.evaluate function. (25 point)**

In [8]:
# Load test dataset
dataset_test = loadtxt('healthTest.csv', delimiter=',')

# Test X, y split
x_test = dataset_test[:,0:4]
y_test = dataset_test[:,4]

# Accuracy calculation
test_loss, test_acc = model.evaluate(x_test,y_test, verbose=2)
print('\nAccuracy:', test_acc)

96/96 - 0s - loss: 0.1700 - accuracy: 0.9432

Accuracy: 0.9432300329208374


`healthTest.csv`를 불러와 accuracy를 계산했다. 모델은 약 0.943의 accuracy를 가진다.