# Assignment - Autoencoder - McCartney

In this assignment, we will focus on healthcare. This data set contains data about patients with and without heart problems. Each row represents a single patient. There two files: heart-normal (contains patients without any heart problems) and heart_anomaly (contains patients with heart problems). This is an anomaly detection task: build an autoencoder on normal patients to identify anomalous observations. You cannot do supervised learning, because there are only 20 anomalous observations - which is not enough to build a binary classification model.

## Description of Variables

The description of variables are provided in "Heart - Data Dictionary.docx"

## Goal

Use the data set **heart-normal.csv** data set to train an autoencoder on healthy (i.e., normal) patients. Then, use the observations in **heart-anomaly.csv** data set to check whether the autoencoder can successfully detect patients who have a heart anomaly. 

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data

In [2]:
# Common imports
import numpy as np
import pandas as pd

random_state=42

In [6]:
heart_anomaly = pd.read_csv("heart-anomaly.csv")

heart_normal = pd.read_csv("heart-normal.csv")

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

In [29]:
heart_normal.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
dtype: int64

In [31]:
heart_anomaly.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
dtype: int64

In [8]:
heart_normal.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
dtype: object

In [10]:
# Identify the numerical columns
numeric_columns = heart_normal.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = heart_normal.select_dtypes('object').columns.to_list()

In [11]:
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['sex', 'fbs', 'exang']

In [12]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in binary_columns:
    numeric_columns.remove(col)

In [16]:
# Identify the binary columns so we can pass them through without transforming
categorical_columns = ['thal']

In [17]:
# Be careful: numerical columns already includes the binary columns,
# So, we need to remove the binary columns from numerical columns.

for col in categorical_columns:
    numeric_columns.remove(col)

In [18]:
binary_columns

['sex', 'fbs', 'exang']

In [19]:
numeric_columns

['age',
 'cp',
 'trestbps',
 'chol',
 'restecg',
 'thalach',
 'oldpeak',
 'slope',
 'ca']

In [20]:
categorical_columns

['thal']

### Pipeline

In [32]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [33]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=9999)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [34]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [35]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')
    
#passtrough is an optional step. You don't have to use it.

### Transform

In [36]:
#Fit and transform the train data

normal_x = preprocessor.fit_transform(heart_normal)

normal_x

array([[ 1.10306652,  1.71093264,  0.97372481, ...,  1.        ,
         1.        ,  0.        ],
       [-1.62754823,  0.65755993,  0.04323489, ...,  1.        ,
         0.        ,  0.        ],
       [-1.20745366, -0.39581278,  0.04323489, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.20745366, -0.39581278, -0.57709173, ...,  1.        ,
         0.        ,  0.        ],
       [-1.52252459,  0.65755993,  0.53949618, ...,  1.        ,
         0.        ,  0.        ],
       [-1.52252459,  0.65755993,  0.53949618, ...,  1.        ,
         0.        ,  0.        ]])

In [37]:
normal_x.shape

(165, 16)

In [38]:
# Transform the test data
anomaly_x = preprocessor.transform(airbnb_anomaly)

anomaly_x

array([[ 1.5231611 , -1.44918549,  1.90421473,  0.81980549, -1.18012347,
        -2.6400108 ,  1.17814884, -1.0035591 ,  3.1150997 ,  0.        ,
         0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
         1.        ],
       [ 1.5231611 , -1.44918549, -0.57709173, -0.24780329, -1.18012347,
        -1.54145941,  2.59146023, -1.0035591 ,  1.93351016,  0.        ,
         0.        ,  0.        ,  1.        ,  1.        ,  0.        ,
         1.        ],
       [ 0.99804287, -1.44918549,  0.6635615 ,  0.48266588, -1.18012347,
         0.08021169,  3.87628877, -2.69322494,  1.93351016,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 1.10306652, -1.44918549,  0.04323489,  0.22044617, -1.18012347,
        -0.59984393,  1.04966598, -1.0035591 ,  0.75192062,  0.        ,
         0.        ,  0.        ,  1.        ,  1.        ,  0.        ,
         0.        ],
       [ 0.05283008, -1.44918549,  0

In [39]:
anomaly_x.shape

(20, 16)

# Autoencoder

In [40]:
import tensorflow as tf
from tensorflow import keras

In [41]:
model = keras.models.Sequential()

#Encoder
model.add(keras.layers.InputLayer(input_shape=normal_x.shape[1]))
model.add(keras.layers.Dense(55, activation='relu'))
model.add(keras.layers.Dense(50, activation='relu'))

#Decoder
model.add(keras.layers.Dense(55, activation='relu'))
model.add(keras.layers.Dense(normal_x.shape[1], activation=None))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 55)                935       
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2800      
_________________________________________________________________
dense_2 (Dense)              (None, 55)                2805      
_________________________________________________________________
dense_3 (Dense)              (None, 16)                896       
Total params: 7,436
Trainable params: 7,436
Non-trainable params: 0
_________________________________________________________________


In [42]:
adam = keras.optimizers.Adam(learning_rate=0.001)


model.compile(loss='mse', optimizer='Nadam', metrics=['mean_squared_error'])

In [43]:
from tensorflow.keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')

callback = [earlystop]

In [44]:
model.fit(normal_x, normal_x, 
          validation_data = (normal_x, normal_x),
          epochs=100, batch_size=100, callbacks=callback)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100


Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x2613b1df610>

In [45]:
model.evaluate(normal_x, normal_x)



[0.01818540319800377, 0.01818540319800377]

In [46]:
#Multiply by 1000 to make sense of the error term:

model.evaluate(normal_x, normal_x)[0]*1000



18.18540319800377

In [47]:
model.evaluate(anomaly_x, anomaly_x)



[0.08003966510295868, 0.08003966510295868]

In [48]:
#Multiply by 1000 to make sense of the error term:

model.evaluate(anomaly_x, anomaly_x)[0]*1000



80.03966510295868

## Predict first 20 in normal data

In [51]:
from sklearn.metrics import mean_squared_error

for i in range(0,20):
    prediction = model.predict(normal_x[i:i+1])
    print((mean_squared_error(normal_x[i:i+1], prediction))*1000)

    
#Error terms are multiplied by 1000 to make sense of the numbers

57.881622319451544
15.06441297281386
20.101264290640785
13.555628624686983
10.580669087980551
48.063486554558246
11.440084221048151
19.54098287855321
19.398707572248238
11.983455222502055
10.296929725230799
8.946287488738028
4.349599301704395
21.088141437297857
24.60331630215143
2.8384562524692183
13.49065907741606
17.461861053404984
10.485419038714724
22.787566941401327


## Predict all 20 in anomaly data

In [52]:
for i in range(0,20):
    prediction = model.predict(anomaly_x[i:i+1])
    print((mean_squared_error(anomaly_x[i:i+1], prediction))*1000)

    
#Error terms are multiplied by 1000 to make sense of the numbers

80.80814779364759
72.3301107634933
128.14193937452788
38.04292732918064
77.80655766858357
116.78861590808884
70.91113204431538
26.087085760191517
161.4536023284414
70.46701559032006
104.1474409811993
58.93828602061158
56.17054187416331
92.91841157838196
105.27897816239779
26.34827799836879
123.39900754995524
19.10632139301839
90.36656127829009
81.28217754177147


# Discussion

Provide a brief discussion (one-paragraph): can the model successfully detect patients with heart anomalies? If not, why? <br>
Discuss any other relevant issues about your autoencoder. 

Yes, several error temrs less than mean however error significantly increased on anomonly data by a factor of 4.

# Extra Credit (3 points):

# Build a GAN

Build a GAN that can generate patients with **normal hearts**. Test the effectiveness of your GAN using the autoencoder you built earlier. Hint: when you send your newly generated data to the autoencoder, the error term should be small.

# Discussion

Provide a brief discussion (one-paragraph): can the GAN generate patients with normal heart? If not, why? <br>
Discuss any other relevant issues about your GAN. 