# Assignment - RNN Sequence Classification

In this assignment, we will focus on healthcare. This data set is made available by MIT. It contains data about 9,026 heartbeat measurements. Each row represents a single measurement (captured on a timeline). There are a total of 187 data points (columns) for each measurement. This is a multiclass classification task: predict whether the measurement represents a normal heartbeat or other anomalies. 

## Description of Variables

You will use the **hearbeat.csv** data set for this assignment. Each row represents a single measurement. Columns labeled as T1 from T187 are the values of a measurement on the timeline (there are 187 data points, or columns, in a single measurement). 

The last column is the target variable. It shows the label (category) of the measurement as follows:<br>
0 = Normal<br>
1 = Supraventricular premature beat<br>
2 = Premature ventricular contraction<br>
3 = Fusion of ventricular and normal beat<br>
4 = Unclassifiable beat

## Goal

Use the data set **hearbeat.csv** to predict the column called **Target**. The input variables are columns labeled as **T1 to T187**. 

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Important note:

Looks like the original length of the sequences were not the same in the data set. For example, one sequence (i.e., row) had 187 columns, while another one had 150 only. Therefore, the creators of the data set did a "zero padding". This means that they added zeros at end of each sequence to make them the same length. This way, all sequences (i.e., rows) were made to have 187 columns.

If you don't account for the zero-padding, you will create biased models. To account for this, you can use the following as your first layer of your models:<br>
`tf.keras.layers.Masking(mask_value=0, input_shape=[n_steps, n_inputs])`<br>
(Of course, you need to enter your own values for n_steps and n_inputs). After you add this layer, continue with your other layers as usual.

# Read and Prepare the Data

In [None]:
# Common imports
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd

In [None]:
data = pd.read_csv("heartbeat.csv")

In [None]:
data.shape

In [None]:
data.head()

In [None]:
y = data['Target']
x = data.drop('Target', axis=1)

In [None]:
# Split the data

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3)

### Data Transformation

In [None]:
#Target variables need to be an array with integer type
train_y = np.array(train_y)
test_y = np.array(test_y)

train_y = train_y.astype(np.int32)
test_y = test_y.astype(np.int32)

In [None]:
#Check the first 10 values of the train_y data set
train_y[0:10]

In [None]:
#Convert input variables to a 2-D array with float data type
train_x= np.array(train_x)
test_x= np.array(test_x)

train_x = train_x.astype(np.float32)
test_x = test_x.astype(np.float32)

In [None]:
train_x

In [None]:
#Keras expects a different input format:
#Data needs to have 3 dimensions

train_x = np.reshape(train_x, (train_x.shape[0], train_x.shape[1], 1))
test_x = np.reshape(test_x, (test_x.shape[0], test_x.shape[1], 1))

In [None]:
train_x.shape, train_y.shape

In [None]:
train_x

## Baseline Accuracy

In [None]:
data['Target'].value_counts()/len(data)

# LSTM Model

### Make sure to add the masking layer to the model!

In [None]:
n_steps = 187
n_inputs = 1

model = keras.models.Sequential([
    
    keras.layers.Masking(mask_value=0, input_shape=[n_steps, n_inputs]),
    keras.layers.LSTM(1, activation='sigmoid' , input_shape=[n_steps, n_inputs]),
    keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
from tensorflow.keras.callbacks import EarlyStopping


earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')

callback = [earlystop]

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(lr=0.001)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = model.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y), callbacks=callback)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# GRU Model

### Make sure to add the masking layer to this model too!

In [None]:
n_steps = 187
n_inputs = 1

model = keras.models.Sequential([
    keras.layers.Masking(mask_value=0, input_shape=[n_steps, n_inputs]),
    keras.layers.GRU(2, return_sequences=True, input_shape=[n_steps, n_inputs]),
    keras.layers.GRU(2, return_sequences=True),
    keras.layers.GRU(2, return_sequences=True),
    keras.layers.GRU(1, activation='sigmoid')
])

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(lr=0.001)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = model.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y), callbacks=callback)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) What is the baseline value?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?

1- LSTM with 6.72% and lower loss rate. <br>
2- 20%<br>
3- No, baseline was 20% and model 6.72%<br>
4- no