# Introduction to Dataset

# Choosing a Model

### Baseline Examples

### Introduction to Time Series

### From Neural Networks to Recurrent Neural Networks

### Recurrent Neural Networks Improved: Long-Short Term Memory Model

# Tutorial by Jason Brownlee

Here we simply use the Time Series to Supervised Set formatter written by Jason:

In [49]:
from pandas import DataFrame
from pandas import concat
 
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	"""
	Frame a time series as a supervised learning dataset.
	Arguments:
		data: Sequence of observations as a list or NumPy array.
		n_in: Number of lag observations as input (X).
		n_out: Number of observations as output (y).
		dropnan: Boolean whether or not to drop rows with NaN values.
	Returns:
		Pandas DataFrame of series framed for supervised learning.
	"""
	n_vars = 1 if type(data) is list else data.shape[1]
	df = DataFrame(data)
	cols, names = list(), list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(df.shift(i))
		names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(df.shift(-i))
		if i == 0:
			names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
		else:
			names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
	# put it all together
	agg = concat(cols, axis=1)
	agg.columns = names
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg

# Implementation of Tutorial

### Selecting Variables From Our Dataset

### Preprocessing Our Data

Loading data from our previously-made Bolus.csv table into a Pandas dataframe:

In [50]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load Bolus Data
file_path = './DataTables/Bolus.csv'
df = pd.read_csv(file_path)
df['CompletionDateTime'] = pd.to_datetime(df['CompletionDateTime'])
df = df.sort_values(by='CompletionDateTime')

Here we prepare and clean our data:

In [51]:
# Cull Featuers We're Not Using
df = df[["BG (mg/dL)", "CompletionDateTime", "InsulinDelivered", "FoodDelivered", "CarbSize"]]

# Check for NaN values
print("NaN values:", df.isnull().values.any())

# Organize data, make dataframe indexable by date
df.columns = ["BG", "Date", "InsulinDelivered", "FoodDelivered", "CarbSize"]
columns_titles = ["Date", "BG", "InsulinDelivered", "FoodDelivered", "CarbSize"]
df = df.reindex(columns=columns_titles)
df.set_index("Date", inplace=True)
print(df.head())

# normalize features
values = df.values
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)

NaN values: False
                      BG  InsulinDelivered  FoodDelivered  CarbSize
Date                                                               
2022-04-27 11:38:14    0              5.00           5.00        75
2022-04-27 20:34:41  132              1.96           1.67        25
2022-04-28 06:43:30    0              1.07           1.07        16
2022-04-28 11:39:14    0              5.00           5.00        75
2022-04-28 18:09:54    0              1.07           1.07        16


### Multiple Timesteps

The "n_in" parameter in Jason's series_to_supervised() function allows us to transform the data such that each feature's time series is matched by *n* other time series of the feature which are offset by (t-1, t-2, ..., t-n) timesteps. If we leave it at default, we can then see, once we print the head of the dataframe, that it has output two timesteps for each variable. For variable 1, our blood glucose level as shown in the code cell above, there is var1(t-1) and var1(t). If you run the code cell below, you will notice that the time series var1(t-1) is the same series as var1(t) except that the entire series is offset one step into the past. This is the same for every one of the variables / features.

It will be more convenient for us to do this later, but if we culled the last three timesteps, we would be left with a dataframe containing four of previous timestep series for all of our features (var1(t-1), var2(t-2), var3(t-3), var4(t-4)) — our input variables — and one timestep series for the current blood glucose level (var1(t)) — our output variable.

In [52]:
# Transform to supervised set with one timestep
df_reframed = series_to_supervised(scaled, n_in=1, n_out=1)
print(df_reframed.head())

   var1(t-1)  var2(t-1)  var3(t-1)  var4(t-1)  var1(t)   var2(t)   var3(t)  \
1       0.00   0.267380   0.267380   0.166667     0.22  0.104813  0.089305   
2       0.22   0.104813   0.089305   0.055556     0.00  0.057219  0.057219   
3       0.00   0.057219   0.057219   0.035556     0.00  0.267380  0.267380   
4       0.00   0.267380   0.267380   0.166667     0.00  0.057219  0.057219   
5       0.00   0.057219   0.057219   0.035556     0.00  0.142781  0.142781   

    var4(t)  
1  0.055556  
2  0.035556  
3  0.166667  
4  0.035556  
5  0.088889  


If we set "n_in" parameter for more timesteps, the result is the same as above but with more timesteps. For our initial attempt, we will use 10 timesteps because we would later like to be able to use data from the last several hours as input into the LSTM to get our predicted blood-glucose level. "n_out" is set to 1 because we want to predict our current blood glucose level.

We *would* like to be able to predict future blood glucose levels (which would be n_out >= 2), but because our data was not recorded at regular intervals, timestep predictions into the future would tell us "how much?" but not exactly "when?". If our data were recorded in regular one hour intervals, we could then rest assured that future timestep predictions would be *n* number of hours into the future precisely. This is a limitation in our dataset or, at least, in the tutorial we have followed to implement our LSTM. The tutorial was working with data on regular intervals. We are not.

In [53]:
# Transform to supervised set with multiple timesteps
n_timesteps = 10
df_reframed = series_to_supervised(scaled, n_in=n_timesteps, n_out=1)
print(df_reframed.head())

    var1(t-10)  var2(t-10)  var3(t-10)  var4(t-10)  var1(t-9)  var2(t-9)  \
10        0.00    0.267380    0.267380    0.166667       0.22   0.104813   
11        0.22    0.104813    0.089305    0.055556       0.00   0.057219   
12        0.00    0.057219    0.057219    0.035556       0.00   0.267380   
13        0.00    0.267380    0.267380    0.166667       0.00   0.057219   
14        0.00    0.057219    0.057219    0.035556       0.00   0.142781   

    var3(t-9)  var4(t-9)  var1(t-8)  var2(t-8)  ...  var3(t-2)  var4(t-2)  \
10   0.089305   0.055556        0.0   0.057219  ...   0.246524   0.166667   
11   0.057219   0.035556        0.0   0.267380  ...   0.000000   0.000000   
12   0.267380   0.166667        0.0   0.057219  ...   0.106952   0.066667   
13   0.057219   0.035556        0.0   0.142781  ...   0.000000   0.000000   
14   0.142781   0.088889        0.0   0.057219  ...   0.000000   0.000000   

    var1(t-1)  var2(t-1)  var3(t-1)  var4(t-1)   var1(t)   var2(t)   var3(t)  \


### Split Test and Train Sets

Here we take data only from Fed 2024 through Oct 2024 because that time range in the Bolus dataset is the largest period of time for which there are not massive holes in the record. We split our data into a training set, comprised of the first half of our time range, and a testing set, comprised of the second half of our time range. We choose to split it this way rather than use standard shuffling methods of generating the sets because our data is only useful insofar as it records a time series. That is, if we scrambled the time series, the data would be relatively meaningless, yielding no info about the relative order nor trends in the data over time.

In [54]:
# Split into train and test sets
values = df_reframed.values
index_2024 = 6893
index_last = 11349
index_midpoint = index_2024 + ((index_last - index_2024) // 2)
train = values[index_2024:index_midpoint,:]
test = values[index_midpoint:,:]

Here we cull all the current timestep features except variable 1, which is the blood glucose level variable we are trying to predict.

In [55]:
# Split into input and outputs
n_features = 4
n_obs = n_timesteps * n_features
train_X, train_y = train[:, 0:n_obs], train[:, -n_features]
test_X, test_y = test[:, 0:n_obs], test[:, -n_features]

### Create and Fit Model

In [58]:
from keras.models import Sequential
from keras.layers import Input, Dense
from keras.layers import LSTM

# design LSTM
model = Sequential()
model.add(LSTM(100, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')

# fit LSTM
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)

IndexError: tuple index out of range

# Tuning

### Are we overfitting? Underfitting? How do we tell?

### Which input variables from out dataset are the best predictors?

### What is the optimal number of timestamps?

### What are the best LSTM hyperparameters?

# Results

### Summary

### Limitations

### Credits