## Training on Cork Data

I will use the data from the to predict 3 days into the future [MET Office](https://www.met.ie//climate/available-data/historical-data). This data is hourly and goes from \[01-Jan-1992 $\rightarrow$ 01-Feb-2022\]. 

According to the key provided with the dataset the columns are:
- **rain** | Precipitation Amount | mm
- **temp** | Air Temperature | °C
- **wetb** | Wet Bulb Air Temperature | °C
- **dewpt** | Dew Point Air Temperature | °C                 
- **vappr** |Vapour Pressure | hpa
- **rhum** | Relative Humidity | %
- **msl** | Mean Sea Level Pressure | hPa
- **wdsp** | Mean Hourly Wind Speed | kt
- **wddir** | Predominant Hourly wind Direction | kt
- **ww** | Synop Code Present Weather
- **w**| Synop Code Past Weather
- **sun** | Sunshine duration | hours
- **vis** | Visibility | m
- **clht** | Cloud Ceiling Height | 100s feet
- **clamt** | Cloud Amount | 
------------------------
Using this data I will train a reccurent model that will be able to predict the next 48 hours into the future given the previous 72 hours weather. I will train 2 different types, one will just predict the tempurature as with the tutorial, next I will predict the tempurature, rainfall and the wind vector as in my opinion these 3 are the most important factor in gauging general weather.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow.keras as keras
import seaborn as sns
OUT_STEPS = 48

### Defining my own plotting function

In [None]:
#custom plotting function
def plot_results(model, window,features, subplot = False):
    pred_num = len(features)
    plt.figure(figsize = (20,3))
    label_width = OUT_STEPS
    shift = OUT_STEPS
    input_width = 72
    batch_num = 1
    total_window_size = input_width + shift

    input_slice = slice(0, input_width)
    input_indices = np.arange(total_window_size)[input_slice]


    label_start = total_window_size - label_width
    labels_slice = slice(label_start, None)
    label_indices = np.arange(total_window_size)[labels_slice]
    
    for i, l in window.test.take(1):
        inputs = i
        labels = l
        break
        
    column_indices = {name: i for i, name in enumerate(train_df.columns)}
    predictions = model(inputs)
    print(predictions.shape)
    cols = ['red','orange','blue','green','yellow']

    for i in range(pred_num):
        if subplot:
            plt.figure(figsize = (20,10))
            plt.subplot(pred_num,1,i+1)
            plt.title(features[i])
            
        plot_col_index = column_indices[features[i]]
        plt.plot(input_indices, inputs[batch_num, :, plot_col_index],label='Input', marker='.', zorder=-10,color = cols[i])
        plt.scatter(label_indices, labels[batch_num,:,i], marker ='o',label = 'True' ,color = cols[i])
        #plt.scatter(label_indices, labels[batch_num,:,i], marker ='o',label = 'Prediction' ,color = cols[i])
        plt.scatter(label_indices, predictions[batch_num,:,i], marker ='x', color = 'green', label = 'prediction')
        plt.legend()
        plt.grid()

### Read in the data
I first need to read in the data using pandas and check that all the data was read correctly.

In [None]:
#for this import to work you need to launch jupyter notebook from the directorythat have the notebook and data
data = pd.read_csv('hly3904.csv')
print(data.shape)
data.dtypes

We can now see that this is a massive dataset with over 200,000 entries that we will cut down on

-------
We can see that some of the data was read in as objects or strings rather than floats or ints so we need to chnage them to numeric values. Then if there are any blank values interpolate them using a cubic spline so we are not missing data. This should give a good enough approximation considering the size f the dataset

In [None]:
data = data.drop(index=np.arange(0,100000)) #drop the first 100000 hours
date_time = pd.to_datetime(data.pop('date'))
cols = ['wetb','vappr','rhum','vis']
for x in cols:
    data[x] = pd.to_numeric(data[x], errors = 'coerce')#data[col].astype(float)
    data[x] = data[x].interpolate(method = 'cubic')

Next is to remove the categorical data that we do not need

In [None]:
cols = ['ww','w','ind','ind.1','ind.2','ind.3','ind.4']
data = data.drop(cols,axis=1)


data.head()

Like the tutorial we will be combining the windspeed and direction as they are not great predictors on their own, we will then normalise the data around 0, as the last part showed that the model will not train well at all unless everything is on the same scale

In [None]:
wv = data.pop('wdsp')
# Convert to radians.
wd_rad = data.pop('wddir')*np.pi / 180
# Calculate the wind x and y components.
data['Wx'] = wv*np.cos(wd_rad)
data['Wy'] = wv*np.sin(wd_rad)

timestamp_s = date_time.map(pd.Timestamp.timestamp)
day = 24*60*60
year = (365.2425)*day

data['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
data['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
data['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
data['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))

data_mean = data.mean()
data_std = data.std()
data_df = (data - data_mean)/data_std
data_df

If we look at the data again now, we have removed 7 usless predictors and all of the data has been normalised. I want to see If I can remove even more features for better training. I will be looking to see how correlated diffferent columns are with each other, if a column is >90% correlated I will remove it as it is just repeating information that is being provided by another

In [None]:
data_df.corr().abs()

This is hard to pull information from so lets put it into a seaborn heatmap

In [None]:
sns.heatmap(data_df.corr().abs())

We can see from this that there is a cluster of highly correlated data in the top left containing temp,wetb,dewpt and vappr. This makes sense as they are all reporting on a type of tempurature. Except vappr that is showing Vapour Pressure. I will remove wetb and dewpt and check again. 

In [None]:
data_df = data_df.drop('wetb', axis=1)
data_df = data_df.drop('dewpt',axis=1)
sns.heatmap(data_df.corr().abs())

This in my opinion looks much better where all features are contributing somthing different. 

Now lets take a quick look at the data using the plotting code from the tutorial

In [None]:
plot_cols = ['temp', 'rain', 'vis']
plot_features = data_df[plot_cols]
plot_features.index = date_time
_ = plot_features.plot(subplots=True)

plot_features = data_df[plot_cols][:480]
plot_features.index = date_time[:480]
_ = plot_features.plot(subplots=True)

## Splitting into Train, Test, Validation

I have modifed the WindowGenerator function from the tutorial, I have removed the plotting function for a more custom plotting

In [None]:
n = data_df.shape[0]

train_df = data_df[0:int(n*0.7)]
val_df = data_df[int(n*0.7):int(n*0.9)]
test_df = data_df[int(n*0.9):]

# Data Windowing

In [None]:
class WindowGenerator():
    def __init__(self, input_width, label_width, shift,
                   train_df=train_df, val_df=val_df, test_df=test_df,
                   label_columns=None):
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df

        # Work out the label column indices.
        self.label_columns = label_columns
        if label_columns is not None:
            self.label_columns_indices = {name: i for i, name in
                                        enumerate(label_columns)}
        self.column_indices = {name: i for i, name in
                               enumerate(train_df.columns)}

        # Work out the window parameters.
        self.input_width = input_width
        self.label_width = label_width
        self.shift = shift

        self.total_window_size = input_width + shift

        self.input_slice = slice(0, input_width)
        self.input_indices = np.arange(self.total_window_size)[self.input_slice]

        self.label_start = self.total_window_size - self.label_width
        self.labels_slice = slice(self.label_start, None)
        self.label_indices = np.arange(self.total_window_size)[self.labels_slice]

    def __repr__(self):
        return '\n'.join([
         f'Total window size: {self.total_window_size}',
            f'Input indices: {self.input_indices}',
            f'Label indices: {self.label_indices}',
            f'Label column name(s): {self.label_columns}'])
    def split_window(self, features):
        inputs = features[:, self.input_slice, :]
        labels = features[:, self.labels_slice, :]
        if self.label_columns is not None:
            labels = tf.stack(
                [labels[:, :, self.column_indices[name]] for name in self.label_columns],
                axis=-1)

        # Slicing doesn't preserve static shape information, so set the shapes
        # manually. This way the `tf.data.Datasets` are easier to inspect.
        inputs.set_shape([None, self.input_width, None])
        labels.set_shape([None, self.label_width, None])
        return inputs, labels

    def make_dataset(self, data):
        data = np.array(data, dtype=np.float32)
        ds = tf.keras.utils.timeseries_dataset_from_array(
          data=data,
          targets=None,
          sequence_length=self.total_window_size,
          sequence_stride=1,
          shuffle=True,
          batch_size=256,)

        ds = ds.map(self.split_window)
        return ds
    
    @property
    def train(self):
        return self.make_dataset(self.train_df)

    @property
    def val(self):
        return self.make_dataset(self.val_df)

    @property
    def test(self):
        return self.make_dataset(self.test_df)

    @property
    def example(self):
        """Get and cache an example batch of `inputs, labels` for plotting."""
        result = getattr(self, '_example', None)
        if result is None:
        # No example batch was found, so get one from the `.train` dataset
            result = next(iter(self.train))
        # And cache it for next time
        self._example = result
        return result

### Making the Windowed Data

In [None]:
OUT_STEPS = 48
cork_window_temp = WindowGenerator(input_width=72,
                               label_width=OUT_STEPS,
                               shift=OUT_STEPS, train_df=train_df, val_df=val_df ,test_df=test_df, label_columns = ['temp'])

# Defining the Model
The model below is to predict tempuratue. It is the same as the model that was used in the tutorial section. this time it has a single output layer where the output dimensions are `(None, 48,1)`, To match the windowed data, As I have only asked it to used tempurature in the labels. This should hopfully improve the accuracy. You will see this output shape in the model summary.

In [None]:
#num_features = data_df.columns.size
num_features = 15
temp_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(32, return_sequences=False),
    tf.keras.layers.Dense(OUT_STEPS*num_features,kernel_initializer=tf.initializers.zeros()),
    #tf.keras.layers.Dense(8, activation = 'selu'),
    tf.keras.layers.Reshape([OUT_STEPS, num_features]),
    tf.keras.layers.Dense(1)
])

In [None]:
def compile_and_fit(model, window,MAX_EPOCHS = 3 ,patience=2):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                    patience=patience,
                                                    mode='min')
    model.compile(loss=tf.losses.MeanSquaredError(),
                optimizer=tf.optimizers.Adam(),
                metrics=[tf.metrics.MeanAbsoluteError()])
    
    history = model.fit(window.train, epochs=MAX_EPOCHS,
                      validation_data=window.val,
                      callbacks=[early_stopping])
    model.summary()
    return history

compile_and_fit(temp_lstm_model,cork_window_temp,MAX_EPOCHS=1)

for i in range(4):
    plot_results(model=temp_lstm_model,window = cork_window_temp, features=['temp'], subplot=False)

As you can see while the mean absolute error is relativly high, I belive this is due to the sheer amount of data as some results are very wrong and some are very close, the model has performed very well in terms of getting the general trends correct. This can be seen espelilly well if you plot several time. Although in my opinion the tempuratue is the easiets to predict of all of the features as it clearly rises and falls for each day cycle. The next model tests the LSTMs ability to predict the tempurature, rainfall and the wind vectors as these are far less tied to the dya night cycle, so I am not expecting as good results

In [None]:
accuracy = temp_lstm_model.evaluate(cork_window_temp.test, verbose=1)

In [None]:
OUT_STEPS = 48
cork_window_multi = WindowGenerator(input_width=72,
                               label_width=OUT_STEPS,
                               shift=OUT_STEPS, train_df=train_df, val_df=val_df ,test_df=test_df, label_columns = ['temp','rain','Wx','Wy'])

num_features = 15
multi_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(32, return_sequences=False),
    tf.keras.layers.Dense(OUT_STEPS*num_features,kernel_initializer=tf.initializers.zeros()),
    #tf.keras.layers.Dense(8, activation = 'selu'),
    tf.keras.layers.Reshape([OUT_STEPS, num_features]),
    tf.keras.layers.Dense(4)
])
compile_and_fit(multi_lstm_model,cork_window_multi)
plot_results(model=multi_lstm_model,window = cork_window_multi, features=['temp','rain','Wx','Wy'], subplot=True)

From the output graphs the LSTM appears to have trained very well, and better than I expected. While it is by no means perfect and definatly stuggles the most with predicting the rain the most as it has alot of '0' values. I am impressed that it managed to predict the wint vectors relativly well.

I wish I had a more powerful computer so I could run more epochs on a deeper network of LSTMs. I belive there is potential in this to develop quite an accurate model. I would also love to expiriment with trying to predict categorical data using a sliding window as the 'ww' codes that I dropped in the beggining classify the general weather

## Summary

In summary from going through the tutorial and modifying it I have made a few conclusions about training
1. For time series data normalising the data is absolutly critical or else convergence will not happen. While this is important for regular models this step is important but not an absolute deal breaker. The model on unnormalised data output stright line almost always and was less than usless 

2. Making the LSTM layers have more units or adding more LSTM layers is a good way to increase the accuracy and increase convergence. However this comes at the cost of training time. Like CNNs it appers as though one can only really leverage them for all they are worth in terms of depth with more powerful hardware

3. From what I could see there appears to be a trade off between giving the model too much historical data vs a small amaount, that is why I landed on giving my Cork model 3 days to predict ahead the next 48

4. There did not seem to be much over fitting, I am sure with more complex models this is bound to happen. 

4. Overall performance was good, espellilly on the tempurature matric. I belive as it simply ocsillates at different frequecies, I think that is why in the correlation heatmap it was correlated to the cos wave