<a href="https://colab.research.google.com/github/bsaha205/Fall_22_PML/blob/main/PML_HW_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1



In this problem, I used weatherAUS.csv (https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package?resource=download) dataset that contains about 10 years of daily weather observations from numerous Australian weather stations. In this dataset, there exist two fields named Humidity9am and Humidity3pm that denote the corresponding humidity information of a specific date of a specific location. I used these both fields to create a median humidity (MedHum) and inserted in the dataset. Then the problem is to prdict the humidity of a specific day of a specific location where humidity information of the seven consecutive days (beforehand) are given. As it is time-series data and the model needs to process all seven consecutive information to predict the correct output, sequence model is needed here to work on all of these seven input features of the model. I used Tensorflow framework to solve this problem.

## Exploring Data

In [75]:
# Tensorflow
from tensorflow import keras 
from keras.models import Sequential
from keras import Input 
from keras.layers import Dense, SimpleRNN, LSTM, GRU

# Data manipulation
import pandas as pd 
import numpy as np 
import math 

# Sklearn
import sklearn 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error 
from sklearn.preprocessing import MinMaxScaler 

import time

In [76]:
# read the weatherAUS.csv file
df=pd.read_csv('weatherAUS.csv', encoding='utf-8')

# drop records where target Humidity9am=NaN or MaxTemp=NaN
df=df[pd.isnull(df['Humidity9am'])==False]
df=df[pd.isnull(df['Humidity3pm'])==False]

# median daily humidity (mid point between Daily Humidity9am and Daily Humidity3pm)
df['MedHum']=df[['Humidity9am', 'Humidity3pm']].median(axis=1)

# viewing the first few rows of the data
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,MedHum
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,46.5
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,25.0,1010.6,1007.8,,,17.2,24.3,No,No,34.5
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,34.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,16.0,1017.6,1012.8,,,18.1,26.5,No,No,30.5
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No,57.5


In [77]:
# what's the data size?
df.shape

(140186, 24)

In [78]:
# general information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140186 entries, 0 to 145459
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           140186 non-null  object 
 1   Location       140186 non-null  object 
 2   MinTemp        139955 non-null  float64
 3   MaxTemp        139975 non-null  float64
 4   Rainfall       138041 non-null  float64
 5   Evaporation    80295 non-null   float64
 6   Sunshine       74323 non-null   float64
 7   WindGustDir    132203 non-null  object 
 8   WindGustSpeed  132234 non-null  float64
 9   WindDir9am     131076 non-null  object 
 10  WindDir3pm     138293 non-null  object 
 11  WindSpeed9am   139322 non-null  float64
 12  WindSpeed3pm   139396 non-null  float64
 13  Humidity9am    140186 non-null  float64
 14  Humidity3pm    140186 non-null  float64
 15  Pressure9am    127650 non-null  float64
 16  Pressure3pm    127709 non-null  float64
 17  Cloud9am       86889 non-null

In [79]:
# preprocessing data
def prep_data(data, time_step):
    # get the indices for y elements
    y_indices = np.arange(start=time_step, stop=len(data), step=time_step)
    y = data[y_indices]
    
    rows_X = len(y)
    X = data[range(time_step*rows_X)]
    # reshape it into the desired shape
    X = np.reshape(X, (rows_X, time_step, 1))
    return X, y

In [80]:
# select only Canberra location
dfCan = df[df['Location']=='Canberra'].copy()

# select data for model
X = dfCan[['MedHum']]

# scaling data
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# create training, validation and testing samples
train_data, test_data = train_test_split(X, test_size=0.2, shuffle=False)
train_data, valid_data = train_test_split(train_data, test_size=0.2, shuffle=False)

print('train_data.shape:', train_data.shape)
print('valid_data.shape:', valid_data.shape)
print('test_data.shape:', test_data.shape)

# prepare input X and target y
time_step = 7
X_train, y_train = prep_data(train_data, time_step)
X_valid, y_valid = prep_data(valid_data, time_step)
X_test, y_test = prep_data(test_data, time_step)

train_data.shape: (2152, 1)
valid_data.shape: (539, 1)
test_data.shape: (673, 1)


## Part 1

In [81]:
## RNN Module
# define RNN Network
model = Sequential(name="RNN-Model") 
model.add(Input(shape=(time_step,1), name='Input-Layer'))
model.add(SimpleRNN(units=1, activation='tanh', name='Hidden-Recurrent-Layer')) # Hidden Recurrent Layer with activation='tanh'
model.add(Dense(units=1, activation='tanh', name='Hidden-Layer')) # Hidden Layer with activation='tanh'
model.add(Dense(units=1, activation='linear', name='Output-Layer')) # Output Layer with activation='linear'

# compile RNN model
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['MeanSquaredError', 'MeanAbsoluteError'])

# model summary
model.summary()

# track star time
start_time = time.time()

# training the model
epochs = 30
model.fit(X_train, y_train, batch_size=32, epochs=epochs)

# testing the model
# Predict the result on validation data
pred_valid = model.predict(X_valid)
# Predict the result on test data
pred_test = model.predict(X_test)

print("")
print("RNN: Validation MSE: ", mean_squared_error(y_valid, pred_valid))
print("RNN: Test MSE: ", mean_squared_error(y_test, pred_test))

# track end time
end_time = time.time()
print("Took", (end_time - start_time), "seconds to complete", epochs, "epochs for basic RNN Network")

Model: "RNN-Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Hidden-Recurrent-Layer (Sim  (None, 1)                3         
 pleRNN)                                                         
                                                                 
 Hidden-Layer (Dense)        (None, 1)                 2         
                                                                 
 Output-Layer (Dense)        (None, 1)                 2         
                                                                 
Total params: 7
Trainable params: 7
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 

## Part 2 

In [86]:
## LSTM Module
# define LSTM model
model_LSTM = Sequential(name="LSTM-Model") 
model_LSTM.add(Input(shape=(time_step,1), name='Input-Layer'))
model_LSTM.add(LSTM(units=1, activation='tanh', name='LSTM-Layer')) # LSTM Layer with activation='tanh'
model_LSTM.add(Dense(units=1, activation='tanh', name='Hidden-Layer')) # Hidden Layer with activation='tanh'
model_LSTM.add(Dense(units=1, activation='linear', name='Output-Layer')) # Output Layer with activation='linear'

# compile RNN model
model_LSTM.compile(optimizer='adam', loss='mean_squared_error', metrics=['MeanSquaredError', 'MeanAbsoluteError'])

# model summary
model_LSTM.summary()

# track start time
start_time = time.time()

# training LSTM model
model_LSTM.fit(X_train, y_train, batch_size=32, epochs=epochs)

# testing LSTM model
# Predict the result on validation data
pred_valid = model_LSTM.predict(X_valid)
# Predict the result on test data
pred_test = model_LSTM.predict(X_test)

print("")
print("LSTM: Validation MSE: ", mean_squared_error(y_valid, pred_valid))
print("LSTM: Test MSE: ", mean_squared_error(y_test, pred_test))

# track end time
end_time = time.time()
print("Took", (end_time - start_time), "seconds to complete", epochs, "epochs for LSTM Network")

Model: "LSTM-Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 LSTM-Layer (LSTM)           (None, 1)                 12        
                                                                 
 Hidden-Layer (Dense)        (None, 1)                 2         
                                                                 
 Output-Layer (Dense)        (None, 1)                 2         
                                                                 
Total params: 16
Trainable params: 16
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/

In [92]:
## GRU Module
# define GRU model
model_GRU = Sequential(name="GRU-Model") 
model_GRU.add(Input(shape=(time_step,1), name='Input-Layer'))
model_GRU.add(GRU(units=1, activation='tanh', name='GRU-Layer')) # GRU Layer with activation='tanh'
model_GRU.add(Dense(units=1, activation='tanh', name='Hidden-Layer')) # Hidden Layer with activation='tanh'
model_GRU.add(Dense(units=1, activation='linear', name='Output-Layer')) # Output Layer with activation='linear'

# compile RNN model
model_GRU.compile(optimizer='adam', loss='mean_squared_error', metrics=['MeanSquaredError', 'MeanAbsoluteError'])

# model summary
model_GRU.summary()

# track start time
start_time = time.time()

# training LSTM model
model_GRU.fit(X_train, y_train, batch_size=32, epochs=epochs)

# testing LSTM model
# Predict the result on validation data
pred_valid = model_GRU.predict(X_valid)
# Predict the result on test data
pred_test = model_GRU.predict(X_test)

print("")
print("GRU: Validation MSE: ", mean_squared_error(y_valid, pred_valid))
print("GRU: Test MSE: ", mean_squared_error(y_test, pred_test))

# track end time
end_time = time.time()
print("Took", (end_time - start_time), "seconds to complete", epochs, "epochs for GRU Network")

Model: "GRU-Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 GRU-Layer (GRU)             (None, 1)                 12        
                                                                 
 Hidden-Layer (Dense)        (None, 1)                 2         
                                                                 
 Output-Layer (Dense)        (None, 1)                 2         
                                                                 
Total params: 16
Trainable params: 16
Non-trainable params: 0
_________________________________________________________________
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/3

I ran all of these three models for 30 epochs with batch size 32. The major difference among basic RNN, LSTM and GRU model are:

**Basic RNN:** <br>
Validation MSE:  0.0506 <br>
Test MSE:  0.0399 <br>
Time: **2.8437** seconds to complete 30 epochs

**LSTM:** <br>
Validation MSE: 0.0176 <br>
Test MSE:  0.0204 <br>
Time: 5.2067 seconds to complete 30 epochs 

**GRU:** <br>
Validation MSE:  **0.0141** <br>
Test MSE:  **0.0191** <br>
Time: 5.05814 seconds to complete 30 epochs 

From the above results, we can see, GRU wins over the other two in both validation & test loss and basic RNN wins over time. As basic RNN has the lease complex sturcture among these three models, it should perform fast which is happening here. LSTM performs closely to the GRU but as GRU has less complex sturcture than LSTM, it should perform well in regard of time than LSTM which really makes sense in this case.