## Plan
* Assuming the input has n features (I think it actually contains 9 features)
* The neural network only have one hidden layer which contains the following
    * Weights for each input features (w_jk in the instruction) and each neuron
    * Bias for each neuron
    * Activation function sigma 
    * Weights to multiply by result after activation function
    * Bias to add to result after activation function

* First function will compute the sum of the gradient of loss of each data inputted
* Second function will take in the average loss and then update all the weights inside them accordingly. 
Maybe use a dictionary of list to store different weights like this: 

{
    weights_features: a matrix (arrary of array) with number of neurons row and number of features column,
    bias_features: an array with number of neurons item (amount of bias parameter does not depend on number of features),
    weights_after_activation: This is the weights after activation layer 
    bias_after_activation: This is the output layer bias
    }

To make it simple to understand, you can do a for loop for each feature and then just find out approximationa and gradient for each with the function?

### input features
The input features features_list should be a list with the value of the following predictor strictly in this order:
[tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID, payment_type, extra]

NOTE: Will check again whether this will need to be a numpy array or not 

In [11]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [9]:
# import the data
data = pd.read_parquet('data/nytaxi2022_cleaned.parquet')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39656098 entries, 0 to 39656097
Data columns (total 13 columns):
 #   Column                 Dtype              
---  ------                 -----              
 0   tpep_pickup_datetime   datetime64[ns, UTC]
 1   tpep_dropoff_datetime  datetime64[ns, UTC]
 2   passenger_count        float64            
 3   trip_distance          float64            
 4   RatecodeID             float64            
 5   PULocationID           int64              
 6   DOLocationID           int64              
 7   payment_type           int64              
 8   extra                  float64            
 9   total_amount           float64            
 10  trip_duration_min      float32            
 11  pickup_hour            float32            
 12  pickup_dow             float32            
dtypes: datetime64[ns, UTC](2), float32(3), float64(5), int64(3)
memory usage: 3.4 GB


In [10]:
data.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,extra,total_amount,trip_duration_min,pickup_hour,pickup_dow
0,2022-01-01 00:35:40+00:00,2022-01-01 00:53:29+00:00,2.0,3.8,1.0,142,236,1,3.0,21.95,17.816668,0.0,5.0
1,2022-01-01 00:33:43+00:00,2022-01-01 00:42:07+00:00,1.0,2.1,1.0,236,42,1,0.5,13.3,8.4,0.0,5.0
2,2022-01-01 00:53:21+00:00,2022-01-01 01:02:19+00:00,1.0,0.97,1.0,166,166,1,0.5,10.56,8.966666,0.0,5.0
3,2022-01-01 00:25:21+00:00,2022-01-01 00:35:23+00:00,1.0,1.09,1.0,114,68,2,0.5,11.8,10.033334,0.0,5.0
4,2022-01-01 00:36:48+00:00,2022-01-01 01:14:20+00:00,1.0,4.3,1.0,68,163,1,0.5,30.3,37.533333,0.0,5.0


In [13]:
# performing train test split

NUM = ["passenger_count","trip_distance","extra","trip_duration_min","pickup_hour","pickup_dow"]
CAT = ["RatecodeID","PULocationID","DOLocationID","payment_type"]
TARGET = "total_amount"

X, y = data[NUM + CAT], data[TARGET].astype("float32")
print("final X,y:", X.shape, y.shape)   # must show non-zero rows

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.30, random_state=42)

final X,y: (39656098, 10) (39656098,)


In [15]:
# preprocess data into desired features_list form
Xtr.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27759268 entries, 25756598 to 21081788
Data columns (total 10 columns):
 #   Column             Dtype  
---  ------             -----  
 0   passenger_count    float64
 1   trip_distance      float64
 2   extra              float64
 3   trip_duration_min  float32
 4   pickup_hour        float32
 5   pickup_dow         float32
 6   RatecodeID         float64
 7   PULocationID       int64  
 8   DOLocationID       int64  
 9   payment_type       int64  
dtypes: float32(3), float64(4), int64(3)
memory usage: 2.0 GB


In [None]:
# # only use this function at the beginning to initialise weights and bias in the desired format, the rest should be just you inputting the newly computed weights and bias from backpropagation
# # Here is an example using Xavier initialisation
# def initialise_weights(features_list, number_of_neurons):
#     weights_dict = {}
#     weights_dict['weights_features'] = np.random.rand(number_of_neurons, len(features_list))
#     weights_dict['bias_features'] = np.random.rand(number_of_neurons)
#     weights_dict['output_weights'] = np.random.rand(number_of_neurons)
#     weights_dict['output_bias'] = np.random.rand(1)[0]
#     return weights_dict

# using Xavier initialisation (optional: test other initialisation methods)
# Xavier initialisation aims to:
# The variance of activations the same across layers (forward pass stability).
# The variance of gradients the same across layers (backward pass stability).
# This reduces risk of vanishing/exploding gradients and allows stable signal propragation
# through the deep neural network

def initialise_weights(features_list, number_of_neurons):
    n_in = len(features_list)
    
    # Xavier/Glorot uniform initialization
    limit_hidden = np.sqrt(1 / (n_in + number_of_neurons))
    limit_output = np.sqrt(1 / (number_of_neurons + 1))  # +1 for scalar output neuron
    
    weights_dict = {}
    # Hidden layer weights and bias
    weights_dict['weights_features'] = np.random.uniform(-limit_hidden, limit_hidden, (number_of_neurons, n_in))
    weights_dict['bias_features'] = np.zeros(number_of_neurons)  # usually initialised to 0
    weights_dict['weights_after_activation'] = np.random.uniform(-limit_output, limit_output, number_of_neurons)
    weights_dict['bias_after_activation'] = 0.0  # usually initialised to 0
    return weights_dict

In [None]:
# activation function list with their derivative
def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def diff_relu(x):
    return np.where(x > 0, 1, 0)

def diff_sigmoid(x):
    sig = sigmoid(x)
    return sig * (1 - sig)

def diff_tanh(x):
    return 1 - np.tanh(x)**2


In [None]:
def calculate_diff(weights_dict, features_list, actual_val, activation_func):
    # perform matrix multiplication of features and weights
    z = np.dot(weights_dict['weights_features'], features_list) + weights_dict['bias_features']
    # apply activation function
    if activation_func == 'relu':
        activated_output = relu(z)
    elif activation_func == 'sigmoid':
        activated_output = sigmoid(z)
    elif activation_func == 'tanh':
        activated_output = tanh(z)
    else:
        raise ValueError("Unsupported activation function")
    
    # calculate final output
    final_output = np.dot(weights_dict['weights_after_activation'], activated_output) + weights_dict['bias_after_activation']

    # calculate loss (mean squared error)
    diff = final_output - actual_val
    # loss = (final_output - actual_val) ** 2

    return diff, z # z is the output pre activation, will be useful in the calculate gradient step


In [None]:
# this function should be used in parallel to calculate RMSE of the whole process, this is essential to record
# RMSE and check convergence of the models (so if it does not decrase,
# iteration should terminate itself)
# NOTE THIS FUNCTION OUTPUTS SQUARED DIFFERENCE, NOT RMSE
def calculate_squared_diff(weights_dict, data_batch, activation_func):
    total_loss = 0
    for features_list, actual_val in data_batch:
        loss = calculate_diff(weights_dict, features_list, actual_val, activation_func)[0] ** 2
        total_loss += loss
    return total_loss, len(data_batch) # returns the total loss and number of samples in the mini batch in this process

In [None]:
# This is the gradient of the model output with respect to the model parameters.
# Must calculate grad f first as well
def calculate_gradients(weights_dict, data_batch, activation_func):
    # using chain rule we can obtain the grad equation for weights feature to be:
    data_batch_size = len(data_batch)
    for features_list, actual_val in data_batch:
        # absolute difference 
        error, z = calculate_diff(weights_dict, features_list, actual_val, activation_func) 
        
        # apply activation function
        if activation_func == 'relu':
            activated_output = relu(z)
            activated_derivative = diff_relu(z)
        elif activation_func == 'sigmoid':
            activated_output = sigmoid(z)
            activated_derivative = diff_sigmoid(z)
        elif activation_func == 'tanh':
            activated_output = tanh(z)
            activated_derivative = diff_tanh(z)
        else:
            raise ValueError("Unsupported activation function")
        
        # Gradients initialization with same shape as weights
        grad_weights_features = np.zeros_like(weights_dict['weights_features'])
        grad_bias_features = np.zeros_like(weights_dict['bias_features'])
        grad_weights_after_activation = np.zeros_like(weights_dict['weights_after_activation'])
        grad_bias_after_activation = 0.0 # bias_after_activation is just scalar

        # Gradient w.r.t. output layer weights and bias
        grad_weights_after_activation += error * activated_output
        grad_bias_after_activation += error

        # Gradient w.r.t. hidden layer weights and bias
        delta_hidden = error * weights_dict['weights_after_activation'] * activated_derivative
        grad_weights_features += np.outer(delta_hidden, features_list)
        grad_bias_features += delta_hidden

    gradient_dict = {
        'grad_weights_features': grad_weights_features/data_batch_size,
        'grad_bias_features': grad_bias_features/data_batch_size,
        'grad_weights_after_activation': grad_weights_after_activation/data_batch_size,
        'grad_bias_features': grad_bias_after_activation/data_batch_size
        }

    return gradient_dict

In [None]:
def update_weights(weights_dict, gradients, learning_rate):
    # Update weights and biases using the computed gradients
    weights_dict['weights_features'] -= learning_rate * gradients['grad_weights_features']
    weights_dict['bias_features'] -= learning_rate * gradients['grad_bias_features']
    weights_dict['weights_after_activation'] -= learning_rate * gradients['grad_weights_after_activation']
    weights_dict['bias_after_activation'] -= learning_rate * gradients['grad_bias_after_activation']
    return weights_dict
    