Project: Use ChatGPT's code for an autoencoder, insert own data (Used Smarket, ISLP's builtin data for the S&P 500 between 2001 and 2005), find the reconstruction error data, and draw some possible conclusions and explanations from that.

In [1]:
# ChatGPT code for an autoencoder

import numpy as np
import matplotlib.pyplot as plt
from ISLP import load_data

Smarket = load_data('Smarket')
Smarket = Smarket.drop('Direction', axis=1)
X = Smarket.values

X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

# Activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Autoencoder class
class Autoencoder:
    def __init__(self, input_dim, hidden_dim):
        self.W1 = np.random.randn(input_dim, hidden_dim) * 0.1
        self.b1 = np.zeros((1, hidden_dim))
        self.W2 = np.random.randn(hidden_dim, input_dim) * 0.1
        self.b2 = np.zeros((1, input_dim))

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.h = sigmoid(self.z1)
        self.z2 = self.h @ self.W2 + self.b2
        self.out = sigmoid(self.z2)
        return self.out

    def backward(self, X, lr=0.1):
        error = self.out - X
        d_out = error * sigmoid_deriv(self.z2)
        dW2 = self.h.T @ d_out
        db2 = np.sum(d_out, axis=0, keepdims=True)
        d_hidden = (d_out @ self.W2.T) * sigmoid_deriv(self.z1)
        dW1 = X.T @ d_hidden
        db1 = np.sum(d_hidden, axis=0, keepdims=True)
        self.W2 -= lr * dW2
        self.b2 -= lr * db2
        self.W1 -= lr * dW1
        self.b1 -= lr * db1

    def train(self, X, epochs=100, lr=0.1):
        for epoch in range(epochs):
            self.forward(X)
            self.backward(X, lr)
            if epoch % 10 == 0:
                loss = mse_loss(X, self.out)
                print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Initialize and train
input_dim = X.shape[1]
ae = Autoencoder(input_dim=input_dim, hidden_dim=5)
ae.train(X, epochs=100, lr=0.1)

# Show reconstructions
reconstructed = ae.forward(X)

print("\nOriginal vs Reconstructed (first 5 samples):")
for i in range(5):
    print(f"\nOriginal     : {np.round(X[i], 2)}")
    print(f"Reconstructed: {np.round(reconstructed[i], 2)}")


Epoch 0, Loss: 0.0296
Epoch 10, Loss: 0.1871
Epoch 20, Loss: 0.2034
Epoch 30, Loss: 0.1873
Epoch 40, Loss: 0.1911
Epoch 50, Loss: 0.1975
Epoch 60, Loss: 0.2605
Epoch 70, Loss: 0.2757
Epoch 80, Loss: 0.2665
Epoch 90, Loss: 0.2662

Original vs Reconstructed (first 5 samples):

Original     : [0.   0.5  0.44 0.22 0.36 0.93 0.3  0.55]
Reconstructed: [0. 1. 1. 1. 0. 0. 0. 0.]

Original     : [0.   0.55 0.5  0.44 0.22 0.36 0.34 0.56]
Reconstructed: [0. 1. 1. 1. 0. 0. 0. 0.]

Original     : [0.   0.56 0.55 0.5  0.44 0.22 0.38 0.4 ]
Reconstructed: [0. 1. 1. 1. 0. 0. 0. 0.]

Original     : [0.   0.4  0.56 0.55 0.5  0.44 0.33 0.52]
Reconstructed: [0. 1. 1. 1. 0. 0. 0. 0.]

Original     : [0.   0.52 0.4  0.56 0.55 0.5  0.3  0.48]
Reconstructed: [0. 1. 1. 1. 0. 0. 0. 0.]


In [2]:
Smarket

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
0,2001,0.381,-0.192,-2.624,-1.055,5.010,1.19130,0.959
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.29650,1.032
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.41120,-0.623
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.27600,0.614
4,2001,0.614,-0.623,1.032,0.959,0.381,1.20570,0.213
...,...,...,...,...,...,...,...,...
1245,2005,0.422,0.252,-0.024,-0.584,-0.285,1.88850,0.043
1246,2005,0.043,0.422,0.252,-0.024,-0.584,1.28581,-0.955
1247,2005,-0.955,0.043,0.422,0.252,-0.024,1.54047,0.130
1248,2005,0.130,-0.955,0.043,0.422,0.252,1.42236,-0.298


In [3]:
X

array([[0.        , 0.49770061, 0.44392304, ..., 0.93214453, 0.29868045,
        0.55194744],
       [0.        , 0.55194744, 0.49770061, ..., 0.3629282 , 0.33630024,
        0.55879869],
       [0.        , 0.55879869, 0.55194744, ..., 0.21567339, 0.37731727,
        0.40347255],
       ...,
       [1.        , 0.37231347, 0.46597841, ..., 0.45969029, 0.42354456,
        0.47414359],
       [1.        , 0.47414359, 0.37231347, ..., 0.48559362, 0.38130811,
        0.43397466],
       [1.        , 0.43397466, 0.47414359, ..., 0.50154857, 0.36706837,
        0.4160488 ]])

In [4]:
reconstructed

array([[2.78291830e-07, 9.99997602e-01, 9.99999002e-01, ...,
        6.66114923e-08, 3.08288153e-03, 2.65632238e-03],
       [2.63236088e-07, 9.99997585e-01, 9.99999025e-01, ...,
        6.61983506e-08, 3.02030396e-03, 2.58378791e-03],
       [2.63461910e-07, 9.99997589e-01, 9.99999023e-01, ...,
        6.62261838e-08, 3.02411252e-03, 2.58785366e-03],
       ...,
       [3.26557619e-07, 9.99997631e-01, 9.99998925e-01, ...,
        6.82729572e-08, 3.26345606e-03, 2.86968894e-03],
       [3.26259510e-07, 9.99997628e-01, 9.99998925e-01, ...,
        6.83044236e-08, 3.26143998e-03, 2.86750787e-03],
       [3.25434913e-07, 9.99997625e-01, 9.99998926e-01, ...,
        6.83244679e-08, 3.25669890e-03, 2.86223933e-03]])

In [5]:
def reconstruction_error(original_data, reconstructed_data):
    error = np.mean((original_data - reconstructed_data) ** 2, axis=1)
    return error

In [6]:
error = reconstruction_error(X, reconstructed)
print(error)

[0.32079757 0.17011802 0.1490098  ... 0.34668117 0.34647414 0.34533041]


Maximum reconstruction errors: 44 (.4610648), 61 (.4482979), 64 (.4411615), 382 (.4391942), 0 (.4337981)

Dates: I don't know if they're perfectly accurate, but it's my best guess at each of them.
44 - Thursday, March 8, 2001
61 - Monday, April 2, 2001
64 - Thursday, April 5, 2001
372 - Tuesday, July 16, 2002
0 - Tuesday, January 2, 2001

The first three all look like they were caused in main part by the dot-com recession, which occured in the US starting in March of 2001. This was a major stock market crash caused by the bursting of the dot-com bubble, which was caused by dot-com startups that had sprouted up around the adoption of the World Wide Web collapsing. This led to online shopping companies such as Pets.com completely collapsing, while companies like Amazon and Cisco lost large parts of their stock valuation. This mainly showed in the NASDAQ, but looks like it would have affected the S&P as well. Day 382 falls in mid-July 2002 and seems to have been part of the turnaround of a stock downturn, seemingly with roots in the dot-com recession as well as the 9/11 attacks. This date had the largest change by day, at 5.733, while also having significant downturns in the prior days. Day 0 was also probably partially because of the dot-com recession, which had begun in the EU months before it had an effect on the US stock market. It could also simply because it's the start of the dataset, but I would find that unlikely because the data is almost completely made up of values from the prior days, meaning there really shouldn't be that large of a difference between the first entry and any other in the dataset. Dates around 9/11 (172 and 174) fall outside the top 10, which was an interesting side of this to me. 9/11 had a larger change than the top 3 dates, which had higher reconstruction errors, so I'm not quite sure why this is the case. My best guess is that the later dates could have had a larger effect on the reconstruction of the vectors as the market grew differently to the way it grew in 2001, which is why more of those are present in the minimum reconstruction errors and more earlier dates are present in the maximum reconstruction errors.

Minimum reconstruction errors: 1074 (.0966696), 1200 (.1078873), 1083 (.1144973), 1210 (.1167006), 441 (.11910858)

Dates

1074 - Wednesday, April 27, 2005
1200 - Wednesday, October 26, 2005
1083 - Tuesday, May 10, 2005
1210 - Wednesday, November 9, 2005
441 - Tuesday, October 8, 2002

I can't think of any very interesting reasoning for these besides what was said above (they would likely be this low because they were the most "typical" days in this 5-year span), but I thought they would be relevant to note anyways. Notably, 441 is the only entry in the triple digits to make it into the top 20 on this list.