# Introduction

In this post, I train Karpathy's __[nanoGPT](https://johncollinsai-nanogpt-voqqf4ls3a-as.a.run.app/)__ on high-frequency (tick-by-tick) data for __[AAPL](https://www.google.com/search?q=aapl&oq=AAPL&aqs=chrome.0.0i512l5j69i61l3.1590j1j9&sourceid=chrome&ie=UTF-8)__ and __[JPM](https://www.google.com/search?q=jpm+stock+price&oq=JPM+stock+pri&aqs=chrome.0.0i512j69i57j0i512l8.4577j1j9&sourceid=chrome&ie=UTF-8)__. I want to see how nanoGPT performs as a volatility predictor.  I also want to explore the use of LLMs for tasks, in this case volatility prediction, that are typically performed by models more specific to finance.  In the case of volatility prediction, the established model classes include stochastic volatility models such as the __[MSM](https://github.com/johncollinsai/markov-switching-multifractal)__ of Calvet & Fisher, ARCH and GARCH, and Jump Diffusion models. More recently deep learning has been applied to volatility prediction and this __[post](https://johncollinsai-deep-learning-finance-voqqf4ls3a-as.a.run.app/)__ describes these developments in some detail. However, the application of LLMs to volatility prediction appears to be quite novel and the use of nanoGPT provides a great basis for an under-the-hood examination.

I begin with __[my earlier implementation of Karpathy's nanoGPT](https://github.com/johncollinsai/nanogpt)__. Starting with a very simple bigram language model, following Karpathy, I define and build a transformer piece by piece. Then I train it on a rigorously prepared tick-by-tick price dataset. In order to negate or reduce the impact of microstructure noise, and in particular Bid-Ask bounce, I compute a weighted mid-price ($WMP$) using the CloseBid and CloseAsk prices and sizes $\mathrm{WMP} = I P^a + \left( 1 - I \right) P^b$, where the weight I is given by the imbalance $I = \frac {Q^b} {Q^b + Q^a}$ where $Q^b$ is the bid size (that is, total volume at the best bid) and $Q^a$ is the ask size. I discuss microstructure noise, Bid-Ask bounce, $WMP$, and my motivation for using $WMP$, in detail in my post __[high frequency data](https://johncollinsai-high-frequency-data-voqqf4ls3a-as.a.run.app)__.  I obtain returns from thw $WMP$ and for tractability log returns.

I use an NVIDIA GeForce RTX 3080 Ti Laptop GPU and a deep learning framework that includes PyTorch, CUDA, cuDNN, and NVIDIA Drivers, on Ubuntu 22.04 LTS.  Source code as always may be found on __[my GitHub](https://github.com/johncollinsai/nanogpt)__.

#### Prepare and import high-frequency data, describe and visualize it

In [None]:
from volgpt_describe import volgpt_import, volgpt_describe

df_data_AAPL, df_data_JPM, AAPL_rr, JPM_rr, AAPL_lr, JPM_lr, AAPL_stats, JPM_stats, device = volgpt_import(dp=8)
volgpt_describe(AAPL_stats, JPM_stats, df_data_AAPL, df_data_JPM, AAPL_lr, JPM_lr)


#### Create text file for input to NanoGPT

In [None]:
# save df_data_AAPL and df_data_JPM as a text file with a comma delimiter
df_data_AAPL.to_csv('df_data_AAPL.txt', sep=',', index=True)
df_data_JPM.to_csv('df_data_JPM.txt', sep=',', index=True)

#### Train NanoGPT and generate new high-frequency data

The basic mechanics here are:

* The decode() function is used to convert the output produced by m.generate() from the list of token IDs to a human-readable string. The output of m.generate() is a tensor of long integers, which is converted to a Python list with .tolist(). Then, the resulting list is passed as an argument to decode(), which uses volGPT's trained bigram language model to convert the token IDs to the corresponding text representations. Finally, the output of decode() is assigned to the variable generated_text.

* I generate 50k tokens, and then decode the tokens into a string. The output string, pred, contains a list of financial data points, including stock names (e.g., AAPL, JPM), timestamps, and the various numerical values. Each row of data starts with the stock symbol (e.g., AAPL or JPM), followed by a timestamp. After the timestamp, the other data are separated by commas. 

* Note that I generate 50k tokens because I need a lot of generated text to get datetimestamps that align between the generated high frequency data and the real high frequency data

In [None]:
# use volgpt to generate text
from volgpt import train_and_generate

# Set max_new_tokens sufficiently high that date-time stamps match the original data
max_new_tokens=50000

# obtain test_data and preds from the function train_and_generate
text_file_path = 'df_data_AAPL.txt' # specify the path to the text file
test_data, generated_text, itos = train_and_generate(text_file_path, 
                                                     max_iters=5000, 
                                                     learning_rate=1e-3, 
                                                     device=device, 
                                                     max_new_tokens=max_new_tokens)

# print(generated_text)

#### Clean generated_text and convert data types

* My clean_data function takes in a text_data string containing preds in CSV format and column names as a list. I clean the data by removing invalid rows, and convert the data types of the columns. The function returns a tuple of three values: the original DataFrame, the cleaned DataFrame, and a list of indices of invalid rows.

* It is essential to clean and preprocess test_data in the same way that I did for the generated predictions. This will ensure that the data is in a consistent format, and I can compare the model's predictions with the actual data accurately.

* I also align the timing of the predictions and the test data. If the model's predictions and the test data have different time steps, it is not possible to accurately calculate the MSE and MAE; I therefore ensure that both datasets are aligned in terms of their time steps before calculating the performance metrics.

In [None]:
# clean the data using clean_data function in clean_data.py
from volgpt_clean_data import clean_data
df, df_clean, invalid_rows = clean_data(generated_text)

print("Original DataFrame: ")
print(df)

print("Cleaned DataFrame: ")
print(df_clean)

print("Invalid Rows: ")
print(invalid_rows)

#### Evaluate the model: MSE, MAE, and paired t-test 

I evaluate the accuracy of the raw return and log return predictions by the model using Mean Squared Error (MSE) and Mean Absolute Error (MAE). To briefly recap:

* Mean Squared Error (MSE) measures the average squared difference between the predicted values and the actual values. A lower MSE indicates better accuracy.

* Mean Absolute Error (MAE) measures the average absolute difference between the predicted values and the actual values. This metric is less sensitive to outliers compared to MSE.

I perform a hypothesis test to compare the performance of two models or evaluate the significance of your model's errors, you can use a paired t-test.

* The paired t-test helps determine if there is a significant difference between the true values and the predicted values. If the p-value is small (typically below 0.05), it suggests that the difference is significant, and the model's errors are not likely due to random chance.


In [None]:
# statistical analysis of the generated data
from volgpt_stats import volgpt_stats
generated_clean, test_data_clean, merged_data, rr_mae, rr_mse, lr_mae, lr_mse, raw_t_stat, raw_p_value, log_t_stat, log_p_value = volgpt_stats(generated_text, test_data, itos)

print("Clean generated data: ")
print(generated_clean), print()

print("Clean test data: ")
print(test_data_clean), print()

print("Merged data: ")
print(merged_data), print()

print("Generated data date range: ", generated_clean['DateTimeIndex'].min(), "to", generated_clean['DateTimeIndex'].max())
print("Test data date range: ", test_data_clean['DateTimeIndex'].min(), "to", test_data_clean['DateTimeIndex'].max()), print()

print(f"Raw returns MSE: {rr_mse:.4f}, MAE: {rr_mae:.4f}")
print(f"Log returns MSE: {lr_mse:.4f}, MAE: {lr_mae:.4f}"), print()

print(f"Raw returns paired t-test results: T-statistic = {raw_t_stat:.2f}, p-value = {raw_p_value:.6f}")
print(f"Log returns paired t-test results: T-statistic = {log_t_stat:.2f}, p-value = {log_p_value:.6f}")

In [None]:
print("Generated clean data:")
print(generated_clean.head())
print("Shape:", generated_clean.shape)
print("\nTest data clean:")
print(test_data_clean.head())
print("Shape:", test_data_clean.shape)

# Merging and filtering code

print("\nMerged data before filtering:")
print(merged_data.head())
print("Shape:", merged_data.shape)


#### Results

* The merged data shows the predicted/generated returns (rr_generated, lr_generated) and the actual/test returns (rr_test, lr_test). It is important to note that the generated data has a date range from 2010-07-29 to 2030-09-07, while the test data has a date range from 2018-01-02 to 2020-02-19. This means that the generated data covers a broader time range and might have different market conditions than the test data.

* Raw returns MSE and MAE values, as well as log returns MSE and MAE values, are still relatively low. This indicates that the prediction model is fairly accurate on average. However, to better assess the performance of the model, it would be helpful to compare these metrics against a suitable benchmark, such as a naive forecasting model or other established models in the field.

* Paired t-test results: With the additional context, it's worth noting that the p-values for both raw returns and log returns are still greater than the commonly used significance level of 0.05. This means there is no strong evidence to reject the null hypothesis that the true and predicted values have the same mean. This suggests that the prediction model is not significantly different from the true values.

* In summary, the results still indicate that the prediction model is performing reasonably well in terms of predicting raw returns and log returns. However, it's crucial to compare these results with a suitable baseline or benchmark and consider the specific context and use case you are working on to determine whether they are good enough. Additionally, you may want to explore other performance metrics and perform further analysis, such as comparing the model's performance during different market conditions, to better assess the model's robustness and applicability.

# References

Bollerslev, T., Hood, B., Huss, J., Pedersen, L.H. (2017). Risk Everywhere: Modeling and Managing Volatility. Available at SSRN: https://ssrn.com/abstract=2722591

Calvet, L.E. & Fisher, A.J. (2008).  Multifractal Volatility Theory, Forecasting, and Pricing.  Elsevier, Academic Press.

__[Colab for Kaparthy's video](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing)__

__[Kaparthy's nanoGPT GitHub repo](https://github.com/karpathy/nanoGPT)__

__[Kaparthy's Youtube video](https://www.youtube.com/watch?v=kCc8FmEb1nY)__

Stoikov, S. (2020). The micro-price: A high frequency estimator of future prices. Available at SSRN: https://ssrn.com/abstract=2970694.

Vaswani, A., et al. (2017).  Attention Is All You Need. arXiv:1706.03762

***
End