## Dataset Curator for Project 3
### Example code for simple equity trading decisions

There are 3 sample python files generated (via multiple queries) by GPT-4o, Claude 3 Opus and Gemini 1.5 Pro.
This notebook creates training data from these files, then converts to the HuggingFace format and uploads to the hub.

It goes without saying: this trading code was generated by LLMs, is over-simplified and untrusted - do not make actual trading decisions based on this!

In [None]:
import os
import glob
import matplotlib.pyplot as plt
import random
from datasets import Dataset
from dotenv import load_dotenv
from huggingface_hub import login
import transformers
from transformers import AutoTokenizer

In [None]:
# Load environment variables in a file called .env
from datasets import load_dataset, Dataset
load_dotenv()
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-hf-token-if-not-using-env')

In [None]:
# Constants

DATASET_NAME = "trade_code_dataset"
BASE_MODEL = "bigcode/starcoder2-3b"

In [None]:
# A utility method to convert the text contents of a file into a list of methods

def extract_method_bodies(text):
    chunks = text.split('def trade')[1:]
    results = []
    for chunk in chunks:
        lines = chunk.split('\n')[1:]
        body = '\n'.join(line for line in lines if line!='\n')
        results.append(body)
    return results          

In [None]:
# Read all .py files and convert into training data

bodies = []
for filename in glob.glob("*.py"):
    with open(filename, 'r') as file:
        content = file.read()
        extracted = extract_method_bodies(content)
        bodies += extracted

print(f"Extracted {len(bodies)} trade method bodies")

In [None]:
# Let's look at one

print(random.choice(bodies))

In [None]:
# To visualize the lines of code in each 

%matplotlib inline
fig, ax = plt.subplots(1, 1)
lengths = [len(body.split('\n')) for body in bodies]
ax.set_xlabel('Lines of code')
ax.set_ylabel('Count of training samples');
_ = ax.hist(lengths, rwidth=0.7, color="green", bins=range(0, max(lengths)))

In [None]:
# Add the prompt to the start of every training example

prompt = """
# tickers is a list of stock tickers
import tickers

# prices is a dict; the key is a ticker and the value is a list of historic prices, today first
import prices

# Trade represents a decision to buy or sell a quantity of a ticker
import Trade

import random
import numpy as np

def trade():
"""

data = [prompt + body for body in bodies]
print(random.choice(data))

In [None]:
# Distribution of tokens in our dataset

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenized_data = [tokenizer.encode(each) for each in data]
token_counts = [len(tokens) for tokens in tokenized_data]

%matplotlib inline
fig, ax = plt.subplots(1, 1)
ax.set_xlabel('Number of tokens')
ax.set_ylabel('Count of training samples');
_ = ax.hist(token_counts, rwidth=0.7, color="purple", bins=range(0, max(token_counts), 20))

In [None]:
CUTOFF = 320
truncated = len([tokens for tokens in tokenized_data if len(tokens) > CUTOFF])
percentage = truncated/len(tokenized_data)*100
print(f"With cutoff at {CUTOFF}, we truncate {truncated} datapoints which is {percentage:.1f}% of the dataset")

In [None]:
random.seed(42)
random.shuffle(data)

In [None]:
# I don't make a Training / Test split - if we had more training data, we would!

dataset = Dataset.from_dict({'text':data})

In [None]:
login(token=os.environ['HF_TOKEN'])

In [None]:
dataset.push_to_hub(DATASET_NAME, private=True)

## And now to head over to a Google Colab for fine-tuning in the cloud

Follow this link for the Colab: https://colab.research.google.com/drive/19E9hoAzWKvn9c9SHqM4Xan_Ph4wNewHS?usp=sharing
