# GR5398 26Spring: FinGPT Large Language Model Track
## Assignment 1

### 1. Data Preparation

In this part, you can build your own dataset for later training. Here is the example of how we generate dataset from Dow Jones 30's component stocks.

We used **Finnhub** to get raw data, and **GPT-4** to generate benchmark responses (you can change these to get a better result, maybe).

For detailed information of this part, please refer to [`prepare_data.ipynb`](https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_Forecaster/prepare_data.ipynb)

In [None]:
!pip install finnhub-python
import os
import re
import csv
import math
import time
import json
import random
import finnhub
import datasets
import pandas as pd
import yfinance as yf
from datetime import datetime
from collections import defaultdict
from datasets import Dataset
from openai import OpenAI

+ Raw Financial Data Acquisition (News and Returns)

In [None]:
# Load Dow Jones 30 dataset provided by FinGPT (May 2023 - May 2024)
splits = {
    "train": "data/train-00000-of-00001-7c4c80aa07272d4c.parquet",
    "test": "data/test-00000-of-00001-28531804b005ddc6.parquet"
}

train_df = pd.read_parquet(
    "hf://datasets/FinGPT/fingpt-forecaster-dow30-202305-202405/"
    + splits["train"]
)

test_df = pd.read_parquet(
    "hf://datasets/FinGPT/fingpt-forecaster-dow30-202305-202405/"
    + splits["test"]
)

print("Train size:", len(train_df))
print("Test size:", len(test_df))
train_df.head()


+ Transform to Llama Training Format

+ Test-time Information Fetching

### 2. Fine-tune LLM

This is the core part of fine-tuning, which needs **DeepSpeed** to help you manage your VRAM while training on GPU(s). Since you need a brand new subprocess to launch DeepSpeed, here we don't provide you with the code for fine-tuning. Instead, we highly suggest you to run `train.sh` on your own terminal.

In [None]:
import subprocess
import textwrap

### Adjust based on your setup
cmd = textwrap.dedent("""
export NCCL_IGNORE_DISABLED_P2P=1
export TRANSFORMERS_NO_ADVISORY_WARNINGS=1
export TOKENIZERS_PARALLELISM=0

ds \
    --include localhost:0 \
    train_lora.py \
    --run_name nasdaq-100-20231231-20241231 \
    --base_model llama2 \
    --dataset fingpt-forecaster-nasdaq-100-20231231-20241231-1-4-06 \
    --max_length 4096 \
    --batch_size 1 \
    --gradient_accumulation_steps 16 \
    --learning_rate 5e-5 \
    --num_epochs 1 \
    --log_interval 10 \
    --warmup_ratio 0.03 \
    --scheduler constant \
    --evaluation_strategy steps \
    --ds_config config.json
""")

subprocess.run(cmd, shell=True, executable="/bin/bash")

### 3. Have a try on your own fine-tuned LLMs

In this part, you can have a try on your fine-tuned models by providing it with some inputs and see their responses.

If you made it, congratulations!

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import re

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cache_dir = "./pretrained-models" ### Adjust based on your setup

In [None]:
from huggingface_hub import login

# Ensure your Hugging Face token is set as a Colab secret named 'HF_TOKEN'
# Or, uncomment the line below and replace 'hf_token_here' with your actual token string
# login(token='hf_token_here')
login() # This will use the HF_TOKEN from Colab secrets if available

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", ### Change to your base model, e.g., 'meta-llama/Llama-2-7b-hf' or a local path
    trust_remote_code=True,
    device_map="auto",
    cache_dir=cache_dir,
    torch_dtype=torch.float16,   # optional if you have enough VRAM
)

tokenizer = AutoTokenizer.from_pretrained(
    'meta-llama/Llama-3.1-8B', ### Change to your base model
    cache_dir=cache_dir,
)

model = PeftModel.from_pretrained(
    base_model,
    'your_finetuned_model_path_here', ### Change to your fine-tuned model path or Hugging Face identifier
    cache_dir=cache_dir,
    # offload_folder="./offload2/",
    torch_dtype=torch.float16,
    # offload_buffers=True
)
model = model.eval()

In [None]:
prompt = """
    Your prompt here
"""

In [None]:
inputs = tokenizer(
    prompt,
    return_tensors='pt',
    max_length=4096,
    padding=False,
    truncation=True
)
inputs = {key: value.to(model.device) for key, value in inputs.items()}

res = model.generate(
    **inputs, max_length=4096, do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True
)
output = tokenizer.decode(res[0], skip_special_tokens=True)
answer = re.sub(r'.*\[/INST\]\s*', '', output, flags=re.DOTALL) # don't forget to import re
print(answer)

### 4. Comparison

Now since you get both fine-tuned LLMs based on DeepSeek, Llama3 and Qwen, here we would like you to do some comparison on selected metrics, to see which of these 2 fine-tuned models performs best.

In [None]:
import os
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from datasets import load_from_disk
from datasets import Dataset
from sklearn.metrics import accuracy_score
from tqdm import tqdm
# from peft import PeftModel
from utils import *
import time
import json, pickle

os.environ["HUGGINGFACE_TOKEN"] = "your huggingface token"

In [None]:
# Llama3
llama3_base_model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-3.1-8B',
    trust_remote_code=True,
    device_map="auto",
    cache_dir=cache_dir,
    torch_dtype=torch.float16,
)

llama3_model = PeftModel.from_pretrained(
    llama3_base_model,
    'your_finetuned_model', ### Change to your fine-tuned model
    cache_dir=cache_dir,
    torch_dtype=torch.float16,
)
llama3_model = llama3_model.eval()

llama3_tokenizer = AutoTokenizer.from_pretrained(
    'meta-llama/Llama-3.1-8B',
    cache_dir=cache_dir,
)
llama3_tokenizer.padding_side = "right"
llama3_tokenizer.pad_token_id = llama3_tokenizer.eos_token_id

In [None]:
# DeepSeek
deepseek_base_model = AutoModelForCausalLM.from_pretrained(
    'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
    trust_remote_code=True,
    device_map="auto",
    cache_dir=cache_dir,
    torch_dtype=torch.float16,
)

deepseek_model = PeftModel.from_pretrained(
    deepseek_base_model,
    'your_finetuned_model', ### Change to your fine-tuned model
    cache_dir=cache_dir,
    torch_dtype=torch.float16,
)
deepseek_model = deepseek_model.eval()

deepseek_tokenizer = AutoTokenizer.from_pretrained(
    'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
    cache_dir=cache_dir,
)
deepseek_tokenizer.padding_side = "right"
deepseek_tokenizer.pad_token_id = deepseek_tokenizer.eos_token_id

In [None]:
test_dataset = load_dataset("your_dataset_name")[0]["test"]

In [None]:
def filter_by_ticker(test_dataset, ticker_code):

    filtered_data = []

    for row in test_dataset:
        prompt_content = row['prompt']

        ticker_symbol = re.search(r"ticker\s([A-Z]+)", prompt_content)

        if ticker_symbol and ticker_symbol.group(1) == ticker_code:
            filtered_data.append(row)

    filtered_dataset = Dataset.from_dict({key: [row[key] for row in filtered_data] for key in test_dataset.column_names})

    return filtered_dataset

def get_unique_ticker_symbols(test_dataset):

    ticker_symbols = set()

    for i in range(len(test_dataset)):
        prompt_content = test_dataset[i]['prompt']

        ticker_symbol = re.search(r"ticker\s([A-Z]+)", prompt_content)

        if ticker_symbol:
            ticker_symbols.add(ticker_symbol.group(1))

    return list(ticker_symbols)

def insert_guidance_after_intro(prompt):

    intro_marker = (
        "[INST]<<SYS>>\n"
        "You are a seasoned stock market analyst. Your task is to list the positive developments and "
        "potential concerns for companies based on relevant news and basic financials from the past weeks, "
        "then provide an analysis and prediction for the companies' stock price movement for the upcoming week."
    )
    guidance_start_marker = "Based on all the information before"
    guidance_end_marker = "Following these instructions, please come up with 2-4 most important positive factors"

    intro_pos = prompt.find(intro_marker)
    guidance_start_pos = prompt.find(guidance_start_marker)
    guidance_end_pos = prompt.find(guidance_end_marker)

    if intro_pos == -1 or guidance_start_pos == -1 or guidance_end_pos == -1:
        return prompt

    guidance_section = prompt[guidance_start_pos:guidance_end_pos].strip()

    new_prompt = (
        f"{prompt[:intro_pos + len(intro_marker)]}\n\n"
        f"{guidance_section}\n\n"
        f"{prompt[intro_pos + len(intro_marker):guidance_start_pos]}"
        f"{prompt[guidance_end_pos:]}"
    )

    return new_prompt


def apply_to_all_prompts_in_dataset(test_dataset):

    updated_dataset = test_dataset.map(lambda x: {"prompt": insert_guidance_after_intro(x["prompt"])})

    return updated_dataset

test_dataset = apply_to_all_prompts_in_dataset(test_dataset)

unique_symbols = set(test_dataset['symbol'])

def test_demo(model, tokenizer, prompt):

    inputs = tokenizer(
        prompt, return_tensors='pt',
        padding=False, max_length=8000
    )
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    start_time = time.time()
    res = model.generate(
        **inputs, max_length=4096, do_sample=True,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=True
    )
    end_time = time.time()
    output = tokenizer.decode(res[0], skip_special_tokens=True)
    return output, end_time - start_time

def test_acc(test_dataset, modelname):
    answers_base, answers_fine_tuned, gts, times_base, times_fine_tuned = [], [], [], [], []
    if modelname == "llama3":
        base_model = llama3_base_model
        model = llama3_model
        tokenizer = llama3_tokenizer
    elif modelname == "deepseek":
        base_model = deepseek_base_model
        model = deepseek_model
        tokenizer = deepseek_tokenizer
    ### Add other models here
    elif modelname == "your_model_name":  # Add other models as needed
        base_model = your_base_model  # Define your base model
        model = your_finetuned_model  # Define your fine-tuned model
        tokenizer = your_tokenizer     # Define your tokenizer

    for i in tqdm(range(len(test_dataset)), desc="Processing test samples"):
        try:
            prompt = test_dataset[i]['prompt']
            gt = test_dataset[i]['answer']

            output_base, time_base = test_demo(base_model, tokenizer, prompt)
            answer_base = re.sub(r'.*\[/INST\]\s*', '', output_base, flags=re.DOTALL)

            output_fine_tuned, time_fine_tuned = test_demo(model, tokenizer, prompt)
            answer_fine_tuned = re.sub(r'.*\[/INST\]\s*', '', output_fine_tuned, flags=re.DOTALL)

            answers_base.append(answer_base)
            answers_fine_tuned.append(answer_fine_tuned)
            gts.append(gt)
            times_base.append(time_base)
            times_fine_tuned.append(time_fine_tuned)

        except Exception as e:
            print(f"Error processing sample {i}: {e}")
    return answers_base, answers_fine_tuned, gts, times_base, times_fine_tuned

In [None]:
### Llama3 Result Evaluating

llama3_answers_base, llama3_answers_fine_tuned, llama3_gts, llama3_base_times, llama3_fine_tuned_times = test_acc(test_dataset, "llama3")
llama3_base_metrics = calc_metrics(llama3_answers_base, llama3_gts)
llama3_fine_tuned_metrics = calc_metrics(llama3_answers_fine_tuned, llama3_gts)

with open("./comparison_results/llama3_base_metrics.pkl", "wb") as f:
    pickle.dump(llama3_base_metrics, f)

with open("./comparison_results/llama3_fine_tuned_metrics.pkl", "wb") as f:
    pickle.dump(llama3_fine_tuned_metrics, f)

with open("./comparison_results/llama3_base_times.pkl", "wb") as f:
    pickle.dump(llama3_base_times, f)

with open("./comparison_results/llama3_fine_tuned_times.pkl", "wb") as f:
    pickle.dump(llama3_fine_tuned_times, f)

In [None]:
### DeepSeek Result Evaluating

deepseek_answers_base, deepseek_answers_fine_tuned, deepseek_gts, deepseek_base_times, deepseek_fine_tuned_times = test_acc(test_dataset, "deepseek")
deepseek_base_metrics = calc_metrics(deepseek_answers_base, deepseek_gts)
deepseek_fine_tuned_metrics = calc_metrics(deepseek_answers_fine_tuned, deepseek_gts)

with open("./comparison_results/deepseek_base_metrics.pkl", "wb") as f:
    pickle.dump(deepseek_base_metrics, f)

with open("./comparison_results/deepseek_fine_tuned_metrics.pkl", "wb") as f:
    pickle.dump(deepseek_fine_tuned_metrics, f)

with open("./comparison_results/deepseek_base_times.pkl", "wb") as f:
    pickle.dump(deepseek_base_times, f)

with open("./comparison_results/deepseek_fine_tuned_times.pkl", "wb") as f:
    pickle.dump(deepseek_fine_tuned_times, f)

In [None]:
### Comparing Llama3 and DeepSeek Results

comparison_matrics = calc_metrics(llama3_answers_fine_tuned, deepseek_answers_fine_tuned) ### Change based on your models

with open("./comparison_results/comparison_matrics.pkl", "wb") as f:
    pickle.dump(comparison_matrics, f)