# Run Open Source Inference 

This notebook runs inference on the open source models using the multi-gpu process implemented in the Huggingface_constructor file. This file requires the llm-base environment to run correctly. The results of the backtest are stored in the Data folder under each model folder. 

We test with the following open-source models:

- Llama 3.2 3B Instruct
- Qwen 2.5 7B Instruct
- Deepseek R1 Qwen 14B 
- Qwen 2.5 3B Instruct

We also test with a fine-tuned model, the Qwen 2.5 3B model. We test this with the earnings prompt and also with a new EPS estimate prompt before fine-tuning to provide a comparison. 

In [16]:
import prompts
import importlib
import datetime
import torch
import json
from huggingface_hub import login
import constructors.huggingface_strategy as hs

import models.model_helper as mh

from constructors.huggingface_strategy import HuggingfaceRun
from transformers import BitsAndBytesConfig
from accelerate import Accelerator, notebook_launcher

In [12]:
importlib.reload(prompts)
importlib.reload(mh)
importlib.reload(hs)

<module 'constructors.huggingface_strategy' from '/project/constructors/huggingface_strategy.py'>

### Set up the environment
Log into Huggingface and check the number of GPUs available, pytorch version currently loaded

In [3]:
# Log into Huggingface with login credentials
with open('pass.txt') as p:
    hf_login = p.read()
    
hf_login = hf_login[hf_login.find('=')+1:hf_login.find('\n')]
login(hf_login, add_to_git_credential=False)

In [4]:
print(f'Torch version: {torch.__version__}')
#print(f'Device Count: {torch.cuda.device_count()}')
import accelerate
accelerate.__version__

Torch version: 2.1.2.post300


'1.4.0'

### Run 1: Llama 3.2 Earning analysis - Base
The first model we test is the smallest one - Llama 3.2 3B model. As a Huggingface model we use the HuggingfaceRun object to create the strategy and produce the list of trades from the run. We must launch in notebook_launcher to use multi-GPUs. 

Results File: Results/Earnings/results - llama - base - 2025-03-30 896587.json

In [6]:
# Create the run config
run_config = {
    'model_s3_loc': 'llama',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True,
    'dataset': 'data_quarterly_pit_indu_blended',
    'data_location': 'data_quarterly_pit_indu_refresh_blended.json'

}

In [7]:
ir = hs.HuggingfaceRun(run_name='Llama 3B Earnings',
                       model_id='meta-llama/Llama-3.2-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',system_prompt=prompts.SYSTEM_PROMPTS['BASE_EARN'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [9]:
# Run the multi-gpu model with notebook_launcher
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 4 GPUs.


  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 6.4 GB
starting backtest...
starting backtest...
starting backtest...
starting backtest...


888it [26:08,  1.34s/it]                         

Finished backtest


888it [26:45,  1.81s/it]


### Run 2: Llama 3.2 Earning analysis - Chain of Thought

Run the Llama 3.2 model on Chain of Thought prompts. Results in the following folder:

Results/Earnings/results - llama - cot - 2025-03-30 332729.json


In [5]:
run_config = {
    'model_s3_loc': 'llama',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True,
    'data_location': 'data_quarterly_pit_indu_refresh_blended.json'
}

In [6]:
ir = HuggingfaceRun(run_name='Llama 3B Earnings COT',
                       model_id='meta-llama/Llama-3.2-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',
                       system_prompt=prompts.SYSTEM_PROMPTS['COT_EARN'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [7]:
# Run the multi-gpu model with notebook_launcher
notebook_launcher(ir.run, num_processes=torch.cuda.device_count())

Launching training on 4 GPUs.


  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 6.4 GB
starting backtest...
starting backtest...
starting backtest...
starting backtest...


888it [1:03:35,  4.31s/it]                         

Finished run in 1:03:43.495022
called log
Saved bclarke16/tmp/fs/logs/Results_2025-04-21 16:40:45.191930.json
Run Completed!


888it [1:03:44,  4.31s/it]


### Run 3: DeepSeek R1 Qwen 7B Base

Results in the following file:

Results/Earnings/results - Deepseek - BASE C -2025-04-01 20:43:33.936328.json

In [5]:
run_config = {
    'model_s3_loc': 'deepseek7B',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True
}

In [6]:
ir = HuggingfaceRun(run_name='DeepSeek 7B Earnings Base',
                       model_id='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',
                       system_prompt=prompts.SYSTEM_PROMPTS['BASE_EARN'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [7]:
# Run the multi-gpu model with notebook_launcher
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 8 GPUs.
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7Bdeepseek7B



Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 5.4 GB
starting backtest...starting backtest...starting backtest...


starting backtest...starting backtest...
starting backtest...

starting backtest...
starting backtest...


888it [2:17:03,  9.49s/it]                           

Finished run in 2:17:37.334527
Called Save run
called log
Saved bclarke16/tmp/fs/logs/Results_2025-03-30 17:44:18.481428.json
Run Completed!


888it [2:17:37,  9.30s/it]


### Run 4: DeepSeek 7B COT

Results in the following file:

Results/Earnings/results - Deepseek - COT C -2025-04-01 21:13:35.944412.json

In [5]:
run_config = {
    'model_s3_loc': 'deepseek7B',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True
}

In [6]:
ir = HuggingfaceRun(run_name='DeepSeek 7B Earnings COT',
                       model_id='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',
                       system_prompt=prompts.SYSTEM_PROMPTS['COT_EARN'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [7]:
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 8 GPUs.
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7B
deepseek7B


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 5.4 GB
starting backtest...starting backtest...starting backtest...
starting backtest...

starting backtest...

starting backtest...
starting backtest...
starting backtest...


888it [2:26:10,  9.47s/it]                           

Finished run in 2:30:12.540523
Called Save run
called log
Saved bclarke16/tmp/fs/logs/Results_2025-03-30 20:11:39.828629.json
Run Completed!


888it [2:30:13, 10.15s/it]


### Qwen 2.5 - 3B Instruct
For this model, we apply quantization to reduce the size of the model. This will be used for fine-tuning and so memory footprint needs to be smaller to permit training on a single GPU. We use the Bits and Bytes library from Huggingface to convert to 4bit.

In [4]:
run_config = {
    'model_s3_loc': 'qwen3b',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True,
}

In [5]:
ir = HuggingfaceRun(run_name='Qwen 3B Earnings BASE',
                       model_id='Qwen/Qwen2.5-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',
                       system_prompt=prompts.SYSTEM_PROMPTS['BASE_EARN'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [6]:
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 4 GPUs.


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 2.0 GB
starting backtest...
starting backtest...
starting backtest...
starting backtest...


888it [16:02,  1.09s/it]                         

Finished run in 0:16:02.309246
called log
Saved bclarke16/tmp/fs/logs/Results_2025-04-29 16:17:41.890187.json
Run Completed!


888it [16:02,  1.08s/it]


In [11]:
run_config = {
    'model_s3_loc': 'qwen3b',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True
}

In [12]:
ir = HuggingfaceRun(run_name='Qwen 3B Earnings COT',
                       model_id='Qwen/Qwen2.5-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',
                       system_prompt=prompts.SYSTEM_PROMPTS['COT_EARN'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [13]:
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 4 GPUs.
qwen3b
qwen3b
qwen3b
qwen3b


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 2.0 GB
starting backtest...
starting backtest...starting backtest...

starting backtest...


888it [13:42,  1.09it/s]                         

Finished run in 0:13:42.324489
Called Save run
called log
Saved bclarke16/tmp/fs/logs/Results_2025-04-03 20:06:52.680533.json
Run Completed!


888it [13:42,  1.08it/s]


## Run 5 - Fine tuned model - EPS only

In [6]:
run_config = {
    'model_hf_id': 'Qwen/Qwen2.5-3B-Instruct',
    'model_s3_loc': 'qwen3b',
    'model_reload': False,
    'model_quant': None,
    'system_prompt': prompts.SYSTEM_PROMPTS['BASE_FINE_TUNED'],
    'multi-gpu':False,
    'dataset': 'data_quarterly_pit_indu_blended_base',
    'data_location': 'data_quarterly_pit_indu_refresh_blended.json',
    'fine_tuned_dir': 'fine_tuned2'
}

In [4]:
ftm = mft.FineTunedInference("EPS only FT Run", run_config)

qwen3b


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


In [5]:
prompt_set = ftm.create_all_prompts(force_refresh=True, is_save_prompts=True)
prompt_set_appended = ftm.reformat_prompts(prompt_set, "The next period EPS is ")

Requesting all datasets...
Saving data...


In [6]:
output_eps = ftm.run_finetuned_backtest(prompt_set_appended)

100%|██████████| 887/887 [1:13:27<00:00,  4.97s/it]


In [7]:
# save the output
with open(f"Results/Earnings/results - Qwen3B Finetuned - EPS only.json", 'w') as f:
    json.dump(output_eps, f)

### Qwen Base

In [7]:
ir = HuggingfaceRun(run_name='Qwen Earnings',
                       model_id='Qwen/Qwen2.5-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu_refresh_blended.json',
                       system_prompt=prompts.SYSTEM_PROMPTS['BASE_FINE_TUNED'],run_config=run_config)

In [8]:
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [9]:
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 4 GPUs.


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

  0%|          | 0/887 [00:00<?, ?it/s]

Memory footprint: 2.0 GB
starting backtest...
starting backtest...starting backtest...

starting backtest...


888it [1:15:25,  4.15s/it]                         

Finished backtest


888it [1:16:07,  5.14s/it]


In [14]:
output = ir.cached_results

## Stock Recommendation Runs

This section now runs for each of the Stock Price Recommendation runs

### Run 1 - Llama 3B

Results from this run: Results/ Stock Price/results - 2025-02-23 13:59:33.728956.json

In [13]:
run_config = {
    'model_s3_loc': 'llama',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True,

}

In [17]:
ir = hs.HuggingfaceRun(run_name='Llama 3B Recommendation - BASE',
                       model_id='meta-llama/Llama-3.2-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu.json',system_prompt=prompts.SYSTEM_PROMPTS['BASE'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

Requesting all datasets...
Saving data...


In [18]:
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

Launching training on 4 GPUs.


  0%|          | 0/896 [00:00<?, ?it/s]

Memory footprint: 6.4 GB
starting backtest...starting backtest...

starting backtest...
starting backtest...


100%|██████████| 896/896 [43:12<00:00,  2.56s/it]  

Finished backtest


100%|██████████| 896/896 [44:27<00:00,  2.98s/it]


### Run 2 - Llama 3B CoT

Results from this run: Results/Stock Price/results - 2025-02-23 14:47:14.366854.json

In [None]:
ir = hs.HuggingfaceRun(run_name='Llama 3B Recommendation - COT',
                       model_id='meta-llama/Llama-3.2-3B-Instruct',
                       dataset_id='data_quarterly_pit_indu.json',system_prompt=prompts.SYSTEM_PROMPTS['COT'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

In [None]:
notebook_launcher(ir.run_multi_gpu, num_processes=torch.cuda.device_count())

### Run 3 - DeepSeek 14B

Results from this run: Results/Stock Price/results - 2025-02-23 15:39:43.667370.json

In [None]:
run_config = {
    'model_s3_loc': 'deepseek14Q',
    'model_reload': False,
    'model_quant': None,
    'multi-gpu':True,
}

In [None]:
ir = hs.HuggingfaceRun(run_name='DeepSeek 14B Recommendation - BASE',
                       model_id='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
                       dataset_id='data_quarterly_pit_indu.json',system_prompt=prompts.SYSTEM_PROMPTS['BASE'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

### Run 4 - DeepSeek 14B - COT

Results from this run: Results/Stock Price/results - 2025-02-23 20:25:09.495607.json

In [None]:
ir = hs.HuggingfaceRun(run_name='DeepSeek 14B Recommendation - COT',
                       model_id='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
                       dataset_id='data_quarterly_pit_indu.json',system_prompt=prompts.SYSTEM_PROMPTS['COT'],run_config=run_config)

# Create the prompts and save to the Data folder
prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)

### Unused Code

In [15]:
# Old Code
# run_name = f"{run_config['model_s3_loc']}_{run_config['dataset']}_finetuned"
# ir = model_inference.InferenceRun(run_name, run_config)

# # Create the prompts and save to the Data folder
# prompt_set = ir.create_all_prompts(force_refresh=True, is_save_prompts=True)
# for prompt in prompt_set:
#     prompt['prompt'] += 'The next period EPS is '
#     #prompt['prompt'] += "\nAnswer in JSON format with the next period EPS, the direction, the magnitude and a confidence."

# from peft import PeftModel
# from transformers import AutoTokenizer
# import json
# from tqdm import tqdm

# outputs = []
# def run_finetuned_backtest(prompts, fine_tuned_model, tokenizer):
#     count = 0
#     progress = tqdm(total=len(prompts), position=0, leave=True)
#     for prompt in prompts:
#         tokens = tokenizer.apply_chat_template(prompt['prompt'], tokenize=False, add_generation_prompt=True )
#         print(tokens)
#         model_inputs = tokenizer([tokens], return_tensors='pt').to("cuda")
#         generated_ids = fine_tuned_model.generate(**model_inputs, 
#                                        pad_token_id=tokenizer.eos_token_id, 
#                                        max_new_tokens=50,
#                                       temperature=0.001)
#         parsed_ids = [
#             output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
#         ]
#         resp = {
#             'date': prompt['date'],
#             'security': prompt['security'],
#             'response': tokenizer.batch_decode(parsed_ids, skip_special_tokens=True)[0]
#         }
#         outputs.append(resp)
#         progress.update()

# # Load the base model 
# model_helper = mh.ModelHelper('tmp/fs')
# base_model = ir.load_model_from_storage(run_config['model_s3_loc'])
# # clear the local folder once completed loading into memory
# model_helper.clear_folder(run_config['model_s3_loc'])

# # Update the base model with the fine-tuned modules
# fine_tuned_model = PeftModel.from_pretrained(base_model, 'fine_tuned2')
# # Create the tokenizer
# tokenizer = AutoTokenizer.from_pretrained(run_config['model_hf_id'])
# run_finetuned_backtest(prompt_set[:2], fine_tuned_model, tokenizer)

# # save the output
# with open(f"Results/Earnings/results - Qwen3B Finetuned - EPS only.json", 'w') as f:
#     json.dump(outputs, f)