# Backtest the strategies

Use an LLM to go through and predict the buy/ sell/ hold recommendation for the company for the given date. Steps needed:

1. Load the LLM - use DeepSeek R1 Qwen model at 7B parameters first and try the quantised models next
2. Step through each data and each financial statement to get a result
3. Log the results in a file and save to S3 (will need a logging file to save to S3 and resume in case of kernel crash)
4. Need a backtesting framework to apply the results


## Load libraries needed

In [12]:
import json
import boto3
from s3fs import S3FileSystem
import os

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
import torch

import pandas as pd

from IPython.display import Markdown, display

from helper import get_s3_folder
import s3_model
import company_data
from s3_model import S3ModelHelper

In [31]:
import importlib
importlib.reload(company_data)

<module 'company_data' from '/project/company_data.py'>

## Load the LLM

Models to test:
- Qwen (Qwen/Qwen2.5-7B-Instruct)
- Llama (meta-llama/Llama-3.2-7B-Instruct)
- DeepSeek (deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)

In [2]:
# Log into Huggingface

with open('pass.txt') as p:
    hf_login = p.read()
    
hf_login = hf_login[hf_login.find('=')+1:hf_login.find('\n')]
login(hf_login, add_to_git_credential=False)

In [10]:
c = s3_model.S3ModelHelper(s3_sub_folder='tmp/fs')
c.clear_folder('deepseek')

In [11]:
# Flag to download from Huggingface again or use stored model
USE_HF = False

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model_id_s3 = 'deepseek'


if USE_HF:
   
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )

    model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', torch_dtype=torch.bfloat16 )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
else:
    model_helper = s3_model.S3ModelHelper(s3_sub_folder='tmp/fs')
    model = model_helper.load_model(model_id_s3)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="auto",
    )
    model_helper.clear_folder(model_id_s3)

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Device set to use cuda:0


## Load Financial PIT dataset

In [15]:
## Load from S3 using the helper file
sec_helper = company_data.SecurityData('tmp/fs','data_quarterly_pit.json')
all_data = sec_helper.get_all_data()

In [32]:
sec_helper = company_data.SecurityData('tmp/fs','data_quarterly_pit.json', all_data)

In [33]:
sec_helper.get_security_statement('2020-01-31','AON UN Equity','is')

Unnamed: 0,t,t-1,t-2,t-3,t-4,t-5
01 Revenue (Adj),968700000.0,948600000.0,1211200000.0,1224800000.0,953400000.0,972800000.0
02 Sales and Services Revenues (Adj),968700000.0,948600000.0,1211200000.0,1224800000.0,953400000.0,972800000.0
05 Cost of Revenue (Adj),780900000.0,727500000.0,811600000.0,847300000.0,734000000.0,739000000.0
06 Cost of Goods & Services Sold (Adj),780900000.0,727500000.0,811600000.0,847300000.0,734000000.0,739000000.0
08 Gross Profit (Adj),187800000.0,221100000.0,399600000.0,377500000.0,219400000.0,233800000.0
10 Operating Expenses (Adj),161000000.0,148000000.0,158800000.0,143900000.0,141200000.0,133700000.0
"11 Selling, General and Administrative Expense (Adj)",161000000.0,148000000.0,158800000.0,143900000.0,141200000.0,133700000.0
14 Operating Income or Losses (Adj),26800000.0,73100000.0,240800000.0,233600000.0,78200000.0,100100000.0
15 Non-Operating (Income) Loss (Adj),16300000.0,9300000.0,10800000.0,10000000.0,13900000.0,10800000.0
16 Net Interest Expense (Adj),13900000.0,13100000.0,11400000.0,10000000.0,10700000.0,9600000.0


In [23]:
sec_helper.      #'2020-01-31']['AON UN Equity']['is'])

'{"t":{"01 Revenue (Adj)":968700000.0,"02 Sales and Services Revenues (Adj)":968700000.0,"05 Cost of Revenue (Adj)":780900000.0,"06 Cost of Goods & Services Sold (Adj)":780900000.0,"08 Gross Profit (Adj)":187800000.0000000298,"10 Operating Expenses (Adj)":161000000.0,"11 Selling, General and Administrative Expense (Adj)":161000000.0,"14 Operating Income or Losses (Adj)":26800000.0,"15 Non-Operating (Income) Loss (Adj)":16300000.0,"16 Net Interest Expense (Adj)":13900000.0,"17 Interest Expense (Adj)":15500000.0,"18 Interest Income (Adj)":1600000.0,"19 Foreign Exch Losses (Gains) (Adj)":0.0,"20 Other Non-Operating (Income) Loss (Adj)":2400000.0,"21 Pretax Income (Loss), Adjusted (Adj)":10500000.0,"22 Abnormal Losses (Gains)":68600000.0,"23 Merger \\/ Acquisition Expense":3400000.0,"27 Other Abnormal Items":33400000.0,"28 Pretax Income (Loss), GAAP":10500000.0,"29 Income Tax Expense (Benefit)":400000.0,"32 Income (Loss) from Continuing Operations":10100000.0,"36 Net Income Including Minor