# Fine tuning and domain adaptation for Financial Services 

In this example we will show how to adapt a popular open source model (Llama 3) towards financial services domain using a method called Domain Adaptaptation or Continued Pre-training. The idea is to take a pre-trained model and further refine it using a domain specific textual dataset. This is an example of unsupervised learning as it does not use examples of inputs and outputs, but rather trains the model to learn the language patterns in a specialised domain such as Financial Services. The expectation is that after fine-tuning our text generation model on the financial documents, the model is able to generate insightful financial related textual output, and therefore can be used to solve multiple domain-specific NLP tasks.

## Fine-tuning dataset

SEC filings are critical for regulation and disclosure in finance. Filings notify the investor community about companies’ business conditions and the future outlook of the companies. The text in SEC filings covers the entire gamut of a company’s operations and business conditions. For more information on [10-K filings](https://www.investopedia.com/terms/1/10-k.asp) see [How to Read a 10-K](https://www.investor.gov/introduction-investing/general-resources/news-alerts/alerts-bulletins/investor-bulletins/how-read).

We will use a subset of SEC filings data of three companies (Amazon, Apple and Meta) for year 2022-2023 in domain adaptation dataset format. It is downloaded from publicly available [EDGAR](https://www.sec.gov/edgar/searchedgar/companysearch). Instruction of accessing the data are shown [here](https://www.sec.gov/os/accessing-edgar-data).

License: [Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/legalcode).



In this demo notebook, we demonstrate how to fine-tune the Llama-3-8B model using QLoRA, Hugging Face PEFT, and bitsandbytes.



## Setup Development environment


This notebook is using the Hugging Face container for the `us-east-1` region. Make sure you are using the right image for your AWS region, otherwise edit [config.yaml](./config.yaml). Container Images are available [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

Install the required libriaries, including the Hugging Face libraries, and restart the kernel.

In [None]:
%pip install -r requirements.txt

%pip install -q -U datasets==2.18.0
%pip install -q -U langchain==0.1.5
%pip install -q -U scikit-learn

We will use Amazon SageMaker which allows us to fine-tune the Llama 3 model using HuggingFace libraries. The code itself used HuggingFace libraries which can be used to train locally as well. 

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


## Visualize and upload the dataset

Read train dataset in a Pandas dataframe

In [None]:
from langchain_community.document_loaders.web_base import WebBaseLoader, default_header_template

annual_reports = [
    "https://www.sec.gov/Archives/edgar/data/1018724/000101872424000008/amzn-20231231.htm", # Amazon 2023
    "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm", # Apple 2023
    "https://www.sec.gov/Archives/edgar/data/1326801/000132680124000012/meta-20231231.htm", # Meta 2023
]

# SEC website requires specific User Agent to be set with Company Name and Email Address
# https://www.sec.gov/os/webmaster-faq#code-support 
sec_header_template = default_header_template.copy()
sec_header_template["User-Agent"] = "Sample Company Name AdminContact@<sample company domain>.com"

loader = WebBaseLoader(annual_reports, header_template=sec_header_template)

data = loader.load()

In [None]:
from datasets import Dataset

def strip_spaces(doc):
    return {"text": doc.page_content.replace("  ", "")}

stripped_data = list(map(strip_spaces, data))

train_dataset = Dataset.from_list(stripped_data)

train_dataset

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels.

In [None]:
HF_TOKEN = "<HF_TOKEN>" # change to your HuggingFace Token

In [None]:
! huggingface-cli login --token {HF_TOKEN}

In [None]:
from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)

tokenizer.pad_token = tokenizer.eos_token

Creating chunks and tokenizing the inputs for making it usable by the LLM. For additional details, please refer to the blog [Leveraging qLoRA for Fine-Tuning of Task-Fine-Tuned Models Without Catastrophic Forgetting: A Case Study with LLaMA2(-chat)](https://medium.com/towards-data-science/leveraging-qlora-for-fine-tuning-of-task-fine-tuned-models-without-catastrophic-forgetting-d9bcd594cff4)

In [None]:
from itertools import chain
from functools import partial

remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result



In [None]:
# tokenize and chunk dataset
chunk_size = 2048

lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_dataset.features)
).map(
    partial(chunk, chunk_length=chunk_size),
    batched=True,
)

print(f"Total number of train samples: {len(lm_train_dataset)}")


After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [None]:
# save lm_train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/sec_data/train'
lm_train_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")


## 3. Fine-Tune LLaMA 3 8B with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [train.py](./train.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.


In [None]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
source_dir = "/home/ec2-user/SageMaker/book/sagemaker/scripts"

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 10,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # train script
    source_dir           = source_dir,        # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',  # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

## Deploy Fine-Tuned model

Note: Run `train_fn` with `merge_weights=True`

In [None]:
import json
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

bucket_name = sagemaker_session.default_bucket()
job_prefix = f'huggingface-qlora-'


In [None]:
def get_last_job_name(job_name_prefix):
    import boto3
    sagemaker_client = boto3.client('sagemaker')
    
    search_response = sagemaker_client.search(
        Resource='TrainingJob',
        SearchExpression={
            'Filters': [
                {
                    'Name': 'TrainingJobName',
                    'Operator': 'Contains',
                    'Value': job_name_prefix
                },
                {
                    'Name': 'TrainingJobStatus',
                    'Operator': 'Equals',
                    'Value': "Completed"
                }
            ]
        },
        SortBy='CreationTime',
        SortOrder='Descending',
        MaxResults=1)
    
    return search_response['Results'][0]['TrainingJob']['TrainingJobName']

In [None]:
job_name = get_last_job_name(job_prefix)

job_name

### Inference configurations

In [None]:
instance_count = 1
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 3600

In [None]:
image_uri = get_huggingface_llm_image_uri(
    "huggingface",
    version="1.4"
)

image_uri

In [None]:
model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_name}/output/model.tar.gz",
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
    },
    role=role
)

In [None]:
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
)

### Deploying a base Llama 3 model for comparison

Let's also deploy a base model to compare the responses and see the difference after fine-tuning

In [None]:
base_model = HuggingFaceModel(
    image_uri=image_uri,
    env={
        'HF_MODEL_ID': "meta-llama/Meta-Llama-3-8B-Instruct", # model id within Hugginface Model Hub
        'HUGGING_FACE_HUB_TOKEN': HF_TOKEN,
    }, # configuration for loading model from Hub
    role=role, # iam role with permissions to create an Endpoint
)

In [None]:
base_predictor = base_model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
)

(Optional) If the model is predeployed, you can connect to it using the code below

In [None]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

base_predictor = Predictor(
    endpoint_name="huggingface-pytorch-tgi-inference-2024-06-29-12-26-05-297", # change to endpoint name
    sagemaker_session=sess,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

## Predict

We can now query the model. We will form some sample queries using the [Llama 3 prompting format](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).

In [None]:
base_prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

In [None]:
prompt = base_prompt.format(question="What drives sales growth at Amazon?")

response = predictor.predict({
	"inputs": prompt,
    "parameters": {
        "max_new_tokens": 1000 - len(prompt),
        "temperature": 0.1,
        # "top_p": 0.9,
    }
})

print(response[0]['generated_text'])

## Comparing performance with base model

Now let's compare the results we get from base model for the same set of questions

In [None]:
# Defining a utility function for querying our endpoint
def query_llama3(question, predictor):
    prompt = base_prompt.format(question=question)

    response = predictor.predict({
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 2048 - len(prompt),
            "temperature": 0.01,
            "top_k": 250,
            "top_p": 0.8,
        }
    })

    return response[0]['generated_text']

In [None]:
questions = [
    # "How did the COVID-19 pandemic impact Amazon’s business?",
    "What is Apple's strategy for growth in Asia",
    "What are Meta's plans to invest more in AI and Generative AI?",
    # "What was Amazon's net sales for year 2023?",
    "What are the key priorities for Amazon in 2023?",
]
for q in questions:
    print("\n\n\n")
    print("-"*40 + "BASE MODEL" + "-"*40)
    print(q)
    print(query_llama3(q, base_predictor))
    print("-"*40 + "FINE-TUNED MODEL" + "-"*40)
    print(query_llama3(q, predictor))


#### Delete Endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

base_predictor.delete_model()
base_predictor.delete_endpoint(delete_endpoint_config=True)