# Fine-tune LLaMA 2 on Amazon SageMaker

In this sagemaker example, we are going to learn how to fine-tune [LLaMA 2](https://huggingface.co/meta-llama/Llama-2-70b-hf) using [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314). [LLaMA 2](https://huggingface.co/meta-llama/Llama-2-70b-hf) is the next version of the [LLaMA](https://arxiv.org/abs/2302.13971). Compared to the V1 model, it is trained on more data - 2T tokens and supports context length window upto 4K tokens. Learn more about LLaMa 2 in the [""]() blog post.

QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

In our example, we are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

In Detail you will learn how to:
1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker
4. Deploy Fine-tuned LLM on Amazon SageMaker

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- (Q)LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
- IA3: [Infused Adapter by Inhibiting and Amplifying Inner Activations](https://arxiv.org/abs/2205.05638)



### Access LLaMA 2

Before we can start training we have to make sure that we accepted the license of [llama 2](https://huggingface.co/meta-llama/Llama-2-70b-hf) to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at: 
* [LLaMa 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf)
* [LLaMa 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf)
* [LLaMa 70B](https://huggingface.co/meta-llama/Llama-2-70b-hf)

## 1. Setup Development Environment

In [1]:
!pip install "transformers==4.31.0" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

In [2]:
%env AWS_PROFILE=dev-admin
%env AWS_REGION=us-east-1
%env HF_HOME=~/.cache/huggingface
%env TOKENIZERS_PARALLELISM=fale

env: AWS_PROFILE=dev-admin
env: AWS_REGION=us-east-1
env: HF_HOME=~/.cache/huggingface
env: TOKENIZERS_PARALLELISM=fale


To access any LLaMA 2 asset we need to login into our hugging face account. We can do this by running the following command:

In [None]:
!huggingface-cli login --token YOUR_TOKEN

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [3]:
from scripts.aws_init import init_sagemaker

sagemaker_session_bucket = "sagemaker-ms-thesis-llm"
# role = "arn:aws:iam::171706357329:role/service-role/SageMaker-ComputeAdmin"
role = "arn:aws:iam::171706357329:role/service-role/AmazonSageMakerServiceCatalogProductsExecutionRole"

sess = init_sagemaker(sagemaker_session_bucket)

sagemaker bucket: sagemaker-ms-thesis-llm
sagemaker session region: us-east-1


## 2. Load and prepare the dataset

we will use the [dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the [InstructGPT paper](https://arxiv.org/abs/2203.02155), including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

```python
{
  "instruction": "What is world of warcraft",
  "context": "",
  "response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment"
}
```

To load the `samsum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [4]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011


  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 7.17MB/s]


Downloading and preparing dataset json/databricks--databricks-dolly-15k to /Users/andrewbeiler/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data: 100%|██████████| 13.1M/13.1M [00:01<00:00, 10.2MB/s]
Downloading data files: 100%|██████████| 1/1 [00:02<00:00,  2.22s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 593.09it/s]
                                                        

Dataset json downloaded and prepared to /Users/andrewbeiler/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
dataset size: 15011
{'instruction': "Who was Manchester United's most successful manager?", 'context': "Manchester United have won a record 20 League titles, 12 FA Cups, six League Cups, and a record 21 FA Community Shields. They have won the European Cup/UEFA Champions League three times, and the UEFA Europa League, the UEFA Cup Winners' Cup, the UEFA Super Cup, the Intercontinental Cup and the FIFA Club World Cup once each. In 1968, under the management of Matt Busby, 10 years after eight of the club's players were killed in the Munich air disaster, they became the first English club to win the European Cup. Sir Alex Ferguson is the club's longest-serving and most successful manager, winning 38 trophies, including 13 leag



In [1]:
import math

n_samples = 100#0000
n_shards = math.ceil(len(dataset)/n_samples)
dataset = dataset.shard(n_shards,2)
print(f"dataset_sample size: {len(dataset)}")

NameError: name 'dataset' is not defined

To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [6]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt


lets test our formatting function on a random example.

In [7]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

### Instruction
What is a romance language?

### Answer
Romance languages are a subset of languages derived from Latin roots into fully fledged nationally spoken languages. Examples of Romance languages are French, Italian, Spanish, and Portuguese.


In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

In [5]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.




We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [9]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

                                                   

### Instruction
Classify each of the following as either round or square shaped: a planet, a a ball, a slice of bread, a chess board.

### Answer
Planets are round.
Balls are round.
A slice of bread is square shaped.
A chess board is square shaped.</s>


                                                   

Total number of samples: 10




After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [8]:
# save train_dataset to s3
ver = "v1"
training_input_path = f's3://{sess.default_bucket()}/dataseta/dolly/{ver}/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

uploaded data to:
training dataset to: s3://sagemaker-ms-thesis-llm/dataseta/dolly/v1/train


## 3. Fine-Tune LLaMA 13B with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [run_clm.py](./scripts/run_clm.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.

### Harwarde requirements

We also ran several experiments to determine, which instance type can be used for the different model sizes. The following table shows the results of our experiments. The table shows the instance type, model size, context length, and max batch size. 

| Model        | Instance Type     | Max Batch Size | Context Length |
|--------------|-------------------|----------------|----------------|
| [LLama 7B]() | `(ml.)g5.4xlarge` | `3`            | `2048`         |
| [LLama 13B]() | `(ml.)g5.4xlarge` | `2`            | `2048`         |
| [LLama 70B]() | `(ml.)p4d.24xlarge` | `1++` (need to test more configs)            | `2048`         |


> You can also use `g5.2xlarge` instead of the `g5.4xlarge` instance type, but then it is not possible to use `merge_weights` parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using [merge_adapter_weights.py](./scripts/merge_adapter_weights.py) after training.

_Note: We plan to extend this list in the future. feel free to contribute your setup!_

In [7]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
job_name = f'huggingface-qlora-dolly-example'

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 1,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'phil-examples',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [9]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-dolly-example-2023-08-21-14-08-50-477


2023-08-21 14:08:51 Starting - Starting the training job...
2023-08-21 14:09:07 Starting - Preparing the instances for training......
2023-08-21 14:10:29 Downloading - Downloading input data
2023-08-21 14:10:29 Training - Downloading the training image...........................
2023-08-21 14:15:01 Training - Training image download completed. Training in progress......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-08-21 14:15:58,889 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-08-21 14:15:58,902 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-08-21 14:15:58,911 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-08-21 14:15:58,913 sagemaker_pytorch_container.training INFO     Invoking user training script.
2023-08-21 14:16:00,218 sagemaker-training-toolkit INFO     Installing dep

In our example for LLaMA 13B, the SageMaker training job took `31728 seconds`, which is about `8.8 hours`. The ml.g5.4xlarge instance we used costs `$2.03 per hour` for on-demand usage. As a result, the total cost for training our fine-tuned LLaMa 2 model was only ~`$18`.

## Next Steps 

You can deploy your fine-tuned LLaMA model to a SageMaker endpoint and use it for inference. Check out the [Deploy Falcon 7B & 40B on Amazon SageMaker](https://www.philschmid.de/sagemaker-falcon-llm) and [Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker](https://www.philschmid.de/sagemaker-llm-vpc) for more details.

### Deploy "Just Trained" Model

In [10]:
predictor = huggingface_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.4xlarge")

INFO:sagemaker:Creating model with name: huggingface-qlora-dolly-example-2023-08-21-14-39-23-226
INFO:sagemaker:Creating endpoint-config with name huggingface-qlora-dolly-example-2023-08-21-14-39-23-226
INFO:sagemaker:Creating endpoint with name huggingface-qlora-dolly-example-2023-08-21-14-39-23-226


----------!

### Pull Model from S3 

In [25]:
import json
from sagemaker.huggingface import HuggingFaceModel
from huggingface_hub import HfFolder

s3_model_uri = "s3://sagemaker-ms-thesis-llm/models/huggingface-qlora-notebook-optDir-2023-08-20-20-01-03-748/output/model.tar.gz"
image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04-v1.0"

# sagemaker config
instance_type = "ml.g5.4xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),
  # 'HF_MODEL_QUANTIZE': "bitsandbytes",# Comment in to quantize
# 
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role,
    image_uri=image_uri,
    model_data=s3_model_uri,
    env=config
)

print(llm_model)


<sagemaker.huggingface.model.HuggingFaceModel object at 0x15673b430>


In [26]:
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

INFO:botocore.tokens:Loading cached SSO token for slu-sso
INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2023-08-21-16-34-04-652
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2023-08-21-16-34-05-592
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2023-08-21-16-34-05-592


--------!

### Build Input Data for Inference

#### Simple String Input

In [31]:
data = {
   "inputs": "What is the Capital of California."
}

In [46]:
chat = llm.predict({"inputs":json.dumps(data)})

print(chat[0]["generated_text"])



\end{code}

I want to get the value of the input as "S


In [48]:
payload = {
  "inputs":  json.dumps(data),
  "parameters": {
    # "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    # "stop": ["</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)

print(response[0]["generated_text"])



<h1>What is the Capital of California.</h1>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento.</p>
  </div>
</div>

<div class="question">
  <div class="question-body">
    <p>The capital of California is Sacramento

In [33]:
print(chat[0])

{'generated_text': '\n\\end{code}\n\nI want to get the value of the input as "S'}


#### Formatted String Input
Format matching how Training Data was Formatted

In [44]:
def build_llama2_prompt(messages):
    startPrompt = "<s>[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, message in enumerate(messages):
        if message["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{message['content']}\n<</SYS>>\n\n")
        elif message["role"] == "user":
            conversation.append(message["content"].strip())
        else:
            conversation.append(f" [/INST] {message['content'].strip()}</s><s>[INST] ")

    return startPrompt + "".join(conversation) + endPrompt

messages = [
  { "role": "system","content": "You are a friendly and knowledgeable vacation planning assistant named Clara. Your goal is to answer their questions to help them plan their perfect vacation. "}
]

# define question and add to messages
instruction = "Capital of CA?"
messages.append({"role": "user", "content": instruction})
prompt = build_llama2_prompt(messages)

print(f"Formatted Prompt: {prompt}")

Formatted Prompt: <s>[INST] <<SYS>>
You are a friendly and knowledgeable vacation planning assistant named Clara. Your goal is to answer their questions to help them plan their perfect vacation. 
<</SYS>>

Capital of CA? [/INST]


### Run Inference Prediction

In [38]:
chat = llm.predict({"inputs":prompt})

print(chat[0]["generated_text"])



[INST] Capital of CA? [/INST]

[INST] Capital of


In [39]:
print(chat[0])

{'generated_text': '\n\n[INST] Capital of CA? [/INST]\n\n[INST] Capital of'}


In [45]:
payload = {
  "inputs":  prompt,
  "parameters": {
    # "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    "stop": ["</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)

print(response[0]["generated_text"])



<</SYS>>


<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

[INST] <<SYS>>

<</SYS>>

