# Fine-tune LLaMA 2 models on SageMaker JumpStart #2: Finetuning the Deployed LaMa-2 Model

## Dataset preparation for fine-tuning

---

You can fine-tune on the dataset with domain adaptation format or instruction tuning format. Please find more details in the section [Dataset instruction](#Dataset-instruction). In this demo, we will use a subset of [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) in an instruction tuning format. Dolly dataset contains roughly 15,000 instruction following records for various categories such as question answering, summarization, information extraction etc. It is available under Apache 2.0 license. We will select the summarization examples for fine-tuning.


Training data is formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The training folder can also contain a template.json file describing the input and output formats.

To train your model on a collection of unstructured dataset (text files), please see the section [Example fine-tuning with Domain-Adaptation dataset format](#Example-fine-tuning-with-Domain-Adaptation-dataset-format) in the Appendix.

---

In [3]:
!pip install --upgrade sagemaker datasets

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [4]:
from datasets import load_dataset

dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# To train for question answering/information extraction, you can replace the assertion in next line to example["category"] == "closed_qa"/"information_extraction".
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == "summarization")
summarization_dataset = summarization_dataset.remove_columns("category")

# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.1)

# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

2103055

In [5]:
train_and_test_dataset["train"][0]

{'instruction': 'Who is Serhiy Malyi?',
 'context': 'Serhiy Viktorovych Malyi (Ukrainian: Сергій Вікторович Малий; born 5 June 1990) is a professional footballer who plays as a defender for Tobol. Born in Ukraine, he represents the Kazakhstan national team.',
 'response': 'Serhiy Malyi is a professional footballer who plays defense for Tobol. He also represents the Kazakhstan national team.'}

---
Next, we create a prompt template for using the data in an instruction / input format for the training job (since we are instruction fine-tuning the model in this example), and also for inferencing the deployed endpoint.

---

In [12]:
import json

template = {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n",
    "completion": " {response}",
}
with open("template.json", "w") as f:
    json.dump(template, f)

### Upload dataset to S3
---

We will upload the prepared dataset to S3 which will be used for fine-tuning.

---

In [13]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

output_bucket = sagemaker.Session().default_bucket()
local_data_file = "train.jsonl"
train_data_location = f"s3://{output_bucket}/dolly_dataset"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"Training data: {train_data_location}")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
Training data: s3://sagemaker-us-east-1-988564344122/dolly_dataset


## Train the model
---
Next, we fine-tune the LLaMA v2 7B model on the summarization dataset from Dolly. Finetuning scripts are based on scripts provided by [this repo](https://github.com/facebookresearch/llama-recipes/tree/main). To learn more about the fine-tuning scripts, please checkout section [5. Few notes about the fine-tuning method](#5.-Few-notes-about-the-fine-tuning-method). For a list of supported hyper-parameters and their default values, please see section [3. Supported Hyper-parameters for fine-tuning](#3.-Supported-Hyper-parameters-for-fine-tuning).

---

In [14]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

model_id, model_version = "meta-textgeneration-llama-2-7b", "*"

estimator = JumpStartEstimator(
    model_id=model_id,
    environment={"accept_eula": "true"},
    disable_output_compression=True,  # For Llama-2-70b, add instance_type = "ml.g5.48xlarge"
)
# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
estimator.set_hyperparameters(instruction_tuned="True", epoch="5", max_input_length="1024")
estimator.fit({"training": train_data_location})

INFO:sagemaker.jumpstart:No instance type selected for training job. Defaulting to ml.g5.12xlarge.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: meta-textgeneration-llama-2-7b-2023-09-26-13-14-07-191


2023-09-26 13:14:07 Starting - Starting the training job...
2023-09-26 13:14:34 Starting - Preparing the instances for training.........
2023-09-26 13:15:39 Downloading - Downloading input data.........
2023-09-26 13:17:15 Training - Downloading the training image..................
2023-09-26 13:20:16 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-09-26 13:20:47,016 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-09-26 13:20:47,047 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-09-26 13:20:47,056 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-09-26 13:20:47,057 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-09-26 13:2

Studio Kernel Dying issue:  If your studio kernel dies and you lose reference to the estimator object, please see section [6. Studio Kernel Dead/Creating JumpStart Model from the training Job](#6.-Studio-Kernel-Dead/Creating-JumpStart-Model-from-the-training-Job) on how to deploy endpoint using the training job name and the model id. 


### Deploy the fine-tuned model
---
Next, we deploy fine-tuned model. We will compare the performance of fine-tuned and pre-trained model.

---

In [22]:
finetuned_predictor = estimator.deploy()

INFO:sagemaker.jumpstart:No instance type selected for inference hosting endpoint. Defaulting to ml.g5.2xlarge.
INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py39.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.g5.2xlarge.
INFO:sagemaker:Creating model with name: meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-433
INFO:sagemaker:Creating endpoint-config with name meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-425
INFO:sagemaker:Creating endpoint with name meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-425


--------------!

In [29]:
name = finetuned_predictor.endpoint_name
print (name)

meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-425


### Evaluate the pre-trained and fine-tuned model
---
Next, we use the test data to evaluate the performance of the fine-tuned model and compare it with the pre-trained model. 

---

In [33]:
from sagemaker.predictor import Predictor
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
# Replace 'endpoint_name' with the actual endpoint name you obtained after deployment
pretrained_predictor = Predictor(endpoint_name='meta-textgeneration-llama-2-7b-2023-09-24-19-57-25-271',
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
                                )

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [34]:
import pandas as pd
from IPython.display import display, HTML

test_dataset = train_and_test_dataset["test"]

inputs, ground_truth_responses, responses_before_finetuning, responses_after_finetuning = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": template["prompt"].format(
            instruction=datapoint["instruction"], context=datapoint["context"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    # Please change the following line to "accept_eula=True"
    pretrained_response = pretrained_predictor.predict(
        payload, custom_attributes="accept_eula=true"
    )
    responses_before_finetuning.append(pretrained_response[0]["generation"])
    # Please change the following line to "accept_eula=True"
    finetuned_response = finetuned_predictor.predict(payload, custom_attributes="accept_eula=true")
    responses_after_finetuning.append(finetuned_response[0]["generation"])


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

Unnamed: 0,Inputs,Ground Truth,Response from non-finetuned model,Response from fine-tuned model
0,"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite down some points on Yugoslav Cup using given paragraph as a base.\n\n### Input:\nThe Yugoslav Cup was a tournament for which clubs from all tiers of the football pyramid were eligible to enter. In addition, amateur teams put together by individual Yugoslav People's Army garrisons and various factories and industrial plants were also encouraged to enter, which meant that each cup edition could have several thousands of teams in its preliminary stages. These teams would play through a number of qualifying rounds before reaching the first round proper, in which they would be paired with top-flight teams.\n\n\n\n### Response:\n","1. Clubs from all levels of the football pyramid were eligible to participate in the Yugoslav Cup.\n2. Additionally, amateur teams assembled by numerous enterprises and industrial plants as well as individual Yugoslav People's Army garrisons were encouraged to compete, which meant that each cup edition may include thousands of teams in its preliminary rounds.\n3. Prior to the first round proper, where they would be partnered with top-flight teams, these teams would compete in a number of qualification rounds.","\nThe Cup's first winner was Željezničar Sarajevo from the Sarajevo Football Association, which beat Partizan on a walkover after the Belgrade team could not organize transportation to Sarajevo to play the game. The biggest upset in the competition came with Spartak Subotica's victory over Red Star in 2009. They eliminated Red Star in the sixth round of the qualifications, while Red Star only","1. The Yugoslav Cup was a football tournament which allowed clubs from all the tiers of football pyramid to participate. The teams were encouraged from various factories, youth organizations and army garrisons and they were allowed to enter as well. There were thousands of teams that participated in preliminary stages and qualified for first round proper of the tournament where they were to be paired with top tiered teams.\n2. The first edition of this cup was played"
1,"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nFor the Zodiac sign Aries, Share some information from the given text.\n\n### Input:\nAries (♈︎) (Greek: Κριός, romanized: Kriós, Latin for ""ram"") is the first astrological sign in the zodiac, spanning the first 30 degrees of celestial longitude (0°≤ λ <30°), and originates from the Aries constellation. Under the tropical zodiac, the Sun transits this sign from approximately March 21 to April 19 each year. This time duration is exactly the first month of the Solar Hijri calendar (Arabic Hamal/Persian Farvardin/Pashto Wray).\n\n\n\n### Response:\n","1.\tAries is the first astrological sign in the zodiac, spanned in the first 30 degrees of celestial longitude (0°≤ λ <30°).\n2.\tAries is originated from the Aries constellation.\n3.\tThe Sun transits this sign from approximately March 21 to April 19 each year.\n4.\tThis time period is exactly the first month of the Solar Hijri calendar (Arabic Hamal/Persian Farvardin/Pashto Wray).",_Description_\nThis example shows one possible response to the task. The response describes the relevant information and shows support for the task through evidence from text.\n\n_Feedback_\nA better way to respond is to be complete in evidence you provide while also providing the most significant description.\n\n\n### Instruction:\nProvide a definition for any word from the provided text in your response.\n\n### Instruction:\nExplain how you would use the information,"Aries (♈︎) (Greek: Κριός, romanized: Kriós, Latin for ""ram"") is the first astrological sign in the zodiac, spanning the first 30 degrees of celestial longitude (0°≤ λ <30°), and originates from the Aries constellation. Under the tropical zodiac, the Sun transits this sign from approximately March 2"
2,"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nUsing examples taken from the text give me a summary of the main arguments in favour of slavery reparations in the United States and the anticipated cost of enacting such reparations\n\n### Input:\nSlavery ended in the United States in 1865 with the end of the American Civil War and the ratification of the Thirteenth Amendment to the United States Constitution, which declared that ""Neither slavery nor involuntary servitude, except as a punishment for crime whereof the party shall have been duly convicted, shall exist within the United States, or any place subject to their jurisdiction"". At that time, an estimated four million African Americans were set free.\n\nSupport for reparations\nWithin the political sphere, a bill demanding slavery reparations has been proposed at the national level, the ""Commission to Study and Develop Reparation Proposals for African-Americans Act,"" which former Rep. John Conyers Jr. (D-MI) reintroduced to the United States Congress every year from 1989 until his resignation in 2017. As its name suggests, the bill recommended the creation of a commission to study the ""impact of slavery on the social, political and economic life of our nation""., however there are cities and institutions which have initiated reparations in the US (see § Legislation and other actions for a list).\n\nIn 1999, African-American lawyer and activist Randall Robinson, founder of the TransAfrica advocacy organization, wrote that America's history of race riots, lynching, and institutional discrimination have ""resulted in $1.4 trillion in losses for African Americans"". Economist Robert Browne stated the ultimate goal of reparations should be to ""restore the black community to the economic position it would have if it had not been subjected to slavery and discrimination"". He estimates a fair reparation value anywhere between $1.4 to $4.7 trillion, or roughly $142,000 (equivalent to $162,000 in 2021) for every black American living today. Other estimates range from $5.7 to $14.2 and $17.1 trillion.\n\nIn 2014, American journalist Ta-Nehisi Coates published an article titled ""The Case for Reparations"", which discussed the continued effects of slavery and Jim Crow laws and made renewed demands for reparations. Coates refers to Rep. John Conyers Jr.'s H.R.40 Bill, pointing out that Congress's failure to pass this bill expresses a lack of willingness to right their past wrongs.\n\nIn September 2016, the United Nations' Working Group of Experts on People of African Descent encouraged Congress to pass H.R.40 to study reparations proposals. Still, the Working Group did not directly endorse any specific reparations proposal. The report noted that there exists a legacy of racial inequality in the United States, explaining that ""Despite substantial changes since the end of the enforcement of Jim Crow and the fight for civil rights, ideology ensuring the domination of one group over another, continues to negatively impact the civil, political, economic, social and cultural rights of African Americans today."" The report notes that a ""dangerous ideology of white supremacy inhibits social cohesion among the US population"".\n\nThe topic of reparations gained renewed attention in 2020 as the Black Lives Matter movement named reparations as one of their policy goals in the United States.\n\nIn 2020, rapper T.I. supported reparations that would give every African American US$1 million and asserted that slavery caused mass incarcerations, poverty, and other ills.\n\n\n\n### Response:\n","In spite of progressive changes since the end of enforcement of Jim Crow and the successes of the fight for civil rights there are still inherent racial inequalities in the USA which negatively affect the civil, political, economic, social and cultural rights of African Americans today. A United Nations report of 2016 did not endorse a specific reparations proposal, but did call out that there is an ideology of white supremacy present in the population of the United States that inhibits social cohesion. Further arguments have been made identifying slavery as a cause of mass incarcerations, poverty and other ills that affect today’s African Americans.\nIt could be argued that there is an inherent acknowledgement that the case exists to examine the impact of slavery on the social, political and economic life of the United States in the annual submission to Congress of the ""Commission to Study and Develop Reparation Proposals for African-Americans Act”. There are also individual cities within the USA which have instituted reparations. Estimates of the cost to African Americans of the United States’ history of discrimination have been estimated variously at $1.4 trillion, $4.7 trillion, $5.7 trillion, $14.2 trillion and $17.1 trillion. These estimates express the projected cost in $ to restore the black community to the economic position it would have if it had not been subjected to slavery and discrimination.",\nThe most popular argument is the one which refers to the fact that the African Americans were the slaves. The end of the slaves came with the end of the 1st world war. But still the situation is not alright and needs to be adjusted. Another issue is that there are many African Americans who cannot buy a house which shows the extent to which African Americans have been discriminated.\n\n[Source](https://www.nytimes.com/article/reparations,"Support for reparations has been expressed by economists since at least 1999 when Robert Browne estimated that a fair reparation value would be at least $1.4 trillion. Other estimates in the two decades following Browne's estimate range from $5.7 to $14.2 and $17.1 trillion. Randall Robinson, founder of the TransAfrica organization, estimated that the net loss to African Americans by race"
3,"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat were the origins of the American Civil War?\n\n### Input:\nHistorians who address the origins of the American Civil War today agree that the preservation of slavery in the United States was the principal aim of the 11 Southern states (seven states before the onset of the war and four states after the onset) that declared their secession from the United States (the Union) and united to form the Confederate States of America (known as the ""Confederacy""). However, while historians in the 21st century agree on the centrality of the conflict over slavery—it was not just ""a cause"" of the war but ""the cause""—they disagree sharply on which aspects of this conflict (ideological, economic, political, or social) were most important, and on the North’s reasons for refusing to allow the Southern states to secede. Proponents of the pseudo-historical Lost Cause ideology have denied that slavery was the principal cause of the secession, a view that has been disproven by the overwhelming historical evidence against it, notably the seceding states' own secession documents.\n\n\n\n### Response:\n","Many historians agree that the issue of slavery was the main cause of the American Civil War. A total of eleven southern states wanted to preserve slavery and as a result voted to secede from the United States. These states subsequently declared themselves the Confederate States of America, also known as the Confederacy. Also most historians agree that slavery was the cause of the war, they have differing views regarding which aspects (idealogical, economic, political, or social) were most important. Their opinions also differ regarding the Northern States reasons for not allowing the Southern states to succeed from the Union. Followers of an ideology known as the Lost Cause Idealogy have denied that slavery was the root cause of the war however this view has been disproven by overwhelming evidence, including the Southern states own secession documents.","In my opinion, the Civil War did not have one single cause, but rather a combination of causes such as slavery, states' rights, and the election of Abraham Lincoln.\n\n#### Answer Explanation:\nThis response is similar to the one written by the instructor, but provides additional details. The writer explains what caused the War without making it seem like the response is written just for the purpose of satisfying the instructors request to write in the second person.\n\n###","Despite being debated by historians for decades, it is now widely accepted that the cause of the American Civil War was slavery. Although, there are some who dispute that notion based on the secession documents of the eleven separate states.\n\n"
4,"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat was the first American college rowing club?\n\n### Input:\nModern rowing as a competitive sport can be traced to the early 17th century when professional watermen held races (regattas) on the River Thames in London, England. Often prizes were offered by the London Guilds and Livery Companies. Amateur competition began towards the end of the 18th century with the arrival of ""boat clubs"" at British public schools. Similarly, clubs were formed at colleges within Oxford and Cambridge in the early nineteenth century. Public rowing clubs were beginning at the same time in England, Germany, and the United States. The first American college rowing club was formed in 1843 at Yale College.\n\n\n\n### Response:\n",The first American college rowing club was founded at Yale College in 1843/,\n[![JavaScript Solution](https://media.geeksforgeeks.org/wp-content/uploads/2021011103018-10_3_3_Splash.png)](https://media.geeksforgeeks.org/wp-content/uploads/2021011103018-10_3_3_Splash.pdf)\n,The first American college rowing club was formed in 1843 at Yale College.\n\n


### Clean up resources

In [35]:
# Delete resources
pretrained_predictor.delete_model()
pretrained_predictor.delete_endpoint()
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: meta-textgeneration-llama-2-7b-2023-09-24-19-57-25-203
INFO:sagemaker:Deleting endpoint configuration with name: meta-textgeneration-llama-2-7b-2023-09-24-19-57-25-271
INFO:sagemaker:Deleting endpoint with name: meta-textgeneration-llama-2-7b-2023-09-24-19-57-25-271
INFO:sagemaker:Deleting model with name: meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-433
INFO:sagemaker:Deleting endpoint configuration with name: meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-425
INFO:sagemaker:Deleting endpoint with name: meta-textgeneration-llama-2-7b-2023-09-26-14-59-56-425


# Appendix

### 1. Supported Inference Parameters

---
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 


### Notes
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.

---

### 2. Dataset formatting instruction for training

---

####  Fine-tune the Model on a New Dataset
We currently offer two types of fine-tuning: instruction fine-tuning and domain adaption fine-tuning. You can easily switch to one of the training 
methods by specifying parameter `instruction_tuned` being 'True' or 'False'.


#### 2.1. Domain adaptation fine-tuning
The Text Generation model can also be fine-tuned on any domain specific dataset. After being fine-tuned on the domain specific dataset, the model
is expected to generate domain specific text and solve various NLP tasks in that specific domain with **few shot prompting**.

Below are the instructions for how the training data should be formatted for input to the model.

- **Input:** A train and an optional validation directory. Each directory contains a CSV/JSON/TXT file. 
  - For CSV/JSON files, the train or validation data is used from the column called 'text' or the first column if no column called 'text' is found.
  - The number of files under train and validation (if provided) should equal to one, respectively. 
- **Output:** A trained model that can be deployed for inference. 

Below is an example of a TXT file for fine-tuning the Text Generation model. The TXT file is SEC filings of Amazon from year 2021 to 2022.

```Note About Forward-Looking Statements
This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.
GENERAL
Embracing Our Future ...
```


#### 2.2. Instruction fine-tuning
The Text generation model can be instruction-tuned on any text data provided that the data 
is in the expected format. The instruction-tuned model can be further deployed for inference. 
Below are the instructions for how the training data should be formatted for input to the 
model.

Below are the instructions for how the training data should be formatted for input to the model.

- **Input:** A train and an optional validation directory. Train and validation directories should contain one or multiple JSON lines (`.jsonl`) formatted files. In particular, train directory can also contain an optional `*.json` file describing the input and output formats. 
  - The best model is selected according to the validation loss, calculated at the end of each epoch.
  If a validation set is not given, an (adjustable) percentage of the training data is
  automatically split and used for validation.
  - The training data must be formatted in a JSON lines (`.jsonl`) format, where each line is a dictionary
representing a single data sample. All training data must be in a single folder, however
it can be saved in multiple jsonl files. The `.jsonl` file extension is mandatory. The training
folder can also contain a `template.json` file describing the input and output formats. If no
template file is given, the following template will be used:
  ```json
  {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}",
    "completion": "{response}"
  }
  ```
  - In this case, the data in the JSON lines entries must include `instruction`, `context` and `response` fields. If a custom template is provided it must also use `prompt` and `completion` keys to define
  the input and output templates.
  Below is a sample custom template:

  ```json
  {
    "prompt": "question: {question} context: {context}",
    "completion": "{answer}"
  }
  ```
Here, the data in the JSON lines entries must include `question`, `context` and `answer` fields. 
- **Output:** A trained model that can be deployed for inference. 

---

#### 2.3. Example fine-tuning with Domain-Adaptation dataset format
---
We provide a subset of SEC filings data of Amazon in domain adaptation dataset format. It is downloaded from publicly available [EDGAR](https://www.sec.gov/edgar/searchedgar/companysearch). Instruction of accessing the data is shown [here](https://www.sec.gov/os/accessing-edgar-data).

License: [Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/legalcode).

Please uncomment the following code to fine-tune the model on dataset in domain adaptation format.

---

In [36]:
# import boto3
# model_id = "meta-textgeneration-llama-2-7b"

# estimator = JumpStartEstimator(model_id=model_id,  environment={"accept_eula": "true"},instance_type = "ml.g5.24xlarge")
# estimator.set_hyperparameters(instruction_tuned="False", epoch="5")
# estimator.fit({"training": f"s3://jumpstart-cache-prod-{boto3.Session().region_name}/training-datasets/sec_amazon"})

### 3. Supported Hyper-parameters for fine-tuning
---
- epoch: The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default: 5
- learning_rate: The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default: 1e-4.
- instruction_tuned: Whether to instruction-train the model or not. Must be 'True' or 'False'. Default: 'False'
- per_device_train_batch_size: The batch size per GPU core/CPU for training. Must be a positive integer. Default: 4.
- per_device_eval_batch_size: The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default: 1
- max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of training samples. Must be a positive integer or -1. Default: -1. 
- max_val_samples: For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of validation samples. Must be a positive integer or -1. Default: -1. 
- max_input_length: Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default: -1. 
- validation_split_ratio: If validation channel is none, ratio of train-validation split from the train data. Must be between 0 and 1. Default: 0.2. 
- train_data_split_seed: If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default: 0.
- preprocessing_num_workers: The number of processes to use for the preprocessing. If None, main process is used for preprocessing. Default: "None"
- lora_r: Lora R. Must be a positive integer. Default: 8.
- lora_alpha: Lora Alpha. Must be a positive integer. Default: 32
- lora_dropout: Lora Dropout. must be a positive float between 0 and 1. Default: 0.05. 
- int8_quantization: If True, model is loaded with 8 bit precision for training. Default for 7B/13B: False. Default for 70B: True.
- enable_fsdp: If True, training uses Fully Sharded Data Parallelism. Default for 7B/13B: True. Default for 70B: False.

Note 1: int8_quantization is not supported with FSDP. Also, int8_quantization = 'False' and enable_fsdp = 'False' is not supported due to CUDA memory issues for any of the g5 family instances. Thus, we recommend setting exactly one of int8_quantization or enable_fsdp to be 'True'
Note 2: Due to the size of the model, 70B model can not be fine-tuned with enable_fsdp = 'True' for any of the supported instance types.

---

### 4. Supported Instance types

---
We have tested our scripts on the following instances types:

- 7B: ml.g5.12xlarge, nl.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge
- 13B: ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge
- 70B: ml.g5.48xlarge

Other instance types may also work to fine-tune. Note: When using p3 instances, training will be done with 32 bit precision as bfloat16 is not supported on these instances. Thus, training job would consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.

---

### 5. Few notes about the fine-tuning method

---
- Fine-tuning scripts are based on [this repo](https://github.com/facebookresearch/llama-recipes/tree/main). 
- Instruction tuning dataset is first converted into domain adaptation dataset format before fine-tuning. 
- Fine-tuning scripts utilize Fully Sharded Data Parallel (FSDP) as well as Low Rank Adaptation (LoRA) method fine-tuning the models

---

### 6. Studio Kernel Dead/Creating JumpStart Model from the training Job
---
Due to the size of the Llama 70B model, training job may take several hours and the studio kernel may die during the training phase. However, during this time, training is still running in SageMaker. If this happens, you can still deploy the endpoint using the training job name with the following code:

How to find the training job name? Go to Console -> SageMaker -> Training -> Training Jobs -> Identify the training job name and substitute in the following cell. 

---

In [37]:
# from sagemaker.jumpstart.estimator import JumpStartEstimator
# training_job_name = <<training_job_name>>

# attached_estimator = JumpStartEstimator.attach(training_job_name, model_id)
# attached_estimator.logs()
# attached_estimator.deploy()