# Continued pre-training Llama 2 models on SageMaker JumpStart

In [None]:
%pip install -U datasets==2.15.0

In [None]:
%pip install -U --force-reinstall \
             langchain==0.0.324 \
             typing_extensions==4.7.1 \
             pypdf==3.16.4

## Deploy Pre-trained Model

---

First we will deploy the Llama-2 model as a SageMaker endpoint. To train/deploy 13B and 70B models, please change model_id to "meta-textgeneration-llama-2-7b" and "meta-textgeneration-llama-2-70b" respectively.

---

In [4]:
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [5]:
model_id, model_version = "meta-textgeneration-llama-2-7b", "2.*"

In [6]:
from sagemaker.jumpstart.model import JumpStartModel

pretrained_model = JumpStartModel(model_id=model_id, model_version=model_version)
pretrained_predictor = pretrained_model.deploy()

For forward compatibility, pin to model_version='2.*' in your JumpStartModel or JumpStartEstimator definitions. Note that major version upgrades may have different EULA acceptance terms and input/output signatures.
Using model 'meta-textgeneration-llama-2-7b' with wildcard version identifier '2.*'. You can pin to version '2.1.8' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


--------------!

## Invoke the endpoint

---
Next, we invoke the endpoint with some sample queries. Later, in this notebook, we will fine-tune this model with a custom dataset and carry out inference using the fine-tuned model. We will also show comparison between results obtained via the pre-trained and the fine-tuned models.

---

In [2]:
def print_response(payload, response):
    print(payload["inputs"])
    print(f"> {response[0]['generation']}")
    print("\n==================================\n")

In [3]:
# payload = {
#     "inputsWhat is the size of the Amazon consumer business in 2022?e is",
#     "parameters": {
#         "max_new_token_p": 0.9,
#         "temperature": 0.6,
#         "return_full_text": False,
#     },
# }
# try:
#     response = pretrained_predictor.predict(payload, custom_attributes="accept_eula=true")
#     print_response(payload, response)
# except Exception as e:
#     print(e)

### Dataset formatting for continued pre-training

#### Domain adaptation fine-tuning
The Text Generation model can also be fine-tuned on any domain specific dataset. After being fine-tuned on the domain specific dataset, the model
is expected to generate domain specific text and solve various NLP tasks in that specific domain with **few shot prompting**.

Below are the instructions for how the training data should be formatted for input to the model.

- **Input:** A train and an optional validation directory. Each directory contains a CSV/JSON/TXT file. 
  - For CSV/JSON files, the train or validation data is used from the column called 'text' or the first column if no column called 'text' is found.
  - The number of files under train and validation (if provided) should equal to one, respectively. 
- **Output:** A trained model that can be deployed for inference. 

In [26]:
s3_prefix = f"s3://jumpstart-cache-prod-{boto3.Session().region_name}/training-datasets/sec_amazon"
s3_location = f"s3://jumpstart-cache-prod-{boto3.Session().region_name}/training-datasets/sec_amazon/AMZN_2021_2022_train_js.txt"

In [27]:
!aws s3 cp $s3_location ./

download: s3://jumpstart-cache-prod-us-east-1/training-datasets/sec_amazon/AMZN_2021_2022_train_js.txt to ./AMZN_2021_2022_train_js.txt


In [30]:
!head AMZN_2021_2022_train_js.txt

PART I

Item 1. Business  

This Annual Report on Form 10-K and the documents incorporated herein by
reference contain forward-looking statements based on expectations, estimates,
and projections as of the date of this filing. Actual results and outcomes may
differ materially from those expressed in forward-looking statements. See Item
1A of Part I — “Risk Factors.” As used herein, “Amazon.com,” “we,” “our,” and
similar terms include Amazon.com, Inc. and its subsidiaries, unless the


In [15]:
import boto3
from sagemaker.jumpstart.estimator import JumpStartEstimator

model_id, model_version = "meta-textgeneration-llama-2-7b", "2.*"

estimator = JumpStartEstimator(model_id=model_id, 
                               model_version=model_version, 
                               environment={"accept_eula": "true"},
                               instance_type = "ml.g5.24xlarge")

estimator.set_hyperparameters(instruction_tuned="False", epoch="5")

estimator.fit({"training": s3_prefix})

INFO:sagemaker:Creating training-job with name: meta-textgeneration-llama-2-7b-2024-01-01-21-08-53-456


2024-01-01 21:08:53 Starting - Starting the training job......
2024-01-01 21:09:30 Starting - Preparing the instances for training....................................
2024-01-01 21:15:42 Downloading - Downloading input data...........................
2024-01-01 21:20:28 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-01-01 21:20:30,087 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-01-01 21:20:30,141 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-01-01 21:20:30,150 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-01-01 21:20:30,152 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-01-01 21:20:38,279 sagemaker-training-tool

### Deploy the continued pretrained model

In [16]:
cpt_predictor = estimator.deploy()

No instance type selected for inference hosting endpoint. Defaulting to ml.g5.2xlarge.
INFO:sagemaker.jumpstart:No instance type selected for inference hosting endpoint. Defaulting to ml.g5.2xlarge.
INFO:sagemaker:Creating model with name: meta-textgeneration-llama-2-7b-2024-01-01-21-35-23-514
INFO:sagemaker:Creating endpoint-config with name meta-textgeneration-llama-2-7b-2024-01-01-21-35-23-511
INFO:sagemaker:Creating endpoint with name meta-textgeneration-llama-2-7b-2024-01-01-21-35-23-511


------------!

In [2]:
def print_response(payload, response):
    print(payload["inputs"])
    print(f"> {response[0]['generation']}")
    print("\n==================================\n")

In [18]:
payload = {
    "inputs": "What is the size of the Amazon consumer business in 2022?",
    "parameters": {"max_new_tokens": 100},
}

try:
    response = cpt_predictor.predict(payload, custom_attributes="accept_eula=true")
    print_response(payload, response)
except Exception as e:
    print(e)

What is the size of the Amazon consumer business in 2022?
>  Amazon.com Inc.'s consumer segment generated $811.0 billion in revenue in 2022, a 0.3% increase from 2021. The company generated $118.9 billion in the United States in 2022.
What is the size of the Amazon Prime subscriber base in 2022? Amazon claims its Prime subscriber base reached 247 million in 2022.




In [19]:
# # Delete resources
# pretrained_predictor.delete_model()
# pretrained_predictor.delete_endpoint()
# cpt_predictor.delete_model()
# cpt_predictor.delete_endpoint()