# SageMaker JumpStart - Embedding Drift Analysis Blog: Deploy Pre-Requisite Foundation Models

---
This notebook contains the pre-requisite steps required to deploy the two models utilized in the architecture below.  This includes deploying an embedding model, GPT-J 6B, as well as a model that will be used for text generation, Falcon-40b.   The notebook is tested on the following notebook environment:  Data Science 2.0, Python 3, ml.t3.medium.

**Note:**  Deploying the GPT-J 6B model will utilize a 'ml.g5.12xlarge' and deploying the Falcon-40b will utilize a 'ml.g5.12xlarge' so please ensure account limits are updated accordingly and be sure to clean up any un-used endpoints. 

![SageMakerPrereqs](./images/sagemaker-prereq.png)

---

This notebook includes the following steps as a pre-requisite to the sample application:
  1. Setup SageMaker environment
  2. Deploy & Validate Embedding Model
  3. Deploy & Validate Generative Model
  
Once you are done exploring the sample solution, please remember to delete unused endpoints using:

  4. Clean-Up Endpoints

## 1. Set up
Before executing the notebook, there are some initial steps required for setup.

In [None]:
!pip install ipywidgets==7.0.0 --quiet
!pip install --upgrade sagemaker --quiet

To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3. 

In [None]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

## 2. Deploy Embedding Model GPT-J 6B

In the provided blog, GPT-J 6B is used as the embedding model. [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6b) is an open source 6 billion parameter model released by Eleuther AI. GPT-J 6B has been trained on a large corpus of text data ([the Pile](https://pile.eleuther.ai/) dataset) and is capable of performing various natural language processing tasks such as text generation, text classification, and text summarization. In this scenario, we are using GPT-J 6B as the embedding model.  

In [None]:
model_id, model_version = "huggingface-textembedding-gpt-j-6b", "*"

### 2.1. Retrieve artifacts & deploy an endpoint

To host the pre-trained model, we create an instance of [`sagemaker.jumpstart.model.JumpStartModel`](https://sagemaker.readthedocs.io/en/stable/overview.html#deploy-a-pre-trained-model-directly-to-a-sagemaker-endpoint) and deploy it.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer
from sagemaker.utils import name_from_base


model_id = model_id
endpoint_name = name_from_base(f"jumpstart-embedding-{model_id}")

model = JumpStartModel(model_id=model_id, name=endpoint_name)

base_model_predictor = model.deploy()
base_model_predictor.serializer = JSONSerializer()
base_model_predictor.content_type = "application/json"

<!-- ### 2.2. Validate deployment through simple query 

The model takes a text string as input and predicts next words in the sequence. We use three of following input examples.

1. `This Form 10-K report shows that`
2. `We serve consumers through`
3. `Our vision is`

**The input examples are related to company's perforamnce in financial report. You will see the outputs from the model without finetuning are limited in providing insightful contents.** -->

In [None]:
# parameters = {
#     "max_length": 400,
#     "num_return_sequences": 1,
#     "top_k": 250,
#     "top_p": 0.8,
#     "do_sample": True,
#     "temperature": 1,
# }

# res_gpt_before_finetune = []
# for quota_text in [
#     "This Form 10-K report shows that",
#     "We serve consumers through",
#     "Our vision is",
# ]:
#     payload = {"text_inputs": f"{quota_text}:", **parameters}
#     generated_texts = base_model_predictor.predict(payload)[0][0]["generated_text"]
#     res_gpt_before_finetune.append(generated_texts)
#     print(generated_texts)
#     print("\n")

## 3. Deploy Generative Model Falcon-40B

The Falcon model is a permissively licensed ([Apache-2.0](https://jumpstart-cache-prod-us-east-2.s3.us-east-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt)) open source model trained on the [RefinedWeb dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).  This model will be used in the sample application to generate answers to prompts utilizing a RAG based architecture with external knowledge retrieved from a vector store, OpenSearch. 


In [None]:
model_id, model_version = "huggingface-llm-falcon-40b-bf16", "*"

### 3.1. Retrieve artifacts & deploy an endpoint

To host the pre-trained model, we create an instance of [`sagemaker.jumpstart.model.JumpStartModel`](https://sagemaker.readthedocs.io/en/stable/overview.html#deploy-a-pre-trained-model-directly-to-a-sagemaker-endpoint) and deploy it.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()

### 3.2. Validate deployment through simple query 

---
Falcon is a causal decoder-only model built by [Technology Innovation Institute](https://www.tii.ae/) (TII) and trained on more than 1 trillion tokens of RefinedWeb enhanced with curated corpora. It was built using custom-built tooling for data pre-processing and model training built on Amazon SageMaker. As of June 6, 2023, it is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). It features an architecture optimized for inference, with FlashAttention and multiquery. 


[Refined Web Dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): Falcon RefinedWeb is a massive English web dataset built by TII and released under an Apache 2.0 license. It is a highly filtered dataset with large scale de-duplication of CommonCrawl. It is observed that models trained on RefinedWeb achieve performance equal to or better than performance achieved by training model on curated datasets, while only relying on web data.

**Model Sizes:**
- **Falcon-7b**: It is a 7 billion parameter model trained on 1.5 trillion tokens. It outperforms comparable open-source models (e.g., MPT-7B, StableLM, RedPajama etc.). To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). To use this model, please select `model_id` in the cell above to be "huggingface-llm-falcon-7b-bf16".
- **Falcon-40B**: It is a 40 billion parameter model trained on 1 trillion tokens.  It has surpassed renowned models like LLaMA-65B, StableLM, RedPajama and MPT on the public leaderboard maintained by Hugging Face, demonstrating its exceptional performance without specialized fine-tuning. To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). 

**Instruct models (Falcon-7b-instruct/Falcon-40B-instruct):** Instruct models are base falcon models fine-tuned on a mixture of chat and instruction datasets. They are ready-to-use chat/instruct models.  To use these models, please select `model_id` in the cell above to be "huggingface-textgeneration-falcon-7b-instruct-bf16" or "huggingface-textgeneration-falcon-40b-instruct-bf16".

It is [recommended](https://huggingface.co/tiiuae/falcon-7b) that Instruct models should be used without fine-tuning and base models should be fine-tuned further on the specific task.

**Limitations:**

- Falcon models are mostly trained on English data and may not generalize to other languages. 
- Falcon carries the stereotypes and biases commonly encountered online and in the training data. Hence, it is recommended to develop guardrails and to take appropriate precautions for any production use. This is a raw, pretrained model, which should be further finetuned for most usecases.


---

In [None]:
%%time


prompt = "Tell me about Amazon SageMaker."

payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
        "stop": ["<|endoftext|>", "</s>"]
    }
}

response = predictor.predict(payload)
print(response[0]["generated_text"])

In [None]:
def query_endpoint(payload):
    """Query endpoint and print the response"""
    response = predictor.predict(payload)
    print(f"\033[1m Input:\033[0m {payload['inputs']}")
    print(f"\033[1m Output:\033[0m {response[0]['generated_text']}")

## 4. Clean up the endpoints

In [None]:
# Delete the SageMaker endpoint for the embedding model and the attached resources
base_model_predictor.delete_model()
base_model_predictor.delete_endpoint()

# Delete the SageMaker endpoint for the generative model and the attached resources
predictor.delete_model()
predictor.delete_endpoint()