# Introduction to SageMaker JumpStart - Text Generation with Falcon models

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|text-generation-falcon.ipynb)

---

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy Falcon models for text generation. It is a permissively licensed ([Apache-2.0](https://jumpstart-cache-prod-us-east-2.s3.us-east-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt)) open source model trained on the [RefinedWeb dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). We show several example use cases including code generation, question answering, translation etc.

---

In [None]:
!pip uninstall -y sagemaker --quiet
!pip install sagemaker --quiet

In [None]:
model_id, model_version, = (
    "huggingface-textgeneration-falcon-7b-instruct-bf16",
    "*",
)

In [None]:
%%time
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer


my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()

In [None]:
%%time

predictor.serializer = JSONSerializer()
predictor.content_type = "application/json"


payload = {
    "text_inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "max_new_tokens": 50,
    "return_full_text": False,
    "do_sample": True,
    "top_k": 10,
}

response = predictor.predict(payload)
print(response["generated_texts"][0])

### About the model

---
Falcon is a causal decoder-only model built by [Technology Innovation Institute](https://www.tii.ae/) (TII) and trained on more than 1 trillion tokens of RefinedWeb enhanced with curated corpora. It was built using custom-built tooling for data pre-processing and model training built on Amazon SageMaker. As of June 6, 2023, it is the best open-source model currently available. Falcon-40B outperforms LLaMA, StableLM, RedPajama, MPT, etc. To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). It features an architecture optimized for inference, with FlashAttention and multiquery. 


[Refined Web Dataset](https://huggingface.co/datasets/tiiuae/falcon-refinedweb): Falcon RefinedWeb is a massive English web dataset built by TII and released under an Apache 2.0 license. It is a highly filtered dataset with large scale de-duplication of CommonCrawl. It is observed that models trained on RefinedWeb achieve performance equal to or better than performance achieved by training model on curated datasets, while only relying on web data.

**Model Sizes:**
- **Falcon-7b**: It is a 7 billion parameter model trained on 1.5 trillion tokens. It outperforms comparable open-source models (e.g., MPT-7B, StableLM, RedPajama etc.). To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). To use this model, please select `model_id` in the cell above to be "huggingface-textgeneration-falcon-7b-bf16".
- **Falcon-40B**: It is a 40 billion parameter model trained on 1 trillion tokens.  It has surpassed renowned models like LLaMA-65B, StableLM, RedPajama and MPT on the public leaderboard maintained by Hugging Face, demonstrating its exceptional performance without specialized fine-tuning. To see comparison, see [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). 

**Instruct models (Falcon-7b-instruct/Falcon-40B-instruct):** Instruct models are base falcon models fine-tuned on a mixture of chat and instruction datasets. They are ready-to-use chat/instruct models.  To use these models, please select `model_id` in the cell above to be "huggingface-textgeneration-falcon-7b-instruct-bf16" or "huggingface-textgeneration-falcon-40b-instruct-bf16".

It is [recommended](https://huggingface.co/tiiuae/falcon-7b) that Instruct models should be used without fine-tuning and base models should be fine-tuned further on the specific task.

**Limitations:**

- Falcon models are mostly trained on English data and may not generalize to other languages. 
- Falcon carries the stereotypes and biases commonly encountered online and in the training data. Hence, it is recommended to develop guardrails and to take appropriate precautions for any production use. This is a raw, pretrained model, which should be further finetuned for most usecases.


---

In [None]:
def query_endpoint(predictor, payload):
    """Query endpoint and print the response"""
    response = predictor.predict(payload)
    print(f"\033[1m Input:\033[0m {payload['text_inputs']}")
    print(f"\033[1m Output:\033[0m {response['generated_texts'][0]}")

In [None]:
%%time
# Code Generation
payload = {"text_inputs": """"Write a program to compute factorial in python:""", "max_length": 100}
query_endpoint(predictor, payload)

Next, we explore a variety of prompts originally written for [Llama models](https://github.com/facebookresearch/llama/blob/main/example.py).

In [None]:
%%time

payload = {
    "text_inputs": "Building a website can be done in 10 simple steps:",
    "max_length": 110,
    "no_repeat_ngram_size": 3,
}
query_endpoint(predictor, payload)

In [None]:
# Translation
payload = {
    "text_inputs": """Translate English to French:

sea otter => loutre de mer

peppermint => menthe poivrée

plush girafe => girafe peluche

cheese =>""",
    "max_length": 50,
}
query_endpoint(predictor, payload)

In [None]:
# Sentiment-analysis
payload = {
    "text_inputs": """"I hate it when my phone battery dies."
                Sentiment: Negative
                ###
                Tweet: "My day has been :+1:"
                Sentiment: Positive
                ###
                Tweet: "This is the link to the article"
                Sentiment: Neutral
                ###
                Tweet: "This new music video was incredibile"
                Sentiment:"""
}
query_endpoint(predictor, payload)

In [None]:
# Question answering
payload = {
    "text_inputs": "Could you remind me when was the C programming language invented?",
    "max_length": 34,
}
query_endpoint(predictor, payload)

In [None]:
# Recipe generation
payload = {"text_inputs": "What is the recipe for a delicious lemon cheesecake?", "max_length": 70}
query_endpoint(predictor, payload)

### Supported Parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **num_return_sequences:** Number of output sequences returned. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **seed:** Fix the randomized state for reproducibility. If specified, it must be an integer.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stopping_criteria:** If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.  

We may specify any subset of the parameters mentioned above while invoking an endpoint.

***

### Impact of Inference parameters

#### Stopping criteria
---

This can be used to stop the text generation once it reaches the a specific string.

---

In [None]:
payload = {
    "text_inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "max_new_tokens": 50,
    "return_full_text": False,
    "do_sample": True,
    "top_k": 10,
}
print("Text generation without using stopping criteria:")
query_endpoint(predictor, payload)

In [None]:
payload = {
    "text_inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "max_new_tokens": 50,
    "return_full_text": False,
    "do_sample": True,
    "top_k": 10,
    "stopping_criteria": ["Daniel:"],
}
print("Text generation with stopping criteria:")
query_endpoint(predictor, payload)

### Instance types
---
Deploying large models such as Falcon requires accelerated computing provided by GPU instances. Different GPU instances have differing inference speed, cost of running and the amount of GPU memory. JumpStart provides a default instance which works well with the examples above. However, if you change the input payload (eg. by setting high `num_return_sequences` or `num_beams`), you may run into CUDA out of memory issues. In such a case, we recommend selecting an input instance with large GPU memory. This can be done by specifying instance type during the deployment process. 

`my_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")`


To learn more about the cost and the CUDA memory available in each isntance, please see [this documentation](https://aws.amazon.com/ec2/instance-types/).




---

### Clean up the endpoint

In [None]:
# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()