# Using MistralLite with HuggingFace Transformers and FlashAttention-2

This notebook provides a step-by-step walkthrough of using the open source MistralLite model for natural language generation with HuggingFace's transformers library and flash attention mechanisms. We will initialize the model using Hugging Face's transformers library, and leverage flash attention to enable stronger performance on long context tasks. The notebook includes code to load the model, create a text generation pipeline, test it on short prompts, and demonstrate its capabilities on longer prompts from files. 

## Installing packages for local deployment
The following cells install the necessary libraries to run MistralLite on a SageMaker notebook instance, using a ml.g5.2xlarge instance type. Execute the following cells to install the packages and restart the kernel.

In [None]:
!pip install transformers==4.34.0 accelerate==0.23.0

In [None]:
!pip install flash-attn==2.3.1.post1 --no-build-isolation

In [None]:
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

## Deploying MistralLite using a Transformers Pipeline

Execute the following cell to load the pretrained MistralLite model and tokenizer, and create a text generation pipeline using them. The pipeline handles prompt encoding and generating text. This cell may take a few minutes to execute to fully load and initialize the LLM.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "amazon/MistralLite"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.bfloat16,
                                             use_flash_attention_2=True,
                                             device_map="auto",)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

## Question-Answering Examples

The following two cells contain example queries to the deployed LLM, showcasing a simple question-answering prompt and a context-aware prompt from Amazon Aurora documentation. 

In [None]:
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"

sequences = pipeline(
    prompt,
    max_new_tokens=400,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"{seq['generated_text']}")

In [None]:
with open("../example_long_ctx.txt", "r", encoding="utf-8") as fin:
    task_instruction = fin.read()
    task_instruction = task_instruction.format(
        my_question="please tell me how does pgvector help with Generative AI and give me some examples."
    )
prompt = f"<|prompter|>{task_instruction}</s><|assistant|>"
sequences = pipeline(
    prompt,
    max_new_tokens=400,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    #eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"{seq['generated_text']}")