# Demo: Text Generation With LLMs
In this notebook, we’ll use Zephyr-7b beta (a fine tuned version of Mistral-7B, developed by mistral.ai), to create a short, engaging story for a two-year-old. With its advanced language capabilities, Zephyr-7B can help you to craft simple, friendly tales perfect for young listeners.

Let’s dive in and bring a fun story to life!

Below, we filter out any annoying warning messages that might pop up when using certain libraries. Your code will run the same without doing this, we just prefer a cleaner output.

In [None]:
import warnings
warnings.filterwarnings("ignore")

## 🔧 Step 1: Install Required Packages

We install the HuggingFace `transformers` package which contains lots of useful functionality when building transformer models.

We also install `accelerate` to let us avoid writing boilerplate code needed to use multi-GPUs/TPU/fp16.

*Note: it's atypical to install packages one at a time, line-by-line, but sometimes Sagemaker notebooks don't pick up all the install instructions when done correctly. We use %%capture to hide the console output*

In [None]:
%%capture
!pip install -U transformers
!pip install -U accelerate

## 🤖 Step 2: Import Required Libraries
We import:
* AutoModelForCausalLM
* AutoTokenizer
* pipeline

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## 🔄 Step 3: Load the Tokenizer

Using the from_pretrained function, we download the tokenizer used by Zephyr-7B

In [None]:
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

## 🧠 Step 4: Load the model - Zephyr-7B

### Architectural details
Zephyr-7B is a decoder-only Transformer with the following architectural choices:

* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
* GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
* Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)


## 🗞 Step 5: Wrap everything in a pipeline object

In [None]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500 # specify we only want 500 tokens back
)

## 🧪 Step 6: Test the model by instructing (prompting)

In [None]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Tell me a short story for a 2 year old"}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

## Extension  🌟

1. Change your prompt to generate a different output