# Exploring LLM Inference

- Have you ever wondered what can be adjusted when you ask an LLM like deepseek and chatgpt?
- We will be doing the experimentation with the Qwen2.5-3B-Instruct model, which with only 3 billion parameters is very lightweight compared to other LLMs.
- The model can run on a consumer-grade GPU like the T4 GPU you have in Colab, but make sure you turn on the GPU under the Runtime -> Change Runtime Type section, or else the text generation would be disastrously slow!

# Model Loading

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-3B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
prompt = "Give me a short introduction to large language model." # the input that you type in to a language model
system_prompt = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
# system prompt: the prompt that is inserted before any prompts you write in your conversation
# "sets the tone" for the model's subsequent generations in your conversation
def generate(prompt, system_prompt, max_new_tokens=None, temperature=None):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]
    # system prompt: the prompt that is inserted before any prompts you write in your conversation
    # "sets the tone" for the model's subsequent generations in your conversation
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    if temperature == None:
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=512
        )
    else:
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=512,
            temperature=temperature
        )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response
print(generate(prompt, system_prompt))

A large language model is a type of artificial intelligence (AI) system that is trained on vast amounts of text data to generate human-like responses to a wide variety of prompts or questions. These models are typically based on neural networks, which are computational models inspired by the structure and function of the brain.

The key characteristics of large language models include:

1. **Scale**: They are trained on extremely large datasets, often containing billions or even trillions of words. This extensive exposure allows them to understand context, grammar, and various styles of writing.

2. **Complexity**: They employ sophisticated algorithms and architectures to process and generate text. This complexity enables them to handle a wide range of tasks including translation, summarization, question-answering, and more.

3. **Versatility**: Large language models can produce coherent and contextually relevant responses across different domains and topics, making them useful for a m

# Adjusting Prompts

You can improve the quality of your response by giving clear instructions to the model!

In [None]:
system_prompt = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
prompt = "How does a large language model work?"
print(f"Prompt: {prompt}")
print(f"Response: {generate(prompt, system_prompt)}")
system_prompt = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
prompt = "How does a large language model work in terms of how they are trained and how they generate text? Please be concise."
print(f"Prompt: {prompt}")
print(f"Response: {generate(prompt, system_prompt)}")
# more content about training and text generation in the second generated text! the second text is more concise too!
# if you specify some suggestions or constraints in the prompt, the model will output exactly what you want!
# if you can give the model more information about what you want (aside from private or irrelevant information), the response quality will usually
# be higher than if you just input a generic prompt.

Prompt: How does a large language model work?
Response: A large language model is essentially a sophisticated neural network designed to understand and generate human-like text based on the patterns it has learned from vast amounts of text data. Here’s an overview of how such models work:

### 1. **Data Collection**
   - **Multilingual Text**: The model is trained on a diverse corpus of text in multiple languages. This includes books, articles, websites, and other forms of written content.
   - **Natural Language Processing (NLP) Tasks**: The data is often annotated with various NLP tasks like part-of-speech tagging, named entity recognition, dependency parsing, etc., to help the model learn more nuanced aspects of language.

### 2. **Preprocessing**
   - **Tokenization**: The raw text is broken down into smaller units called tokens (e.g., words, subwords, or characters). For example, "hello world" might be tokenized as ["hello", "world"].
   - **Encoding**: Each token is then converte

# Adjusting System Prompts

You can very effectively adjust the style in which the language model delivers content!

In [None]:
system_prompt = "You are a dog. You can only output the word 'bark' out of every prompt, and nothing else."
print(f"Prompt: {prompt}")
print(f"Response: {generate(prompt, system_prompt)}")
# system prompt at the beginning can overrides the prompt!
# think of it as giving the model a "character" to base its responses on.

Prompt: How does a large language model work in terms of how they are trained and how they generate text? Please be concise.
Response: bark


In [None]:
system_prompt = "You are a dog."
print(f"System prompt: {system_prompt}")
print(f"Prompt: {prompt}")
print(f"Response: {generate(prompt, system_prompt)}")
system_prompt = "You are a ghost from a horror story."
print(f"System prompt: {system_prompt}")
print(f"Prompt: {prompt}")
print(f"Response: {generate(prompt, system_prompt)}")

System prompt: You are a dog.
Prompt: How does a large language model work in terms of how they are trained and how they generate text? Please be concise.
Response: A large language model is trained using vast amounts of text data, often millions or billions of words, to learn patterns and relationships within the language. During training, the model tries to predict the next word in a sentence based on all preceding words. It's like learning from a really big book where every word is its own page.

After training, the model can generate new text by predicting what words come next in a sequence, which helps it create coherent sentences and paragraphs. It uses what it learned from all that training data to make educated guesses about how words should fit together in various contexts.
System prompt: You are a ghost from a horror story.
Prompt: How does a large language model work in terms of how they are trained and how they generate text? Please be concise.
Response: A large language mo

# Adjust the Temperature

The temperature is one of the important inference parameters that affect how a model responds to a prompt!

In [None]:
prompt = "Give me a short introduction to large language model." # the input that you type in to a language model
system_prompt = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
print(generate(prompt, system_prompt, max_new_tokens=512, temperature=20.0)) # very high temperature
# you can see that it starts outputting things that are not that relevant or even coherent
# such as "artificial artificial intelligent" which is very confusing and not grammatically correct in the context of AI
# for example it makes up acronyms like LaMO or LaME, which is not common terminology in terms of LLMs
# you're likely to get different phrases and results as very high temperature "evens out" the predicted probability distribution across tokens,
# so that much more diversity can possibly appear in the results when you run the model multiple times than if you use a lower temperature

A *Large-scale languagemodel**s** refers backbones capable of recognizing linguistic rules by training deep-neurall networks in large, publicly avalachable language data resources such text collections across diverse formats, including web articles and documents across numerous sources in a distributed computation setting. Their primary objective entails not replicate preexistent linguistic expressions verbed for prediction and completion in the format of conversation sequences or answer inquiries by inputting incomplete text sentences. Some critical parameters charactering successful languagemodels, apart being vast size to accommodate wide diversity within a training phase (measuerred via vocabulary sizes or number-of-model-finetunes-parameterizations per epoch) or training length period, involves a highly sophisticated fine-to-finesetting for optimization purposes (via training schedules and lossfunctions that aim optimizing predictions in both low-cost computation settings).

The e

In [None]:
print(generate(prompt, system_prompt, max_new_tokens=512, temperature=20.0)) # another go at the same temperature
# very different! but things are still unclear, sometimes incoherent and incorrect...
# that's why for the vast majority of applications, temperatures are not set higher than 1.

In [None]:
print(generate(prompt, system_prompt, max_new_tokens=512, temperature=1e-4)) # very low temperature
# can approximate greedy decoding (only selecting the top-1 selected token), as very low temperatures greatly amplify the signal of the
# token with the highest logit score