# Inference


In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Let's run the model! You can change the instruction and input - leave the output blank!

In [9]:
from unsloth import FastLanguageModel
import torch


max_seq_length = 2048
dtype = None # None for auto detection.
load_in_4bit = True

fourbit_models = [
    "rcarioniporras/model_baseline_llama",
    "ericbanzuzi/model_datacentric_llama",
    "rcarioniporras/model_modelcentric_llama",
]

hf_token = 'token'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = fourbit_models[2],
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token
)

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

In [10]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer


tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


test_messages = [
    # coding tasks
    {"role": "user", "content": "What is a decorator in Python, and why is it used? Provide an example of a decorator to log the execution time of a function."},
    {"role": "user", "content": "Complete the following Python function to determine whether a given number is prime: def is_prime(num):  # Your code here"},
    {"role": "user", "content": "Explain Python's Global Interpreter Lock (GIL). How does it affect multithreading in Python? What are some ways to work around it?"},
    {"role": "user", "content": "Compare the stack and queue data structures. Provide an example use case for each."},
    {"role": "user", "content": "What is object-oriented programming (OOP), and what are its four main principles?"},

    # general tasks for following instructions
    {"role": "user", "content": "Describe the greenhouse effect and its role in global warming. Include examples of gases that contribute to this phenomenon."},
    {"role": "user", "content": "Explain the causes and consequences of the Industrial Revolution. How did it shape the modern world?"},
    {"role": "user", "content": "Explain the sine function and its relation to the unit circle."},
    {"role": "user", "content": "A bag contains 3 red balls, 2 blue balls, and 5 green balls. If one ball is selected at random, what is the probability that it is blue? Explain your answer."},
    {"role": "user", "content": "Simplify the expression 3(x+4)−2(2x−1). Show all steps and explain the rules of algebra used."}
]



output_file = "output_modelcentric.txt"

with open(output_file, "w") as f:
    for i, message in enumerate(test_messages):
        # Prepare messages and inputs
        messages = [message]
        f.write(f"User {i}: {message['content']}\n\n")

        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,  # Must add for generation
            return_tensors="pt",
        ).to("cuda")

        f.write("Assistant:\n")
        text_streamer = TextStreamer(tokenizer, skip_prompt=True)
        
        outputs = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=max_seq_length,
                                 use_cache=True, temperature=1.5, min_p=0.1)
        # setting temperature and min_p to those values allows for quality boost: the combination allows the model to produce outputs that are both
        # creative and less repetitive, and coherent and meaningful. 
        # min_p (minimum probability) is a threshold that removes tokens with a probability lower than the specified value before sampling
        # temperature adjusts the randomness of the model's output by scaling the probabilities of the next token distribution before sampling

        generated_text = tokenizer.batch_decode(outputs)[0]
        f.write(generated_text + "\n")

        f.write("\n" + "-" * 80 + "\n\n")

print(f"Results have been saved to {output_file}")

A decorator in Python is a technique to dynamically apply some behavior or function to another function. It essentially wraps another function to execute additional code before, after, or between each execution of the main function. Decors are useful for customizing the behavior of another function based on some condition.

For example, the decorator we want to use in this question is likely to log the execution time of a function. This is commonly used in Python scripts to monitor how quickly certain functions execute. By adding this decorator to a function, it can log the execution time at the beginning, middle, and end of the function's execution. It also keeps track of the number of times it executes, as well as when it started running and stopped. This can provide a great deal of valuable information for debugging purposes.

Here's an example of a decorator that logs execution times using the 'time' module:

```python
import time

def time_log(func):
    def wrapper(*args, **kwarg