# Load a Base Model

This notebook demonstrates how to load a pre-trained model from Hugging Face using the `transformers` library.

We will use `mistralai/Mistral-7B-v0.1` as an example.

### Authentication Required

The model `mistralai/Mistral-7B-v0.1` is a "gated model", which means you need to accept its terms of use on Hugging Face and log in with an access token to download it.

**Step 1: Grant Access on the Hugging Face Website**

1.  Go to the model's page: [https://huggingface.co/mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
2.  Log in with your Hugging Face account.
3.  Read and accept the terms and conditions to get access to the model.

**Step 2: Authenticate Your Local Environment**

After getting access on the website, you need to log in from your terminal.

1.  Activate the virtual environment: `source .venv/bin/activate`
2.  Run the login command:
    ```bash
    huggingface-cli login
    ```
3.  Paste your Hugging Face access token when prompted. You can create one in your account settings under **Settings > Access Tokens**.

Once authenticated, you can proceed to run the code cell below.

In [None]:
import sys

# See the Python binary file that's executing. This is helpful when you're using a virtual environment.
print(sys.executable)

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-v0.1"

print(f"Loading tokenizer for {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Forcing CPU execution as a fallback for older GPUs or CUDA incompatibility.
device = "cpu"
print(f"Using device: {device}")

print(f"Loading model {model_name}...")
# NOTE: Running on CPU will be significantly slower than on a compatible GPU.
# 4-bit quantization (`load_in_4bit`) is a CUDA feature and is disabled for CPU execution.
model = AutoModelForCausalLM.from_pretrained(
    model_name,
)
model.to(device) # Move model to CPU

print("Model and tokenizer loaded successfully!")

In [None]:
import torch

# 1. Define a prompt
# The prompt format should match the one the model was trained on.
# For Mistral models, it's common to use instruction tokens.
prompt = "<s>[INST] Write a simple Rust function that adds two numbers. [/INST]"

# 2. Tokenize the prompt
# `return_tensors="pt"` ensures we get PyTorch tensors.
inputs = tokenizer(prompt, return_tensors="pt")

# 3. Generate a response
# We move the inputs to the same device as the model.
# `max_new_tokens` limits the length of the generated response.
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"].to(model.device),
        attention_mask=inputs["attention_mask"].to(model.device),
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id # Set pad_token_id to eos_token_id for open-end generation
    )

# 4. Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)