<a href="https://colab.research.google.com/github/easonlai/deepseek_examples/blob/main/Loading_DeepSeek_R1_Example_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
# Install & upgrade necessary libraries.
!pip install --upgrade transformers
!pip install --upgrade bitsandbytes

In [1]:
# Display the information about Nvidia GPU devices.
!nvidia-smi

Wed Feb  5 14:34:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
# Import PyTorch
import torch
# Import Hugging Face for Login via API
from huggingface_hub import login
# Import the necessary classes from the transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Import Colab secrets
from google.colab import userdata


In [4]:
# Access saved token secret from Colab.
hugging_face_token = userdata.get('HF_Token_DEV_R')

# Login to Hugging Face API with Access Token.
login(hugging_face_token)

In [5]:
# Define the model name, specifying which pre-trained model to use.
# https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

# Set the maximum sequence length for the model's input.
max_seq_length = 2048

# Configure the quantization settings for efficient model loading and inference.
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, # Enable 4-bit quantization.
    bnb_4bit_use_double_quant=True, # Use double quantization for improved accuracy.
    bnb_4bit_quant_type="nf4", # Specify the type of 4-bit quantization.
    bnb_4bit_compute_dtype=torch.bfloat16 # Set the computation data type to bfloat16 for optimized performance.
)

# Load the tokenizer for the pre-trained model.
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained language model with specific configurations.
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, # Set the data type for tensors based on CUDA availability.
    device_map="auto",  # Automatically map the model to available devices (CPU/GPU).
    quantization_config=quantization_config  # Apply the 4-bit quantization for efficient model loading and processing.
)

print("Model and tokenizer loaded successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.67G [00:00<?, ?B/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/7.39G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Model and tokenizer loaded successfully!


In [6]:
# Define a system prompt.
prompt_format = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a computer programming expert with advanced knowledge in software development, debugging, and algorithm design.
Please answer the following computer programming question.

### Question:
{}

### Response:
<think>{}"""

In [7]:
# Crafting a testing computer programming question for inference.
question = """"I got an error from the code below. Please help to debug.

def greet(name):
    print("Hello, " + name)

greet("Alice")
greet(123)

"""

# Format the question using the structured prompt ('prompt_style') and tokenize it.
inputs = tokenizer([prompt_format.format(question, "")], return_tensors="pt").to("cuda")  # Convert input to PyTorch tensor & move to GPU

# Using the model to generate the response.
outputs = model.generate(
    input_ids=inputs.input_ids, # Tokenized input question.
    attention_mask=inputs.attention_mask, # Mask to ignore padding tokens in the input.
    max_new_tokens=1200, # Setting the maximum number of new tokens to generate.
    use_cache=True # Enable the cache for faster inference.
)

# Decode the generated output tokens to readable text.
response = tokenizer.batch_decode(outputs)

# Extract and print only the relevant response.
print(response[0].split("### Response:")[1])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



<think>
Okay, I need to help debug the given Python code. Let's look at the code step by step.

The function `greet` is defined to take one parameter, `name`, and print "Hello, " followed by that name. So, when someone calls `greet("Alice")`, it should work fine because it's passing a string. But when they call `greet(123)`, there might be an issue.

In the first call, `greet("Alice")`, everything works because both `print` and the function expect a string. So no error there.

In the second call, `greet(123)`, the problem arises. The function expects `name` to be a string, but it's receiving an integer, 123. When trying to concatenate an integer with a string using `+`, Python throws a TypeError because you can't add a string and an integer together.

So the error occurs at the `print` statement line. The solution is to ensure that `name` is a string when it's passed to the function. One way to handle this is to convert the input to a string using `str()`, like `str(name)`. Alternativ