In [1]:
!nvidia-smi

Tue Dec  9 02:40:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:18:00.0 Off |                   On |
| N/A   46C    P0             44W /  400W |   17699MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

In [2]:
#!kill 3032198

In [3]:
import torch
print("CUDA:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
print("Memory allocated:", torch.cuda.memory_allocated()/1024**2, "MB")


CUDA: True
Device: NVIDIA A100-SXM4-80GB MIG 1g.10gb
Memory allocated: 0.0 MB


In [4]:
!pip install transformers accelerate bitsandbytes



In [5]:
import os
from dotenv import load_dotenv
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

load_dotenv(".env.local")
HF_TOKEN = os.getenv("HF_TOKEN")
if HF_TOKEN is None: raise ValueError("Missing HF_TOKEN in .env.local")

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# --- 4-BIT QUANTIZATION CONFIG ---
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    token=HF_TOKEN,
)

print("Loading 4-bit quantized model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=HF_TOKEN,
    quantization_config=quant_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",   # automatically puts model on your MIG GPU
)

print("Model loaded!")


Loading tokenizer...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading 4-bit quantized model...


2025-12-09 02:40:33.513048: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-09 02:40:33.536315: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-12-09 02:40:33.542234: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-09 02:40:33.558158: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded!


In [6]:
# ---- CHAT EXAMPLE ----
messages = [
    {"role": "system", "content": "You are a concise helpful assistant."},
    {"role": "user", "content": "Explain deep learning in simple words."}
]

# Convert messages ‚Üí model prompt
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# ---- GENERATE ----
output = model.generate(
    **inputs,
    max_new_tokens=10000,
    temperature=0.7,
    do_sample=True,
)

response_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n=== MODEL RESPONSE ===\n")
print(response_text)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



=== MODEL RESPONSE ===

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a concise helpful assistant.user

Explain deep learning in simple words.assistant

**What is Deep Learning?**

Deep learning is a type of machine learning that uses artificial neural networks with multiple layers to analyze and interpret data. It's like a computer version of the human brain, where each layer processes information and passes it on to the next layer.

**Key Components:**

1. **Neural Networks:** Inspired by the human brain, these networks consist of interconnected nodes (neurons) that process and transmit information.
2. **Layers:** Multiple layers of neurons are stacked on top of each other, allowing the network to learn complex patterns in data.
3. **Artificial Intelligence:** Deep learning is a subset of AI that enables computers to learn from data without being explicitly programmed.

**How it Works:**

1. **Data Collection:** Gather a large dataset (e.g., images, 

In [7]:
# --- CONVERSATION LOOP USING LLAMA CHAT TEMPLATE ---

messages = [
    {"role": "system", "content": "You are a helpful, concise assistant."}
]

print("\nChat with Llama-3.1-8B (type 'exit' to quit)\n")

while True:
    user_input = input("User: ").strip()
    if user_input.lower() in ["exit", "quit"]:
        print("Exiting chat...")
        break

    # Add user's message
    messages.append({"role": "user", "content": user_input})

    # Convert conversation to a prompt
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate response
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
    )

    response = tokenizer.decode(output[0], skip_special_tokens=True)

    # The model produces the *full conversation*, so extract only last turn
    # (Llama 3 always puts assistant response at the end)
    assistant_reply = response.split("<|assistant|>")[-1].strip()

    print(f"\nAssistant: {assistant_reply}\n")

    # Add the assistant response to history
    messages.append({"role": "assistant", "content": assistant_reply})



Chat with Llama-3.1-8B (type 'exit' to quit)



User:  exit


Exiting chat...


In [8]:
import sys
import time

# ANSI Colors
RESET   = "\033[0m"
BOLD    = "\033[1m"
USER    = "\033[38;5;39m"   # Blue
ASSIST  = "\033[38;5;46m"   # Green
SYSTEM  = "\033[38;5;214m"  # Orange
SEPARATOR = f"{BOLD}\033[38;5;240m" + "-" * 60 + RESET

# Conversation state
messages = [
    {"role": "system", "content": "You are a helpful, concise assistant."}
]

print(f"{SYSTEM}System prompt loaded. Chat with Llama-3.1-8B!{RESET}")
print("Type 'exit' to quit.\n")

turn = 1

while True:
    user_input = input(f"{USER}User{turn}: {RESET}").strip()
    if user_input.lower() in ["exit", "quit"]:
        print(f"{SYSTEM}Exiting chat...{RESET}")
        break

    # Add user message
    messages.append({"role": "user", "content": user_input})

    # Convert to Llama chat prompt
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Model generate
    output = model.generate(
        **inputs,
        max_new_tokens=2200,
        temperature=0.7,
        do_sample=True,
    )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    assistant_reply = decoded.split("<|assistant|>")[-1].strip()

    # Print clean formatted response
    print(f"\n{SEPARATOR}")
    print(f"{ASSIST}Assistant{turn}:{RESET} {assistant_reply}")
    print(f"{SEPARATOR}\n")

    # Add to history
    messages.append({"role": "assistant", "content": assistant_reply})

    turn += 1


[38;5;214mSystem prompt loaded. Chat with Llama-3.1-8B![0m
Type 'exit' to quit.



[38;5;39mUser1: [0m congratulate me for finishing this off


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



[1m[38;5;240m------------------------------------------------------------[0m
[38;5;46mAssistant1:[0m system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful, concise assistant.user

congratulate me for finishing this offassistant

Huge congratulations to you for completing whatever challenge or task you've been working on. That's a fantastic achievement. What was it that you've finished?
[1m[38;5;240m------------------------------------------------------------[0m



[38;5;39mUser2: [0m use emojies


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



[1m[38;5;240m------------------------------------------------------------[0m
[38;5;46mAssistant2:[0m system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful, concise assistant.user

congratulate me for finishing this offassistant

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful, concise assistant.user

congratulate me for finishing this offassistant

Huge congratulations to you for completing whatever challenge or task you've been working on. That's a fantastic achievement. What was it that you've finished?user

use emojiesassistant

HUGE congratulations to you for completing whatever challenge or task you've been working on üéâüëè. That's a fantastic achievement üéä! You must be feeling proud and relieved üòä. What was it that you've finished?
[1m[38;5;240m------------------------------------------------------------[0m



[38;5;39mUser3: [0m exit


[38;5;214mExiting chat...[0m


In [9]:
## Next is the Embedings