In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

#### Define the input texts for the model to process

In [2]:
input_texts = [
    "Explain the theory of relativity.",
    "What is the capital of France?",
    "How does quantum computing work?",
    "What are the benefits of machine learning?",
]

### Load the **Llama 3.1 8B** model and tokenizer, then perform inference on the input texts.
- This function requires at least 32GiB of GPU memory to run efficiently.

In [3]:
def inference_8B(input_texts):
    # Load model directly
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
    
    ### NOTE: THIS MODEL REQUIRES AT LEAST 32GiB OF GPU MEMORY ###
    # Move the model to the GPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    if torch.cuda.is_available():
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
    else:
        print("Using CPU")
    model.to(device)

    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).to(device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=200)
        generated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    for i, text in enumerate(generated_texts):
        print(f"Question: {input_texts[i]}")
        print(f"Answer: {text}")
        print()
    
    # Clear GPU memory
    del model
    del inputs
    torch.cuda.empty_cache()

### Perform inference on the defined input texts using the **8B** model

In [4]:
inference_8B(input_texts)

Loading checkpoint shards: 100%|██████████| 4/4 [00:36<00:00,  9.10s/it]


Using GPU: NVIDIA A40


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Question: Explain the theory of relativity.
Answer: Explain the theory of relativity. What are the 4 postulates of special relativity? 2019-01-19
What is the theory of relativity?
If a person is moving with the same velocity as a photon, they would see the photon moving at the speed of light. It is also called the theory of relativity. It is the theory of relativity that explains how the laws of physics are the same for all non-accelerating observers, and are independent of the state of motion of the observers. The theory of relativity is a theory that states that the laws of physics are the same for all observers that are moving at a constant velocity, regardless of their relative motion. In 1905, Einstein published his paper on the Special Theory of Relativity. The theory of relativity was developed by Albert Einstein in 1905. The theory of relativity is a theory of gravitation in which gravity is treated as a geometric property of space and

Question: What is the capital of France?


### Load the **Llama 3.1 8B-Instruct** model and tokenizer, then perform inference on the input texts.
- This function requires at least 32GiB of GPU memory to run efficiently.

In [5]:
def inference_8B_instruct(input_texts):

    # Load model directly
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
    
    ### NOTE: THIS MODEL REQUIRES AT LEAST 32GiB OF GPU MEMORY ###
    # Move the model to the GPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    if torch.cuda.is_available():
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
    else:
        print("Using CPU")
    model.to(device)

    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).to(device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=200)
        generated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    for i, text in enumerate(generated_texts):
        print(f"Question: {input_texts[i]}")
        print(f"Answer: {text}")
        print()
        
    # Clear GPU memory
    del model
    del inputs
    torch.cuda.empty_cache()

### Perform inference on the defined input texts using the **8B** model

In [6]:
inference_8B_instruct(input_texts)

Loading checkpoint shards: 100%|██████████| 4/4 [00:47<00:00, 11.93s/it]


Using GPU: NVIDIA A40


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Question: Explain the theory of relativity.
Answer: Explain the theory of relativity. Relativity is a fundamental concept in physics that explains how space and time are intertwined and how they are affected by gravity and motion. There are two main components to the theory of relativity: special relativity and general relativity.
Special relativity, developed by Albert Einstein in 1905, posits that the laws of physics are the same for all observers in uniform motion relative to one another. This means that time and space are relative, and their measurement depends on the observer's frame of reference. The theory also introduces the concept of time dilation, where time appears to pass slower for an observer in motion relative to a stationary observer. Additionally, special relativity explains the concept of length contraction, where objects appear shorter to an observer in motion relative to a stationary observer.
General relativity, developed by Einstein in 1915, builds upon special r

### Load the **Llama 3.1 70B** model and tokenizer, then perform inference on the input texts.
- This function requires at least 80GiB of GPU memory to run efficiently.

In [7]:
def inference_70B(input_texts):
    # Load model directly
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B")
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-70B")

    ### NOTE: THIS MODEL REQUIRES AT LEAST 80GiB OF GPU MEMORY ###
    # Move the model to the GPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    if torch.cuda.is_available():
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
    else:
        print("Using CPU")
    model.to(device)

    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).to(device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=200)
        generated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    for i, text in enumerate(generated_texts):
        print(f"Question: {input_texts[i]}")
        print(f"Answer: {text}")
        print()
    
    # Clear GPU memory
    del model
    del inputs
    torch.cuda.empty_cache()

### Perform inference on the defined input texts using the **70B** model

In [8]:
inference_70B(input_texts)

Loading checkpoint shards: 100%|██████████| 30/30 [05:41<00:00, 11.39s/it]


OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 44.35 GiB of which 414.25 MiB is free. Including non-PyTorch memory, this process has 43.93 GiB memory in use. Of the allocated memory 43.61 GiB is allocated by PyTorch, and 13.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)