In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch
import time
import json
import transformers

In [2]:
print(torch.cuda.is_available())

True


### Load Mistral 7B

In [5]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map='auto'
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
def generate_text(prompt, max_new_tokens=500):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample = False
            # Keep Deterministic for Comparison Purposes
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True) 

In [10]:
# Test Prompts
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a python function to calculate fibonacci.",
    "What is machine learning?"
]

In [11]:
# Measure latency
times = []
outputs = []
for prompt in prompts:
    start = time.time()
    output = generate_text(prompt)
    latency = time.time()-start
    times.append(latency)
    outputs.append(output)
    print(f"Prompt: {prompt}...")
    print(f"Latency: {latency:.2f}s")
    print(f"Output: {output}\n")

print(f"Average latency: {sum(times)/len(times):.2f}s")

# Measure Memory
print(f"GPU Memory allocated: {torch.cuda.memory_allocated()/1023**3:.2f} GB")
print(f"GPU Memory reserved: {torch.cuda.memory_reserved()/1023**3:.2f} GB")

baseline_outputs = {
    "model": model_name,
    "prompts": prompts,
    "outputs": outputs,
    "avg_latency": sum(times)/len(times),
    "Memory_GB": torch.cuda.memory_allocated()/1023**3
}

with open("../benchmarks/results/pytorch_baseline.json", "w") as f:
    json.dump(baseline_outputs, f, indent=4)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Prompt: Explain quantum computing in simple terms....
Latency: 13.69s
Output: Explain quantum computing in simple terms. Quantum computing is a type of computing that uses quantum bits, or qubits, instead of classical bits to process information. Qubits can exist in multiple states at once, allowing quantum computers to perform many calculations simultaneously. This property, known as superposition, enables quantum computers to solve certain problems much faster than classical computers. Additionally, quantum computers can also use a phenomenon called entanglement, where qubits become interconnected and their states influence each other, even when separated by large distances. This allows quantum computers to perform certain calculations that are impossible for classical computers. Overall, quantum computing has the potential to revolutionize fields such as cryptography, optimization, and materials science, but it is still a complex and developing technology.



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Prompt: Write a python function to calculate fibonacci....
Latency: 18.08s
Output: Write a python function to calculate fibonacci.

Here's a simple Python function to calculate Fibonacci numbers:

```python
def fibonacci(n):
    if n <= 0:
        print("Input should be a positive integer.")
        return None
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        a, b = 0, 1
        for _ in range(n - 2):
            a, b = b, a + b
        return b
```

This function takes an integer `n` as an argument and returns the `n`th Fibonacci number. The base cases are for `n` being 0, 1, or 2, and for larger values, it uses a loop to calculate the Fibonacci number recursively.

Here's an example usage:

```python
print(fibonacci(10))  # Output: 34
```

Prompt: What is machine learning?...
Latency: 42.28s
Output: What is machine learning?

Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and i