# Mamba-2 Language Model demo

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import time

import torch
from transformers import AutoTokenizer

from mamba2 import Mamba2LMHeadModel

if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cpu')

  from .autonotebook import tqdm as notebook_tqdm


Official pretrained models on [huggingface](https://huggingface.co/state-spaces):
* `state-spaces/mamba2-130m`
* `state-spaces/mamba2-370m`
* `state-spaces/mamba2-780m`
* `state-spaces/mamba2-1.3b`
* `state-spaces/mamba2-2.7b`

Choose a model depending on available system RAM (for CPU or system with unified memory) or VRAM.

Note that these are base models without fine-tuning for downstream tasks such as chat or instruction following.

In [3]:
model = Mamba2LMHeadModel.from_pretrained("state-spaces/mamba2-1.3b", device=device)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.pad_token_id = tokenizer.eos_token_id



In [4]:
generation_config = dict(
    max_new_length=200,
    temperature=1.0,
    top_k=30,
    top_p=1.0,
)

In [5]:
def generate(prompt: str, seed: int = 0, show_perf: bool = True):
    """Generate streaming completion"""
    torch.manual_seed(seed)

    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)[0]
    print(prompt, end="")

    start = time.process_time()
    n_generated = 0
    for i, (token_id, _hidden_state) in enumerate(model.generate(input_ids, **generation_config)):
        token = tokenizer.decode([token_id])
        if i == 0:
            now = time.process_time()
            prompt_eval_elapsed, start = now - start, now
        else:
            n_generated += 1
        print(token, end="", flush=True)
    if show_perf:
        elapsed = time.process_time() - start
        print('\n\n---')
        print(f'Prompt eval | tokens: {input_ids.shape[0]} | elapsed: {prompt_eval_elapsed:.2f}s | tok/s: {input_ids.shape[0] / prompt_eval_elapsed:.2f}')
        print(f'Generation | tokens: {n_generated} | elapsed: {elapsed:.2f}s | tok/s: {n_generated / elapsed:.2f}')

In [6]:
generate("Mamba is a new state space model architecture")

Mamba is a new state space model architecture, for applications such as neural signal processing and classification. It has been implemented as a C library for Windows.

Features are:

Vector autoregressive models of any order

Bayesian state space models of any order

SVM models of any order

K-Means clustering models

Support for linear transformations, including scaling, translation, rotation, and scaling+translation

Support for random data as a distribution of samples for both the training and testing

Support for non-linear transformation models by the choice of activation function through a weighted least squares cost

The Mamba library can be downloaded at the following URL: https://github.com/louisdal/mamba

---
Prompt eval | tokens: 9 | elapsed: 1.34s | tok/s: 6.71
Generation | tokens: 144 | elapsed: 4.25s | tok/s: 33.86


In [7]:
generate("The meaning of life is")

The meaning of life is one that you choose. And this is what I want to tell you, as a person who has been through so much, and whose life will go on no matter what.

---
Prompt eval | tokens: 5 | elapsed: 0.21s | tok/s: 24.06
Generation | tokens: 34 | elapsed: 1.02s | tok/s: 33.39


In [8]:
generate("CUDA is Nvidia's biggest moat")

CUDA is Nvidia's biggest moat, but you can build a strong case for it even without it. If you're making high-end gaming PC (Gigabytes of RAM, beefy graphics cards, beefy cooling systems).

Nvidia's GPUs are the most powerful, reliable, and expensive parts in the industry. GPUs are very power hungry, so if they run hot, things can get complicated really fast (I learned this by the ways of my Razer Core. A lot!).

If you're looking to build a gaming PC or something that needs lots of RAM, you can build a PC with a huge amount of RAM, but most people use them like me. Most of the times, you can get away with 8GB RAM.

Then your graphics cards are your largest financial investment and your biggest power wasters. A good GPU can cost a few grand. But as Nvidia makes more and more powerful GPUs, the price comes down. It's hard to build a

---
Prompt eval | tokens: 9 | elapsed: 0.37s | tok/s: 24.32
Generation | tokens: 199 | elapsed: 5.89s | tok/s: 33.80
