## This notebook demos the usage of how to create and configure a quantized Huggingface model

Create a Huggingface text generation pipeline with a model. 

Note that some huggingface models cannot be fx traced directly and requires slight modification in modeling_[model].py. For example, for llama models, in transformers.models.llama.modeling_llama.py ```if query_states.device.type == "cuda" and causal_mask is not None:``` needs to be changed to ```if causal_mask is not None:```.

If you use models from the d-matrix domain, changes in modeling_[model].py are made for tracing.

In [None]:
from dmx.compressor.modeling.hf import pipeline
pipe = pipeline(
    task="text-generation",
    model="d-matrix/opt",
    revision="opt-125m",
    dmx_config="BASELINE",
    trust_remote_code=True,
    device_map="auto",  # enabling model parallel on multi-GPU nodes
)

Configure the model to formats equivalent to basic-mode execution on d-Matrix's hardware

In [None]:
from dmx.compressor import config_rules
pipe.model = pipe.model.transform(
    pipe.model.dmx_config,
    *config_rules.BASIC,
)

Run a forward pass

In [None]:
import torch
x = torch.ones((1, 1024), dtype=int).to("cuda")
model_inputs = {
    "input_ids": torch.tensor(
        [[2, 11475, 2115, 10, 86, 11, 10, 1212, 444, 6, 444, 409]], device="cuda:0"
    ),
    "labels": torch.tensor(
        [[2, 11475, 2115, 10, 86, 11, 10, 1212, 444, 6, 444, 409]], device="cuda:0"
    ),
    "past_key_values": None,
    "use_cache": True,
}

In [None]:
y = pipe.model(**model_inputs)

Configure to other formats

In [None]:
from dmx.compressor.modeling import nn, DmxConfigRule
bfp16 = "BFP[8|8]{64}(SN)"
bfp14 = "BFP[6|8]{64}(SN)"
rules = (
    DmxConfigRule(
        module_types=(nn.Embedding,),
        module_config=dict(
            input_formats=[bfp16],
            output_format=bfp16,
        ),
    ),
    DmxConfigRule(
        module_types=(nn.Linear,),
        module_config=dict(
            input_format=[bfp16],
            weight_format=bfp14,
        ),
    ),
)

In [None]:
pipe.model.configure(None, *rules)

Check quantized GraphModule

In [None]:
pipe.model._gm

Run text generation

In [None]:
prompt = "Once upon a time in a land far, far away"
generated_texts = pipe(prompt, max_length=50, num_return_sequences=1)
print(generated_texts)

Unquantize and run text generation again

In [None]:
rules = (
    DmxConfigRule(
        module_types=(nn.Embedding,),
        module_config=dict(
            input_formats=["SAME"],
            output_format="SAME",
        ),
    ),
    DmxConfigRule(
        module_types=(nn.Linear,),
        module_config=dict(
            input_format=["SAME"],
            weight_format="SAME",
        ),
    ),
)
pipe.model.configure(None, *rules)
prompt = "Once upon a time in a land far, far away"
generated_texts = pipe(prompt, max_length=50, num_return_sequences=1)
print(generated_texts)

In [None]:
pipe.model._gm

Run evaluation on perplexity metric

In [None]:
metric = pipe.evaluate(
    "d-matrix/dmx_perplexity",
    dataset="wikitext",
    dataset_version="wikitext-2-raw-v1",
)
print(metric)