In [6]:
!pip install -q -U transformers peft torch accelerate einops sentencepiece bitsandbytes

In [1]:
import torch
from peft import PeftModel, PeftConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)


In [2]:
peft_model_id = "dfurman/Mixtral-8x7B-peft-v0.1"
config = PeftConfig.from_pretrained(peft_model_id)

tokenizer = AutoTokenizer.from_pretrained(
    peft_model_id,
    use_fast=True,
    trust_remote_code=True,
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(
    model,
    peft_model_id
)


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

In [3]:
messages = [
    {"role": "user", "content": "Tell me a recipe for a mai tai."},
]

print("\n\n*** Prompt:")
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
)
print(tokenizer.decode(input_ids[0]))

print("\n\n*** Generate:")
with torch.autocast("cuda", dtype=torch.bfloat16):
    output = model.generate(
        input_ids=input_ids.to("cuda"),
        max_new_tokens=1024,
        return_dict_in_generate=True,
    )

response = tokenizer.decode(
    output["sequences"][0][len(input_ids[0]):],
    skip_special_tokens=True
)
print(response)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




*** Prompt:
<s> [INST] Tell me a recipe for a mai tai. [/INST] 


*** Generate:
1.5 oz light rum
2 oz dark rum
1 oz lime juice
0.5 oz orange curaçao
0.5 oz orgeat syrup

In a shaker filled with ice, combine the light rum, dark rum, lime juice, orange curaçao, and orgeat syrup. Shake well.

Strain the mixture into a chilled glass filled with fresh ice.

Garnish with a lime wedge and a cherry.


In [7]:
messages = [
    {"role": "user", "content": "Recommend some games to play for 3 year old and 7 year olds."},
]

print("\n\n*** Prompt:")
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
)
print(tokenizer.decode(input_ids[0]))

print("\n\n*** Generate:")
with torch.autocast("cuda", dtype=torch.bfloat16):
    output = model.generate(
        input_ids=input_ids.to("cuda"),
        max_new_tokens=1024,
        return_dict_in_generate=True,
    )

response = tokenizer.decode(
    output["sequences"][0][len(input_ids[0]):],
    skip_special_tokens=True
)
print(response)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




*** Prompt:
<s> [INST] Recommend some games to play for 3 year old and 7 year olds. [/INST] 


*** Generate:
1. Candy Land: A classic board game that is easy to understand and play for both 3 and 7 year olds.

2. Chutes and Ladders: Another classic board game that teaches counting and basic strategy.

3. Memory Matching Games: These games help with memory and concentration skills.

4. Jenga: A game of skill and strategy that can be played by both age groups.

5. Uno: A card game that teaches colors, numbers, and simple strategy.

6. Connect Four: A strategy game that teaches spatial reasoning and planning.

7. Guess Who?: A game of deduction and elimination that can be played by both age groups.

8. Operation: A game of fine motor skills and hand-eye coordination.

9. Hi Ho Cherry-O: A counting game that teaches numbers and basic addition.

10. Simon Says: A fun game that teaches listening skills and following directions.
