# Using gated Hugginface models

We'll use Mistral 7B as the example for this notebok; it tends to work extremely well, and is a gated model that requires no special license terms.

Directions:
1. Create a Huggingface account: [https://huggingface.co/join](https://huggingface.co/join)
2. Create an access token with "read" permissions: [https://huggingface.co/docs/hub/security-tokens](https://huggingface.co/docs/hub/security-tokens).
    1. _Be sure to copy and save the key after it's generated,_ or you'll need to create a new one.  You can only see the access token once.
3. Navigate to the page of the gated model you want to access, e.g. [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3).  Accept the license terms on the page.
4. Then, run the following cell to load the gated model.

In [None]:
import re
import torch
import transformers

access_token = "paste your huggingface access token here"

model_name = "mistralai/Mistral-7B-Instruct-v0.3"
tok = transformers.AutoTokenizer.from_pretrained(
    model_name,
    token=access_token,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    token=access_token,
    quantization_config=transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    ),
    low_cpu_mem_usage=True,
)

In [37]:
def generate(text, chat_history=None):
    prompt = [
        {"role": "user", "content": text},
    ]
    model_inputs = tok.apply_chat_template(prompt, return_tensors="pt")
    
    if chat_history is not None:
        model_inputs = torch.hstack((chat_history, model_inputs))
        
    generated_ids = model.generate(
        model_inputs,
        max_new_tokens=256,
        do_sample=True,
    ).to("cpu")
    output = tok.decode(generated_ids[0])

    seps = [i.content for _, i in tok.added_tokens_decoder.items()]
    seps = "|".join(map(re.escape, seps))
    seps = [i for i in re.split(seps, output) if i.strip() != ""]
    return generated_ids, seps

In [38]:
ids, resp = generate("What's a cool name for an orcish DnD character?")
print(resp)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[" What's a cool name for an orcish DnD character? ", " Choosing a name for a D&D character can be fun and immersive. For an orcish character, you might want to consider a name that reflects strength, honor, or traditions of the orc race. Here are a few suggestions:\n\n1. Grimgut Zharkhan: This name combines a powerful-sounding personal name (Grimgut) with a clan or tribal name (Zharkhan).\n\n2. Boulderfist Wreckspear: Boulderfist is a strong, descriptive name, and Wreckspear is a dynamic, action-oriented surname.\n\n3. Blackscar Mammothslayer: Blackscar could denote a fearsome warrior, and Mammothslayer showcases a notable accomplishment.\n\n4. Fireheart Stonebringer: Fireheart suggests bravery or fearlessness, while Stonebringer could be a title relating to the orc's connection to their clan or tribe through powerful relics.\n\n5. Axebane Bloodaxe: Axebane sounds like a powerful orcish warrior, and Bloodaxe can be interpreted as a title or a family"]


In [36]:
ids, resp = generate("I don't like those very much.  Can you give me ones that sound a bit less sterotypical, maybe a bit more like real-world names?", ids)
print(resp)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Of the allocated memory 7.15 GiB is allocated by PyTorch, and 146.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:

tok("a test", return_dict=False)