In [1]:
%%html
<style>
    body {
        --vscode-font-family: "Segoe UI"
    }
</style>

If I want to use a model that is available on HF but does not have integration with llama-index, then I can use it with HF integration. However, this demo from the docs does not work consistently.

In [2]:
# For some reason loading the HF model also causes the embedding model to be
# loaded even though I am not using embedding anywhere in the code!
# Loading my OpenAI API key so llama-index can load the default OpenAI embedding model.
from dotenv import load_dotenv
load_dotenv()

True

In case the model I want to use is not available as a llama-index integration but is available on HF, then I can use the HF integration to use that model.

In [3]:
from llama_index.prompts import PromptTemplate
import torch
from llama_index.llms import HuggingFaceLLM
from llama_index import ServiceContext

In [4]:
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

In [5]:
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"offload_folder": "/Users/avilay/Desktop/temp/offload"}
)

service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
resp = llm.complete("Albert Eistein is ")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [7]:
print(resp)


