# From Open AI to Open LLMs with Messages API

In [1]:
!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import getpass
import os

# enter API key
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_API_KEY = getpass.getpass()

# enter OpenAI key (for RAG embeddings)
# os.environ["OPENAI_API_KEY"] = getpass.getpass()

## Create an Inference Endpoint using `huggingface_hub`

The `huggingface_hub` Python library allows you to programatically create and manage Inference Endpoints which just a few steps. Here, we'll use it to deploy the powerful [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as an endpoint running on [Text Generation Inference](https://huggingface.co/docs/text-generation-inference/index), our high performance inference solution for serving LLMs in production.

We need to specify the endpoint name and model repository for the text-generation task. A protected Inference Endpoint means a valid HF token is required to access the deployed API. We also need to configure the hardware requirements like vendor, region, accelerator, instance type, and size. You can check out the list of available resources [here](https://api.endpoints.huggingface.cloud/#get-/v2/provider).

In [5]:
from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "nous-hermes-2-mixtral-8x7b-demo",
    repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_type="p4de",
    instance_size="2xlarge",
    namespace="HF-test-lab",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_INPUT_LENGTH": "4096",
            "MAX_BATCH_PREFILL_TOKENS": "4096",
            "MAX_TOTAL_TOKENS": "32000",
            "MAX_BATCH_TOTAL_TOKENS": "1024000",
            "MODEL_ID": "/repository",
        },
        # "url": "ghcr.io/huggingface/text-generation-inference:1.4.0",  # must be >= 1.4.0
        "url": "ghcr.io/huggingface/text-generation-inference:sha-ee1cf51",
    },
)

endpoint.wait()
print(endpoint.status)

running


In [21]:
from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "zephyr-7b-beta-arr-test",
    repository="HuggingFaceH4/zephyr-7b-beta",
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_type="g5.2xlarge",
    instance_size="medium",
    namespace="HF-test-lab",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_INPUT_LENGTH": "16384",
            "MAX_BATCH_PREFILL_TOKENS": "16384",
            "MAX_TOTAL_TOKENS": "18432",
            "MAX_BATCH_TOTAL_TOKENS": "18432",
            "MODEL_ID": "/repository",
        },
        # "url": "ghcr.io/huggingface/text-generation-inference:1.4.0",  # must be >= 1.4.0
        "url": "ghcr.io/huggingface/text-generation-inference:sha-ee1cf51",
    },
)

endpoint.wait()
print(endpoint.status)

running


It will take a few minutes for our deployment to spin up. We can utilize the `.wait()` utility to block the running thread until the endpoint reaches a final "running" state. Once its running, we can run a quick check to see everything is working as expected.

In [15]:
endpoint.client.text_generation(
    "<s>[INST] Why is open-source software important? [/INST]",
    max_new_tokens=100,
    do_sample=True,
)

'\n\nSci-fi writer William Gibson said that the “the future is here, it’s just not evenly distributed.” Open-source software enthusiasts would perhaps understand his words as a metaphor on the dissemination of technological knowledge and its global implications.\n\nAs cryptocurrencies and digital identity storage/access technologies gain ground, so should the understanding of the underlying technical principles they employ become an increasingly important part of education and public awareness.\n\nThe pressing need'

Great, we now have a working deployment! But notice how we needed to carefully format the prompt according to the model's instruction format? While our [chat templates](https://huggingface.co/docs/transformers/chat_templating) handle all of this nuance, the new Messages API makes things even simpler...

## Using the Messages API via the OpenAI SDK

The added support for messages in TGI makes Inference Endpoints directly compatibile with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models via the OpenAI client libraries can be directly swapped out to use any open-source LLM running on a TGI endpoint!

The example below shows how to make this transition to stream responses from our Inference Endpoint. Simply replace the `base_url` with your endpoint URL (be sure to include `v1/` the suffix) and populate the `api_key` field with a valid Hugging Face user token.

In [25]:
BASE_URL

'https://y3y8movzdnat3ydy.us-east-1.aws.endpoints.huggingface.cloud'

In [4]:
from openai import OpenAI

# BASE_URL = endpoint.url
BASE_URL = "https://gw38hkbdke51vu7l.us-east-1.aws.endpoints.huggingface.cloud"

# init the client but point it to TGI
client = OpenAI(
    base_url=os.path.join(BASE_URL, "v1/"),
    api_key=HF_API_KEY,
)
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Why is open-source software important?",
        },
    ],
    stream=True,
    max_tokens=500,
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")


An open-source software is of great importance because of several reasons:

1. Cost: Open-source software is free, or at least less costly than proprietary software. Since the source code is available, users can download, install, and use the software at no direct cost.

2. Customization: The source code in open-source software is open, meaning developers and users can modify and customize the software to meet specific needs.

3. Security: Since the source code is transparent, developers and users can review it to identify and rectify any security vulnerabilities. Conversely, commonly used proprietary software may hide security weaknesses, making them difficult to detect.

4. Improved reliability: Open-source software tends to be more reliable than proprietary software because more developers, users and companies are actively contributing to its development. This creates a wide and collaborative community that maintains, tests and improves the software.

5. Collaboration: Open-source 

In [24]:
from transformers import AutoTokenizer

# model_id = "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"
model_id = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [35]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Why is open-source software important ?"},
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
# print(prompt)
# generate text
out = endpoint.client.text_generation(prompt, max_new_tokens=500, model=model_id)
print(out)

Open-source software is important for several reasons:

1. Cost-effective: Open-source software is free to use, modify, and distribute, which makes it a cost-effective alternative to proprietary software. This is particularly beneficial for small businesses, startups, and non-profit organizations with limited budgets.

2. Customizable: Since the source code is available, users can modify and customize the software to suit their specific needs. This level of customization is not possible with proprietary software, which can be a significant limitation for some users.

3. Community-driven: Open-source software is developed and maintained by a community of developers who collaborate and share their knowledge. This community-driven approach ensures that the software is constantly being improved and updated, making it more reliable and secure.

4. Transparency: The open-source model promotes transparency, as the source code is available for anyone to see and review. This transparency helps 

In [32]:
print(out)

<|assistant|>
Open-source software is important for several reasons:

1. Cost-effective: Open-source software is free to use, modify, and distribute, which makes it a cost-effective alternative to proprietary software. This is particularly beneficial for small businesses, startups, and organizations with limited budgets.

2. Customizable: Since the source code is available, users can modify and customize the software to suit their specific needs. This level of customization is not possible with proprietary software, which can be a significant limitation for some users.

3. Community-driven: Open-source software is developed and maintained by a community of developers, who contribute to the project voluntarily. This community-driven approach ensures that the software is constantly being improved and updated, and that any issues are addressed promptly.

4. Security: Since the source code is available, security vulnerabilities can be identified and addressed more quickly than with proprie

In [29]:
out

'<|assistant|>\nDeep learning is a subfield of machine learning that uses artificial neural networks with multiple layers to learn and make predictions or decisions based on large datasets. These neural networks, inspired by the structure of the human brain, can learn and adapt to complex patterns and relationships in data, enabling them to make more accurate predictions and decisions than traditional machine learning algorithms. Deep learning techniques are commonly used in applications such as image and speech recognition, natural language processing, and autonomous driving.'

## How to use IE with Langchain

In [5]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [6]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key=HF_API_KEY,
    openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")

AIMessage(content="<|assistant|>\nOpen-source software is important for several reasons:\n\n1. It's free: Open-source software is free to use, distribute, and modify. This means that anyone can download, install, and use it without paying any licensing fees.\n\n2. It's customizable: Since the source code is available, users can modify and customize the software to suit their specific needs. This allows for greater flexibility and customization than propriet")

In [7]:
from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load, chunk and index the contents of the blog
loader = WebBaseLoader(
    web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()

# declare an HF embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)

# Retrieve and generate using the relevant snippets of the blog
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke(
    "According to this article which open-source model is the best for an agent behaviour?"
)

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

{'context': [Document(page_content='To overcome this weakness, amongst other approaches, one can integrate the LLM into a system where it can call tools: such a system is called an LLM agent.\nIn this post, we explain the inner workings of ReAct agents, then show how to build them using the ChatHuggingFace class recently integrated in LangChain. Finally, we benchmark several open-source LLMs against GPT-3.5 and GPT-4.', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),
  Document(page_content='Since the open-source models were not specifically fine-tuned for calling functions in the given output format, they are at a slight disadvantage compared to the OpenAI agents.\nDespite this, some models perform really well! 💪\nHere’s an example of Mixtral-8x7B 

## LlamaIndex

In [8]:
from llama_index.llms import OpenAI, OpenAILike

llm = OpenAILike(
    model="tgi",
    api_key=HF_API_KEY,
    api_base=BASE_URL + "/v1/",
    is_chat_model=True,
    is_local=False,
    is_function_calling_model=False,
    context_window=4096,
)

In [9]:
llm.complete("Why is open-source software important?")

CompletionResponse(text='<|assistant|>\nOpen-source software is important for several reasons:\n\n1. Cost-effective: Open-source software is free to use, modify, and distribute, which makes it a cost-effective alternative to proprietary software. This is particularly beneficial for small businesses, startups, and organizations with limited budgets.\n\n2. Customizable: Since the source code is available, users can modify and customize the software to suit their specific needs. This level', additional_kwargs={}, raw={'id': '', 'choices': [Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='<|assistant|>\nOpen-source software is important for several reasons:\n\n1. Cost-effective: Open-source software is free to use, modify, and distribute, which makes it a cost-effective alternative to proprietary software. This is particularly beneficial for small businesses, startups, and organizations with limited budgets.\n\n2. Customizable: Since the source 

In [10]:
from llama_index.evaluation import (
    DatasetGenerator,
    ResponseEvaluator,
    QueryResponseEvaluator,
)
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import (
    SimpleDirectoryReader,
    ServiceContext,
    LLMPredictor,
    VectorStoreIndex,
    load_index_from_storage,
    StorageContext,
)
from llama_index import download_loader
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.query_engine import CitationQueryEngine


SimpleWebPageReader = download_loader("SimpleWebPageReader")
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://huggingface.co/blog/open-source-llms-as-agents"]
)

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
# give the previously instanciated llm to your RAG pipeline
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context, show_progress=True
)
# play with the similarity_top_k for best results
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=2,
)
response = query_engine.query(
    "According to this article which open-source model is the best for an agent behaviour?"
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting html2text
  Downloading html2text-2020.1.16-py3-none-any.whl (32 kB)
Installing collected packages: html2text
Successfully installed html2text-2020.1.16



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/6 [00:00<?, ?it/s]

In [12]:
response.response

"Based on the benchmark provided in the article, Mixtral-8x7B is the best performing open-source model for agent behavior, even outperforming GPT-3.5. However, the authors note that Mixtral was not specifically fine-tuned for agent workflows, and with proper fine-tuning, its performance could potentially be even higher. The article also mentions that GPT-4's performance in agent workflows is not"