# From Open AI to Open LLMs with Messages API

In [1]:
!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [20]:
import getpass
import os

# enter API key
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_API_KEY = getpass.getpass()

# enter OpenAI key (for RAG embeddings)
os.environ["OPENAI_API_KEY"] = getpass.getpass()

## Create an Inference Endpoint using `huggingface_hub`

The `huggingface_hub` Python library allows you to programatically create and manage Inference Endpoints which just a few steps. Here, we'll use it to deploy the powerful [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as an endpoint running on [Text Generation Inference](https://huggingface.co/docs/text-generation-inference/index), our high performance inference solution for serving LLMs in production.

We need to specify the endpoint name and model repository for the text-generation task. A protected Inference Endpoint means a valid HF token is required to access the deployed API. We also need to configure the hardware requirements like vendor, region, accelerator, instance type, and size. You can check out the list of available resources [here](https://api.endpoints.huggingface.cloud/#get-/v2/provider).

In [4]:
from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "mixtral-8x7b-instruct-v0-1-demo",
    repository="mistralai/Mixtral-8x7B-Instruct-v0.1",
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_type="p4de",
    instance_size="2xlarge",
    namespace="HF-test-lab",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_INPUT_LENGTH": "1024",
            "MAX_BATCH_PREFILL_TOKENS": "2048",
            "MAX_TOTAL_TOKENS": "32000",
            "MAX_BATCH_TOTAL_TOKENS": "1024000",
            "MODEL_ID": "/repository",
        },
        # "url": "ghcr.io/huggingface/text-generation-inference:1.4.0",  # must be >= 1.4.0
        "url": "ghcr.io/huggingface/text-generation-inference:sha-ee1cf51",
    },
)

endpoint.wait()
print(endpoint.status)

running


It will take a few minutes for our deployment to spin up. We can utilize the `.wait()` utility to block the running thread until the endpoint reaches a final "running" state. Once its running, we can run a quick check to see everything is working as expected.

In [5]:
endpoint.client.text_generation(
    "<s>[INST] Why is open-source so important? [/INST]",
    max_new_tokens=100,
    do_sample=True,
)

' Open-source is important for several reasons:\n\n1. **Collaboration and Innovation:** Open-source software allows for collaboration between developers from all over the world. This collaborative environment can lead to faster innovation and the creation of better software.\n2. **Transparency:** With open-source software, the source code is available for anyone to view and audit. This transparency can lead to more secure and reliable software, as bugs and vulnerabilities can be identified and fixed more'

Great, we now have a working deployment! But notice how we needed to carefully format the prompt according to the model's instruction format? While our [chat templates](https://huggingface.co/docs/transformers/chat_templating) handle all of this nuance, the new Messages API makes things even simpler...

## Using the Messages API via the OpenAI SDK

The added support for messages in TGI makes Inference Endpoints directly compatibile with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models via the OpenAI client libraries can be directly swapped out to use any open-source LLM running on a TGI endpoint!

The example below shows how to make this transition to stream responses from our Inference Endpoint. Simply replace the `base_url` with your endpoint URL (be sure to include `v1/` the suffix) and populate the `api_key` field with a valid Hugging Face user token.

In [32]:
BASE_URL

'https://tv4qjkt9fpevpa9b.us-east-1.aws.endpoints.huggingface.cloud'

In [36]:
from openai import OpenAI

BASE_URL = endpoint.url

# init the client but point it to TGI
client = OpenAI(
    base_url=os.path.join(BASE_URL, "v1/"),
    api_key=HF_API_KEY,
)
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "user", "content": "Why is open-source software important?"},
    ],
    stream=True,
    max_tokens=500,
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

 Open-source software (OSS) is important for several reasons:

1. Cost-effectiveness: OSS is typically free to use, which can help organizations save money on software licenses. This is especially important for small businesses and startups that may have limited budgets.
2. Flexibility: OSS can be customized to meet specific needs, which can be difficult or impossible with proprietary software. This allows organizations to tailor the software to their workflows and processes.
3. Transparency: Because the source code is openly available, OSS is more transparent than proprietary software. This means that users can inspect the code to ensure that it meets their security and privacy requirements, and they can also contribute improvements and fixes.
4. Community support: OSS often has a large and active community of developers and users who can provide support and assistance. This can be especially helpful for users who are new to a particular software product or who are trying to solve a c

## How to use IE with Langchain

In [30]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [35]:
from langchain_community.chat_models.openai import ChatOpenAI
from langchain_openai import OpenAI, OpenAIEmbeddings

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key=HF_API_KEY,
    openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")

AIMessage(content=' Open-source software (OSS) is important for several reasons:\n\n1. Cost-effectiveness: OSS is typically free to use, which can help organizations save on software licensing costs.\n2. Flexibility: OSS can be customized to meet specific needs, which can be particularly important for organizations with unique requirements.\n3. Innovation: OSS is developed and maintained by a community of developers, which can lead to rapid innovation and feature development.\n')

In [37]:
# Load, chunk and index the contents of the blog.

loader = WebBaseLoader(
    web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [38]:
from langchain_core.runnables import RunnableParallel

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke(
    "According to this article which open-source model is the best for an agent behaviour?"
)

{'context': [Document(page_content='As you can see, some open-source models do not perform well in powering agent workflows: while this was expected for the small Zephyr-7b, Llama2-70b performs surprisingly poorly.\n👉 But Mixtral-8x7B performs really well: it even beats GPT-3.5! 🏆', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),
  Document(page_content='As you can see, some open-source models do not perform well in powering agent workflows: while this was expected for the small Zephyr-7b, Llama2-70b performs surprisingly poorly.\n👉 But Mixtral-8x7B performs really well: it even beats GPT-3.5! 🏆', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No la