# Hugging Face Inference Endpoints x Langchain Chat Models


Open source LLMs are becoming strong general purpose agents. The purpose of this notebook is to demonstrate how to make use of open-source LLMs within as chat models within Langchain to enable their usage and experimentation with agent-based pipelines.

In [53]:
!pip install -q transformers langchain text-generation langchain-experimental python-dotenv jinja2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
from dotenv import load_dotenv

In [2]:
# load env vars for LangSmith
load_dotenv(override=True)

True

## First, let's see if how Llama-2 Chat wrapper works

This section builds upon this [Langchain integration](https://python.langchain.com/docs/integrations/chat/llama2_chat)

In [3]:
import os

from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import HuggingFaceTextGenInference
from langchain_experimental.chat_models import Llama2Chat

In [4]:
# model = "https://b64oqapulf4lv8w1.us-east-1.aws.endpoints.huggingface.cloud"
model = "https://zsg12cshlzvfl3l3.us-east-1.aws.endpoints.huggingface.cloud"
hf_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")

llm = HuggingFaceTextGenInference(
    inference_server_url=model,
    max_new_tokens=512,
    top_k=50,
    temperature=0.1,
    repetition_penalty=1.03,
    server_kwargs={
        "headers": {
            "Authorization": f"Bearer {hf_token}",
            "Content-Type": "application/json",
        }
    },
)

model = Llama2Chat(llm=llm)




### Test Chat Model

In [29]:
from langchain.schema import SystemMessage, HumanMessage, AIMessage

messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(content="What is the purpose of model regularization?"),
]

In [30]:
model.invoke(messages)

AIMessage(content="  Model regularization is a technique used in machine learning to prevent overfitting, improve generalization, and promote interpretability of a model. Overfitting occurs when a model is trained too well on the training data and fails to generalize well to new, unseen data. This can result in poor performance on the test data.\n\nThe purpose of model regularization is to add a penalty term to the loss function that discourages the model from fitting the training data too closely. By adding this penalty term, the model is forced to find a simpler solution that generalizes better to new data.\n\nThere are several types of regularization techniques, including:\n\n1. L1 Regularization (Lasso): This adds a penalty term to the loss function based on the absolute value of the model's weights. L1 regularization tends to produce sparse models, where some of the weights are set to zero.\n2. L2 Regularization (Ridge): This adds a penalty term to the loss function based on the s

In [31]:
model._to_chat_prompt(messages)

"<s>[INST] <<SYS>>\nYou're a helpful assistant\n<</SYS>>\n\nWhat is the purpose of model regularization? [/INST]"

### Test in a LLMChain

In [27]:
# ruff: noqa: E402

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)

template_messages = [
    SystemMessage(content="You are a helpful assistant."),
    MessagesPlaceholder(variable_name="chat_history"),
    HumanMessagePromptTemplate.from_template("{text}"),
]
prompt_template = ChatPromptTemplate.from_messages(template_messages)

In [28]:
prompt_template

ChatPromptTemplate(input_variables=['chat_history', 'text'], input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessage(content='You are a helpful assistant.'), MessagesPlaceholder(variable_name='chat_history'), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['text'], template='{text}'))])

In [12]:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
chain = LLMChain(llm=model, prompt=prompt_template, memory=memory)

In [14]:
out = chain.run(
    text="What can I see in Vienna? Propose a few locations. Names only, no details."
)

In [15]:
print(out)

  Certainly! Vienna is a beautiful city with a rich history and culture, offering countless attractions to explore. Here are some of the top locations to consider visiting:

1. Schönbrunn Palace
2. St. Stephen's Cathedral
3. Hofburg Palace
4. Belvedere Palace
5. Prater Park
6. MuseumsQuartier
7. Ringstrasse
8. Vienna State Opera
9. Albertina Museum
10. Rathausplatz


In [16]:
out = chain.run(text="Tell me more about #7")

In [17]:
print(out)

  Of course! The seventh location I mentioned is the Ringstrasse, which is a grand boulevard that encircles the historic center of Vienna. The Ringstrasse is home to many of the city's most famous landmarks and attractions, including:

1. Vienna State Opera: A world-renowned opera house known for its opulent architecture and high-quality productions.
2. Parliament Building: A grand neo-Gothic structure that serves as the seat of Austria's federal government.
3. Town Hall: A beautiful building with a stunning clock tower that offers panoramic views of the city.
4. Imperial Palace: A former royal residence that now houses several museums and event spaces.
5. Museum of Natural History: A popular museum featuring exhibits on natural history, including dinosaurs, fossils, and minerals.
6. Museum of Art History: A museum showcasing works of art from the Middle Ages to the present day.
7. Vienna University: A prestigious university with a beautiful neo-Renaissance main building.
8. Augarten P

## Now lets build our own for `HuggingFaceH4/zephyr-7b-beta`

In [7]:
from transformers import AutoTokenizer

model_id = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(model_id)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [55]:
system_input = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
user_input = "How many helicopters can a human eat in one sitting?"
messages = [
    {"role": "system", "content": system_input},
    {"role": "user", "content": user_input},
]
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

"<|system|>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.</s>\n<|user|>\nHow many helicopters can a human eat in one sitting?</s>\n<|assistant|>\n"

In [23]:
print(tokenizer.chat_template)

{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}


In [45]:
t = messages[1]

In [48]:
t.dict()

{'content': 'What is the purpose of model regularization?',
 'additional_kwargs': {},
 'type': 'human',
 'example': False}

In [57]:
type(tokenizer)

transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast

In [103]:
llm.inference_server_url

'https://zsg12cshlzvfl3l3.us-east-1.aws.endpoints.huggingface.cloud'

In [114]:
endpoint = list_inference_endpoints()[0]

In [118]:
endpoint.repository

'HuggingFaceH4/zephyr-7b-beta'

In [104]:
from huggingface_hub import get_inference_endpoint, list_inference_endpoints

In [None]:
endpoint = get_inference_endpoint()

In [27]:
from typing import Any, List, Optional, Union

from huggingface_hub import list_inference_endpoints
from langchain.callbacks.manager import (
    AsyncCallbackManagerForLLMRun,
    CallbackManagerForLLMRun,
)
from langchain.chat_models.base import BaseChatModel
from langchain.llms.base import LLM
from langchain.schema import (
    AIMessage,
    BaseMessage,
    ChatGeneration,
    ChatResult,
    HumanMessage,
    LLMResult,
    SystemMessage,
)

DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant."""


class HFInferenceEndpointWrapper(BaseChatModel):
    """
    Wrapper for using HuggingFaceTextGenInference LLM as a ChatModel.

    Adapted from: https://python.langchain.com/docs/integrations/chat/llama2_chat
    """

    llm: LLM
    tokenizer: Any
    system_message: SystemMessage = SystemMessage(content=DEFAULT_SYSTEM_PROMPT)

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._resolve_model_id()

    # llm: LLM
    # tokenizer: Any
    # system_message: SystemMessage = SystemMessage(content=DEFAULT_SYSTEM_PROMPT)

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        llm_input = self._to_chat_prompt(messages)
        llm_result = self.llm._generate(
            prompts=[llm_input], stop=stop, run_manager=run_manager, **kwargs
        )
        return self._to_chat_result(llm_result)

    async def _agenerate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        llm_input = self._to_chat_prompt(messages)
        llm_result = await self.llm._agenerate(
            prompts=[llm_input], stop=stop, run_manager=run_manager, **kwargs
        )
        return self._to_chat_result(llm_result)

    def _to_chat_prompt(
        self,
        messages: List[BaseMessage],
    ) -> str:
        """Convert a list of messages into a prompt format expected by wrapped LLM."""
        if not messages:
            raise ValueError("at least one HumanMessage must be provided")

        if not isinstance(messages[0], SystemMessage):
            messages = [self.system_message] + messages

        if not isinstance(messages[1], HumanMessage):
            raise ValueError(
                "messages list must start with a SystemMessage or UserMessage"
            )

        if not isinstance(messages[-1], HumanMessage):
            raise ValueError("last message must be a HumanMessage")

        messages = [self._to_chatml_format(m) for m in messages]

        return self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

    def _to_chatml_format(
        self, message: Union[AIMessage, SystemMessage, HumanMessage]
    ) -> dict:
        """Convert LangChain message to ChatML format."""

        if isinstance(message, SystemMessage):
            role = "system"
        elif isinstance(message, AIMessage):
            role = "assistant"
        elif isinstance(message, HumanMessage):
            role = "user"
        else:
            raise ValueError(f"Unknown message type: {type(message)}")

        return {"role": role, "content": message.content}

    @staticmethod
    def _to_chat_result(llm_result: LLMResult) -> ChatResult:
        chat_generations = []

        for g in llm_result.generations[0]:
            chat_generation = ChatGeneration(
                message=AIMessage(content=g.text), generation_info=g.generation_info
            )
            chat_generations.append(chat_generation)

        return ChatResult(
            generations=chat_generations, llm_output=llm_result.llm_output
        )

    def _resolve_model_id(self):
        """Resolve the model id for the given inference server url"""
        available_endpoints = list_inference_endpoints("*")

        for endpoint in available_endpoints:
            if endpoint.url == self.llm.inference_server_url:
                return endpoint.repository
            else:
                raise ValueError(
                    f"Could not find model id for inference server provided: {self.llm.inference_server_url}. Check to ensure the HF token you're using has access to the endpoint."
                )

    @property
    def _llm_type(self) -> str:
        return "hf-ie-style"

In [28]:
messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(content="What is the purpose of model regularization?"),
]

model = HFInferenceEndpointWrapper(llm=llm, tokenizer=tokenizer)

In [29]:
model._to_chat_prompt(messages)

"<|system|>\nYou're a helpful assistant</s>\n<|user|>\nWhat is the purpose of model regularization?</s>\n<|assistant|>\n"

In [30]:
model.invoke(messages)

AIMessage(content="Model regularization is a technique used in machine learning to prevent overfitting, which occurs when a model is trained too well on a limited dataset and fails to generalize well to new, unseen data. Overfitting occurs when a model learns the noise or random fluctuations in the training data rather than the underlying patterns. By adding a penalty term to the loss function, regularization encourages the model to find a simpler solution that generalizes better to new data.\n\nThere are several types of regularization, including:\n\n1. L1 regularization (Lasso): This adds a penalty term to the loss function based on the absolute value of the model's weights. L1 regularization tends to produce sparse models, where some weights are set to zero.\n2. L2 regularization (Ridge): This adds a penalty term to the loss function based on the square of the model's weights. L2 regularization tends to produce models with smaller weights.\n3. Dropout regularization: This is a type 