# Hugging Face Inference Endpoints x Langchain Chat Models


Open source LLMs are becoming strong general purpose agents. The goal of this notebook is to demonstrate how to make use of open-source LLMs as chat models via [Hugging Face Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) with [LangChain's ChatModel abstraction](https://python.langchain.com/docs/modules/model_io/chat/) to enable their usage and experimentation with agent-based workflows.

In particular, we will:
1. Utilize the [HuggingFaceTextGenInference](https://python.langchain.com/docs/integrations/llms/huggingface_textgen_inference) integration to call Inference Endpoints that are serving LLMs via [Text Generation Inference (TGI)](https://huggingface.co/docs/text-generation-inference/index)
2. Create a wrapper around the `BaseChatModel` class that interfaces between LangChain's [Chat Messages](https://python.langchain.com/docs/modules/model_io/chat/#messages) and the hosted LLM by leveraging [Hugging Face's Chat Templates](https://huggingface.co/docs/transformers/chat_templating).
3. Use an open-source LLM to power an `ChatAgent` pipeline



> Note: To run this notebook, you'll need to have:
> - an LLM deployed via a Hugging Face Inference Endpoint (the LLM must have a `chat_template` defined in its `tokenizer_config.json`)
> - A Hugging Face Token with access to the deployed endpoint saved as an environment variable: `HUGGINGFACEHUB_API_TOKEN`
> - A SerpAPI key saved as an environment variable: `SERPAPI_API_KEY`


## Setup

In [None]:
!pip install -q transformers langchain langchain-experimental text-generation python-dotenv jinja2 google-search-results langchainhub numexpr

In [2]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## 1. Instantiate an LLM with `HuggingFaceTextGenInference`

You'll need to have a running Inference Endpoint available.

In [None]:
import os
from langchain.llms import HuggingFaceTextGenInference

ENDPOINT_URL = "https://b64oqapulf4lv8w1.us-east-1.aws.endpoints.huggingface.cloud"
HF_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

llm = HuggingFaceTextGenInference(
    inference_server_url=ENDPOINT_URL,
    max_new_tokens=512,
    top_k=50,
    temperature=0.1,
    repetition_penalty=1.03,
    server_kwargs={
        "headers": {
            "Authorization": f"Bearer {HF_TOKEN}",
            "Content-Type": "application/json",
        }
    },
)

## 2. Create a wrapper for `BaseChatModel` to apply chat templates

In [None]:
from typing import Any, List, Optional, Union

from transformers import AutoTokenizer
from huggingface_hub import list_inference_endpoints
from langchain.callbacks.manager import (
    AsyncCallbackManagerForLLMRun,
    CallbackManagerForLLMRun,
)
from langchain.chat_models.base import BaseChatModel
from langchain.llms.base import LLM
from langchain.schema import (
    AIMessage,
    BaseMessage,
    ChatGeneration,
    ChatResult,
    HumanMessage,
    LLMResult,
    SystemMessage,
)

DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant."""


class HFInferenceEndpointChatWrapper(BaseChatModel):
    """
    Wrapper for using HuggingFaceTextGenInference LLM as a ChatModel.

    Upon instantiating this class, the model_id is resolved from the inference_server_url provided to the LLM,
    and the appropriate tokenizer is loaded from the HuggingFace Hub.

    Adapted from: https://python.langchain.com/docs/integrations/chat/llama2_chat
    """

    llm: LLM
    tokenizer: Any
    system_message: SystemMessage = SystemMessage(content=DEFAULT_SYSTEM_PROMPT)
    model_id: Optional[str] = None

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._resolve_model_id()
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        llm_input = self._to_chat_prompt(messages)
        llm_result = self.llm._generate(
            prompts=[llm_input], stop=stop, run_manager=run_manager, **kwargs
        )
        return self._to_chat_result(llm_result)

    async def _agenerate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        llm_input = self._to_chat_prompt(messages)
        llm_result = await self.llm._agenerate(
            prompts=[llm_input], stop=stop, run_manager=run_manager, **kwargs
        )
        return self._to_chat_result(llm_result)

    def _to_chat_prompt(
        self,
        messages: List[BaseMessage],
    ) -> str:
        """Convert a list of messages into a prompt format expected by wrapped LLM."""
        if not messages:
            raise ValueError("at least one HumanMessage must be provided")

        if not isinstance(messages[0], SystemMessage):
            messages = [self.system_message] + messages

        if not isinstance(messages[1], HumanMessage):
            raise ValueError(
                "messages list must start with a SystemMessage or UserMessage"
            )

        if not isinstance(messages[-1], HumanMessage):
            raise ValueError("last message must be a HumanMessage")

        messages = [self._to_chatml_format(m) for m in messages]

        return self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

    def _to_chatml_format(
        self, message: Union[AIMessage, SystemMessage, HumanMessage]
    ) -> dict:
        """Convert LangChain message to ChatML format."""

        if isinstance(message, SystemMessage):
            role = "system"
        elif isinstance(message, AIMessage):
            role = "assistant"
        elif isinstance(message, HumanMessage):
            role = "user"
        else:
            raise ValueError(f"Unknown message type: {type(message)}")

        return {"role": role, "content": message.content}

    @staticmethod
    def _to_chat_result(llm_result: LLMResult) -> ChatResult:
        chat_generations = []

        for g in llm_result.generations[0]:
            chat_generation = ChatGeneration(
                message=AIMessage(content=g.text), generation_info=g.generation_info
            )
            chat_generations.append(chat_generation)

        return ChatResult(
            generations=chat_generations, llm_output=llm_result.llm_output
        )

    def _resolve_model_id(self):
        """Resolve the model_id from the LLM's inference_server_url"""
        available_endpoints = list_inference_endpoints("*")

        for endpoint in available_endpoints:
            if endpoint.url == self.llm.inference_server_url:
                self.model_id = endpoint.repository

        if not self.model_id:
            raise ValueError(
                f"Could not find model id for inference server provided: {self.llm.inference_server_url}.\
                    Check to ensure the HF token you're using has access to the endpoint."
            )

    @property
    def _llm_type(self) -> str:
        return f"{self.model_id.lower()}-style"

Instantiate the model and some messages to pass.

In [5]:
messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(
        content="What happens when an unstoppable force meets an immovable object?"
    ),
]

chat_model = HFInferenceEndpointChatWrapper(llm=llm)

Inspect which model and corresponding chat template is being used.

In [6]:
chat_model.model_id

'HuggingFaceH4/zephyr-7b-beta'

Inspect how the chat messages are formatted for the LLM call.

In [7]:
chat_model._to_chat_prompt(messages)

"<|system|>\nYou're a helpful assistant</s>\n<|user|>\nWhat happens when an unstoppable force meets an immovable object?</s>\n<|assistant|>\n"

Call the model.

In [8]:
res = chat_model.invoke(messages)
print(res.content)

According to the popular idiom, when an unstoppable force meets an immovable object, there is a paradoxical situation where both forces seem to contradict each other's very nature. The force that is absolutely unstoppable should overcome the object that is completely immovable, but in this scenario, both forces are presented as equally powerful and unyielding. This paradox raises philosophical questions about the nature of force, objectivity, and the limits of logic. In reality, such a scenario is impossible, as it defies the laws of physics, and both forces cannot exist simultaneously in the same place and time.


## 3. Take it for a spin!

Here we'll test out `Zephyr-7B-beta` as a zero-shot ReAct Agent. The example below is taken from [here](https://python.langchain.com/docs/modules/agents/agent_types/react#using-chat-models).

In [9]:
from langchain.agents import load_tools
from langchain.utilities import SerpAPIWrapper
from langchain import hub
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.tools.render import render_text_description
from langchain.agents import AgentExecutor
from langchain.agents.output_parsers import (
    ReActJsonSingleInputOutputParser,
)

Configure the agent with a `react-json` style prompt and access to a search engine and calculator.

In [10]:
# setup tools
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# setup ReAct style prompt
prompt = hub.pull("hwchase17/react-json")
prompt = prompt.partial(
    tools=render_text_description(tools),
    tool_names=", ".join([t.name for t in tools]),
)

# define the agent
chat_model_with_stop = chat_model.bind(stop=["\nObservation"])
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_log_to_str(x["intermediate_steps"]),
    }
    | prompt
    | chat_model_with_stop
    | ReActJsonSingleInputOutputParser()
)

# instantiate AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

In [11]:
agent_executor.invoke(
    {
        "input": "Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power?"
    }
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power?

Thought: I need to use the Search tool to find out who Leo DiCaprio's current girlfriend is. Then, I can use the Calculator tool to raise her current age to the power of 0.43.

Action:
```
{
  "action": "Search",
  "action_input": "leo dicaprio girlfriend"
}
```
[0m[36;1m[1;3mLeonardo DiCaprio looked typically understated as he stepped out in London with his girlfriend Vittoria Ceretti and her family on Thursday.[0m[32;1m[1;3mNow, let's find out Vittoria Ceretti's current age.

Action:
```
{
  "action": "Search",
  "action_input": "vittoria ceretti age"
}
```
[0m[36;1m[1;3m25 years[0m[32;1m[1;3mNow, let's use the Calculator tool to raise Vittoria Ceretti's age to the power of 0.43.

Action:
```
{
  "action": "Calculator",
  "action_input": "25^0.43"
}
```
[0m[33;1m[1;3mAnswer: 3.991298452658078[0m[32;1m[1;3mFinal A

{'input': "Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power?",
 'output': "Leo DiCaprio's current girlfriend is Vittoria Ceretti, and when her age of 25 is raised to the power of 0.43, it equals approximately 3.9913."}

Wahoo! Our open-source 7b parameter Zephyr model was able to:

1. Plan out a series of actions: `I need to use the Search tool to find out who Leo DiCaprio's current girlfriend is. Then, I can use the Calculator tool to raise her current age to the power of 0.43.`
2. Then execute a search using the SerpAPI tool to find who Leo DiCaprio's current girlfriend is
3. Execute another search to find her age
4. And finally use a calculator tool to calculate her age raised to the power of 0.43

I'm excited to see how far open-source LLM's can go as general purpose reasoning agents. Give it a try yourself!