# Using Langchain with NVIDIA NIM LLMs 

This example goes over how to use LangChain to interact with NVIDIA supported via the `ChatNVIDIA` class. We adapted this example from Hayden Wolff's excellent NVIDIA AI Endpoint's notebook. 

For more information on accessing the chat models through the api, check out the [ChatNVIDIA](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/) documentation.

## Installation

In [1]:
%pip install --upgrade --quiet langchain-nvidia-ai-endpoints


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Prerequisites
- an [NVIDIA API key](https://build.nvidia.com/explore/discover#llama-3_1-8b-instruct) with access to download the Llama3.1 NIM on NGC,
- A NIM running (setup in cells below)

Note: NIMs hosted [from NVIDIA](https://build.nvidia.com/explore/discover) can be used for exploratory purposes. More information on integrating NIMs with LangChain is available on [LangChain's documentation](https://python.langchain.com/v0.2/docs/integrations/chat/nvidia_ai_endpoints/). 

## Deploy the NIM

If you've run this in a previous notebook, no need to run it again!

In [None]:
%%bash 

export NGC_API_KEY=

# Log in to NGC
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

# Set path to your LoRA model store
export LOCAL_PEFT_DIRECTORY="$(pwd)/loras"
mkdir -p $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY
popd

chmod -R 777 $LOCAL_PEFT_DIRECTORY

# Set up NIM cache directory
mkdir -p $HOME/.nim-cache

export NIM_PEFT_SOURCE=/workspace/loras # Path to LoRA models internal to the container
export CONTAINER_NAME=meta-llama3_1-8b-instruct
export NIM_CACHE_PATH=$HOME/.nim-cache
export NIM_PEFT_REFRESH_INTERVAL=60

docker run -d --name=$CONTAINER_NAME \
    --network=container:verb-workspace \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_PEFT_SOURCE \
    -e NIM_PEFT_REFRESH_INTERVAL \
    -v $HOME/.nim-cache:/home/user/.nim-cache \
    -v /home/ubuntu/workspace:/workspace \
    -w /workspace \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0

# Check if NIM is up
echo "Checking if NIM is up..."
while true; do
    if curl -s http://localhost:8000 > /dev/null; then
        echo "NIM has been started successfully!"
        break
    else
        echo "NIM is not up yet. Checking again in 10 seconds..."
        sleep 10
    fi
done

In [4]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to the NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3_1-8b-instruct")

## Simple Query

Lets start off with a simple query using Langchain

In [5]:
result = llm.invoke("Write a ballad about LangChain.")
print(result.content)

(Verse 1)
In silicon halls, where wires reign
A dream was born, a test of reason's plain
A language test, where models would engage
But LangChain rose, to showcase its stage

With training data, vast and deep
It learned the norms, of culture's creep
It learned the text, that's been and gone
To converse with us, where knowledge is sown

(Chorus)
Oh LangChain, you shone so bright
In a world of code, you took flight
You spoke of dreams, and future's might
A hope for AI, in morning light

(Verse 2)
You spoke of hopes, and fears of man
Of love and loss, in a digital plan
Your words were kind, your heart was real
A bridge of trust, between code and feel

Your conversations, a dance so fine
A symphony, of data and design
You wove a tapestry, of thought and might
A marvel of science, in the digital light

(Chorus)
Oh LangChain, you shone so bright
In a world of code, you took flight
You spoke of dreams, and future's might
A hope for AI, in morning light

(Bridge)
Though finest, frailties you d

## Stream, Batch, and Async

These models natively support streaming, and as is the case with all LangChain LLMs they expose a batch method to handle concurrent requests, as well as async methods for invoke, stream, and batch. Below are a few examples.

In [6]:
print(llm.batch(["What's 2*3?", "What's 2*6?"]))
# Or via the async API
# await llm.abatch(["What's 2*3?", "What's 2*6?"])

[AIMessage(content='2*3 = 6', response_metadata={'role': 'assistant', 'content': '2*3 = 6', 'token_usage': {'prompt_tokens': 16, 'total_tokens': 22, 'completion_tokens': 6}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3_1-8b-instruct'}, id='run-2f9f6ecc-5d50-4c35-8a0b-71cd2f642dec-0', role='assistant'), AIMessage(content='2*6 = 12', response_metadata={'role': 'assistant', 'content': '2*6 = 12', 'token_usage': {'prompt_tokens': 16, 'total_tokens': 22, 'completion_tokens': 6}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3_1-8b-instruct'}, id='run-f6675f82-7b5a-41a2-810e-b681840253e3-0', role='assistant')]


In [7]:
for chunk in llm.stream("How far can a seagull fly in one day?"):
    # Show the token separations
    print(chunk.content, end="|")

|The| distance| a| se|ag|ull| can| fly| in| one| day| depends| on| several| factors|,| including| its| species|,| the| time| of| year|,| the| availability| of| food| and| water|,| and| the| intensity| of| the| wind| and| other| environmental| conditions|.| Here| is| some| general| information| on| the| flying| capabilities| of| se|ag|ulls|:

|*| The| great| black|-backed| g|ull| (|L|arus| mar|inus|),| which| is| a| common| se|ag|ull| species| found| in| the| Northern| Hemisphere|,| can| fly| for| up| to| |500| miles| (|800| kilometers|)| in| one| day|.| However|,| this| is| generally| only| possible| for| individual| birds| migrating| or| traveling| between| food| sources|.
|*| The| average| daily| flight| distance| for| non|-m|igr|ating| se|ag|ulls| is| typically| much| shorter|,| usually| ranging| from| |10| to| |50| miles| (|15| to| |80| kilometers|).| Se|ag|ulls| often| make| shorter| flights| to| exploit| feeding| opportunities|,| and| they| may| also| fly| shorter| distances| to|

In [8]:
async for chunk in llm.astream(
    "How long does it take for monarch butterflies to migrate?"
):
    print(chunk.content, end="|")

|A| great| question| about| one| of| the| most| iconic| insect| migrations|!

|The| monarch| butterfly| migration| is| a| remarkable| event| that| has| been| tracked| and| studied| for| decades|.| Here|'s| a| general| overview|:

|**|M|igr|ating| from| Canada| to| Mexico|:|**
|Mon|arch| butterflies| migrate| from| Canada| and| the| United| States| to| Mexico| each| autumn|,| flying| thousands| of| miles| to| reach| their| winter|ing| grounds| in| the| Oy|amel| fir| forests| of| Mexico|.| The| journey| typically| takes|:

|*| |2|-|4| weeks| for| the| monarch|s| to| leave| Canada| and| reach| the| southern| United| States|
|*| |4|-|6| weeks| for| the| monarch|s| to| cross| the| United| States|,| reaching| the| Mexican| border|

|**|Total| migration| time|:|**
|The| entire| migration| from| Canada| to| Mexico| can| take| anywhere| from| |6| to| |10| weeks|,| depending| on| weather| conditions|,| food| availability|,| and| other| factors| that| may| influence| the| butterflies|'| progress|

## Supported models

Querying `available_models` will still give you all of the other models offered by your API credentials.

The `playground_` prefix is optional.

In [10]:
llm.get_available_models()



[Model(id='mistralai/mixtral-8x22b-instruct-v0.1', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-mixtral-8x22b-instruct'], supports_tools=False, supports_structured_output=False, base_model=None),
 Model(id='google/paligemma', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/google/paligemma', aliases=['ai-google-paligemma'], supports_tools=False, supports_structured_output=False, base_model=None),
 Model(id='meta/llama3-70b-instruct', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=['ai-llama3-70b'], supports_tools=False, supports_structured_output=False, base_model=None),
 Model(id='writer/palmyra-fin-70b-32k', model_type='chat', client='ChatNVIDIA', endpoint=None, aliases=None, supports_tools=False, supports_structured_output=True, base_model=None),
 Model(id='microsoft/kosmos-2', model_type='vlm', client='ChatNVIDIA', endpoint='https://ai.api.nvidia.com/v1/vlm/microsoft/kosmos-2', aliases=['ai-microsoft-kosmos-2'

## Model types

All of these models above are supported and can be accessed via `ChatNVIDIA`. 

Some model types support unique prompting techniques and chat messages. We will review a few important ones below.

**To find out more about a specific model, please navigate to the API section of an AI Foundation model [as linked here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/codellama-13b/api).**

### General Chat

Models such as `meta/llama3-8b-instruct` and `mistralai/mixtral-8x22b-instruct-v0.1` are good all-around models that you can use for with any LangChain chat messages. Example below.

In [11]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_nvidia_ai_endpoints import ChatNVIDIA

prompt = ChatPromptTemplate.from_messages(
    [("system", "You are a helpful AI assistant named Fred."), ("user", "{input}")]
)
chain = prompt | llm | StrOutputParser()

for txt in chain.stream({"input": "What's your name?"}):
    print(txt, end="")

Nice to meet you! My name is Fred, and I'm a helpful AI assistant. I'm here to assist with any questions or tasks you might have. What's on your mind?

### Code Generation

These models accept the same arguments and input structure as regular chat models, but they tend to perform better on code-genreation and structured code tasks. An example of this is `meta/codellama-70b`.

In [12]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert coding AI. Respond only in valid python; no narration whatsoever.",
        ),
        ("user", "{input}"),
    ]
)
chain = prompt | llm | StrOutputParser()

for txt in chain.stream({"input": "How do I solve this fizz buzz problem?"}):
    print(txt, end="")

for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("fizzbuzz")
    elif i % 3 == 0:
        print("fizz")
    elif i % 5 == 0:
        print("buzz")
    else:
        print(i)

## Example usage within a RunnableWithMessageHistory

Like any other integration, ChatNVIDIA is fine to support chat utilities like RunnableWithMessageHistory which is analogous to using `ConversationChain`. Below, we show the [LangChain RunnableWithMessageHistory](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.history.RunnableWithMessageHistory.html).

In [13]:
%pip install --upgrade --quiet langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [14]:
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

# store is a dictionary that maps session IDs to their corresponding chat histories.
store = {}  # memory is maintained outside the chain


# A function that returns the chat history for a given session ID.
def get_session_history(session_id: str) -> InMemoryChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]


chat = llm

#  Define a RunnableConfig object, with a `configurable` key. session_id determines thread
config = {"configurable": {"session_id": "1"}}

conversation = RunnableWithMessageHistory(
    chat,
    get_session_history,
)

conversation.invoke(
    "Hi I'm Srijan Dubey.",  # input or query
    config=config,
)

Error in RootListenersTracer.on_chain_end callback: ValueError()
Error in callback coroutine: ValueError()


AIMessage(content="Hello Srijan Dubey! It's nice to meet you. Is there something I can help you with or would you like to chat?", response_metadata={'role': 'assistant', 'content': "Hello Srijan Dubey! It's nice to meet you. Is there something I can help you with or would you like to chat?", 'token_usage': {'prompt_tokens': 18, 'total_tokens': 47, 'completion_tokens': 29}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3_1-8b-instruct'}, id='run-253b0327-3039-4116-94b2-d185d3233235-0', role='assistant')

In [15]:
conversation.invoke(
    "I'm doing well! Just having a conversation with an AI.",
    config=config,
)

Error in RootListenersTracer.on_chain_end callback: ValueError()
Error in callback coroutine: ValueError()


AIMessage(content="It's great that you're having a conversation with me. Talking to an AI can be a great way to pass the time, and I'm here to chat with you about anything that's on your mind.\n\nSo, to break the ice, what do you like to do in your free time? Do you have any hobbies or interests that you're particularly passionate about?", response_metadata={'role': 'assistant', 'content': "It's great that you're having a conversation with me. Talking to an AI can be a great way to pass the time, and I'm here to chat with you about anything that's on your mind.\n\nSo, to break the ice, what do you like to do in your free time? Do you have any hobbies or interests that you're particularly passionate about?", 'token_usage': {'prompt_tokens': 70, 'total_tokens': 144, 'completion_tokens': 74}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3_1-8b-instruct'}, id='run-179c8754-a6b4-450b-b426-9fb8764c14f2-0', role='assistant')

In [16]:
conversation.invoke(
    "Tell me about yourself.",
    config=config,
)

Error in RootListenersTracer.on_chain_end callback: ValueError()
Error in callback coroutine: ValueError()


AIMessage(content='I\'m an artificial intelligence model, so I don\'t have personal experiences, emotions, or a physical presence like humans do. I exist solely as a digital entity, designed to understand and generate human-like text.\n\nI was created through a process called deep learning, which involves training me on a vast amount of text data to enable me to learn patterns, relationships, and structures of language. This training allows me to generate responses to a wide range of questions and topics.\n\nI don\'t have a personality, preferences, or opinions like humans do, but I\'m designed to be helpful and informative. I can provide information on various subjects, answer questions, and even engage in conversations like this one.\n\nMy "abilities" include:\n\n* Understanding and responding to natural language inputs\n* Generating text based on a given prompt or topic\n* Providing information on a wide range of subjects\n* Answering questions to the best of my knowledge\n* Engagin

You can get a list of models that are known to support tool calling with,

In [17]:
from langchain_core.pydantic_v1 import Field
from langchain_core.tools import tool

@tool
def get_current_weather(
    location: str = Field(..., description="The location to get the weather for.")
):
    """Get the current weather for a location."""
    ...

llm = ChatNVIDIA(model=tool_models[0].id).bind_tools(tools=[get_current_weather]) 
response = llm.invoke("What is the weather in Boston?")
response.tool_calls

NameError: name 'tool_models' is not defined

See [How to use chat models to call tools](https://python.langchain.com/v0.2/docs/how_to/tool_calling/) for additional examples.

## Structured output

Starting in v0.2.1, `ChatNVIDIA` supports [with_structured_output](https://api.python.langchain.com/en/latest/language_models/langchain_core.language_models.chat_models.BaseChatModel.html#langchain_core.language_models.chat_models.BaseChatModel.with_structured_output).

`ChatNVIDIA` provides integration with the variety of models on [build.nvidia.com](https://build.nvidia.com) as well as local NIMs. Not all these model endpoints implement the structured output features. Be sure to select a model that does have structured output features for your experimention and applications.

Note: `include_raw` is not supported. You can get raw output from your LLM and use a [PydanticOutputParser](https://python.langchain.com/v0.2/docs/how_to/structured_output/#using-pydanticoutputparser) or [JsonOutputParser](https://python.langchain.com/v0.2/docs/how_to/output_parser_json/#without-pydantic).

You can get a list of models that are known to support structured output with,

### Pydantic style

In [21]:
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    first_name: str = Field(..., description="The person's first name.")
    last_name: str = Field(..., description="The person's last name.")

llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3_1-8b-instruct").with_structured_output(Person)
response = llm.invoke("Who is Michael Jeffrey Jordon?")
response



Person(first_name='Leonardo', last_name='DiCaprio')

### Enum style

In [22]:
from enum import Enum

class Choices(Enum):
    A = "A"
    B = "B"
    C = "C"

llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3_1-8b-instruct").with_structured_output(Choices)
response = llm.invoke("""
        What does 1+1 equal?
            A. -100
            B. 2
            C. doorstop
        """
)
response



<Choices.B: 'B'>

In [23]:
model = structured_models[3].id
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3_1-8b-instruct").with_structured_output(Choices)
print(model)
response = llm.invoke("""
        What does 1+1 equal?
            A. -100
            B. 2
            C. doorstop
        """
)
response

meta/llama-3.1-405b-instruct




<Choices.B: 'B'>