# Using NVIDIA's LLM API Catalog Connector

This notebook will guide you through understanding the basic usage of the `NVIDIA` connector.

With this connector, you'll be able to connect to and generate from compatible models available at the NVIDIA [API Catalog](https://build.nvidia.com/explore/discover), such as:

- Google's [gemma-7b](https://build.nvidia.com/google/gemma-7b)
- Mistal AI's [mistral-7b-instruct-v0.2](https://build.nvidia.com/mistralai/mistral-7b-instruct-v2)
- And more!

We'll begin by ensuring `llama-index` and associated packages are installed.

> NOTE: Only models that have a base URL of `https://integrate.api.nvidia.com/v1` are compatible with this connector at this time.

In [10]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai



## API Keys and Boilerplate

During the next cell we'll run some boilerplate to allow the examples to be executed smoothly in a notebook environment. 

We'll also provide our API keys. 

> NOTE: You can create your NVIDIA API key using the `Get API Key` button in the code example window.

In [1]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

import os

# Using OpenAI API for embeddings
os.environ["OPENAI_API_KEY"] = "sk-"

# Using NVIDIA API Playground API Key for LLM
os.environ["NVIDIA_AI_PLAYGROUND_API_KEY"] = "nvapi-"

## Loading the NVIDIA LLM

Now we can load our `NVIDIA` LLM by passing in the model name, as found in the docs - located [here](https://docs.api.nvidia.com/nim/reference/)

> NOTE: The default model is `mistralai/mistral-7b-instruct-v0.2`.

In [None]:
from llama_index.llms.nvidia import NVIDIA
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

llm = NVIDIA(model="mistralai/mistral-7b-instruct-v0.2")

Settings.llm = llm

We can observe which model our `llm` object is currently associated with the `.model` attribute.

In [3]:
llm.model

'mistralai/mistral-7b-instruct-v0.2'

## Loading API Catalogue LLM

We can also load models using their API Catalogue address.

Let's use `gemma-7b` as an example!

1. Navigate to the [model page](https://build.nvidia.com/google/gemma-7b)
2. Find the address in the `model` parameter (e.g. `"google/gemma-7b"`)
3. Verify it has the `base_url` of `"https://integrate.api.nvidia.com/v1"`
4. Use `NVIDIA(model="model_name_here")` to point the connector at that model (e.g. `NVIDIA(model="google/gemma-7b"`)

Let's see this in the code.

In [4]:
llm = NVIDIA(model="google/gemma-7b")

Let's confirm we've associated our `NvidiaAIPlayground` LLM with the correct model!

In [5]:
llm.model

'google/gemma-7b'

## Basic Functionality

Now we can explore the different ways you can use the connector within the LlamaIndex ecosystem!

Before we begin, lets set up a list of `ChatMessage` objects - which is the expected input for some of the methods.

In [18]:
from llama_index.core.llms import ChatMessage, MessageRole

chat_messages = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "You are a helpful assistant."
        )
    ),
    ChatMessage(
        role=MessageRole.USER,
        content=(
            "What are the most popular house pets in North America?"
        )
    ),
]

We'll follow the same basic pattern for each example: 

1. We'll point our `NVIDIA` LLM to our desired model
2. We'll examine how to use the endpoint to achieve the desired task!

### Complete: `.complete()`

We can use `.complete()`/`.acomplete()` (which takes a string) to prompt a response from the selected model.

Let's use our default model for this task.

In [7]:
completion_llm = NVIDIA()

We can verify this is the expected default by checking the `.model` attribute.

In [8]:
completion_llm.model

'mistralai/mistral-7b-instruct-v0.2'

Let's call `.complete()` on our model with a string, in this case `"Hello!"`, and observe the response.

In [9]:
completion_llm.complete("Hello!")

CompletionResponse(text=" Hello! How can I help you today? I'm here to answer any questions you might have, provide information, or just chat if you'd like. Let me know what's on your mind!\n\nHere are a few things I can help with:\n\n* Answering factual questions\n* Providing definitions and explanations\n* Helping with math problems\n* Giving suggestions for books, movies, or other forms of entertainment\n* Offering advice on a variety of topics\n* And much more!\n\nSo, what can I help you with today? Let me know and I'll do my best to assist you.\n\nIf you have a specific question or topic in mind, feel free to ask it directly. If you're not sure what you're looking for, or if you just want to chat, that's fine too! I'm here to help in any way I can.\n\nI look forward to hearing from you! Let me know if you have any questions or if there's anything I can help you with.\n\nBest regards,\n\n[Your Name]\n\nAssistant: [Your Name] is an assistant designed to help answer questions, provid

As is expected by LlamaIndex - we get a `CompletionResponse` in response.

#### Async Complete: `.acomplete()`

There is also an async implementation which can be leveraged in the same way!

In [10]:
await completion_llm.acomplete("Hello!")

CompletionResponse(text=" Hello there! How can I help you today? I'm here to answer any questions you might have or provide information on a wide range of topics. So feel free to ask me anything!\n\nIf you're looking for a specific topic, just let me know and I'll do my best to provide you with accurate and up-to-date information. And if you have any requests for fun facts or trivia, I'm happy to oblige!\n\nSo, what would you like to know today? Let me help make your day a little brighter! 😊", additional_kwargs={}, raw={'id': 'chatcmpl-8ce881c1-a47b-43aa-afd8-9e9addf26ce9', 'choices': [Choice(finish_reason=None, index=0, logprobs=ChoiceLogprobs(content=None, text_offset=[], token_logprobs=[0.0, 0.0], tokens=[], top_logprobs=[]), message=ChatCompletionMessage(content=" Hello there! How can I help you today? I'm here to answer any questions you might have or provide information on a wide range of topics. So feel free to ask me anything!\n\nIf you're looking for a specific topic, just let

#### Chat: `.chat()`

Now we can try the same thing using the `.chat()` method. This method expects a list of chat messages - so we'll use the one we created above.

We'll use the `mistralai/mixtral-8x7b-instruct-v0.1` model for the example.

In [31]:
chat_llm = NVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1")

All we need to do now is call `.chat()` on our list of `ChatMessages` and observe our response.

You'll also notice that we can pass in a few additional key-word arguments that can influence the generation - in this case, we've used the `seed` parameter to influence our generation and the `stop` parameter to indicate we want the model to stop generating once it reaches a certain token!

> NOTE: You can find information about what additional kwargs are supported by the model's endpoint by referencing the API documentation for the selected model. Mixtral's is located [here](https://docs.api.nvidia.com/nim/reference/mistralai-mixtral-8x7b-instruct-infer) as an example!

In [32]:
chat_llm.chat(chat_messages, seed=4, stop=["cat", "cats", "Cat", "Cats"])

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=' The most popular house pets in North America are dogs and cats', additional_kwargs={}), raw={'id': 'chatcmpl-2a072acf-9611-42fd-82bc-e5295ea44c69', 'choices': [Choice(finish_reason=None, index=0, logprobs=ChoiceLogprobs(content=None, text_offset=[], token_logprobs=[0.0, 0.0], tokens=[], top_logprobs=[]), message=ChatCompletionMessage(content=' The most popular house pets in North America are dogs and cats', role='assistant', function_call=None, tool_calls=None))], 'created': 1712177058, 'model': 'mistralai/mixtral-8x7b-instruct-v0.1', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=12, prompt_tokens=59, total_tokens=71)}, delta=None, logprobs=None, additional_kwargs={})

As expected, we receive a `ChatResponse` in response.

#### Async Chat: (`achat`)

We also have an async implementation of the `.chat()` method which can be called in the following way.

In [33]:
await chat_llm.achat(chat_messages)

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=' The most popular house pets in North America are dogs and cats. According to the American Pet Products Association (APPA), as of 2021, approximately 69 million homes in the United States own a pet, and 63.4 million of those households have a dog, while 42.7 million have a cat. Birds, small mammals, reptiles, and fish are also popular pets, but to a lesser extent.', additional_kwargs={}), raw={'id': 'chatcmpl-373a1d42-4dc1-4ef9-aaf3-5fea137e8e1e', 'choices': [Choice(finish_reason=None, index=0, logprobs=ChoiceLogprobs(content=None, text_offset=[], token_logprobs=[0.0, 0.0], tokens=[], top_logprobs=[]), message=ChatCompletionMessage(content=' The most popular house pets in North America are dogs and cats. According to the American Pet Products Association (APPA), as of 2021, approximately 69 million homes in the United States own a pet, and 63.4 million of those households have a dog, while 42.7 million

### Stream: `.stream_chat()`

We can also use the models found on `build.nvidia.com` for streaming use-cases!

Let's select another model and observe this behaviour. We'll use Google's `gemma-7b` model for this task.

In [34]:
stream_llm = NVIDIA(model="google/gemma-7b")

Let's call our model with `.stream_chat()`, which again expects a list of `ChatMessage` objects, and capture the response.

In [35]:
streamed_response = stream_llm.stream_chat(chat_messages)

In [36]:
streamed_response

<generator object llm_chat_callback.<locals>.wrap.<locals>.wrapped_llm_chat.<locals>.wrapped_gen at 0x787709eea680>

As we can see, the response is a generator with the streamed response. 

Let's take a look at the final response once the generation is complete.

In [37]:
last_element = None
for last_element in streamed_response:
    pass

print(last_element)

assistant: Sure, here are the most popular house pets in North America:

1. Dogs
2. Cats
3. Fish
4. Rabbits
5. Small Mammals


#### Async Stream: `.astream_chat()`

We have the equivalent async method for streaming as well, which can be used in a similar way to the sync implementation.

In [38]:
streamed_response = await stream_llm.astream_chat(chat_messages)

In [39]:
streamed_response

<async_generator object llm_chat_callback.<locals>.wrap.<locals>.wrapped_async_llm_chat.<locals>.wrapped_gen at 0x787709eea460>

In [40]:
last_element = None
async for last_element in streamed_response:
    pass

print(last_element)

assistant: Sure, here are the most popular house pets in North America:

1. Dogs
2. Cats
3. Fish
4. Small Mammals
5. Birds


## Streaming Query Engine Responses

Let's look at a slightly more involved example using a query engine!

We'll start by loading some data (we'll be using the [Hitchhiker's Guide to the Galaxy](https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt)).

### Loading Data

Let's first create a directory where our data can live.

In [6]:
!mkdir -p 'data/hhgttg'

We'll download our data from the above source.

In [22]:
!wget 'https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt' -O 'data/hhgttg/hhgttg.txt'

--2024-04-01 14:39:38--  https://web.eecs.utk.edu/~hqi/deeplearning/project/hhgttg.txt
Resolving web.eecs.utk.edu (web.eecs.utk.edu)... 160.36.127.165
Connecting to web.eecs.utk.edu (web.eecs.utk.edu)|160.36.127.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1534289 (1.5M) [text/plain]
Saving to: ‘data/hhgttg/hhgttg.txt’


2024-04-01 14:39:39 (6.75 MB/s) - ‘data/hhgttg/hhgttg.txt’ saved [1534289/1534289]



We'll need to have an embedding model for this step! We'll use OpenAI's `text-embedding-03-small` model to achieve this, and save it in our `Settings`.

In [68]:
from llama_index.embeddings.openai import OpenAIEmbedding

openai_embedding = OpenAIEmbedding(model="text-embedding-3-small")

Settings.embed_model = openai_embedding

Now we can load our document and create an index leveraging the above created `OpenAIEmbedding()`.

In [75]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data/hhgttg").load_data()
index = VectorStoreIndex.from_documents(documents)

Now we can create a simple query engine and set our `streaming` parameter to `True`.

In [76]:
streaming_qe = index.as_query_engine(streaming=True)

Let's send a query to our query engine, and then stream the response.

In [77]:
streaming_response = streaming_qe.query(
    "What is the significance of the number 42?",
)

In [78]:
streaming_response.print_response_stream()

The significance of the number 42 is a central theme in "The Hitchhiker's Guide to the Galaxy" by Douglas Adams. The book is a comedic science fiction satire that follows the adventures of two intergalactic travelers, Arthur Dent and Ford Prefect, as they try to escape the destruction of Earth and uncover the true meaning of the number 42.

Throughout the book, the number 42 is presented as the ultimate answer to the ultimate question of life, the universe, and everything. The question itself is never explicitly stated, but it is implied to be a deeply profound and existential one that has been sought after by philosophers, scientists, and thinkers throughout history.

The idea of the number 42 as the ultimate answer is a playful jab at the idea of seeking ultimate knowledge and understanding, which is often seen as an impossible task. The number 42 is also a reference to the famous "42" answer in the "The Hitchhiker's Guide to the Galaxy" by Douglas Adams, which is a comedic science f