<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

<br>

# <font color="#76b900">**Notebook 9:** LangServe and Assessment</font>

<br>

## LangServe Server Setup

This notebook is a playground for those interested in developing interactive web applications using LangChain and [**LangServe**](https://python.langchain.com/docs/langserve). The aim is to provide a minimal-code example to illustrate the potential of LangChain in web application contexts.

This section provides a walkthrough for setting up a simple API server using LangChain's Runnable interfaces with FastAPI. The example demonstrates how to integrate a LangChain model, such as `ChatNVIDIA`, to create and distribute accessible API routes. Using this, you will be able to supply functionality to the frontend service's [**`frontend_server.py`**](frontend/frontend_server.py) session, which strongly expects:
- A simple endpoint named `:9012/basic_chat` for the basic chatbot, exemplified below.
- A pair of endpoints named `:9012/retriever` and `:9012/generator` for the RAG chatbot.
- All three for the **Evaluate** utility, which will be required for the final assessment. *More on that later!*

**IMPORTANT NOTES:**
- Make sure to click the square ( $\square$ ) button twice to shut down an active FastAPI cell. The first time might fall through or trigger a try-catch routine on an asynchronous process.
- If it still doesn't work, do a hard restart on this notebook by using **Kernel -> Restart Kernel**.
- When a FastAPI server is running in your cell, expect the process to block up this notebook. Other notebooks should not be impacted by this. 

<br>

### **Part 1:** Delivering the /basic_chat endpoint

Instructions are provided for launching a `/basic_chat` endpoint both as a standalone Python file. This will be used by the frontend to make basic decision with no internal reasoning.

In [14]:
%%writefile server_app.py
import typing
import os
import random

import tarfile
import logging

from fastapi import FastAPI
import uvicorn

from langserve import add_routes
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS

from datetime import datetime
from fastapi import FastAPI
from time import sleep

from functools import partial
from operator import itemgetter

from langchain.document_loaders import ArxivLoader
from langchain.document_transformers import LongContextReorder
from langchain.schema import SystemMessage, HumanMessage
from langchain.prompts import ChatPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnableMap, RunnableLambda
from langchain_core.runnables.passthrough import RunnableAssign
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableBranch
from langchain_core.runnables.passthrough import RunnableAssign
from langchain.document_transformers import LongContextReorder
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from langchain.pydantic_v1 import BaseModel
from langserve import RemoteRunnable
import gradio as gr

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain_community.vectorstores import FAISS

# https://python.langchain.com/docs/langserve#server
from fastapi import FastAPI
from langchain.prompts import ChatPromptTemplate
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langserve import add_routes
import logging
# Configure logging
tls_logger = logging.getLogger("uvicorn.error")
tls_logger.setLevel(logging.INFO)

# 1. Initialize embeddings and load vector store
embedder = NVIDIAEmbeddings(model="nvidia/nv-embed-v1", truncate="END")

# Extract and load the FAISS index
tar_path = "docstore_index.tgz"
if os.path.exists(tar_path):
    with tarfile.open(tar_path) as tar:
        tar.extractall(path=".")

# Ensure directory exists
index_path = "docstore_index"
if not os.path.isdir(index_path):
    raise RuntimeError(f"Missing {index_path}/; please upload your docstore_index files.")

docstore = FAISS.load_local(index_path, embedder, allow_dangerous_deserialization=True)

# 2. Define LLM and prompt
llm_chain = ChatNVIDIA(model="meta/llama-3.1-8b-instruct") | StrOutputParser()

chat_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a document chatbot. Use the provided context to answer user questions. "
     "Cite sources that you use in your answer."
    ),
    ("user", "{input}\n\nContext:\n{context}")
])

# 3. Create FastAPI app and mount routes
app = FastAPI(
    title="LangChain RAG Server",
    version="1.0",
    description="Provides /basic_chat, /retriever, and /generator endpoints for RAG."
)

# Basic chat: single-model endpoint
add_routes(app, llm_chain, path="/basic_chat")
# Retriever: vector store retriever endpoint
add_routes(app, docstore.as_retriever(), path="/retriever")
# Generator: prompt + LLM chain endpoint
add_routes(app, chat_prompt | llm_chain, path="/generator")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=9012)

Overwriting server_app.py


In [None]:
## Works, but will block the notebook.
!python server_app.py  

## Will technically work, but not recommended in a notebook. 
## You may be surprised at the interesting side effects...
# import os
# os.system("python server_app.py &")


>> from langchain.document_loaders import ArxivLoader

with new imports of:

>> from langchain_community.document_loaders import ArxivLoader
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here <https://python.langchain.com/docs/versions/v0_2/>
  from langchain.document_loaders import ArxivLoader

>> from langchain.document_transformers import LongContextReorder

with new imports of:

>> from langchain_community.document_transformers import LongContextReorder
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here <https://python.langchain.com/docs/versions/v0_2/>
  from langchain.document_transformers import LongContextReorder

>> from langchain.document_transformers import LongContextReorder

with new imports of:

>> from langchain_community.document_transformers import LongContextReorder
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation he

<br>

### **Part 2:** Using The Server:

While this cannot be easily utilized within Google Colab (or at least not without a lot of special tricks), the above script will keep a running server tied to the notebook process. While the server is running, do not attempt to use this notebook (except to shut down/restart the service).

In another file, however, you should be able to access the `basic_chat` endpoint using the following interface:

```python
from langserve import RemoteRunnable
from langchain_core.output_parsers import StrOutputParser

llm = RemoteRunnable("http://0.0.0.0:9012/basic_chat/") | StrOutputParser()
for token in llm.stream("Hello World! How is it going?"):
    print(token, end='')
```

**Please try it out in a different file and see if it works!**


<br>

### **Part 3: Final Assessment**

**This notebook will be used to completing the final assessment!** When you have otherwise finished the course, we recommend cloning this notebook, getting the frontend open in a new tab, and implement the Evaluate functionality by implementing the `/generator` and `/retriever` endpoints above! For a quick link to the frontend, run the cell below:

In [None]:
%%js
var url = 'http://'+window.location.host+':8090';
element.innerHTML = '<a style="color:#76b900;" target="_blank" href='+url+'><h2>< Link To Gradio Frontend ></h2></a>';

<hr>
<br>

#### **Assessment Hint:** 
Note that the following functionality is already implemented in the frontend microservice. 

```python
## Necessary Endpoints
chains_dict = {
    'basic' : RemoteRunnable("http://lab:9012/basic_chat/"),
    'retriever' : RemoteRunnable("http://lab:9012/retriever/"),  ## For the final assessment
    'generator' : RemoteRunnable("http://lab:9012/generator/"),  ## For the final assessment
}

basic_chain = chains_dict['basic']

## Retrieval-Augmented Generation Chain

retrieval_chain = (
    {'input' : (lambda x: x)}
    | RunnableAssign(
        {'context' : itemgetter('input') 
        | chains_dict['retriever'] 
        | LongContextReorder().transform_documents
        | docs2str
    })
)

output_chain = RunnableAssign({"output" : chains_dict['generator'] }) | output_puller
rag_chain = retrieval_chain | output_chain
```

**To conform to this endpoint ingestion strategy, make sure not to duplicate pipeline functionality and only deploy the features that are missing!**

----

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>