## Ray integration with Ray, Serve, LangChain, and HuggingFace 

<img src="ray_langchain.png" width="50%" height="25%">

The process involves two stages.

1. Building the vector embeddings indexes from the pdf document and storing them into a Chroma database. 
For long documents, exceeding hundreds of pages, they can split and parallelized with Ray tasks. 
2. Once the Chroma vector embeddings database is built, we can read the embeddings from the disk, we load the embeddings
from disk.

<img src="create_index_flow.png" width="70%" height="25%">

Our pdf document is an AI report, which is 351 pages, small enough for one remote Ray task

<img src="ai_report.png" width="20%" height="10%">


In [22]:
import os
import requests
from random import randint
from pathlib import Path

from langchain.document_loaders import PyPDFLoader 
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.vectorstores import Chroma 
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI

from ray import serve
from starlette.requests import Request

# simple example inspired from 
# https://medium.com/geekculture/automating-pdf-interaction-with-langchain-and-chatgpt-e723337f26a6
# and https://github.com/sophiamyang/tutorials-LangChain/tree/main

In [23]:
HF_TOKEN = "your_token"
vector_db_path = Path(Path.cwd(), "vector_hf_db").as_posix()

In [24]:
@serve.deployment(route_prefix="/",
                  autoscaling_config={
                        "min_replicas": 2,
                        "initial_replicas": 2,
                        "max_replicas": 4,
    })
class AnswerHuggingFacePDFQuestions():
    def __init__(self, vector_db_path: str, hf_ai_key: str, verbose=False):

        # Load the embeddings 
        os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_ai_key
    
        embeddings = HuggingFaceEmbeddings()
        self._vectordb = self._vectordb = Chroma(persist_directory=vector_db_path, 
                                                 embedding_function=embeddings)
        template = """Question: {question}

        Answer: """
        prompt = PromptTemplate(template=template, input_variables=["question"])
        llm_chain = LLMChain(prompt=prompt, llm=HuggingFaceHub(repo_id="google/flan-t5-base",
                                                                model_kwargs={"temperature":0, "max_length":128}))
        self._pdf_qa_chain = llm_chain
        self._chat_history=[]

    async def __call__(self, http_request: Request) -> str:
        json_request: str = await http_request.json()
        prompts = []
        for prompt in json_request:
            text = prompt["text"]
            if isinstance(text, list):
                prompts.extend(text)
            else:
                prompts.append(text)
        result = self._pdf_qa_chain({"question": prompts[0], "chat_history": self._chat_history})
        
        return result

In [25]:
deployment = AnswerHuggingFacePDFQuestions.bind(vector_db_path, HF_TOKEN)
serve.run(deployment)

[2m[36m(ServeController pid=51063)[0m INFO 2023-04-21 11:11:51,810 controller 51063 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-c0e650abbe202d84d1465a53424f085d78e995352624b199b7d10e7a' on node 'c0e650abbe202d84d1465a53424f085d78e995352624b199b7d10e7a' listening on '127.0.0.1:8000'
[2m[36m(ServeController pid=51063)[0m INFO 2023-04-21 11:11:52,575 controller 51063 deployment_state.py:1333 - Adding 2 replicas to deployment 'AnswerHuggingFacePDFQuestions'.
[2m[36m(HTTPProxyActor pid=51065)[0m INFO:     Started server process [51065]
[2m[36m(ServeReplica:AnswerHuggingFacePDFQuestions pid=51068)[0m Using embedded DuckDB with persistence: data will be stored in: /Users/jules/git-repos/misc-code/py/langchain/vector_hf_db
[2m[36m(ServeReplica:AnswerHuggingFacePDFQuestions pid=51069)[0m Using embedded DuckDB with persistence: data will be stored in: /Users/jules/git-repos/misc-code/py/langchain/vector_hf_db


RayServeSyncHandle(deployment='AnswerHuggingFacePDFQuestions')

### Send questions to the LLM model served by Serve

<img src="ray_serve_request_response.png" width="50%" height="25%">


In [26]:
# send the request to the deployment
# send the request to the deployment
prompts = [ "What is the total number of publications?",
            "What is the percentage increase in the number of AI-related job postings?",
            "Why Chinese citizens are more optimistic about AI than Americans?",
            # "What are the top takeaways from this report?",
            # "List benchmarks are released to evaulate AI workloads?",
            # "Describe and summarize the techincal ethical issues raised in the report?",
            # "How many bills containing “artificial intelligence” were passed into law?",
            # "What is the percentage increase in the number of AI-related job postings?"      
]

In [27]:
sample_inputs = [{"text": prompt} for prompt in prompts]
for sample_input in sample_inputs:
    output = requests.post("http://localhost:8000/", json=[sample_input]).json()
    print(f"Question: {output['question']}\nAnswer: {output['text']}\n")

Question: What is the total number of publications?
Answer: 59



[2m[36m(HTTPProxyActor pid=51065)[0m INFO 2023-04-21 11:11:57,471 http_proxy 127.0.0.1 http_proxy.py:373 - POST / 200 845.8ms
[2m[36m(ServeReplica:AnswerHuggingFacePDFQuestions pid=51069)[0m INFO 2023-04-21 11:11:57,469 AnswerHuggingFacePDFQuestions AnswerHuggingFacePDFQuestions#fswBYg replica.py:518 - HANDLE __call__ OK 840.9ms


Question: What is the percentage increase in the number of AI-related job postings?
Answer: a ten percent increase



[2m[36m(HTTPProxyActor pid=51065)[0m INFO 2023-04-21 11:11:57,904 http_proxy 127.0.0.1 http_proxy.py:373 - POST / 200 427.0ms
[2m[36m(ServeReplica:AnswerHuggingFacePDFQuestions pid=51068)[0m INFO 2023-04-21 11:11:57,900 AnswerHuggingFacePDFQuestions AnswerHuggingFacePDFQuestions#GPVTIK replica.py:518 - HANDLE __call__ OK 417.0ms


Question: Why Chinese citizens are more optimistic about AI than Americans?
Answer: China is a country that has a high rate of adoption of AI.



[2m[36m(HTTPProxyActor pid=51065)[0m INFO 2023-04-21 11:11:58,338 http_proxy 127.0.0.1 http_proxy.py:373 - POST / 200 423.4ms
[2m[36m(ServeReplica:AnswerHuggingFacePDFQuestions pid=51069)[0m INFO 2023-04-21 11:11:58,333 AnswerHuggingFacePDFQuestions AnswerHuggingFacePDFQuestions#fswBYg replica.py:518 - HANDLE __call__ OK 417.1ms


In [28]:
serve.shutdown()

[2m[36m(ServeController pid=51063)[0m INFO 2023-04-21 11:12:06,620 controller 51063 deployment_state.py:1359 - Removing 2 replicas from deployment 'AnswerHuggingFacePDFQuestions'.
