# Description

*DocuBot is a specialized chatbot designed for efficiently retrieving case-specific information from a collection of documents. By leveraging a pre-trained **large language model**, users can query DocuBot using natural language and receive relevant information from the document set.*

*Behind the scenes, DocuBot generates **word embeddings** for input documents, organizes and stores them in a **PostgreSQL database**. When a user submits a query, DocuBot searches the database to contextually relevant responses.*

### Import required libraries

In [1]:
import psycopg2
import pgvector
import vertexai
import tiktoken
import numpy as np
from psycopg2 import pool
from loguru import logger
from itertools import chain
from pydantic import BaseModel
from psycopg2.extras import execute_values
from llama_index.core.schema import Document
from pgvector.psycopg2 import register_vector
from vertexai.language_models import ChatModel
from vertexai.language_models import TextEmbeddingModel
from llama_index.core.text_splitter import SentenceSplitter

# 1. Document Ingestion
   
- Preprocesses raw input, splitting it into logical segments.
- Generates text embeddings for each segment

## 1.1 Split Text to Chunks

In [22]:
def split_input_to_chunks(input_text: str) -> list[str]:
    """
    Split a sentence into chunks
    Input:
        text : Text to be split
    Output:
        chunks: Segments of text after splitting
    """

    # Parsing text with a preference for complete sentences
    text_splitter = SentenceSplitter(
        separator = " ",
        chunk_size = 300,
        chunk_overlap = 20,
        paragraph_separator = "\n\n",
        secondary_chunking_regex = "[^,.;。]+[,.;。]?",
        tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
    )
    
    txt_doc = Document(text = input_text)    
    # Split the text into chunks
    chunks = text_splitter([txt_doc])

    return [chunk.text for chunk in chunks]

## 1.2 Generate Text Embeddings

\- Using a pre-trained model by Vertex AI to generate embeddings for input chunks

In [11]:
class TextEmbedding(BaseModel):
    text : str
    embedding : list[float]

In [12]:
def text_embedding(text) -> list[float]:    
    """
    Generate embeddings for given text
    Input:
        text : Input text   
    Output:
        vector: Emdedding of the input text
    """
    
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings = model.get_embeddings([text])
    
    for embedding in embeddings:
        vector = embedding.values             
        
    return vector

## 1.3 Generate text, embedding pair for all chunks in a given text

In [23]:
def get_text_embedding_pairs(text : str) -> list[ChunkEmbedding]:
    """
    Get all the chunks and corresponding embeddings for a given text
    Input:
        text : Text whose chunk and embedding is needed
    Output:
        chunk_embedding_pairs: chunk and embedding of given text
    """    
    
    chunks : list[str] = split_input_to_chunks(text) 
    chunk_embedding_pairs : list[TextEmbedding] = []
    logger.info(f'Number of chunks generated: {len(chunks)}')
    
    for curr_chunk in chunks:        
        curr_embedding = text_embedding(curr_chunk)
        chunk_embedding_pairs.append(TextEmbedding(text = curr_chunk, embedding = curr_embedding))    
    
    return chunk_embedding_pairs

# 2. Setting up the Cloud Database

- Implementing functions to create and manage a database table for storing text chunks and their embeddings.
- Utilizing connection pooling to efficiently manage database connections and reduce overhead.
- Providing methods for inserting data into the table (ingest) and retrieving data from the table (retrieve).

**Initializing Vertex AI environment**

In [14]:
vertexai.init(project = "inductive-world-416413")

**Initializing DB parameters**

In [15]:
DB_PARAMS = {
    'dbname' : "vectordb",
    'user' : "user",
    'password' : "pwd",
    'host' : "localhost",
    'port' : "5432"
}

**Setting up the Datastore**

In [27]:
class DataStore:
    
    DATABASE_SCHEMA = {
        "text_chunk" : "varchar",
        "embedding" : "vector(768)"
    }
    
    TABLE_NAME = "my_table"
    
    def __init__(self, db_params : dict = DB_PARAMS):
        self.db_params = db_params        
        self.conn_pool = self._get_connection_pool()        
        self._create_table()
    
    def _get_connection_pool(self):
        return psycopg2.pool.SimpleConnectionPool(1, 10, **self.db_params)

    def _create_table(self) -> None:
        col_defs = [f'{col_name} {col_type}' for col_name, col_type in self.DATABASE_SCHEMA.items()]        
        cols = ", ".join(col_defs)        
        table_creation_query = f"""
            CREATE EXTENSION IF NOT EXISTS vector;
            DROP TABLE IF EXISTS {self.TABLE_NAME};
            CREATE TABLE IF NOT EXISTS {self.TABLE_NAME} (
            id SERIAL PRIMARY KEY,
            {cols}
            );
            """   
        logger.info(table_creation_query)
        try:
            connection = self.conn_pool.getconn()
            with connection:
                with connection.cursor() as cursor:
                    cursor.execute(table_creation_query)
        except Exception as e:
            logger.error(f"Error in create table query: {e}")
            raise
        finally:
            self.conn_pool.putconn(connection)
    
    def ingest(self, text: str) -> None:
        text_embedding_pairs : list[TextChunk] = get_text_embedding_pairs(text)
        data_list = [(curr.text, curr.embedding) for curr in text_embedding_pairs]
        print(data_list[0][0])
        col_names = ",".join(list(self.DATABASE_SCHEMA.keys()))
        table_update_query = f"""
            INSERT INTO {self.TABLE_NAME} 
            ( {col_names} )
            VALUES %s
            """                    
        try:            
            connection = self.conn_pool.getconn()
            with connection:
                with connection.cursor() as cursor:
                    execute_values(cursor, table_update_query, data_list)
                    logger.info("Updated table with embedding pairs")
        except Exception as e:
            logger.error(f"Error in update table query : {e}")
            raise
        finally:
            self.conn_pool.putconn(connection)

    def retrieve(self, query: str) -> str:
        query_embedding : list[float] = text_embedding(query)
        retrieval_query = f"""
            SELECT text_chunk FROM {self.TABLE_NAME}
            ORDER BY embedding <-> %s LIMIT 1
            """
        retrieved_chunk = ""
        try:            
            connection = self.conn_pool.getconn()
            register_vector(connection)
            with connection:
                with connection.cursor() as cursor:
                    cursor.execute(retrieval_query, (np.array(query_embedding, dtype = np.float64), ))
                    retrieved_chunk = list(chain.from_iterable(cursor.fetchall()))
                    logger.info(f"Retreived {len(retrieved_chunk)} chunk for the given embedding")            
        except Exception as e:
            logger.error(f"Error in retrieval query : {e}")
            raise  
        finally:
            self.conn_pool.putconn(connection)
        return retrieved_chunk

# 3. Ingest Documents to Database

- 

In [28]:
datastore = DataStore()

[32m2024-03-15 02:25:26.186[0m | [1mINFO    [0m | [36m__main__[0m:[36m_create_table[0m:[36m29[0m - [1m
            CREATE EXTENSION IF NOT EXISTS vector;
            DROP TABLE IF EXISTS my_table;
            CREATE TABLE IF NOT EXISTS my_table (
            id SERIAL PRIMARY KEY,
            text_chunk varchar, embedding vector(768)
            );
            [0m


In [29]:
input1 = "John lives in America. John has two kids. " * 100
input2 = "I am Aakansha and I live in India. " * 100

In [30]:
datastore.ingest(input1)

[32m2024-03-15 02:25:27.880[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_text_embedding_pairs[0m:[36m12[0m - [1mNumber of chunks generated: 5[0m
[32m2024-03-15 02:25:46.883[0m | [1mINFO    [0m | [36m__main__[0m:[36mingest[0m:[36m56[0m - [1mUpdated table with embedding pairs[0m


John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has two kids. John lives in America. John has tw

In [31]:
datastore.ingest(input2)

[32m2024-03-15 02:25:46.888[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_text_embedding_pairs[0m:[36m12[0m - [1mNumber of chunks generated: 5[0m
[32m2024-03-15 02:26:05.733[0m | [1mINFO    [0m | [36m__main__[0m:[36mingest[0m:[36m56[0m - [1mUpdated table with embedding pairs[0m


I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India. I am Aakansha and I live in India.


# 4. Seting up the Chatbot 

## 4.1 Initialize LLM model

In [32]:
chat_model = ChatModel.from_pretrained("chat-bison@002")
parameters = {
    "candidate_count": 1,
    "max_output_tokens": 1024,
    "temperature": 0.9,
    "top_p": 1
}

In [33]:
template = "Refer to the following context to answer this query: {query}\n\nContext: {context}"

In [None]:
chat = chat_model.start_chat(
    context="""""",
)
while True:
    query = input("User Query: ")
    if query == "quit":
        break    
    similar_chunks : list[str] = datastore.retrieve(query)
    context : str = '\n'.join(similar_chunks)
    model_input = template.format(query = query, context = context)
    response = chat.send_message(model_input)
    print(f"Model : {response.text.strip()}")

User Query:  Who lives in america?


[32m2024-03-15 02:27:32.738[0m | [1mINFO    [0m | [36m__main__[0m:[36mretrieve[0m:[36m77[0m - [1mRetreived 1 chunk for the given embedding[0m


Model : John lives in America.


User Query:  Who lives in India?


[32m2024-03-15 02:27:47.065[0m | [1mINFO    [0m | [36m__main__[0m:[36mretrieve[0m:[36m77[0m - [1mRetreived 1 chunk for the given embedding[0m


Model : Aakansha lives in India.


User Query:  Does Aakansha have kids?


[32m2024-03-15 02:28:07.854[0m | [1mINFO    [0m | [36m__main__[0m:[36mretrieve[0m:[36m77[0m - [1mRetreived 1 chunk for the given embedding[0m


Model : The provided context does not mention whether Aakansha has kids or not.


User Query:  What can you tell me about Aakansha?


[32m2024-03-15 02:28:25.725[0m | [1mINFO    [0m | [36m__main__[0m:[36mretrieve[0m:[36m77[0m - [1mRetreived 1 chunk for the given embedding[0m


Model : Aakansha lives in India.


User Query:  Anything else?


[32m2024-03-15 02:28:36.528[0m | [1mINFO    [0m | [36m__main__[0m:[36mretrieve[0m:[36m77[0m - [1mRetreived 1 chunk for the given embedding[0m


Model : The provided context does not mention anything else.
