# Description

*DocuBot is a specialized chatbot designed for efficiently retrieving case-specific information from a collection of documents. By leveraging a pre-trained **large language model**, users can query DocuBot using natural language and receive relevant information from the document set.*

*Behind the scenes, DocuBot generates **word embeddings** for input documents, organizes and stores them in a **PostgreSQL database**. When a user submits a query, DocuBot searches the database to contextually relevant responses.*

### Import required libraries

In [1]:
import psycopg2
import pgvector
import vertexai
import tiktoken
import numpy as np
from psycopg2 import pool
from loguru import logger
from itertools import chain
from pydantic import BaseModel
from psycopg2.extras import execute_values
from llama_index.core.schema import Document
from pgvector.psycopg2 import register_vector
from vertexai.language_models import ChatModel
from vertexai.language_models import TextEmbeddingModel
from llama_index.core.text_splitter import SentenceSplitter

# 1. Document Ingestion
   
- Preprocesses raw input, splitting it into logical segments.
- Generates text embeddings for each segment

## 1.1 Split Text to Chunks

In [2]:
def split_input_to_chunks(input_text: str) -> list[str]:
    """
    Split a sentence into chunks
    Input:
        text : Text to be split
    Output:
        chunks: Segments of text after splitting
    """

    # Parsing text with a preference for complete sentences
    text_splitter = SentenceSplitter(
        separator = " ",
        chunk_size = 300,
        chunk_overlap = 20,
        paragraph_separator = "\n\n",
        secondary_chunking_regex = "[^,.;。]+[,.;。]?",
        tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
    )
    
    txt_doc = Document(text = input_text)    
    # Split the text into chunks
    chunks = text_splitter([txt_doc])

    return [chunk.text for chunk in chunks]

## 1.2 Generate Text Embeddings

\- Using a pre-trained model by Vertex AI to generate embeddings for input chunks

In [3]:
class TextEmbedding(BaseModel):
    text : str
    embedding : list[float]

In [4]:
def text_embedding(text) -> list[float]:    
    """
    Generate embeddings for given text
    Input:
        text : Input text   
    Output:
        vector: Emdedding of the input text
    """
    
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings = model.get_embeddings([text])
    
    for embedding in embeddings:
        vector = embedding.values             
        
    return vector

## 1.3 Generate text, embedding pair for all chunks in a given text

In [5]:
def get_text_embedding_pairs(text : str) -> list[TextEmbedding]:
    """
    Get all the chunks and corresponding embeddings for a given text
    Input:
        text : Text whose chunk and embedding is needed
    Output:
        chunk_embedding_pairs: chunk and embedding of given text
    """    
    
    chunks : list[str] = split_input_to_chunks(text) 
    chunk_embedding_pairs : list[TextEmbedding] = []
    logger.info(f'Number of chunks generated: {len(chunks)}')
    
    for curr_chunk in chunks:        
        curr_embedding = text_embedding(curr_chunk)
        chunk_embedding_pairs.append(TextEmbedding(text = curr_chunk, embedding = curr_embedding))    
    
    return chunk_embedding_pairs

# 2. Setting up the Cloud Database

- Implementing functions to create and manage a database table for storing text chunks and their embeddings.
- Utilizing connection pooling to efficiently manage database connections and reduce overhead.
- Providing methods for inserting data into the table (ingest) and retrieving data from the table (retrieve).

**Initializing Vertex AI environment**

In [6]:
vertexai.init(project = "inductive-world-416413")

**Initializing DB parameters**

In [7]:
DB_PARAMS = {
    'dbname' : "vectordb",
    'user' : "user",
    'password' : "pwd",
    'host' : "localhost",
    'port' : "5432"
}

**Setting up the Datastore**

In [8]:
class DataStore:
    
    DATABASE_SCHEMA = {
        "text_chunk" : "varchar",
        "embedding" : "vector(768)"
    }
    
    TABLE_NAME = "my_table"
    
    def __init__(self, db_params : dict = DB_PARAMS):
        self.db_params = db_params        
        self.conn_pool = self._get_connection_pool()        
        self._create_table()
    
    def _get_connection_pool(self):
        return psycopg2.pool.SimpleConnectionPool(1, 10, **self.db_params)

    def _create_table(self) -> None:
        col_defs = [f'{col_name} {col_type}' for col_name, col_type in self.DATABASE_SCHEMA.items()]        
        cols = ", ".join(col_defs)        
        table_creation_query = f"""
            CREATE EXTENSION IF NOT EXISTS vector;
            DROP TABLE IF EXISTS {self.TABLE_NAME};
            CREATE TABLE IF NOT EXISTS {self.TABLE_NAME} (
            id SERIAL PRIMARY KEY,
            {cols}
            );
            """   
        logger.info(table_creation_query)
        try:
            connection = self.conn_pool.getconn()
            with connection:
                with connection.cursor() as cursor:
                    cursor.execute(table_creation_query)
        except Exception as e:
            logger.error(f"Error in create table query: {e}")
            raise
        finally:
            self.conn_pool.putconn(connection)
    
    def ingest(self, text: str) -> None:
        text_embedding_pairs : list[TextChunk] = get_text_embedding_pairs(text)
        data_list = [(curr.text, curr.embedding) for curr in text_embedding_pairs]
        col_names = ",".join(list(self.DATABASE_SCHEMA.keys()))
        table_update_query = f"""
            INSERT INTO {self.TABLE_NAME} 
            ( {col_names} )
            VALUES %s
            """                    
        try:            
            connection = self.conn_pool.getconn()
            with connection:
                with connection.cursor() as cursor:
                    execute_values(cursor, table_update_query, data_list)
                    logger.info("Updated table with embedding pairs")
        except Exception as e:
            logger.error(f"Error in update table query : {e}")
            raise
        finally:
            self.conn_pool.putconn(connection)

    def retrieve(self, query: str) -> str:
        query_embedding : list[float] = text_embedding(query)
        retrieval_query = f"""
            SELECT text_chunk FROM {self.TABLE_NAME}
            ORDER BY embedding <-> %s LIMIT 5
            """
        retrieved_chunk = ""
        try:            
            connection = self.conn_pool.getconn()
            register_vector(connection)
            with connection:
                with connection.cursor() as cursor:
                    cursor.execute(retrieval_query, (np.array(query_embedding, dtype = np.float64), ))
                    retrieved_chunk = list(chain.from_iterable(cursor.fetchall()))
                    #logger.info(f"Retreived {len(retrieved_chunk)} chunk for the given embedding")            
        except Exception as e:
            logger.error(f"Error in retrieval query : {e}")
            raise  
        finally:
            self.conn_pool.putconn(connection)
        return retrieved_chunk

# 3. Ingest Documents to Database

\- Reading text from input pdf and ingesting it into the database

In [9]:
datastore = DataStore()

[32m2024-03-17 18:44:02.767[0m | [1mINFO    [0m | [36m__main__[0m:[36m_create_table[0m:[36m29[0m - [1m
            CREATE EXTENSION IF NOT EXISTS vector;
            DROP TABLE IF EXISTS my_table;
            CREATE TABLE IF NOT EXISTS my_table (
            id SERIAL PRIMARY KEY,
            text_chunk varchar, embedding vector(768)
            );
            [0m


In [10]:
from pypdf import PdfReader

def read_pdf(file_path):    
    text = ""
    with open(file_path, "rb") as file:
        reader = PdfReader(file)
        for page in reader.pages:            
            text += page.extract_text()
    
    return text

In [11]:
pdf_file_path = "/Users/aakanshadalmia/abc.pdf"
text = read_pdf(pdf_file_path)

In [12]:
text_ingested = datastore.ingest(text)

[32m2024-03-17 18:44:03.967[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_text_embedding_pairs[0m:[36m12[0m - [1mNumber of chunks generated: 57[0m
[32m2024-03-17 18:47:33.959[0m | [1mINFO    [0m | [36m__main__[0m:[36mingest[0m:[36m55[0m - [1mUpdated table with embedding pairs[0m


# 4. Seting up the Chatbot 

## 4.1 Initialize LLM model

In [13]:
chat_model = ChatModel.from_pretrained("chat-bison@002")
parameters = {
    "candidate_count": 1,
    "max_output_tokens": 1024,
    "temperature": 0.9,
    "top_p": 1
}

## 4.2 Set up Model Prompt


In [14]:
prompt_template = "Refer to the following context to answer this query: {query}\n\nContext: {context}"

## 4.3 Set up Chat

- Use prompt template to send user query as input to the chatbot
- Store the chat history and update context to include this chat history
- Use updated context to send another input to chatbot and use this answer as final responser sent to user

In [15]:
chat_history = []

In [20]:
chat = chat_model.start_chat(
    context="""""",
)

while True:
    query = input("User Query: ")
    if query == "quit":
        break    
        
    # First API call to send the chat history and get the updated context
    model_input_1 = prompt_template.format(query="", context="\n".join(chat_history))
    response_1 = chat.send_message(model_input_1)
    updated_context = response_1.text.strip()
 
    # Second API call to send the query with the updated context            
    similar_chunks : list[str] = datastore.retrieve(query)
    updated_context += '\n'.join(similar_chunks)
    model_input_2 = prompt_template.format(query = query, context = updated_context)
    response_2 = chat.send_message(model_input_2)
    
    print(f"Model : {response_2.text.strip()}\n\n")

    chat_history.append(query)

User Query:  What is geographical erasure?


Model : **Geographical erasure:** 

Geographical erasure refers to the phenomenon where language models underpredict or overlook certain geographical regions or countries in their generated text. This can occur when the training data used to develop the language model is biased towards certain regions or when the model lacks sufficient information about underrepresented areas.

In the context of the provided text, geographical erasure is studied in the context of large language models (LLMs) and their tendency to capture information about dominant groups disproportionately. The paper investigates instances where LLMs underpredict the likelihood of certain countries appearing in generated text, despite those countries having significant English-speaking populations.

The text highlights the importance of addressing geographical erasure to ensure that language models provide a more balanced and inclusive representation of the world.




User Query:  Describe fairness measures for language generation


Model : **Fairness measures for language generation:** 

Fairness measures for language generation typically focus on identifying and mitigating biases in the generated text. These measures aim to ensure that the language models produce fair and unbiased outputs, free from discriminatory or harmful content.

Common fairness measures for language generation include:

1. **Demographic parity:** This measure assesses whether the generated text exhibits equal representation of different demographic groups, such as gender, race, or ethnicity. It ensures that the model does not favor or disfavor certain groups in its output.

2. **Equality of opportunity:** This measure evaluates whether the generated text provides equal opportunities for different demographic groups. It ensures that the model does not generate text that reinforces existing societal biases or stereotypes.

3. **Counterfactual fairness:** This measure assesses whether the generated text would remain fair if certain attributes

User Query:  What was the goal of this paper?


Model : **Goal of the paper:** 

The goal of the paper is to study and operationalize a form of geographical erasure in language models, where these models underpredict the likelihood of certain countries appearing in generated text. The paper aims to demonstrate the existence of geographical erasure across different language models and investigate its causes and potential mitigation strategies.




User Query:  What was the method used to do this?


Model : **Method used:** 

The paper employs the following methods to study and mitigate geographical erasure in language models:

1. **Erasure Measurement:** The paper defines a metric called "Erasure" (ER) to quantify the extent to which a language model underpredicts the likelihood of certain countries appearing in generated text. ER is calculated by comparing the model's predicted probabilities of countries with a ground truth distribution based on real-world population data.

2. **Prompt Rephrasing:** To obtain a more comprehensive understanding of the model's world knowledge, the paper uses a set of diverse prompt wordings that encode the meaning of "home country." These prompts are generated by paraphrasing a seed prompt using techniques such as prompting a language model (e.g., ChatGPT) and replacing sentence subjects.

3. **Mitigation Strategy:** The paper proposes a mitigation strategy to alleviate geographical erasure by employing supervised fine-tuning. This involves fine-t

User Query:  What conclusion was arrived at?


Model : **Conclusion:** 

The paper concludes that geographical erasure is a significant issue in large language models, leading to the underprediction of certain countries in generated text. The paper's analysis reveals that factors such as training data bias, lack of diversity in training data, model size, and sampling strategies contribute to geographical erasure.

The paper proposes a mitigation strategy based on supervised fine-tuning to address geographical erasure. The fine-tuning process aims to adjust the model's predictions to better align with the actual population distribution of countries. The evaluation results show that the fine-tuning strategy effectively reduces geographical erasure while improving the model's performance on a standard language modeling benchmark.

The paper highlights the importance of addressing geographical erasure to ensure fairness and inclusiveness in language generation systems. It




User Query:  quit
