<a href="https://colab.research.google.com/github/ainulyaqinmhd/Local-Chabot-RAG/blob/main/Chatbot_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RAG

### Indexing
1. **Load**: First we need to load our data. This is done with CSV Loaders
2. **Split**: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

### Retrieval and generation
4. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a Retriever.
5. **Generate**: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

## Package Installation

In [None]:
!pip install langchain langchain_community langchain_chroma langchain-openai langchainhub gradio

Collecting langchain
  Downloading langchain-0.2.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_community
  Downloading langchain_community-0.2.10-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain_chroma
  Downloading langchain_chroma-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.1.19-py3-none-any.whl.metadata (2.6 kB)
Collecting langchainhub
  Downloading langchainhub-0.1.20-py3-none-any.whl.metadata (659 bytes)
Collecting gradio
  Downloading gradio-4.39.0-py3-none-any.whl.metadata (15 kB)
Collecting langchain-core<0.3.0,>=0.2.23 (from langchain)
  Downloading langchain_core-0.2.24-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.93-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0

## Use OpenAI API Key

In [None]:
import getpass
import os

# Set the OpenAI API key for accessing the OpenAI services
os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

# Initialize the ChatOpenAI object with the GPT-4 model
llm = ChatOpenAI(model="gpt-4o")


··········


## Load data from CSV

In [None]:
import os
import csv
from langchain_community.document_loaders.csv_loader import CSVLoader

# Define the directory containing the database files
database_folder = "Database"

# List all files in the database folder
files = [os.path.join(database_folder, f) for f in os.listdir(database_folder) if os.path.isfile(os.path.join(database_folder, f))]

all_docs = []

# Function to detect the structure of a CSV file
def detect_csv_structure(file_path):
    with open(file_path, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        header = next(reader)
    return header

# Load data from each CSV file in the folder
for file in files:
    if file.endswith(".csv"):
        header = detect_csv_structure(file)
        print(f"Detected columns in {file}: {header}")
        source_column = "URL" if "URL" in header else header[0]  # Adjust source column as needed
        loader = CSVLoader(
            file_path=file,
            source_column=source_column,
            csv_args={
                "delimiter": ",",
                "quotechar": '"',
                "fieldnames": header
            }
        )
        docs = loader.load()
        print(f"Loaded {len(docs)} documents from {file}")
        all_docs.extend(docs)

print(f"Total documents loaded: {len(all_docs)}")


Detected columns in Database/Database with Financial Data and Addresses.csv: ['\ufeffFund Name', 'Fund Manager Name', 'ISIN', 'Region', 'Sector', '2023 RE participants', 'Ticker_B', 'Ticker', '2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06', '2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12', '2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06', '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12', '2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06', '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12', '2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06', 'Address', 'Market']
Loaded 270 documents from Database/Database with Financial Da

## Split data to chunks

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the text splitter with specific chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(all_docs)

print(f"Total splits created: {len(all_splits)}")
print(all_splits[0])


Total splits created: 4204
page_content='﻿Fund Name: ﻿Fund Name
Fund Manager Name: Fund Manager Name
ISIN: ISIN
Region: Region
Sector: Sector
2023 RE participants: 2023 RE participants
Ticker_B: Ticker_B
Ticker: Ticker
2019-01: 2019-01
2019-02: 2019-02
2019-03: 2019-03
2019-04: 2019-04
2019-05: 2019-05
2019-06: 2019-06
2019-07: 2019-07
2019-08: 2019-08
2019-09: 2019-09
2019-10: 2019-10
2019-11: 2019-11
2019-12: 2019-12
2020-01: 2020-01
2020-02: 2020-02
2020-03: 2020-03
2020-04: 2020-04
2020-05: 2020-05
2020-06: 2020-06
2020-07: 2020-07
2020-08: 2020-08
2020-09: 2020-09
2020-10: 2020-10
2020-11: 2020-11
2020-12: 2020-12
2021-01: 2021-01
2021-02: 2021-02
2021-03: 2021-03
2021-04: 2021-04
2021-05: 2021-05
2021-06: 2021-06
2021-07: 2021-07
2021-08: 2021-08
2021-09: 2021-09
2021-10: 2021-10
2021-11: 2021-11
2021-12: 2021-12
2022-01: 2022-01
2022-02: 2022-02
2022-03: 2022-03
2022-04: 2022-04
2022-05: 2022-05
2022-06: 2022-06
2022-07: 2022-07
2022-08: 2022-08
2022-09: 2022-09
2022-10: 2022-10

## Store document and embedding to vector database

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
import time

# Initialize the embeddings
embedding = OpenAIEmbeddings(model='text-embedding-3-small')

# Initialize the vector store for storing document embeddings
vectorstore = Chroma(
    collection_name="my_collection",
    embedding_function=embedding,
    persist_directory="./chroma_db"
)

batch_size = 10

# Add document splits to the vector store in batches
for i in range(0, len(all_splits), batch_size):
    batch = all_splits[i: i + batch_size]
    vectorstore.add_documents(batch)
    time.sleep(1)


## Similarity search

In [None]:
from scipy import spatial

# Testing question
question = "What is global investor expand engagement?"

# Create a retriever object to search for similar documents
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

# Retrieve documents relevant to the question
retrieved_docs = retriever.invoke(question)

print(f"Total documents retrieved: {len(retrieved_docs)}")
# print(retrieved_docs[0])

# Calculate and print similarity scores
question_embedding = embedding.embed_query(question)  # Embed the question
for i in range(len(retrieved_docs)):
    doc_embedding = embedding.embed_documents([retrieved_docs[i].page_content])[0]  # Embed the document
    similarity = 1 - spatial.distance.cosine(question_embedding, doc_embedding)  # Calculate cosine similarity
    print(f"Document {i+1} (Similarity: {similarity:.4f}):\n{retrieved_docs[i].page_content}\n")

Total documents retrieved: 5
Document 1 (Similarity: 0.5678):
Content: Amsterdam, May 22, 2024 – Global investors are expanding their coordinated efforts in the Asia Pacific region to engage with the real asset industry on material ESG disclosures and net-zero strategies. Building on the success of past years, this year’s collaborative engagement campaign extends its focus to include infrastructure companies for the first time, alongside listed real estate companies and REITs. This year, GRESB collaborated with 16 global investors with more than USD 5 trillion in assets under management to engage a select group of listed real estate and infrastructure companies across the APAC region. Through active stewardship and engagement, investors encourage real asset companies to participate in GRESB. This allows them to assess alignment with best practices on environmental, social, and governance (ESG) issues alongside the GRESB Standards, enabling them to identify leaders while encouraging imp

## Reply generation

In [None]:
from langchain import hub

# Load a predefined prompt from the Langchain hub
prompt = hub.pull("rlm/rag-prompt")

# Generate example messages based on the retrieved documents and the question
example_messages = prompt.invoke(
    {"context": "{retrieved_docs}", "question": "{your_question}", "reference": "{source}"}
).to_messages()

print(example_messages[0].content)


You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {your_question} 
Context: {retrieved_docs} 
Answer:


## Customize the prompt

In [None]:
from langchain_core.prompts import PromptTemplate

# Define a custom prompt template
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer and followed by the reference contains URL source (if it doesn't contain URL don't put any reference).

{context}

Question: {question}

Helpful Answer:

Reference:
- {source_column}"""

# Create a prompt object from the template
prompt = PromptTemplate.from_template(template)
example_messages = prompt.invoke(
    {
        "context": retrieved_docs[0].page_content,
        "question": question,
        # "source_file": files[0],  # Replace with actual source file
        "source_column": 'URL'  # Replace with actual source column
    }
).to_messages()

print(example_messages[0].content)


Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer and followed by the reference contains URL source (if it doesn't contain URL don't put any reference).

Content: Amsterdam, May 22, 2024 – Global investors are expanding their coordinated efforts in the Asia Pacific region to engage with the real asset industry on material ESG disclosures and net-zero strategies. Building on the success of past years, this year’s collaborative engagement campaign extends its focus to include infrastructure companies for the first time, alongside listed real estate companies and REITs. This year, GRESB collaborated with 16 global investors with more than USD 5 trillion in assets under management to engage a select group of listed real estate and infrastructure compan

## LCEL (Langchain Expression Language)

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Function to format documents for the prompt
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create a RAG (Retrieval-Augmented Generation) chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough(),
        # "source_file": lambda x: files[0],  # Provide source_file
        "source_column": lambda x: "URL"  # Provide source_column
    }
    | prompt
    | llm
    | StrOutputParser()
)

# Stream the response from the RAG chain for the given question
for chunk in rag_chain.stream(question):
    print(chunk, end="", flush=True)


Global investor expand engagement refers to the coordinated efforts by global investors to improve ESG disclosures and net-zero strategies in the Asia Pacific region, now including infrastructure companies alongside listed real estate companies and REITs. This initiative, supported by 16 global investors with over USD 5 trillion in assets, aims to encourage real asset companies to participate in GRESB for better alignment with ESG best practices.

Thanks for asking!

Reference: [Global investors expand engagement on ESG and GRESB participation in APAC region to include infrastructure companies](https://www.gresb.com/nl-en/insights/global-investors-expand-engagement-on-esg-and-gresb-participation-in-apac-region-to-include-infrastructure-companies/)

## Gradio interface setup

In [None]:
import gradio as gr
import datetime
import csv
import os

# Function to log interactions to a CSV file
def log_interaction_csv(user_message, bot_message, vote_message=None, log_file="chat_log.csv"):
    file_exists = os.path.isfile(log_file)
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    try:
        with open(log_file, "a+", newline="", encoding="utf-8") as file:
            writer = csv.writer(file)
            if not file_exists:
                writer.writerow(["Timestamp", "Query", "Answer", "Satisfaction"])

            if vote_message:
                file.seek(0)
                rows = list(csv.reader(file))
                if len(rows) > 1:
                    rows[-1][-1] = vote_message
                    file.seek(0)
                    file.truncate()
                    writer.writerows(rows)
            else:
                writer.writerow([timestamp, user_message, bot_message, ""])

    except Exception as e:
        print(f"Error writing to file: {e}")

# Function to generate a chatbot response
def chatbot_response(question):
    response = ""
    for chunk in rag_chain.stream(question):
        response += chunk
    return response

# Function to handle user votes on responses
def vote(data: gr.LikeData):
    vote_message = "Liked" if data.liked else "Disliked"
    log_interaction_csv("", "", vote_message)
    print(vote_message)

# Gradio interface setup
with gr.Blocks() as demo:
    gr.Markdown("## TABC - ChatBot V.0.1\nThe database of chatbot (V.0.1) now contains detailed information about the GRESB foundation, its impact, and sustainability focus.")
    chatbot = gr.Chatbot(label="Ask me anything!")

    with gr.Row():
        txt = gr.Textbox(show_label=False, placeholder="Enter your question here...")
        submit_btn = gr.Button("Send")
        retry_btn = gr.Button("Regenerate") # Add a retry button

        # Function to handle user message input
        def user_message(message, history):
            history.append((message, None))
            return history, ""

        # Function to handle bot response
        def bot_response(history):
            user_message = history[-1][0]
            bot_message = chatbot_response(user_message)
            history[-1] = (user_message, bot_message)

            # Log the interaction
            log_interaction_csv(user_message, bot_message)

            return history

        # Function to handle retry
        def retry(history):
            if history:
                last_question = history[-1][0]  # Get the last question
                history.pop() # Remove the last interaction
                # Re-run the last question by triggering user_message and bot_response
                history, _ = user_message(last_question, history)
                history = bot_response(history)
            return history, ""

    # Handle the submit, click, and retry events
    txt.submit(user_message, [txt, chatbot], [chatbot, txt], queue=False).then(
        bot_response, chatbot, chatbot
    )
    submit_btn.click(user_message, [txt, chatbot], [chatbot, txt], queue=False).then(
        bot_response, chatbot, chatbot
    )
    retry_btn.click(retry, chatbot, [chatbot, txt])
    # Add the voting functionality to the chatbot
    chatbot.like(vote, None, None)

# Launch the Gradio interface
demo.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://6a3dcc86e2dd63b325.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
# List of examples
"""
1. What is global investor expand engagement?
2. Give me some information about ticker named "REG"
3. How did [Fund Name X] perform in the first quarter of 2020?" (Replace [Fund Name X] with an actual fund from your data)
4. Which fund in the [Region Y] region showed the strongest growth in 2021?
5. What was the average performance of [Sector Z] funds in the second half of 2022?

"""


'\n1. What is global investor expand engagement?\n2. Give me some information about ticker named "REG"\n\n'