# Purpose

The first step in developing the RAG-based QA system is to prepare the data by extracting relevant content from the provided PDF. The document includes a mix of text, charts, and tables, which requires careful extraction and preprocessing to ensure that the information is usable for later stages of the system.

Accurate data extraction is crucial as it directly affects the quality of the information retrieval and answer generation processes. Poor extraction can lead to incomplete or noisy data, which can degrade the system's overall performance.

In [8]:
!pip install pdfplumber pdf2image openai pdfplumber PyMuPDF pdfminer layoutparser -qqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m106.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [9]:
# Load packages

import pdfplumber
from pdf2image import convert_from_path
import pandas as pd
from PIL import Image
import re
import openai


In [10]:
pdf_path= './Investment Case For Disruptive Innovation.pdf'


# 1. Text Extraction

## 1.1 PDF Extraction Python Packages

Use PDF extraction tools to extract the text content. Compare Python packages' performance and choose the one that best preserves the structure and quality of the text.

### 1.1.1 PDFPlumber

In [11]:
import pdfplumber

def extract_text_from_pdf(pdf_path, page_num):
    text = ""

    with pdfplumber.open(pdf_path) as pdf:
        if page_num > 0 and page_num <= len(pdf.pages):
            page = pdf.pages[page_num - 1]
            page_text = page.extract_text()
            text += page_text
        else:
            print(f"Invalid page number: {page_num}. The PDF has {len(pdf.pages)} pages.")
    return text

pdf_text = extract_text_from_pdf(pdf_path, 2)
print(pdf_text)


• 2
DISCLOSURE
Risks of Investing in Innovation
Please note: Companies that ARK believes are capitalizing on disruptive innovation and developing technologies to displace older technologies or create new markets
may not in fact do so. ARK aims to educate investors and seeks to size the potential investment opportunity, noting that risks and uncertainties may impact our
projections and research models. Investors should use the content presented for informational purposes only, and be aware of market risk, disruptive innovation risk,
regulatory risk, and risks related to certain innovation areas.
Please read risk disclosure carefully.
RISK OF INVESTING IN INNOVATION
RAPID PACE OF CHANGE REGULATORY HURDLES
EXPOSURE ACROSS SECTORS AND MARKET CAP DISRUPTIVE POLITICAL OR LEGAL PRESSURE
INNOVATION
UNCERTAINTY AND UNKNOWNS COMPETITIVE LANDSCAPE
à Aim for a cross-sector understanding of technology à Aim to understand the regulatory, market, sector,
and combine top-down and bottom-up research. a

#### Analysis


PDFPlumber often introduces unnecessary line breaks during text extraction. This can combine different pieces of content together, disrupting the logical flow and structure of the text. For example:

PDFPlumber result:
EXPOSURE ACROSS SECTORS AND MARKET CAP DISRUPTIVE POLITICAL OR LEGAL PRESSURE

![image.png](attachment:849c9439-c0bd-494b-a2ce-87d54fd28fa6.png)

### 1.1.2.Fitz

In [12]:
import fitz  # PyMuPDF

doc = fitz.open(pdf_path)
page_num = 11
page = doc.load_page(page_num)
text = page.get_text("text")
print(f"Page {page_num + 1} Text:\n{text[1200:]}\n")






Page 12 Text:

25%
30%
Legacy
Precision
Probability Of Clinical Success
Clinical Success Probability 
Innovative Trial Design
+ Single-Cell Biology
+ Proteomic Techniques
+ Virtual Compound Libraries
+ Biomarker Development
+ Humanized animal models
+ Automated Liquid Handling
+ Automated Invivomics
+ Automated Microsynthesis
+ CRISPR “Perturb-Seq” Screens
+ Organ-on-a-chip Technology
Artificial Intelligence
Automation
Fundamental Biology
+ AI-Enabled Pathway Analysis
+ AI-Enabled Toxicity Prediction
+ In-Silico Molecular Modeling
+ ML-Driven Compound Screens
+ Adaptive Clinical Trial Design
+ Precision Biomarkers
+ Decentralized/Virtual Trials
Efficiency Innovations
-48%
2.1x
WH Y IN VES T  IN  D IS RU P T IVE IN N O VA T IO N ?




#### Analysis

PyMuPDF fitz extract content out of order, particularly for tables and lists. The tool doesn't always maintain the original layout, leading to misaligned rows and columns, which can distort the data's meaning.

Example:

Probability Of Clinical Success

Clinical Success Probability

Innovative Trial Design

+ Single-Cell Biology
+ Proteomic Techniques
+ Virtual Compound Libraries
+ Biomarker Development
+ Humanized animal models
+ Automated Liquid Handling
+ Automated Invivomics
+ Automated Microsynthesis
+ CRISPR “Perturb-Seq” Screens
+ Organ-on-a-chip Technology

Artificial Intelligence

But the original content should be:

![image.png](attachment:2751bba6-dc0f-4dc0-9d65-9996d6aa8b93.png)

### 1.1.3.pdfminer

In [13]:
from pdfminer.high_level import extract_text

text = extract_text(pdf_path)
print(text)


1

•

Why Invest In 

Disruptive Innovation?

Sources: ARK Investment Management LLC, 2024. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy, 
sell, or hold any particular security. Past performance is not indicative of future results.

As of June 30, 2024

•

D I S C L O S U R E

2

Risks of Investing in Innovation

Please note: Companies that ARK believes are capitalizing on disruptive innovation and developing technologies to displace older technologies or create new markets 

may not in fact do so. ARK aims to educate investors and seeks to size the potential investment opportunity, noting that risks and uncertainties may impact our 

projections and research models. Investors should use the content presented for informational purposes only, and be aware of market risk, disruptive innovation risk, 

regulatory risk, and risks related to certain innovation areas. 

P

#### Analysis


PDFMiner preserves the accuracy and order of extracted text, making it a reliable choice for documents with straightforward text layouts.

Issue: PDFMiner is not designed to interpret the content of charts or visual elements. It may attempt to extract text from these areas, resulting in disorganized and nonsensical output.

## 1.2 Post-processing and Noise Removal

Implement a post-processing step to clean up the extracted text, removing unnecessary line breaks, misaligned data, or garbled text from charts

In [14]:
import re

def clean_noise(text):
    text = re.sub(r'^[\s]*[\d\W]+[\s]*$', '', text, flags=re.MULTILINE)

    text = re.sub(r'^\s*[\d\W]*[A-Za-z]{0,2}[\d\W]*\s*$', '', text, flags=re.MULTILINE)

    text = re.sub(r'^\s*\S\s*$', '', text, flags=re.MULTILINE)

    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

    text = re.sub(r'\n\s*\n', '\n\n', text)

    text = re.sub(r'\n+', '\n', text)

    return text.strip()



In [15]:
clean_text = clean_noise(text)
print(clean_text)

Why Invest In 
Disruptive Innovation?
Sources: ARK Investment Management LLC, 2024. Forecasts are inherently limited and cannot be relied upon. For informational purposes only and should not be considered investment advice or a recommendation to buy,  sell, or hold any particular security. Past performance is not indicative of future results.
As of June 30, 2024
D I S C L O S U R E
Risks of Investing in Innovation
Please note: Companies that ARK believes are capitalizing on disruptive innovation and developing technologies to displace older technologies or create new markets 
may not in fact do so. ARK aims to educate investors and seeks to size the potential investment opportunity, noting that risks and uncertainties may impact our 
projections and research models. Investors should use the content presented for informational purposes only, and be aware of market risk, disruptive innovation risk, 
regulatory risk, and risks related to certain innovation areas. 
Please read risk disclos

## 1.3 Chart and Table Handing

In [11]:
import layoutparser as lp
from PIL import Image
import matplotlib.pyplot as plt


In [None]:
import pdfplumber

with pdfplumber.open(pdf_path) as pdf:
    page = pdf.pages[12]
    image = page.to_image()

image.save('page_image.jpg')


In [None]:
# # # Load the pre-trained model
# # model = lp.Detectron2LayoutModel(
# #     config_path='lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
# #     model_path='lp://PubLayNet/faster_rcnn_R_50_FPN_3x/model',
# #     label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
# # )

# # # Load the image
# # image = Image.open('page_image.jpg')

# # # Detect the layout of the page
# # layout = model.detect(image)

# # # Filter to get only the figures (charts)
# # charts = lp.Layout([b for b in layout if b.type == 'Figure'])

# # # Visualize the result
# # lp.draw_box(image, charts, box_width=3, box_color="red", show_element_type=True)
# # plt.show()
# model = lp.Detectron2LayoutModel('lp://HJDataset/faster_rcnn_R_50_FPN_3x/config')


## 1.4 Load to CSV

In [16]:
def extract_and_clean_page(pdf_path, page_number):
    text = extract_text(pdf_path, page_numbers=[page_number])
    cleaned_text = clean_noise(text)
    return cleaned_text


In [17]:
def save_to_dataframe(pdf_path):
    data = {'pdf_name': [], 'page': [], 'content': []}

    total_pages = len(extract_text(pdf_path).split('\f')) - 1
    for page_number in range(1, total_pages + 1):
        cleaned_text = extract_and_clean_page(pdf_path, page_number)
        data['pdf_name'].append(pdf_path.split('/')[-1])
        data['page'].append(page_number+1)
        data['content'].append(cleaned_text)

    df = pd.DataFrame(data)
    return df



In [18]:
df = save_to_dataframe(pdf_path)

In [19]:
df.head()

Unnamed: 0,pdf_name,page,content
0,Investment Case For Disruptive Innovation.pdf,2,D I S C L O S U R E\nRisks of Investing in Inn...
1,Investment Case For Disruptive Innovation.pdf,3,Five Innovation Platforms Are Converging And...
2,Investment Case For Disruptive Innovation.pdf,4,Converging Technologies Are Generating A Histo...
3,Investment Case For Disruptive Innovation.pdf,5,AI Is Accelerating Faster Than Forecasters Ant...
4,Investment Case For Disruptive Innovation.pdf,6,ChatGPT Delighted Consumers And Amazed Enterpr...


In [20]:
df.to_csv('./dataset/Investment_content.csv', index=False)

In [28]:
# Prepare the Questions

import csv
import re

def read_and_clean_questions(file_path):
    with open(file_path, 'r') as file:
        questions = file.readlines()

    cleaned_questions = []
    for question in questions:
        question = question.strip()
        question = re.sub(r'^Q\d{2}\.\s*', '', question)
        if question:
            cleaned_questions.append(question)

    return cleaned_questions

def save_questions_to_csv(questions, csv_file_path):
    with open(csv_file_path, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Question'])
        for question in questions:
            writer.writerow([question])

file_path = './Evaluation_Questions.txt'
csv_file_path = './dataset/cleaned_questions.csv'

questions = read_and_clean_questions(file_path)
save_questions_to_csv(questions, csv_file_path)

print(f"Questions have been saved to {csv_file_path}")


Questions have been saved to cleaned_questions.csv


# 2.Vector_DB

In [2]:
!pip install sentence_transformers llama_index langchain -U langchain-community chromadb -qqq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m83.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.3/584.3 kB[0m [31m38.6 MB/s[0m eta [36m0:00:

In [21]:
import pandas as pd
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
import os


In [22]:

def chunk_text_with_overlap(text, max_length=400, overlap_percent=0.1):
    if pd.isna(text):  # Handle NaN values
        return []

    sentences = text.split('. ')
    chunks = []
    current_chunk = []
    overlap_size = int(max_length * overlap_percent)

    current_length = 0
    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > max_length:
            chunks.append(' '.join(current_chunk))
            overlap_chunk = current_chunk[-overlap_size:]  # Keep the overlap
            current_chunk = overlap_chunk + [sentence]
            current_length = len(' '.join(overlap_chunk).split()) + sentence_length
        else:
            current_chunk.append(sentence)
            current_length += sentence_length

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Read the CSV file
df = pd.read_csv('./dataset/Investment_content.csv')

# Process documents with chunking and overlap
documents = []
for index, row in df.iterrows():
    chunks = chunk_text_with_overlap(row['content'])
    for chunk in chunks:
        documents.append(Document(page_content=chunk, metadata={'source': f"PDF source: {row['pdf_name']}. Source from {row['page']} pages."}))

# Embedding and storing in Chroma vector store
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(documents, embedding_function, persist_directory="./chroma_db")
vectordb.persist()

print("Vector store created and saved locally.")


  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vector store created and saved locally.


  vectordb.persist()


In [23]:
# Load the vector store for similarity search
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

def similarity_search(query, k=3):
    results = vectordb.similarity_search(query, k=k)
    return results



  vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)


In [24]:
# Test the similarity search
query = "What is the core objective of investing in disruptive innovation according to ARK?"
results = similarity_search(query)

results

[Document(metadata={'source': 'PDF source: Investment Case For Disruptive Innovation.pdf. Source from 17 pages.'}, page_content='ARK Seeks to Capture Disruptive Innovation\nThe ARK Innovation ETF (ARKK) Aims to Offer\n1 Access to Growth  Investors who seek to access companies at the forefront of technology-enabled innovation, in some of the most  promising areas of the economy, with potential for long-term growth.\n2 Portfolio Diversification Potentially Suited for investors who like to diversify their existing portfolio with strategies that offer low correlation to a  number of core asset classes held in most investors’ portfolios.\n3 Moderate-to-High Risk-Reward Profile A constant focus on secular changes and disruptive innovation can compliment traditional strategies and core portfolios  May be suited for investors who have a moderate-to-high risk profile and intend to stay invested for the medium-to-long  term.\nThe information herein is general in nature and should not be consider

# 1.3 RAG_LLM

In [3]:
import pandas as pd
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import time
import torch


In [4]:
!nvidia-smi

Mon Sep  2 19:11:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [25]:
# Load chromadb and csv
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

# Function to perform similarity search
def similarity_search(query, k=5):
    results = vectordb.similarity_search(query, k=k)
    return results




In [5]:
from huggingface_hub import login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"

In [46]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id =  "meta-llama/Meta-Llama-3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32).to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Basic llama3

In [None]:
import torch

# Function to generate answers using the basic LLM
def generate_basic_llm_answer(question):
    prompt = (
        f"You are an expert in financial analysis with a deep understanding of investment strategies and ETF funds."
        f"Please provide a detailed and accurate answer to the following question based on your expertise.\n\n"
        f"Question: {question}\n"
        f"Answer:"
    )

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    output = model.generate(input_ids=input_ids, max_new_tokens=500, pad_token_id=tokenizer.pad_token_id, temperature=0.01)

    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    answer = generated_text.split("Answer:")[1].strip() if "Answer:" in generated_text else generated_text.strip()

    return answer

def log_process_time(start_time):
    end_time = time.time()
    return end_time - start_time

def process_csv(df):

    answers = []
    times = []

    for question in df['Question']:
        start_time = time.time()
        answer = generate_basic_llm_answer(question)
        process_time = log_process_time(start_time)

        answers.append(answer)
        times.append(process_time)

    df['basic_llm'] = answers
    df['basic_time'] = times

    return df

ARK_question = pd.read_csv('./dataset/cleaned_questions.csv')
df_process = process_csv(ARK_question)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


## LLama3 RAG

In [None]:
# LLama3 RAG


# generate answers using RAG
def generate_rag_answer(question, k=3):
    results = similarity_search(question, k)

    context = "\n\n".join([f"Source {i+1}: {result.page_content}" for i, result in enumerate(results)])

    prompt = (
          f"You are an expert in financial analysis, specifically focused on investment strategies and ETF funds. You will be provided with context from the ARK Investment Case for Disruptive Innovation.\n"
          f"Use only the provided information to answer the question accurately and concisely. Do not use any external knowledge."
          f"Base your answer strictly on the context provided.\n\n"
          f"Please follow these instructions carefully to provide an accurate answer:\n"
          f"1. Carefully read each paragraph of the provided content.\n"
          f"2. Identify if the paragraph contains relevant information to answer the question. If a paragraph does not provide relevant information, move to the next paragraph.\n"
          f"3. When you find relevant information, use it to construct your answer. Include as much evidence as possible from the context to support your answer, even if an answer has already been started.\n"
          f"4. Ensure your answer is accurate, concise, and based solely on the provided context.\n\n"
          f"Context:\n"
          f"{context}\n\n"
          f"Question: {question}\n"
          f"Answer:"
    )

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    output = model.generate(input_ids=input_ids, max_new_tokens=500, pad_token_id=tokenizer.pad_token_id, temperature=0)

    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    answer = generated_text.split("Answer:")[1].strip() if "Answer:" in generated_text else generated_text.strip()

    return answer, results

def log_process_time(start_time):
    end_time = time.time()
    return end_time - start_time

def process_rag_csv(df):

    rag_answers = []
    process_times = []
    similarity_metadata = []

    for question in df['Question']:
        start_time = time.time()
        answer, results = generate_rag_answer(question)
        process_time = log_process_time(start_time)

        similarity_info = "\n\n".join([f"Title: {result.metadata['source_url']}\nContent: {result.page_content}" for result in results])

        rag_answers.append(answer)
        process_times.append(process_time)
        similarity_metadata.append(similarity_info)

    df['rag_llm'] = rag_answers
    df['rag_time'] = process_times
    df['rag_similarity'] = similarity_metadata

    return df

df_process = process_csv(df_process)

## Rag With Reranker

In [None]:
from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank()

In [None]:
def generate_rag_answer(question, k=5):
    results = similarity_search(question, k)

    reranked_results = reranker.rerank(results, query=question)

    context = "\n\n".join([f"Source {i+1}: {result['page_content']}" for i, result in enumerate(reranked_results)])


    prompt = (
          f"You are an expert in financial analysis, specifically focused on investment strategies and ETF funds. You will be provided with context from the ARK Investment Case for Disruptive Innovation.\n"
          f"Use only the provided information to answer the question accurately and concisely. Do not use any external knowledge."
          f"Base your answer strictly on the context provided.\n\n"
          f"Please follow these instructions carefully to provide an accurate answer:\n"
          f"1. Carefully read each paragraph of the provided content.\n"
          f"2. Identify if the paragraph contains relevant information to answer the question. If a paragraph does not provide relevant information, move to the next paragraph.\n"
          f"3. When you find relevant information, use it to construct your answer. Include as much evidence as possible from the context to support your answer, even if an answer has already been started.\n"
          f"4. Ensure your answer is accurate, concise, and based solely on the provided context.\n\n"
          f"Context:\n"
          f"{context}\n\n"
          f"Question: {question}\n"
          f"Answer:"
    )

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    output = model.generate(input_ids=input_ids, max_new_tokens=500, pad_token_id=tokenizer.pad_token_id)

    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    answer = generated_text.split("Answer:")[1].strip() if "Answer:" in generated_text else generated_text.strip()

    return answer, reranked_results

In [None]:
def process_rag_rerank_csv(df):

    rag_answers = []
    process_times = []
    similarity_metadata = []

    for question in df['Question']:
        start_time = time.time()
        answer, results = generate_rag_answer(question)
        process_time = log_process_time(start_time)

        similarity_info = "\n\n".join([f"Title: {result.metadata['source_url']}\nContent: {result.page_content}" for result in results])

        rag_answers.append(answer)
        process_times.append(process_time)
        similarity_metadata.append(similarity_info)

    df['rag_rerank_llm'] = rag_answers
    df['rag_rerank_time'] = process_times
    df['rag_rerank_similarity'] = similarity_metadata

    return df

In [None]:
df_rerank = process_rag_rerank_csv(df_process)
