In [None]:
%load_ext autoreload
%autoreload 2

# Retriever
In this notebook we will extract content from pdf, chunk the content, index the chunks and retrieve chunk given the query

In [None]:
#| default_exp retriever

In [None]:
#|export
import toolslm as tlm
import os
import openai, numpy as np
from wattbot import eda, utils
import fastcore.all as fc
import contextkit.read as rd
from rank_bm25 import BM25Okapi
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
from langchain_text_splitters import MarkdownTextSplitter

In [None]:
load_dotenv()

True

## Load Data

Loading the files
- metadata
- train
- test
  
and viewing them

In [None]:
md = eda.metadata()
md.head()

Unnamed: 0,id,type,title,year,citation,url
0,amazon2023,report,2023 Amazon Sustainability Report,2023,Amazon Staff. (2023). Amazon Sustainability Re...,https://sustainability.aboutamazon.com/2023-am...
1,chen2024,paper,Efficient Heterogeneous Large Language Model D...,2024,"Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingx...",https://arxiv.org/pdf/2405.01814
2,chung2025,paper,The ML.ENERGY Benchmark: Toward Automated Infe...,2025,"Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan...",https://arxiv.org/pdf/2505.06371
3,cottier2024,paper,The Rising Costs of Training Frontier AI Models,2024,"Ben Cottier, Robi Rahman, Loredana Fattorini, ...",https://arxiv.org/pdf/2405.21015
4,dodge2022,paper,Measuring the Carbon Intensity of AI in Cloud ...,2022,"Jesse Dodge, Taylor Prewitt, Remi Tachet Des C...",https://arxiv.org/pdf/2206.05229


In [None]:
qa = eda.train()
qa.head()

Unnamed: 0,id,question,answer,answer_value,answer_unit,ref_id,ref_url,supporting_materials,explanation
0,q003,What is the name of the benchmark suite presen...,The ML.ENERGY Benchmark,ML.ENERGY Benchmark,is_blank,['chung2025'],['https://arxiv.org/pdf/2505.06371'],"We present the ML.ENERGY Benchmark, a benchmar...",Quote
1,q009,What were the net CO2e emissions from training...,4.3 tCO2e,4.3,tCO2e,['patterson2021'],['https://arxiv.org/pdf/2104.10350'],"""Training GShard-600B used 24 MWh and produced...",Quote
2,q054,What is the model size in gigabytes (GB) for t...,64.7 GB,64.7,GB,['chen2024'],['https://arxiv.org/pdf/2405.01814'],Table 3: Large language models used for evalua...,Table 3
3,q062,What was the total electricity consumption of ...,Unable to answer with confidence based on the ...,is_blank,MWh,is_blank,is_blank,is_blank,is_blank
4,q075,True or False: Hyperscale data centers in 2020...,TRUE,1,is_blank,"['wu2021b','patterson2021']","['https://arxiv.org/abs/2108.06738','https://a...","Wu 2021, body text near Fig. 1: ""…between trad...",The >40% statement is explicit in Wu. Patterso...


In [None]:
tst = eda.test()
tst.head()

Unnamed: 0,id,question,answer,answer_value,answer_unit,ref_id,ref_url,supporting_materials,explanation
0,q001,What was the average increase in U.S. data cen...,,,percent,,,,
1,q002,"In 2023, what was the estimated amount of cars...",,,cars,,,,
2,q004,How many data centers did AWS begin using recy...,,,data centers,,,,
3,q005,Since NVIDIA doesn't release the embodied carb...,,,kg/GPU,,,,
4,q006,By what factor was the estimated amortized tra...,,,ratio,,,,


We have to fill up the `answer`, `answer_value`, `answer_unit`, `ref_id`, `ref_url`, `supporting_materials` and `explanation` here.

From the competition following values are expected 

- `answer`: A clear natural-language response (e.g., 1438 lbs, Water consumption, TRUE)'. If no answer is possible, use "Unable to answer with confidence based on the provided documents."
  
- `answer_value`: The normalized numeric or categorical value (e.g., 1438, Water consumption, 1)
  - If no answer is possible, use is_blank
  - Ranges should be encoded as [low,high]
  - Do not include symbols like <, >, ~ here. Those can be left in the clear natural language column.


- `answer_unit`: Unit of measurement (e.g., lbs, kWh, gCO2, projects, is_blank).

- `ref_id`: One or more document IDs from metadata.csv that support the answer.

- `ref_url`: One or more URL(s) of the cited document(s).

- `supporting_materials`: Verbatim justification from the cited document (quote, table reference, figure reference, etc.).

- `explanation`: Short reasoning describing why the cited material supports the answer. 

## Read pdf

I already downloaded all the pdfs, please refer the notebook `00_eda`

We will extract the content from the pdfs here using answerdotai's [`contextkit`](https://github.com/AnswerDotAI/ContextKit) library which uses [`pypdf`](https://github.com/py-pdf/pypdf) underneath.

pypdf does a decent job of text extraction from pdf but it does not preserve the layouts, table structure and reading order. 

In [None]:
#|export
def get_metadata(doc_id:str) -> str:
    """Returns the metadata for a given doc_id"""
    meta = eda.metadata()
    return meta[meta['id'] == doc_id].iloc[0].to_dict()

In [None]:
get_metadata('chen2024')

{'id': 'chen2024',
 'type': 'paper',
 'title': 'Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
 'year': 2024,
 'citation': 'Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
 'url': 'https://arxiv.org/pdf/2405.01814'}

In [None]:
doc_id = 'chen2024'
fc.test_eq(get_metadata(doc_id)['id'], doc_id)

In [None]:
#|export
def read_doc(doc_id:str) -> str: 
    """Returns the content of the pdf along with its metadata for a given doc_id"""
    meta = get_metadata(doc_id)
    content = rd.read_pdf(fc.Path(eda.data_path)/f'{doc_id}.pdf')
    return fc.NS(content=content, **meta)

In [None]:
doc = read_doc('chen2024')
doc.content[:100], doc.id

('Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan C',
 'chen2024')

In [None]:
doc_id='chen2024'
doc = read_doc(doc_id)
fc.test_ne(len(doc.content), 0)
fc.test_eq(doc.id, doc_id)

## Read Markdown

In [None]:
#|export
def read_markdown(doc_id:str) -> str:
    """Returns the markdown content given the doc_id"""
    meta = get_metadata(doc_id)
    md = (fc.Path(eda.data_path)/f'markdown/{doc_id}.md').read_text()
    return fc.NS(content=tlm.download.clean_md(md), **meta)

In [None]:
len(read_markdown('chen2024').content), len(read_doc('chen2024').content)

(75025, 69175)

## Total content size

In [None]:
fc.L(eda.metadata()['id'].to_list()).map(lambda x: len(read_doc(x).content)).sum()

2673613

I dont think any open source models can handle that many characters in their context window as of November 2025. 

A RAG based system will be good where we chunk the content, retrieve the relavent chunk and generate answer with those relevant chunk

## Document Chunks

In [None]:
#|export
def get_content_metadata(fn, doc_id):
    doc = fn(doc_id)
    content = doc.__dict__.pop('content')
    return content, doc.__dict__

In [None]:
content, metadata = get_content_metadata(read_doc, 'chen2024')
content[:100], metadata

('Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan C',
 {'id': 'chen2024',
  'type': 'paper',
  'title': 'Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
  'year': 2024,
  'citation': 'Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
  'url': 'https://arxiv.org/pdf/2405.01814'})

In [None]:
#|export
def chunk_doc(doc_id:str, start_id:int=0, chunk_size:int=1500, step:int=1400) -> list:
    """Chunks the content of a doc given the doc_id"""
    content, metadata = get_content_metadata(read_doc, doc_id)
    def _chunk(x): return fc.NS(text = content[x[-1]: x[-1] + chunk_size],  chunk_id=x[0] + start_id, **metadata)
    return fc.L.range(0, len(content), step).enumerate().map(_chunk)

In [None]:
doc = read_doc('chen2024')
len(doc.content)

69175

In [None]:
chunks = chunk_doc('chen2024')
len(chunks), chunks[0]['text'][-200:]

(50,
 ' performance and cost efficiency. Our com-\nprehensive analysis and experiments confirm the viability\nof splitting the attention computation over multiple devices.\nAlso, the communication bandwidth req')

In [None]:
doc_id = 'chen2024'
chunks = chunk_doc(doc_id)
fc.test_ne(len(chunks), 0)
fc.test_eq(chunks[0].id, doc_id)

In [None]:
chunks[0]['text'][:200]

'Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan Chen1 Wencong Xiao2 Yutong Lin1 Mingxing Zhang1 Yingdi Shan1 Jinlei Jiang1\nKang Chen1 Yongwei Wu1\n1Ts'

In [None]:
chunks[0]

namespace(text='Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan Chen1 Wencong Xiao2 Yutong Lin1 Mingxing Zhang1 Yingdi Shan1 Jinlei Jiang1\nKang Chen1 Yongwei Wu1\n1Tsinghua University\n2ByteDance\nAbstract\nTransformer-based large language models (LLMs) exhibit\nimpressive performance in generative tasks but also intro-\nduce significant challenges in real-world serving due to in-\nefficient use of the expensive, computation-optimized accel-\nerators. Although disaggregated serving architectures have\nbeen proposed to split different phases of LLM inference, the\nefficiency of decoding phase is still low. This is caused by\nthe varying resource demands of different operators in the\ntransformer-based LLMs. Specifically, the attention operator\nis memory-intensive, exhibiting a memory access pattern that\nclashes with the strengths of modern accelerators, especially\nfor long context requests.\nTo enhance the efficiency of LLM decodi

In [None]:
chunks[-1]

namespace(text='i Chen, Christopher De-\nwan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-\nhaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel\nSimig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang,\nand Luke Zettlemoyer. Opt: Open pre-trained trans-\nformer language models, 2022.\n[59] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu,\nYibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist-\nServe: Disaggregating prefill and decoding for goodput-\noptimized large language model serving. In 18th\nUSENIX Symposium on Operating Systems Design and\nImplementation (OSDI 24), pages 193–210, 2024.\n16',
          chunk_id=49,
          id='chen2024',
          type='paper',
          title='Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
          year=2024,
          citation='Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding wit

## Markdown Chunks

In [None]:
chunk_size = 375
chunk_overlap = 125
md_splitter = MarkdownTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [None]:
md_splitter

<langchain_text_splitters.markdown.MarkdownTextSplitter>

In [None]:
doc_id = 'chen2024'
md_content = read_markdown(doc_id).content
chunks = md_splitter.split_text(md_content)

In [None]:
chunks[0][-1200:]

'# Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation\n\nShaoyuan Chen<sup>1</sup> Wencong Xiao<sup>2</sup> Yutong Lin<sup>1</sup> Mingxing Zhang<sup>1</sup> Yingdi Shan<sup>1</sup> Jinlei Jiang<sup>1</sup>  \nKang Chen<sup>1</sup> Yongwei Wu<sup>1</sup>\n\n<sup>1</sup>Tsinghua University\n\n<sup>2</sup>ByteDance\n\n## Abstract\n\nTransformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of moder

In [None]:
chunks[1][:1200]

'Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.\n\nTo enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component 

In [None]:
#|export
@fc.patch
def chunk_markdown(self:MarkdownTextSplitter, doc_id:str, start_id:int=0):
    """Chunks the markdown of a doc given the doc_id"""
    content, metadata = get_content_metadata(read_markdown, doc_id)
    chunks = fc.L(self.split_text(content))
    return chunks.enumerate().map(lambda x: fc.NS(text = x[1],  chunk_id=x[0] + start_id, **metadata))

In [None]:
chunks = md_splitter.chunk_markdown(doc_id)

In [None]:
chunks[0].text[-900:]

'p> Yutong Lin<sup>1</sup> Mingxing Zhang<sup>1</sup> Yingdi Shan<sup>1</sup> Jinlei Jiang<sup>1</sup>  \nKang Chen<sup>1</sup> Yongwei Wu<sup>1</sup>\n\n<sup>1</sup>Tsinghua University\n\n<sup>2</sup>ByteDance\n\n## Abstract\n\nTransformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.'

In [None]:
chunks[1].text[:900]

'Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.\n\nTo enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end'

## Human readable Chunk

This will help later in creating context for the prompt

In [None]:
#|export
def Nugget(chunk:object, chunk_no:int=0) -> str:
    """Returns the chunk in a readable format"""
    return f"""### Chunk {chunk_no}
            Text: {chunk['text']}
            Chunk Id: {chunk['chunk_id']}
            Doc ID: {chunk['id']}
            Type: {chunk['type']}
            Title: {chunk['title']}
            Year: {chunk['year']}
            Citation: {chunk['citation']}
            URL: {chunk['url']}"""

In [None]:
print(Nugget(chunks[42]))

### Chunk 0
            Text: ![](_page_8_Figure_0.jpeg)

(a) Request-level partition.

(b) Head-level partition.

Figure 9: Work partition methods of the attention operator.

store the KV caches and compute the attention operators. As depicted in Figure 9, the attention operators can be parallelized among memory devices in various ways. One method is to distribute different requests across different devices; an alternative strategy is to partition and distribute the attention heads, which can also be computed independently, to different devices. The head-level partitioning approach ensures a balanced workload distribution, whereas the request-level partitioning may result in load imbalance due to the differences in sequence lengths and therefore the KV cache sizes among requests. However, head-level partitioning has limited flexibility, as it requires the number of memory devices to be divisible by the number of attention heads. We opt for head-level partitioning in Lamina, which offe

## Chunks all the docs

In [None]:
#|export
def chunk_all(fn) -> list:
    """Chunk contents of all docs"""
    doc_ids = eda.metadata()['id'].tolist()
    start_id, all_chunks = 0, fc.L()
    for doc_id in doc_ids:
        chunks = fn(doc_id, start_id=len(all_chunks))
        all_chunks.extend(chunks)
    return all_chunks

In [None]:
all_chunks = chunk_all(chunk_doc)
len(all_chunks)

1927

In [None]:
all_chunks[0]

namespace(text='Amazon \nSustainability \nReport\n2023 Contents\nOverview\n3 Introduction\n4 A Letter from Our Chief \nSustainability Officer\xa0\n5 How We Work\n6 Goals Summary\n7 2023 Year in Review \xa0\nEnvironment\n9 Carbon\n24 Carbon-Free Energy\n29 Packaging \n34 Waste and Circularity\n40 Water\nValue Chain\n45 Human Rights \n50 Responsible Supply Chain\n58 Sustainable Products and \nMaterials \n64 Supplier Diversity \n67 Community Impact\nPeople\n75 Employee Experience\n81 Health and Safety\n86 Inclusive Experiences\nAppendix\n94  Sustainability Reporting Topic \nAssessment\n95  Endnotes\n96 Assurance Statements \n97 Disclaimer and Forward-Looking \nStatements \nOn the cover  \nThe Baldy Mesa Solar and Storage Project (developed \nand operated by AES), located in Adelanto, California. Employees inside one of our newest office buildings in Bellevue, \nWashington.\nIntroduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Letter\nAbout This Report\nThis is our sixth annual repo

In [None]:
all_chunks[-1]

namespace(text='\nWeidinger, L., Mellor, J., et al.: Ethical and social risks of harm from language models.\narXiv preprint arXiv:2112.04359 (2021)\n25',
          chunk_id=1926,
          id='zschache2025',
          type='paper',
          title='Comparing energy consumption and accuracy in text classification inference',
          year=2025,
          citation='Johannes Zschache, & Tilman Hartwig (2025). Comparing energy consumption and accuracy in text classification inference arXiv. https://arxiv.org/pdf/2508.14170',
          url='https://arxiv.org/pdf/2508.14170 ')

## Neighbour Chunks

In [None]:
#|export
class Chunks:
    def __init__(self, all_chunks):
        fc.store_attr()

    def get_chunk(self, chunk_id):
        return self.all_chunks.filter(lambda x: x.chunk_id==chunk_id)[0]

    def get_neighbours(self, chunk_id):
        left_chunk, right_chunk = None, None
        left_chunk_id, right_chunk_id = chunk_id - 1, chunk_id + 1
        if left_chunk_id >= 0: left_chunk = self.get_chunk(left_chunk_id)
        if right_chunk_id < len(self.all_chunks): right_chunk = self.get_chunk(right_chunk_id)
        return left_chunk, right_chunk

    @staticmethod
    def unique(chunks):
        unique_chunk_ids = set()
        ans = fc.L()
        for c in chunks:
            if c.chunk_id not in unique_chunk_ids:
                unique_chunk_ids.add(c.chunk_id)
                ans.append(c)
        return ans

    def include_neighbours(self, chunks):
        ans = fc.L()
        for chunk in chunks:
            left_chunk, right_chunk = self.get_neighbours(chunk.chunk_id)
            if left_chunk: ans.append(left_chunk)
            ans.append(chunk)
            if right_chunk: ans.append(right_chunk)
        return self.unique(ans)

In [None]:
Chunks(all_chunks).get_chunk(1850)

namespace(text='ne-tune the full BlackMamba model (i.e.,\noriginal weight matrices), whereas employed QLoRA [15]\nfor parameter-efficient fine-tuning (PEFT) on Mixtral due to\nGPU memory capacity budget. For QLoRA, we target the\nMoE layers, including the routers, and set the rank of the\nLoRA modules to 16. We enable FlashAttention2 [17] during\nMixtral fine-tuning for enhanced efficiency. Moreover, we use\ngradient checkpointing [18] to save memory usage.\nDatasets. Our fine-tuning process is implemented in Py-\nTorch using the LLaMA-Factory framework [19], with a\nlearning rate of 5e-5 and 10 epochs. Both models were fine-\ntuned on two datasets focused on different tasks: common-\nsense 15k (CS) and Math 14k (MATH), which address com-\nmonsense reasoning and arithmetic reasoning respectively\n(provided by LLM-adapters [20]). The details of datasets\nare used in Table II. For evaluation, we tested the models\non GSM8K [21] for arithmetic reasoning and HE [22] for\ncommonsense reason

In [None]:
left_chunk, right_chunk = Chunks(all_chunks).get_neighbours(1850)
fc.L(left_chunk, right_chunk).attrgot('chunk_id')

(#2) [1849,1851]

In [None]:
some_dup_chunks = fc.L(Chunks(all_chunks).get_chunk(cid) for cid in [1850, 1851, 1852, 1853, 1850, 1851, 1852])

In [None]:
Chunks.unique(some_dup_chunks).attrgot('chunk_id')

(#4) [1850,1851,1852,1853]

In [None]:
ans = Chunks(all_chunks).include_neighbours(some_dup_chunks)
ans.attrgot('chunk_id')

(#6) [1849,1850,1851,1852,1853,1854]

## Lexical Search
> We will use BM25 here

In [None]:
idx = np.random.randint(0, len(all_chunks))
query = all_chunks[idx].text
all_chunks[idx]

namespace(text='easoning modes\n8 Discussion and Policy Implications\n8.1 The Critical Role of Infrastructure in AI Sustainability\nOur findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While\nmodel design enhances theoretical efficiency, real-world outcomes can substantially diverge based\non deployment conditions and factors such as renewable energy usage and hardware efficiency.\nFor instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more\nenergy than GPT-4o on long queries due to reliance on older A100 GPU nodes. Similarly, DeepSeek\nmodels highlight the profound impact of infrastructure: DeepSeek-R1 and DeepSeek-V3 deployed on\nDeepSeek’s own servers exhibit water consumption and carbon emissions nearly six times higher than\ntheir Azure-hosted counterparts. The Azure deployments benefit from better hardware, more efficient\ncooling systems, lower carbon intensity, and tighter PUE control, demonstrating 

In [None]:
def get_random_chunk(chunks): return all_chunks[np.random.randint(0, len(all_chunks))]

In [None]:
get_random_chunk(chunks).text[:100]

'eter, Compute and Data Trends in Machine Learning.https://epochai.org/data/epochdb/\nvisualization, 2'

In [None]:
#|export
def tokenize(query): return query.lower().split()

In [None]:
tokenized_query = tokenize(query)
tokenized_query[:10]

['easoning',
 'modes',
 '8',
 'discussion',
 'and',
 'policy',
 'implications',
 '8.1',
 'the',
 'critical']

In [None]:
#|export
def bm25chunks(chunks:object) -> object: 
    """Indexes the chunks to BM250kapi"""
    return BM25Okapi([tokenize(t) for t in chunks.attrgot('text')])

In [None]:
bm25 = bm25chunks(all_chunks)
bm25.corpus_size

1927

In [None]:
bm25.get_top_n(tokenized_query, all_chunks, n=1) 

[namespace(text='easoning modes\n8 Discussion and Policy Implications\n8.1 The Critical Role of Infrastructure in AI Sustainability\nOur findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While\nmodel design enhances theoretical efficiency, real-world outcomes can substantially diverge based\non deployment conditions and factors such as renewable energy usage and hardware efficiency.\nFor instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more\nenergy than GPT-4o on long queries due to reliance on older A100 GPU nodes. Similarly, DeepSeek\nmodels highlight the profound impact of infrastructure: DeepSeek-R1 and DeepSeek-V3 deployed on\nDeepSeek’s own servers exhibit water consumption and carbon emissions nearly six times higher than\ntheir Azure-hosted counterparts. The Azure deployments benefit from better hardware, more efficient\ncooling systems, lower carbon intensity, and tighter PUE control, demonstrating

In [None]:
#|export
class LexicalSearch:
    def __init__(self, chunks, tokenize_func=tokenize, neighbour_chunks=False):
        fc.store_attr()
        self.model = 'BM25Okapi'
        self.bm25 = bm25chunks(chunks)
        
    def search(self, query, n=10):
        ans = fc.L(self.bm25.get_top_n(self.tokenize_func(query), self.chunks, n=n))
        if self.neighbour_chunks: ans = Chunks(self.chunks).include_neighbours(ans)
        return ans

In [None]:
ls = LexicalSearch(all_chunks)
lexical_res = ls.search(query, n=1)
lexical_res

(#1) [NS(text='easoning modes\n8 Discussion and Policy Implications\n8.1 The Critical Role of Infrastructure in AI Sustainability\nOur findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While\nmodel design enhances theoretical efficiency, real-world outcomes can substantially diverge based\non deployment conditions and factors such as renewable energy usage and hardware efficiency.\nFor instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more\nenergy than GPT-4o on long queries due to reliance on older A100 GPU nodes. Similarly, DeepSeek\nmodels highlight the profound impact of infrastructure: DeepSeek-R1 and DeepSeek-V3 deployed on\nDeepSeek’s own servers exhibit water consumption and carbon emissions nearly six times higher than\ntheir Azure-hosted counterparts. The Azure deployments benefit from better hardware, more efficient\ncooling systems, lower carbon intensity, and tighter PUE control, demonstrating t

In [None]:
ls = LexicalSearch(all_chunks, neighbour_chunks=True)
ls.search(query, n=1).attrgot('chunk_id')

(#3) [868,869,870]

## Semantic Search

In [None]:
embed_model = 'nomic-ai/nomic-embed-text-v1.5'

In [None]:
#|export
@fc.patch
def embed(self:openai.OpenAI, model, texts, bs=256): 
    if type(texts) == str: texts = [texts]
    texts_chunks = fc.chunked(texts, chunk_sz=bs)
    data = fc.mapped(lambda o: self.embeddings.create(input=o, model=model), texts_chunks).attrgot('data')
    return np.array(data.concat().attrgot('embedding'))

In [None]:
utils.fw().embed(embed_model, ['hi', 'anubhav']).shape

(2, 768)

In [None]:
texts = all_chunks.attrgot('text')
len(texts), texts[0][:100]

(1927,
 'Amazon \nSustainability \nReport\n2023 Contents\nOverview\n3 Introduction\n4 A Letter from Our Chief \nSust')

In [None]:
embeddings = utils.fw().embed(embed_model, texts)

In [None]:
embeddings.shape

(1927, 768)

In [None]:
eda.data_path

Path('../data')

In [None]:
#|export
def embed_chunks(chunks, model='nomic-ai/nomic-embed-text-v1.5'):
    texts = chunks.attrgot('text')
    embeddings = utils.fw().embed(model, texts)
    chunks_embeddings = fc.L(chunks, embeddings)
    return chunks_embeddings

In [None]:
chunks_embeddings = embed_chunks(all_chunks)    
len(chunks_embeddings[0]), chunks_embeddings[-1].shape

(1927, (1927, 768))

In [None]:
random_chunk = get_random_chunk(all_chunks)
random_chunk

namespace(text=' Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language\nUnderstanding. arXiv:1810.04805 [cs.CL]\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias\nMinderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint\narXiv:2010.11929 (2020).\n[10] Jim Gao. 2014. Machine learning applications for data center optimization. (2014).\n[11] Michael Gillenwater. 2008. Redefining RECs—Part 1: untangling attributes and offsets. Energy Policy 36, 6 (2008), 2109–2119.\n[12] Google. 2021. Carbon free energy for Google Cloud regions. https://cloud.google.com/sustainability/region-carbon\n[13] Google. 2021. Helping you pick the greenest region for your Google Cloud resources. https://cloud.google.com/blog/topics/sustainability/pick-the-\ngoogle-cloud-region-with-the-lowest-co2\n[14] A

In [None]:
query_embedding = utils.fw().embed(embed_model, random_chunk.text)
query_embedding.shape

(1, 768)

In [None]:
k = 10
all_chunks, all_embeddings = chunks_embeddings
scores = cosine_similarity(query_embedding, all_embeddings)
best_k_ind = np.argsort(scores)[0].tolist()[::-1][:k]
top_k_chunks = all_chunks[best_k_ind]

In [None]:
top_k_chunks[0]

namespace(text=' Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language\nUnderstanding. arXiv:1810.04805 [cs.CL]\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias\nMinderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint\narXiv:2010.11929 (2020).\n[10] Jim Gao. 2014. Machine learning applications for data center optimization. (2014).\n[11] Michael Gillenwater. 2008. Redefining RECs—Part 1: untangling attributes and offsets. Energy Policy 36, 6 (2008), 2109–2119.\n[12] Google. 2021. Carbon free energy for Google Cloud regions. https://cloud.google.com/sustainability/region-carbon\n[13] Google. 2021. Helping you pick the greenest region for your Google Cloud resources. https://cloud.google.com/blog/topics/sustainability/pick-the-\ngoogle-cloud-region-with-the-lowest-co2\n[14] A

In [None]:
#|export
class SemanticSearch:
    def __init__(self, chunks, model='nomic-ai/nomic-embed-text-v1.5', neighbour_chunks=False):
        fc.store_attr()
        self.chunks_embeddings = embed_chunks(chunks, model)

    def search(self, query, n=1):
        query_embedding = utils.fw().embed(self.model, query)
        all_chunks, all_embeddings = self.chunks_embeddings
        scores = cosine_similarity(query_embedding, all_embeddings)
        best_k_ind = np.argsort(scores)[0].tolist()[::-1][:n]
        ans = all_chunks[best_k_ind]
        if self.neighbour_chunks: ans = Chunks(self.chunks).include_neighbours(ans)
        return ans

In [None]:
ss = SemanticSearch(all_chunks)
semantic_res = ss.search(random_chunk.text, n=1)
len(semantic_res), semantic_res[0]

(1,
 namespace(text=' Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language\nUnderstanding. arXiv:1810.04805 [cs.CL]\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias\nMinderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint\narXiv:2010.11929 (2020).\n[10] Jim Gao. 2014. Machine learning applications for data center optimization. (2014).\n[11] Michael Gillenwater. 2008. Redefining RECs—Part 1: untangling attributes and offsets. Energy Policy 36, 6 (2008), 2109–2119.\n[12] Google. 2021. Carbon free energy for Google Cloud regions. https://cloud.google.com/sustainability/region-carbon\n[13] Google. 2021. Helping you pick the greenest region for your Google Cloud resources. https://cloud.google.com/blog/topics/sustainability/pick-the-\ngoogle-cloud-region-with-the-lowest-co2\n[

In [None]:
ss = SemanticSearch(all_chunks, neighbour_chunks=True)
ss.search(random_chunk.text, n=1).attrgot('chunk_id')

(#3) [529,530,531]

## Hybrid: Rerank

Here we will rerank the outputs from semantic search and lexical search using a reranker model. 

There are other ways to mix the outputs from the above two searches like Reciprocal Rank Fusion (RRF), Linear Combination etc which you can try later

In [None]:
#|export
def combine_chunks(chunks1:list, chunks2:list) -> list:
    "Returns unique combination of chunks 1 and chunks 2"
    res = chunks1.copy()
    seen_id = set(res.attrgot('chunk_id'))
    for ele in chunks2:
        if ele.chunk_id not in seen_id:
            res.append(ele)
            seen_id.add(ele.chunk_id)
    return res

In [None]:
combined_res = combine_chunks(semantic_res, lexical_res)
len(combined_res)

2

In [None]:
combined_res.attrgot('chunk_id')

(#2) [530,6]

In [None]:
ranker = utils.Reranker()

In [None]:
#|export
@fc.patch
def rerank_chunks(self:utils.Reranker, query, chunks, n=10):
    docs = chunks.attrgot('text')
    ranked_ids = self.rank(query=query, docs=docs, n=n)
    return chunks[ranked_ids]

In [None]:
query[:100]

'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'

In [None]:
combined_res[-1].text[:100]

'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'

In [None]:
ranker.rerank_chunks(combined_res[-1].text, combined_res)[0].text[:100]

'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'

In [None]:
#|export
class HybridSearch:
    def __init__(self, lexical_search, semantic_search, neighbour_chunks=False):
        fc.store_attr()
        self.model = [lexical_search.model, semantic_search.model]
        self.ranker = utils.Reranker()

    def search(self, query, n=1):
        lexical_res = self.lexical_search.search(query, n=2*n)
        semantic_res = self.semantic_search.search(query, n=2*n)
        combined_chunks = combine_chunks(lexical_res, semantic_res)
        ans = self.ranker.rerank_chunks(query, combined_chunks,  n=n)
        if self.neighbour_chunks: ans = Chunks(self.lexical_search.chunks).include_neighbours(ans)
        return ans

In [None]:
hs = HybridSearch(ls, ss)
chunks_res = hs.search(combined_res[-1].text)
chunks_res[0].text[:100]

'23 Amazon Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We Work'

In [None]:
hs = HybridSearch(ls, ss, neighbour_chunks=True)
hs.search(combined_res[-1].text, n=1).attrgot('chunk_id')

(#3) [10,11,12]

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()