In [4]:
'''
we now build the index which is our knowledge base.
These are papers of a specific topic that are already in the database or repository(arxiv database).
NB:Indexes are built to handle data in a large dataset.
'''

from tools import fetch_arxiv_papers

papers = fetch_arxiv_papers('Language Models', 10) # Fetch 10 papers on Language Models

Building an index when creating an AI agent serves as a foundation for efficient data storage, retrieval, and processing. The index enables the agent to access relevant information quickly and accurately, which is crucial for tasks such as answering questions, making decisions, or generating content

In [5]:
[paper['title'] for paper in papers] # Print the titles of the fetched papers

['Distilling Multi-modal Large Language Models for Autonomous Driving',
 'Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues',
 'OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking',
 'Enhancing Lexicon-Based Text Embeddings with Large Language Models',
 'Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models',
 'Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps',
 'A Simple Aerial Detection Baseline of Multimodal Language Models',
 'CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education',
 'Domain Adaptation of Foundation LLMs for e-Commerce',
 'Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key']

In [7]:
#create documents from the fetched papers
#the documents are generic interface for the data that we want to index
#the documemt is connected to data sources

from llama_index.core  import Document

def create_documents_from_papers(papers):
    documents = []
    for paper in papers:
        content = (
            f'Title: {paper["title"]}\n'
            f'Authors: {", ".join(paper["authors"])}\n'
            f'summary: {paper["summary"]}\n'
            f'Published: {paper["published"]}\n'
            f'journal_ref: {paper["journal_ref"]}\n'
            f'ODOI: {paper["doi"]}\n'
            f'Primary Category: {paper["primary_category"]}\n'
            f'categories: {", ".join(paper["categories"])}\n'
            f'PDF url: {paper["pdf_url"]}\n'
            f'arxiv url: {paper["arxiv_url"]}\n'
        )
        documents.append(Document(text = content))
    return documents
       

In [8]:
#call the create_documnets_from_papers function
documents = create_documents_from_papers( papers)

In [9]:
#builds a string that will be parse through the embed model
documents

[Document(id_='f085c0bc-6d17-413e-93d6-8719e925ef25', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Title: Distilling Multi-modal Large Language Models for Autonomous Driving\nAuthors: Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M. Patel, Fatih Porikli\nsummary: Autonomous driving demands safe motion planning, especially in critical\n"long-tail" scenarios. Recent end-to-end autonomous driving systems leverage\nlarge language models (LLMs) as planners to improve generalizability to rare\nevents. However, using LLMs at test time introduces high computational costs.\nTo address this, we propose DiMA, an end-to-end autonomous driving system that\nmaintains the efficiency of an LLM-free (or vision-based) planner while\n

In [16]:
'''
#% ensure we are installing in the correct environment
%pip install constants
'''

'\n#% ensure we are installing in the correct environment\n%pip install constants\n'

In [17]:
'''
#print the content of constant to determine what it contains

import constants
print(dir(constants))
'''


'\n#print the content of constant to determine what it contains\n\nimport constants\nprint(dir(constants))\n'

In [None]:

#this where we build the index

from llama_index.core import Settings,VectorStoreIndex
from constant import embed_model

Settings.chunk_size = 1024   #the text will be processed in chunks of 1024 characters(if the text in more than 1024,it will be processed in 2 chunks)
Settings.chunk_overlap = 50  #each chunk will overlap with the previous one by 50 characters

index = VectorStoreIndex.from_documents(documents, embed_model = embed_model)
'''
In creating the index,we pass the documents through and embed_model
This code snippet is actually calling the openai model to embed the documents.that is to change the documents into vectors
'''

'\nthis is actually calling the openai model to embed the documents.that is to change the documents into vectors\n'

In [12]:
#store the created index in a folder called index(index/ is a local storage)
index.storage_context.persist('index/')  

'''
Add the index to gitignore because it changes based on what we search for and the documents we index.
go to .gitignore and add index/ (this is a local index)

we can also use cloud-based storage index like pinecone or weaviate

'''
