# Llama-index

In [1]:
from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

  from .autonotebook import tqdm as notebook_tqdm


In [29]:
# global settings
Settings.text_splitter = SentenceSplitter(chunk_size=512)

# https://huggingface.co/BAAI/bge-small-en-v1.5
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
)
Settings.llm = None

LLM is explicitly disabled. Using MockLLM.


In [3]:
dataset = SimpleDirectoryReader("documents/").load_data()

In [4]:
dataset



In [5]:
dataset[0].metadata

{'file_path': '/Users/us65093/Documents/scai/documents/alice.txt',
 'file_name': 'alice.txt',
 'file_type': 'text/plain',
 'file_size': 154573,
 'creation_date': '2024-10-30',
 'last_modified_date': '2024-10-30'}

In [6]:
dataset[0]



In [7]:
index = VectorStoreIndex.from_documents(dataset)

In [8]:
top_passages = 5
query = "What is the name of the rabbit in Alice in Wonderland?"


passages = index.as_retriever(similarity_top_k=top_passages).retrieve(
    f"Represent this sentence for searching relevant passages: {query}",
)
text = [p.node.text for p in passages]

In [9]:
text

['“Mary Ann! Mary Ann!” said the voice. “Fetch me my gloves this moment!”\r\nThen came a little pattering of feet on the stairs. Alice knew it was\r\nthe Rabbit coming to look for her, and she trembled till she shook the\r\nhouse, quite forgetting that she was now about a thousand times as\r\nlarge as the Rabbit, and had no reason to be afraid of it.\r\n\r\nPresently the Rabbit came up to the door, and tried to open it; but, as\r\nthe door opened inwards, and Alice’s elbow was pressed hard against it,\r\nthat attempt proved a failure. Alice heard it say to itself “Then I’ll\r\ngo round and get in at the window.”\r\n\r\n“_That_ you won’t!” thought Alice, and, after waiting till she fancied\r\nshe heard the Rabbit just under the window, she suddenly spread out her\r\nhand, and made a snatch in the air. She did not get hold of anything,\r\nbut she heard a little shriek and a fall, and a crash of broken glass,\r\nfrom which she concluded that it was just possible it had fallen into a\r\ncu

## Multiple documents

In [10]:
dataset = SimpleDirectoryReader("documents/").load_data()
index = VectorStoreIndex.from_documents(dataset, show_progress=True)

query = "Where does the story takes place?"
top_passages = 10

passages = index.as_retriever(similarity_top_k=top_passages).retrieve(
    f"Represent this sentence for searching relevant passages: {query}"
)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 22.44it/s]
Generating embeddings: 100%|██████████| 127/127 [00:08<00:00, 14.61it/s]


In [11]:
passages

[NodeWithScore(node=TextNode(id_='50a7fa5e-5fbe-4466-9154-12a14f0ba000', embedding=None, metadata={'file_path': '/Users/us65093/Documents/scai/documents/alice.txt', 'file_name': 'alice.txt', 'file_type': 'text/plain', 'file_size': 154573, 'creation_date': '2024-10-30', 'last_modified_date': '2024-10-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='556545d4-c499-45d2-97f3-8b5d12580892', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/Users/us65093/Documents/scai/documents/alice.txt', 'file_name': 'alice.txt', 'file_type': 'text/plain', 'file_size': 154573, 'creation_date': '2024-10-30', 'last_modified_date': '2024-10-30'}, hash='fa7f5213fea0a4406606b7ce0b8a2286ba9013e47b1fd0ff513f20328da0

## Node Parsers

### Hierachical Node Parser

In [32]:
from llama_index.core import StorageContext
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.storage.docstore import SimpleDocumentStore

In [31]:
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128])

nodes = node_parser.get_nodes_from_documents(dataset, show_progress=True)

Parsing documents into nodes: 100%|██████████| 1/1 [00:00<00:00,  6.91it/s]


We can create a retriever using the nodes smallest chunk size.

If passages are found, using `AutoMerginRetriever`, it will try merging the chunks with parents chunk from higher levels into a single context.

In [18]:
# retrieve nodes from the bottom of the hierarchy
leaf_nodes = get_leaf_nodes(nodes)

# create context for the leaf nodes
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
storage_context = StorageContext.from_defaults(docstore=docstore)

# this will allow us to search for the leaf nodes
base_index = VectorStoreIndex(
    leaf_nodes, storage_context=storage_context, show_progress=True
)

Generating embeddings: 100%|██████████| 470/470 [00:08<00:00, 55.24it/s]


In [19]:
from llama_index.core.retrievers import AutoMergingRetriever

# same as before but here we would have the relationship between nodes
base_retriever = base_index.as_retriever(similarity_top_k=5)

# this will merge the results from the leaf nodes with the context from the higher levels
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

In [20]:
query_str = "How did Alice enter the magical world?"

retrieved_base_nodes = base_retriever.retrieve(query_str)
retrieved_nodes = retriever.retrieve(query_str)

> Merging 1 nodes into parent node.
> Parent node id: f6fba588-e982-4998-b66b-345ada701392.
> Parent node text: “How am I to get in?” asked Alice again, in a louder tone.

“_Are_ you to get in at all?” said ...



In [21]:
print(f"Retrieved {len(retrieved_base_nodes)} base nodes")
print(f"Retrieved {len(retrieved_nodes)} nodes")

Retrieved 5 base nodes
Retrieved 5 nodes


In [22]:
retrieved_nodes[0].text

'However, on the second\r\ntime round, she came upon a low curtain she had not noticed before, and\r\nbehind it was a little door about fifteen inches high: she tried the\r\nlittle golden key in the lock, and to her great delight it fitted!\r\n\r\nAlice opened the door and found that it led into a small passage, not\r\nmuch larger than a rat-hole: she knelt down and looked along the\r\npassage into the loveliest garden you ever saw.'

In [23]:
retrieved_nodes[0].node.relationships

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='556545d4-c499-45d2-97f3-8b5d12580892', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/Users/us65093/Documents/scai/documents/alice.txt', 'file_name': 'alice.txt', 'file_type': 'text/plain', 'file_size': 154573, 'creation_date': '2024-10-30', 'last_modified_date': '2024-10-30'}, hash='fa7f5213fea0a4406606b7ce0b8a2286ba9013e47b1fd0ff513f20328da0e6ae'),
 <NodeRelationship.PARENT: '4'>: RelatedNodeInfo(node_id='14bd6afe-a5d1-4482-a2d8-90a252e46f2f', node_type=<ObjectType.TEXT: '1'>, metadata={'file_path': '/Users/us65093/Documents/scai/documents/alice.txt', 'file_name': 'alice.txt', 'file_type': 'text/plain', 'file_size': 154573, 'creation_date': '2024-10-30', 'last_modified_date': '2024-10-30'}, hash='ae9a7b8d150920c27ba9dcc542c2a4a468edeeea3e4aa7c97cc071a78a9c19e8')}

In [24]:
node_dict = {node.node_id: node for node in nodes}

In [25]:
node_dict[retrieved_nodes[0].node.relationships["4"].node_id].text

'However, on the second\r\ntime round, she came upon a low curtain she had not noticed before, and\r\nbehind it was a little door about fifteen inches high: she tried the\r\nlittle golden key in the lock, and to her great delight it fitted!\r\n\r\nAlice opened the door and found that it led into a small passage, not\r\nmuch larger than a rat-hole: she knelt down and looked along the\r\npassage into the loveliest garden you ever saw. How she longed to get\r\nout of that dark hall, and wander about among those beds of bright\r\nflowers and those cool fountains, but she could not even get her head\r\nthrough the doorway; “and even if my head would go through,” thought\r\npoor Alice, “it would be of very little use without my shoulders. Oh,\r\nhow I wish I could shut up like a telescope! I think I could, if I only\r\nknew how to begin.” For, you see, so many out-of-the-way things had\r\nhappened lately, that Alice had begun to think that very few things\r\nindeed were really impossible.'

## Hybrid Search

In [26]:
from llama_index.core import SimpleKeywordTableIndex

In [27]:
keyword_index = SimpleKeywordTableIndex(
    leaf_nodes,  # using the same leaf_nodes
    storage_context=storage_context,
    show_progress=True,
)

Extracting keywords from nodes: 100%|██████████| 470/470 [00:00<00:00, 16050.90it/s]


In [28]:
keyword_retriever = keyword_index.as_retriever(similarity_top_k=5)
keyword_passages = keyword_retriever.retrieve("White Rabbit")

[p.node.text for p in keyword_passages]

['“It isn’t directed at all,” said the White Rabbit; “in fact, there’s\r\nnothing written on the _outside_.” He unfolded the paper as he spoke,\r\nand added “It isn’t a letter, after all: it’s a set of verses.”\r\n\r\n“Are they in the prisoner’s handwriting?” asked another of the jurymen.\r\n\r\n“No, they’re not,” said the White Rabbit, “and that’s the queerest\r\nthing about it.” (The jury all looked puzzled.)',
 'The White Rabbit put on his spectacles. “Where shall I begin, please\r\nyour Majesty?” he asked.',
 'Imagine her\r\nsurprise, when the White Rabbit read out, at the top of his shrill\r\nlittle voice, the name “Alice!”\r\n\r\n\r\n\r\n\r\nCHAPTER XII.',
 'It was the White\r\nRabbit returning, splendidly dressed, with a pair of white kid gloves\r\nin one hand and a large fan in the other: he came trotting along in a\r\ngreat hurry, muttering to himself as he came, “Oh! the Duchess, the\r\nDuchess! Oh!',
 '“Herald, read the accusation!” said the King.\r\n\r\nOn this the White Ra