# Part 2: Building a custom vector index

## Setup

### Imports

In [None]:
import pprint
import json
import os
import sys
from pathlib import Path
from loguru import logger
import llama_index

### Logging

In [None]:
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

4

In [None]:
DATA_PATH = Path("data/julia_evans/blogposts.json")

### Load Julia Evans blogpost data from file

In [None]:
with open(DATA_PATH, 'r') as infile:
    blogposts = json.loads(infile.read())

In [None]:
logger.info(f"Loaded {len(blogposts)} blogposts from file.")

2023-09-04T16:05:52.946129+0200 - INFO - Loaded 20 blogposts from file.


In [None]:
for idx, post in enumerate(blogposts):
    print(f"{idx+1}: {post['title']}")

1: Notes on using a single-person Mastodon server
2: What helps people get comfortable on the command line?
3: Some tactics for writing in public
4: Behind "Hello World" on Linux
5: Why is DNS still hard to learn?
6: Lima: a nice way to run Linux VMs on Mac
7: Open sourcing the nginx playground
8: New zine: How Integers and Floats Work
9: Some blogging myths
10: New playground: memory spy
11: Introducing "Implement DNS in a Weekend"
12: New talk: Learning DNS in 10 years
13: New playground: integer.exposed
14: A list of programming playgrounds
15: Building a custom site for zine feedback
16: Some possible reasons for 8-bit bytes
17: How do Nix builds work?
18: Some notes on using nix
19: Writing Javascript without a build system
20: Print copies of The Pocket Guide to Debugging have arrived


In [None]:
print("Example blogpost:")
pprint.pprint(blogposts[0])

Example blogpost:
{'author': 'Julia Evans',
 'guidislink': False,
 'id': 'https://jvns.ca/blog/2023/08/11/some-notes-on-mastodon/',
 'link': 'https://jvns.ca/blog/2023/08/11/some-notes-on-mastodon/',
 'text': "I started using Mastodon back in November, and it's the Twitter "
         "alternative where I've been spending most of my time recently, "
         'mostly because the Fediverse is where a lot of the Linux nerds seem '
         'to be right now.\n'
         '\n'
         "I've found Mastodon quite a bit more confusing than Twitter because "
         "it's a distributed system, so here are a few technical things I've "
         "learned about it over the last 10 months. I'll mostly talk about "
         'what using a single-person server has been like for me, as well as a '
         'couple of notes about the API, DMs and ActivityPub.\n'
         '\n'
         "I might have made some mistakes, please let me know if I've gotten "
         'anything wrong!\n'
         '\n'
       

### Create a service context
- No OpenAI API calls
- No large local LLM
- Just use the smallest sentence-transformers embeddings model

- all-minilm-l6-v2 has a maximum size of 256 tokens
- source: https://www.sbert.net/docs/pretrained_models.html#model-overview

In [None]:
service_context = llama_index.ServiceContext.from_defaults(
  embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256, llm=None
)

LLM is explicitly disabled. Using MockLLM.


### Create documents from the blogposts

In [None]:
documents = [llama_index.Document(text=blogpost['text']) for blogpost in blogposts]

In [None]:
len(documents)

20

In [None]:
print("Example document:")
documents[0]

Example document:


Document(id_='bf487c38-61b3-435b-a674-42020f96a23b', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='74bb2056380d3e3871dfa6ae4f2b4d9af8e908962b533e58819a94d497caca5b', text='I started using Mastodon back in November, and it\'s the Twitter alternative where I\'ve been spending most of my time recently, mostly because the Fediverse is where a lot of the Linux nerds seem to be right now.\n\nI\'ve found Mastodon quite a bit more confusing than Twitter because it\'s a distributed system, so here are a few technical things I\'ve learned about it over the last 10 months. I\'ll mostly talk about what using a single-person server has been like for me, as well as a couple of notes about the API, DMs and ActivityPub.\n\nI might have made some mistakes, please let me know if I\'ve gotten anything wrong!\n\n### what\'s a mastodon instance?\n\nFirst: Mastodon is a decentralized collection of independently run servers instead of One

# Exercises

In [None]:
# Create a vector index from `documents` (this could take a minute because we're processing lots of text)

logger.info(f"Building a VectorStoreIndex from {len(documents)} documents...")
index = llama_index.VectorStoreIndex.from_documents(documents, service_context=service_context)
logger.info(f"Built a VectorStoreIndex from {len(documents)} documents")

2023-09-04T16:13:59.257738+0200 - INFO - Building a VectorStoreIndex from 20 documents...
2023-09-04T16:14:17.641477+0200 - INFO - Built a VectorStoreIndex from 20 documents


In [None]:
# Retrieve blogposts about DNS

retriever = index.as_retriever(similarity_top_k=10)
results = retriever.retrieve('DNS')

for result in results:
    print(result.score)
    print(result.node)
    print()

0.5832783075911185
id_='a3213ab0-7f19-4a23-a038-8d3ae8efee0a' embedding=None metadata={} excluded_embed_metadata_keys=[] excluded_llm_metadata_keys=[] relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f0809c20-2168-4190-a507-957375775e1a', node_type=None, metadata={}, hash='60b34270cb3fa7f0974c09e0296fdf2c4bf98715c0b44a4136444b014ef4e031'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='6f3f343e-d201-4f40-bc0a-83018c288127', node_type=None, metadata={}, hash='96e930b94c401d3dfddc7dc30b4068ca399d7944a226f040b7b30bbcc81aeb16'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='99351f7b-ed28-4d0c-a586-5555996ef029', node_type=None, metadata={}, hash='c37eb86ecec3c89858b619694fd1fdecd2ad53ebcb69a269cdc49d0eb7158cfa')} hash='9c35a586205d34714de2df12ecbe0f75475bf66b1af46d9776c710d3ccb5f0e1' text="Then we'll talk about reading the specification, we'll going to do some experiments, and we're going to implement our own terrible version of DNS.[](https://jv

In [None]:
# It's really hard to figure out from which blogposts these results are! Create a set of Documents that has the blogpost title in the metadata.

documents = []
for blogpost in blogposts:
    document = llama_index.Document(text=blogpost['text'], metadata={'title' : blogpost['title']})
    documents.append(document)

print("Example of a blogpost Document with metadata:")
pprint.pprint(dict(documents[0]))


Example of a blogpost Document with metadata:
{'embedding': None,
 'end_char_idx': None,
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'hash': 'eb42c6e922c903ae6b3bee9b7c6d07107235d2de3f201c00d2cfbac1945bb268',
 'id_': '70bd7a2b-2765-44e6-abb7-9cd23068d893',
 'metadata': {'title': 'Notes on using a single-person Mastodon server'},
 'metadata_seperator': '\n',
 'metadata_template': '{key}: {value}',
 'relationships': {},
 'start_char_idx': None,
 'text': "I started using Mastodon back in November, and it's the Twitter "
         "alternative where I've been spending most of my time recently, "
         'mostly because the Fediverse is where a lot of the Linux nerds seem '
         'to be right now.\n'
         '\n'
         "I've found Mastodon quite a bit more confusing than Twitter because "
         "it's a distributed system, so here are a few technical things I've "
         "learned about it over the last 10 months. I'll mostly talk about "
         'wha

Example of a blogpost Document with metadata:
{'embedding': None,
 'end_char_idx': None,
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'hash': 'eb42c6e922c903ae6b3bee9b7c6d07107235d2de3f201c00d2cfbac1945bb268',
 'id_': '45e9e3f0-7532-4721-bf4b-cf97dcb71b35',
 'metadata': {'title': 'Notes on using a single-person Mastodon server'},
 'metadata_seperator': '\n',
 'metadata_template': '{key}: {value}',
 'relationships': {},
 'start_char_idx': None,
 'text': "I started using Mastodon back in November, and it's the Twitter "
         "alternative where I've been spending most of my time recently, "
         'mostly because the Fediverse is where a lot of the Linux nerds seem '
         'to be right now.\n'
         '\n'
         "I've found Mastodon quite a bit more confusing than Twitter because "
         "it's a distributed system, so here are a few technical things I've "
         "learned about it over the last 10 months. I'll mostly talk about "
         'wha

In [None]:
# Build a vector index from these new documents

index = llama_index.VectorStoreIndex.from_documents(documents, service_context=service_context)


In [None]:
# Retrieve blogposts about DNS using the new index. This time, use the metadata to check from which blogpost they originate.

retriever = index.as_retriever()
results = retriever.retrieve('DNS')

for result in results:
    print(result.score)
    print(result.node.metadata)
    print()

0.6294323463788487
{'author': 'Julia Evans', 'title': 'New talk: Learning DNS in 10 years'}

0.5798348508030715
{'author': 'Julia Evans', 'title': 'Why is DNS still hard to learn?'}



In [None]:
# Store your index to disk

index_path = 'indices/julia_evans_blogposts'
index.storage_context.persist(index_path)
logger.info(f"Saved VectorStoreIndex to {index_path}")

2023-09-04T16:28:36.981963+0200 - INFO - Saved VectorStoreIndex to indices/julia_evans_blogposts


In [None]:
# Load your index from disk in a new index variable. Hint: you need two ingredients: a storage context and a service context)

index_path = 'indices/julia_evans_blogposts'
storage_context = llama_index.StorageContext.from_defaults(persist_dir=index_path)
service_context = llama_index.ServiceContext.from_defaults(embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256, llm=None)
index2 = llama_index.load_index_from_storage(storage_context, service_context=service_context)

LLM is explicitly disabled. Using MockLLM.


In [None]:
# Expand the documents in your index with metadata such as original URL, author, and update time.

In [None]:
pprint.pprint(blogposts[0])

documents = []
for blogpost in blogposts:
    metadata = {}
    metadata['author'] = blogpost['author']
    metadata['url'] = blogpost['link']
    metadata['updated_at'] = blogpost['updated']
    metadata['title'] = blogpost['title']
    document = llama_index.Document(text=blogpost['text'], metadata=metadata)
    documents.append(document)
    

{'author': 'Julia Evans',
 'guidislink': False,
 'id': 'https://jvns.ca/blog/2023/08/11/some-notes-on-mastodon/',
 'link': 'https://jvns.ca/blog/2023/08/11/some-notes-on-mastodon/',
 'text': "I started using Mastodon back in November, and it's the Twitter "
         "alternative where I've been spending most of my time recently, "
         'mostly because the Fediverse is where a lot of the Linux nerds seem '
         'to be right now.\n'
         '\n'
         "I've found Mastodon quite a bit more confusing than Twitter because "
         "it's a distributed system, so here are a few technical things I've "
         "learned about it over the last 10 months. I'll mostly talk about "
         'what using a single-person server has been like for me, as well as a '
         'couple of notes about the API, DMs and ActivityPub.\n'
         '\n'
         "I might have made some mistakes, please let me know if I've gotten "
         'anything wrong!\n'
         '\n'
         "### what's a ma

In [None]:
# Create an index from these new documents. Retrieve 20 search results about Nix (a package manager).

index = llama_index.VectorStoreIndex.from_documents(documents, service_context=service_context)

retriever = index.as_retriever(similarity_top_k = 20)
results = retriever.retrieve("Nix")

In [None]:
# For all 20 chunks, print their score, url and the date the blogpost was published. 

for result in results:
    print(f"{result.score:.2f}\t{result.node.metadata['url']}\t{result.node.metadata['updated_at']}")

0.75	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.70	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.68	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.68	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.67	https://jvns.ca/blog/2023/03/03/how-do-nix-builds-work-/	2023-03-03T11:19:25+00:00
0.67	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.67	https://jvns.ca/blog/2023/03/03/how-do-nix-builds-work-/	2023-03-03T11:19:25+00:00
0.67	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.66	https://jvns.ca/blog/2023/03/03/how-do-nix-builds-work-/	2023-03-03T11:19:25+00:00
0.66	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.66	https://jvns.ca/blog/2023/02/28/some-notes-on-using-nix/	2023-02-28T23:16:17+00:00
0.66	https://jvns.ca/blog/2023/0