# Part 2: Building a custom vector index

## Setup

### Imports

In [None]:
import pprint
import json
import os
import sys
from pathlib import Path
from loguru import logger
import llama_index

### Logging

In [None]:
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

2

In [None]:
DATA_PATH = Path("data/julia_evans/blogposts.json")

### Load Julia Evans blogpost data from file

In [None]:
with open(DATA_PATH, 'r') as infile:
    blogposts = json.loads(infile.read())

In [None]:
logger.info(f"Loaded {len(blogposts)} blogposts from file.")

2023-09-04T17:07:40.027379+0200 - INFO - Loaded 20 blogposts from file.


In [None]:
for idx, post in enumerate(blogposts):
    print(f"{idx+1}: {post['title']}")

1: Notes on using a single-person Mastodon server
2: What helps people get comfortable on the command line?
3: Some tactics for writing in public
4: Behind "Hello World" on Linux
5: Why is DNS still hard to learn?
6: Lima: a nice way to run Linux VMs on Mac
7: Open sourcing the nginx playground
8: New zine: How Integers and Floats Work
9: Some blogging myths
10: New playground: memory spy
11: Introducing "Implement DNS in a Weekend"
12: New talk: Learning DNS in 10 years
13: New playground: integer.exposed
14: A list of programming playgrounds
15: Building a custom site for zine feedback
16: Some possible reasons for 8-bit bytes
17: How do Nix builds work?
18: Some notes on using nix
19: Writing Javascript without a build system
20: Print copies of The Pocket Guide to Debugging have arrived


In [None]:
print("Example blogpost:")
pprint.pprint(blogposts[0])

Example blogpost:
{'author': 'Julia Evans',
 'guidislink': False,
 'id': 'https://jvns.ca/blog/2023/08/11/some-notes-on-mastodon/',
 'link': 'https://jvns.ca/blog/2023/08/11/some-notes-on-mastodon/',
 'text': "I started using Mastodon back in November, and it's the Twitter "
         "alternative where I've been spending most of my time recently, "
         'mostly because the Fediverse is where a lot of the Linux nerds seem '
         'to be right now.\n'
         '\n'
         "I've found Mastodon quite a bit more confusing than Twitter because "
         "it's a distributed system, so here are a few technical things I've "
         "learned about it over the last 10 months. I'll mostly talk about "
         'what using a single-person server has been like for me, as well as a '
         'couple of notes about the API, DMs and ActivityPub.\n'
         '\n'
         "I might have made some mistakes, please let me know if I've gotten "
         'anything wrong!\n'
         '\n'
       

### Create a service context
- No OpenAI API calls
- No large local LLM
- Just use the smallest sentence-transformers embeddings model

- all-minilm-l6-v2 has a maximum size of 256 tokens
- source: https://www.sbert.net/docs/pretrained_models.html#model-overview

In [None]:
service_context = llama_index.ServiceContext.from_defaults(
  embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256, llm=None
)

LLM is explicitly disabled. Using MockLLM.


### Create documents from the blogposts

In [None]:
documents = [llama_index.Document(text=blogpost['text']) for blogpost in blogposts]

In [None]:
len(documents)

20

In [None]:
print("Example document:")
documents[0]

Example document:


Document(id_='2f48c07d-395c-46a7-8ceb-2e3eda7e8a99', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='74bb2056380d3e3871dfa6ae4f2b4d9af8e908962b533e58819a94d497caca5b', text='I started using Mastodon back in November, and it\'s the Twitter alternative where I\'ve been spending most of my time recently, mostly because the Fediverse is where a lot of the Linux nerds seem to be right now.\n\nI\'ve found Mastodon quite a bit more confusing than Twitter because it\'s a distributed system, so here are a few technical things I\'ve learned about it over the last 10 months. I\'ll mostly talk about what using a single-person server has been like for me, as well as a couple of notes about the API, DMs and ActivityPub.\n\nI might have made some mistakes, please let me know if I\'ve gotten anything wrong!\n\n### what\'s a mastodon instance?\n\nFirst: Mastodon is a decentralized collection of independently run servers instead of One

# Exercises

### Create a vector index from `documents` (this could take a minute because we're processing lots of text)

### Retrieve blogposts about DNS

### It's really hard to figure out from which blogposts these results are! Create a set of Documents that has the blogpost title as the id.

### Build a vector index from these new documents

### Retrieve blogposts about DNS using the new index. This time, use the id of the source node to check from which blogpost they originate.

### Store your index to disk

### Load your index from disk in a new index variable. Hint: you need two ingredients: a storage context and a service context.

### Expand the documents in your index with metadata such as title, original URL, author, and update time.

### Create an index from these new documents. Retrieve 20 search results about Nix (a package manager).

### For all 20 chunks, print their score, url and the date the blogpost was published.