# Part 2: Building a custom vector index

How to use this notebook: 
1. Execute all cells under "Setup"
2. Fill in the exercises below

## Setup

### Imports

In [None]:
import pprint
import json
import os
import sys
from pathlib import Path
from loguru import logger
import llama_index

### Logging

In [None]:
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

In [None]:
DATA_PATH = Path("data/julia_evans/blogposts.json")

### Load Julia Evans blogpost data from file

In [None]:
with open(DATA_PATH, 'r') as infile:
    blogposts = json.loads(infile.read())

In [None]:
logger.info(f"Loaded {len(blogposts)} blogposts from file.")

In [None]:
for idx, post in enumerate(blogposts):
    print(f"{idx+1}: {post['title']}")

In [None]:
print("Example blogpost:")
pprint.pprint(blogposts[0])

### Create a service context
- No OpenAI API calls
- No large local LLM
- Just use the smallest sentence-transformers embeddings model

- all-minilm-l6-v2 has a maximum size of 256 tokens
- source: https://www.sbert.net/docs/pretrained_models.html#model-overview

In [None]:
service_context = llama_index.ServiceContext.from_defaults(
  embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256, llm=None
)

### Create documents from the blogposts

In [None]:
documents = [llama_index.Document(text=blogpost['text']) for blogpost in blogposts]

In [None]:
len(documents)

In [None]:
print("Example document:")
documents[0]

# Exercises

### Create a vector index from `documents` (this could take a minute because we're processing lots of text)

### Retrieve blogposts about DNS

### It's really hard to figure out from which blogposts these results are! Create a set of Documents that has the blogpost title as the id.

### Build a vector index from these new documents

### Retrieve blogposts about DNS using the new index. This time, use the id of the source node to check from which blogpost they originate.

### Store your index to disk

### Load your index from disk in a new index variable. Hint: you need two ingredients: a storage context and a service context.

### Expand the documents in your index with metadata such as title, original URL, author, and update time.

### Create an index from these new documents. Retrieve 20 search results about Nix (a package manager).

### For all 20 chunks, print their score, url and the date the blogpost was published.