# Part 1: Querying

PyData Amsterdam 2023

* Tutorial: Building a personal search engine with llama-index
* Speakers: Judith van Stegeren and Yorick van Pelt
* Company: [Datakami](www.datakami.nl)

In [64]:
# imports
import pprint # dev
import json
import os
import sys
from pathlib import Path

# loguru: logging for lazy people :)
from loguru import logger

# we're using a local embeddings model
from sentence_transformers import SentenceTransformer

# llama_index: the topic of this tutorial
# we're not importing specific methods or classes so it's clear when we actually call llama_index!
import llama_index

In [65]:
# log to stdout and local file
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

4

In [108]:
# constants
DATA_PATH = Path("data/pydata/schedule.json")
INDEX_PATH = Path("indices/pydata_schedule_index/")

## Setup

### Use a local embeddings model

(So no calls to OpenAI APIs :))

In [125]:
# all-minilm-l6-v2 has a maximum size of 256 tokens
# source: https://www.sbert.net/docs/pretrained_models.html#model-overview

service_context = llama_index.ServiceContext.from_defaults(
  embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256
)

llama_index.global_service_context = service_context

### Load PyData schedule

**Load JSON file with the PyData Amsterdam 2023 schedule**
* source: https://amsterdam2023.pydata.org/cfp/schedule/export/schedule.json
* retrieved: 2023-08-10

In [126]:
with open(DATA_PATH, 'r') as infile:
    schedule = json.loads(infile.read())
    logger.info(f"Loaded the PyData schedule JSON from file {DATA_PATH}")

2023-08-10T16:08:22.152109+0200 - INFO - Loaded the PyData schedule JSON from file data/pydata/schedule.json


**Extract the talks from the schedule**

In [127]:
documents = []
talks = []
for day in schedule['schedule']['conference']['days']:
    for room in day['rooms']:
        for talk in day['rooms'][room]:
            talk['filename'] = str(DATA_PATH)
            talk['category'] = "Conference talk at PyData Amsterdam 2023"
            talks.append(talk)

logger.info(f"Loaded {len(talks)} talks from the PyData schedule JSON!")

2023-08-10T16:08:32.275500+0200 - INFO - Loaded 67 talks from the PyData schedule JSON!


**Turn the talks data into llama_index Documents**

In [128]:
documents = []
for talk in talks:
    talk_text = f"{talk['title']}\n\n{talk['abstract']}\n\n{talk['description']}"
    doc = llama_index.Document(text = talk_text)
    #doc.extra_info = talk
    documents.append(doc)

### Create vector index from PyData schedule

In [129]:
# create vector index from PyData schedule
logger.info(f"Building a VectorStoreIndex from {len(documents)} documents")
# create a vector store index
# uses OpenAI API by default!
index = llama_index.VectorStoreIndex.from_documents(documents, service_context=service_context)

# store index to disk
index.storage_context.persist(INDEX_PATH)
logger.info(f"Saved VectorStoreIndex to {INDEX_PATH}")

2023-08-10T16:08:33.453421+0200 - INFO - Building a VectorStoreIndex from 67 documents
2023-08-10T16:08:42.219739+0200 - INFO - Saved VectorStoreIndex to indices/pydata_schedule_index


In [130]:
# load vector index from file
if not os.path.exists(INDEX_PATH):
    logger.error("Index file for part 1 does not exist on disk. :(")
else:
    try:                                                                             
        # rebuild storage context from disk                                          
        storage_context = StorageContext.from_defaults(persist_dir=INDEX_PATH)
        # load index                                                                 
        index = load_index_from_storage(storage_context, service_context=service_context)
        logger.info("Loaded index from local storage")                               
    except Exception as e:                                                           
        logger.error(e) 

2023-08-10T16:08:42.589211+0200 - INFO - Loaded index from local storage


In [131]:
retriever = index.as_retriever()
results = retriever.retrieve("llama_index")

In [188]:
for result in results:
    print(result.node)
    print("---")
    print()

id_='a64c9da7-86b1-4782-af65-30cd1aa9f5cf' embedding=None metadata={} excluded_embed_metadata_keys=[] excluded_llm_metadata_keys=[] relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='1f00b54b-a024-4493-83ff-c6408f8b498c', node_type=None, metadata={}, hash='06987f3dd83522f1b914565902298eb9eed00f77d5fa47d3d52c5f4301959f54'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='7da999ef-e4a3-4f0f-ae88-0c8a500c5135', node_type=None, metadata={}, hash='e03d84fc9854a169e62cc820db376263518300e815b5b95fe6642903037612e1')} hash='6a94850a7452390034c74ec335f631cc1c25a2753a78508cf6242f98bffe8538' text='Building a personal search engine with llama-index\n\nWouldn’t it be great to have a Google-like search engine, but then for your own text files and completely private? In this tutorial we’ll build a small personal search engine using open source library llama-index.\n\nIn this tutorial we will build a small personal search engine using open source library `llama-index`. Llama

In [199]:
results[0].node.relationships.

{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='1f00b54b-a024-4493-83ff-c6408f8b498c', node_type=None, metadata={}, hash='06987f3dd83522f1b914565902298eb9eed00f77d5fa47d3d52c5f4301959f54'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='7da999ef-e4a3-4f0f-ae88-0c8a500c5135', node_type=None, metadata={}, hash='e03d84fc9854a169e62cc820db376263518300e815b5b95fe6642903037612e1')}