# Part 1: Querying

PyData Amsterdam 2023

* Tutorial: Building a personal search engine with llama-index
* Speakers: Judith van Stegeren and Yorick van Pelt
* Company: [Datakami](www.datakami.nl)

In [None]:
# imports
import pprint # dev
import json
import os
import sys
from pathlib import Path

# loguru: logging for lazy people :)
from loguru import logger

# llama_index: the topic of this tutorial
# we're not importing specific methods or classes so it's clear when we actually call llama_index!
import llama_index

In [None]:
from secret import openai_api_key

In [None]:
# log to stdout and local file
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

4

In [None]:
# constants
DATA_PATH = Path("data/pydata/schedule.json")
INDEX_PATH = Path("indices/pydata_schedule_index/")

## Setup

### Use a local embeddings model

(So no calls to OpenAI APIs :))

In [None]:
llm = llama_index.llms.OpenAI(model="gpt-3.5-turbo", api_key=openai_api_key) # todo: requires API key

In [None]:
# all-minilm-l6-v2 has a maximum size of 256 tokens
# source: https://www.sbert.net/docs/pretrained_models.html#model-overview
service_context = llama_index.ServiceContext.from_defaults(
  embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256, llm=llm
)

In [None]:
llama_index.global_service_context = service_context

### Load PyData schedule

**Load JSON file with the PyData Amsterdam 2023 schedule**
* source: https://amsterdam2023.pydata.org/cfp/schedule/export/schedule.json

In [None]:
with open(DATA_PATH, 'r') as infile:
    schedule = json.load(infile)
    logger.info(f"Loaded the PyData schedule JSON from file {DATA_PATH}")

2023-08-30T17:19:37.543473+0200 - INFO - Loaded the PyData schedule JSON from file data/pydata/schedule.json


**Extract the talks from the schedule**

In [None]:
talks = {}
for day in schedule['schedule']['conference']['days']:
    for room in day['rooms'].values():
        for talk in room:
            talk['filename'] = str(DATA_PATH)
            talk['category'] = "Conference talk at PyData Amsterdam 2023"
            talks[talk['guid']] = talk

logger.info(f"Loaded {len(talks)} talks from the PyData schedule JSON!")

2023-08-30T17:19:38.431570+0200 - INFO - Loaded 68 talks from the PyData schedule JSON!


In [None]:
print("Example of a PyData talk:")
pprint.pprint(list(talks.values())[12])

Example of a PyData talk:
{'abstract': 'Working on ML serving for couple of years we learned a lot. I '
             'would like to share a set of best practices / learnings with the '
             'community',
 'answers': [],
 'attachments': [],
 'category': 'Conference talk at PyData Amsterdam 2023',
 'date': '2023-09-14T12:00:00+02:00',
 'description': 'At Adyen we deploy a lot of models for online inference in '
                'the payment flow. Working in the MLOps team to streamline '
                'this process, I learned a lot about best practices / things '
                'to consider before (after) putting a model online. These are '
                'small things but they do contribute to a production and '
                'reliable setup for online inference. Some examples:\r\n'
                '\r\n'
                '- Adding meta data & creating a self contained archive\r\n'
                '- Separating serving sources from training sources\r\n'
                '- Cho

**Turn the talks data into llama_index Documents**

In [None]:
documents = []
for talk in talks.values():
    talk_text = f"{talk['title']}\n\n{talk['abstract']}\n\n{talk['description']}"
    speakers = ", ".join([p['public_name'] for p in talk['persons']])
    doc = llama_index.Document(text = talk_text, id_ = talk["guid"], extra_info={'title': talk['title'], 'speakers' : speakers})
    documents.append(doc)

In [None]:
print("Example of a PyData talk Document:")
pprint.pprint(dict(documents[12]))

Example of a PyData talk Document:
{'embedding': None,
 'end_char_idx': None,
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'hash': 'e1a90272b9b8d29972d9674b7c1ab976aca6fdf7ce73f49b1ca7432c90123c74',
 'id_': '80309313-5f6a-58dd-b81f-730483b2dd6b',
 'metadata': {'speakers': 'Ziad Al Moubayed',
              'title': 'Reliable and Scalable ML Serving: Best Practices for '
                       'Online Model Deployment'},
 'metadata_seperator': '\n',
 'metadata_template': '{key}: {value}',
 'relationships': {},
 'start_char_idx': None,
 'text': 'Reliable and Scalable ML Serving: Best Practices for Online Model '
         'Deployment\n'
         '\n'
         'Working on ML serving for couple of years we learned a lot. I would '
         'like to share a set of best practices / learnings with the '
         'community\n'
         '\n'
         'At Adyen we deploy a lot of models for online inference in the '
         'payment flow. Working in the MLOps team to s

### Create vector index from PyData schedule

In [None]:
# create vector index from PyData schedule
logger.info(f"Building a VectorStoreIndex from {len(documents)} documents")
index = llama_index.VectorStoreIndex.from_documents(documents, service_context=service_context)

# store index to disk
index.storage_context.persist(INDEX_PATH)
logger.info(f"Saved VectorStoreIndex to {INDEX_PATH}")

2023-08-30T17:19:43.632450+0200 - INFO - Building a VectorStoreIndex from 68 documents
2023-08-30T17:19:54.067441+0200 - INFO - Saved VectorStoreIndex to indices/pydata_schedule_index


## Load vector index with PyData Amsterdam 2023 schedule

In [None]:
# load vector index from file
if not os.path.exists(INDEX_PATH):
    logger.error("Index file for part 1 does not exist on disk. :(")
else:
    try:                                                                             
        # rebuild storage context from disk                                          
        storage_context = llama_index.StorageContext.from_defaults(persist_dir=INDEX_PATH)
        # load index                                                                 
        #index = llama_index.load_index_from_storage(storage_context, service_context=service_context)
        index = llama_index.load_index_from_storage(storage_context)
        logger.info("Loaded index from local storage")                               
    except Exception as e:                                                           
        logger.error(e) 

2023-08-23T16:22:05.678105+0200 - INFO - Loaded index from local storage


## Create a search engine from vector index

In [None]:
# create a search engine
retriever = index.as_retriever()

## Query the search engine

In [None]:
# query the search engine
results = retriever.retrieve("llama_index")
for result in results:
    talk = talks[result.node.source_node.node_id]
    print(f"- score: {result.score:.2f} title: {talk['title']}")

- score: 0.35 title: Building a personal search engine with llama-index
- score: 0.24 title: Unconference #1


### Startups

In [None]:
results = retriever.retrieve("startups")
for result in results:
    talk = talks[result.node.source_node.node_id]
    print(f"- score: {result.score:.2f} title: {talk['title']}")

- score: 0.32 title: Kickstart AI sponsored drinks [time & location TBD]
- score: 0.26 title: Power Users, Long Tail Users, and Everything In Between: Choosing Meaningful Metrics and KPIs for Product Strategy


## Querying the vector index with an external LLM

In [None]:
import openai

In [None]:
query_engine = index.as_query_engine()

In [None]:
try:
    response = query_engine.query("Which talks are probably interesting for startup founders?")
except openai.error.AuthenticationError as auth_error:
    logger.error(auth_error)

In [None]:
response.response

'The talks that are probably interesting for startup founders are "Setting The Right KPIs" and "Data-Driven Decision Making." These talks discuss topics such as setting realistic and challenging KPIs and leveraging data for informed decision-making and product strategy adjustments, which are relevant for startup founders involved in shaping product strategy and making data-driven decisions.'

In [None]:
print("Sources:")
for source in response.source_nodes:
    print("-", talks[source.node.source_node.node_id]['title'])

Sources:
- Kickstart AI sponsored drinks [time & location TBD]
- Power Users, Long Tail Users, and Everything In Between: Choosing Meaningful Metrics and KPIs for Product Strategy
