# Part 1: Querying

PyData Amsterdam 2023

* Tutorial: Building a personal search engine with llama-index
* Speakers: Judith van Stegeren and Yorick van Pelt
* Company: [Datakami](www.datakami.nl)

## Prep

- Install all the prerequisities using TODO
- If you want to use the OpenAPI API during this tutorial, make a file `secret.py` with `openai_api_key = "YOURKEYHERE"`
- Run all cells under 'Setup'

## Setup

### Imports

In [None]:
import pprint
import os
import sys
from pathlib import Path

# logging for lazy people :)
from loguru import logger

# we're not importing specific methods or classes so it's clear when we actually call llama_index!
import llama_index

### Logging

In [None]:
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

2

In [None]:
DATA_PATH = Path("data/pydata/schedule.json")
INDEX_PATH = Path("indices/pydata_schedule_index/")

### Tell `llama-index` to use a local embeddings model for retrieval

In [None]:
# More information about this embeddings model: https://www.sbert.net/docs/pretrained_models.html#model-overview
# all-minilm-l6-v2 has a maximum size of 256 tokens
embed_model = "local:sentence-transformers/all-minilm-l6-v2"
llm = None

In [None]:
service_context = llama_index.ServiceContext.from_defaults(
  embed_model=embed_model, chunk_size=256, llm=llm
)

LLM is explicitly disabled. Using MockLLM.


### Load a vector index with the PyData Amsterdam 2023 schedule

In [None]:
# load vector index from file
if not os.path.exists(INDEX_PATH):
    logger.error("Index file for part 1 does not exist on disk. :(")
else:
    try:                                                                             
        storage_context = llama_index.StorageContext.from_defaults(persist_dir=INDEX_PATH)
        index = llama_index.load_index_from_storage(storage_context, service_context=service_context)
        logger.info("Loaded index from local storage")                               
    except Exception as e:                                                           
        logger.error(e) 

2023-08-23T16:19:43.182350+0200 - INFO - Loaded index from local storage


## Exercises

### Create a search engine from the vector index `index`

In [None]:
# create a search engine
retriever = index.as_retriever()

### Retrieve talks that mention llama-index

In [None]:
# query the search engine
results = retriever.retrieve("llama_index")

In [None]:
for result in results:
    print(result)
    print()

node=TextNode(id_='0790cffb-67fa-4be4-b668-61b6f75b195a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f71f1eae-c12e-5b4f-8902-b8d0b8a38d72', node_type=None, metadata={}, hash='06987f3dd83522f1b914565902298eb9eed00f77d5fa47d3d52c5f4301959f54'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='7d194c54-99e5-4011-8cd0-51b77b1274e6', node_type=None, metadata={}, hash='3afcb14eb9ca291a920e6ec82ff895103024cb798781880f8f09ee4e80c725c0')}, hash='c852556384af48ec95a06b58d93753903d7a40cf65713b2ce35ef85583de0ac8', text='Building a personal search engine with llama-index\n\nWouldn’t it be great to have a Google-like search engine, but then for your own text files and completely private?In this tutorial we’ll build a small personal search engine using open source library llama-index.In this tutorial we will build a small personal search engine using open source library `lla

### Retrieve talks about causal machine learning

In [None]:
results = retriever.retrieve("causal")
for result in results:
    print(result)
    print()

node=TextNode(id_='3667cdc1-476c-45bc-b991-af941fb56858', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e5be9c09-4c7c-5a3f-b1f8-2c7ecec74cd8', node_type=None, metadata={}, hash='5a52c60ee9f4c8bd3f85be362bf7abaf161cf24ca442c226d83ca7ec2f307177'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='468f946b-6bdf-445d-b45b-c342364ac4af', node_type=None, metadata={}, hash='62eacf013c2fd781932f8ab4ae72aabd9021bfd4292803138792b918fdbcc883')}, hash='939bab04cf1ba76e562e5c9f7aa37bc9550dd61f6a7555cbf72554a260ea115b', text="The proof of the pudding is in the (way of) eating: quasi-experimental methods of causal inference and their practical pitfalls\n\nData scientists and analysts are using quasi-experimental methods to make recommendations based on causality instead of randomized control trials.While these methods are easy to use, their assumptions can be complex to explain.

### Create a new retriever that retrieves more than 2 results.

In [None]:
# hint: https://gpt-index.readthedocs.io/en/v0.8.5.post2/api_reference/query/retrievers/vector_store.html
retriever = index.as_retriever(similarity_top_k=10)

### Find all talks about causal inference at PyData

In [None]:
results = retriever.retrieve('causal')
for r in results:
    print(r.node.text)
    print("---------------")

The proof of the pudding is in the (way of) eating: quasi-experimental methods of causal inference and their practical pitfalls

Data scientists and analysts are using quasi-experimental methods to make recommendations based on causality instead of randomized control trials.While these methods are easy to use, their assumptions can be complex to explain.This talk will explain these assumptions for data scientists and analysts without in-depth training of causal inference so they can use and explain these methods more confidently to change people's minds using data.Instead of relying solely on randomized control trials (also known as A/B tests), which are considered the gold standard for inferring causality, data scientists and analysts are increasingly turning to quasi-experimental methods to make recommendations based on causality.These methods, including open-source libraries such as CausalImpact (originally an R package but with numerous Python ports), are easy to use, but their ass

## Querying the vector index with an external LLM

### Set OpenAI API key (optional)

In [None]:
from secret import openai_api_key

### Use OpenAI's gpt-3.5-turbo for querying

In [None]:
llm = llama_index.llms.OpenAI(model="gpt-3.5-turbo", api_key=openai_api_key)
service_context = llama_index.ServiceContext.from_defaults(
  embed_model=embed_model, chunk_size=256, llm=llm
)

In [None]:
# load vector index from file
if not os.path.exists(INDEX_PATH):
    logger.error("Index file for part 1 does not exist on disk. :(")
else:
    try:                                                                             
        storage_context = llama_index.StorageContext.from_defaults(persist_dir=INDEX_PATH)
        index = llama_index.load_index_from_storage(storage_context, service_context=service_context)
        logger.info("Loaded index from local storage")                               
    except Exception as e:                                                           
        logger.error(e) 

2023-08-23T16:20:45.797800+0200 - INFO - Loaded index from local storage


### Create a query engine

In [None]:
query_engine = index.as_query_engine()

### Which talks might be interesting for startup founders?

In [None]:
response = query_engine.query("Which talks are probably interesting for startup founders?")

In [None]:
response.response

'The talks that are probably interesting for startup founders are "Setting The Right KPIs" and "Data-Driven Decision Making." These talks discuss topics such as setting realistic and challenging KPIs and leveraging data for informed decision-making and product strategy adjustments, which are relevant for startup founders involved in shaping product strategy and making data-driven decisions.'