# Running this notebook in Google Colab
This notebook is intended to run in Google Colab [here](https://colab.research.google.com/github/datakami/pydata-llama-index-tutorial/blob/main/Colab-Exercises.ipynb).

Simply execute the code below to install the neccessary dependencies and download the data.

In [None]:
%pip install llama-index==0.8.5.post2 sentence-transformers==2.2.2 loguru==0.7.0
# download the data
from urllib.request import urlretrieve
path, _ = urlretrieve("https://pub.yori.cc/lid_data.zip")
import zipfile
with zipfile.ZipFile(path, 'r') as zip_ref:
    zip_ref.extractall(".")
!mv pydata-llama-index-tutorial-main/data pydata-llama-index-tutorial-main/indices .

# Part 1: Querying

If you want to use the OpenAPI API during this tutorial, make a file `secret.py` with `openai_api_key = "YOURKEYHERE"`.

How to use this notebook: 
1. Execute all cells under "Setup"
2. Fill in the exercises below

## Setup

### Imports

In [None]:
import pprint
import os
import sys
from pathlib import Path

# logging for lazy people :)
from loguru import logger

# we're not importing specific methods or classes so it's clear when we actually call llama_index!
import llama_index

### Logging

In [None]:
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

### Tell the notebook where files are stored

In [None]:
DATA_PATH = Path("data/pydata/schedule.json")
INDEX_PATH = Path("indices/pydata_schedule_index/")

### Tell `llama-index` to use a local embeddings model for retrieval

More information about this embeddings model: https://www.sbert.net/docs/pretrained_models.html#model-overview

Note that all-minilm-l6-v2 has a maximum size of 256 tokens.

In [None]:
embed_model = "local:sentence-transformers/all-minilm-l6-v2"
llm = None

In [None]:
service_context = llama_index.ServiceContext.from_defaults(
  embed_model=embed_model, chunk_size=256, llm=llm
)

### Load a vector index with the PyData Amsterdam 2023 schedule

Load the index from file.

In [None]:
if not os.path.exists(INDEX_PATH):
    logger.error("Index file for part 1 does not exist on disk. :(")
else:
    try:                                                                             
        storage_context = llama_index.StorageContext.from_defaults(persist_dir=INDEX_PATH)
        index = llama_index.load_index_from_storage(storage_context, service_context=service_context)
        logger.info("Loaded index from local storage")                               
    except Exception as e:                                                           
        logger.error(e) 

## Exercises

### Create a retriever from the vector index `index`

### Retrieve chunks that mention llama-index

### Retrieve chunks related to causal machine learning

### Print the talk title and speaker for results related to causal machine learning

### Create a new retriever that retrieves more than 2 results. Hint: [llama-index Retriever documentation](https://gpt-index.readthedocs.io/en/v0.8.5.post2/api_reference/query/retrievers/vector_store.html)

### Find all talks about causal inference at PyData

## Querying the vector index with an external LLM

### Set OpenAI API key (optional)

In [None]:
from secret import openai_api_key

### Use OpenAI's gpt-3.5-turbo for querying

In [None]:
llm = llama_index.llms.OpenAI(model="gpt-3.5-turbo", api_key=openai_api_key)
service_context = llama_index.ServiceContext.from_defaults(
  embed_model=embed_model, chunk_size=256, llm=llm
)

In [None]:
if not os.path.exists(INDEX_PATH):
    logger.error("Index file for part 1 does not exist on disk. :(")
else:
    try:                                                                             
        storage_context = llama_index.StorageContext.from_defaults(persist_dir=INDEX_PATH)
        index = llama_index.load_index_from_storage(storage_context, service_context=service_context)
        logger.info("Loaded index from local storage")                               
    except Exception as e:                                                           
        logger.error(e) 

### Create a query engine

### Find talks that might be interesting for startup founders

# Part 2: Building a custom vector index

How to use this notebook: 
1. Execute all cells under "Setup"
2. Fill in the exercises below

## Setup

### Imports

In [None]:
import pprint
import json
import os
import sys
from pathlib import Path
from loguru import logger
import llama_index

### Logging

In [None]:
logger.remove()
logger.add(sys.stdout, format="{time} - {level} - {message}", level="DEBUG")
logger.add("tutorial_part_1.log", level="DEBUG")

In [None]:
DATA_PATH = Path("data/julia_evans/blogposts.json")

### Load Julia Evans blogpost data from file

In [None]:
with open(DATA_PATH, 'r') as infile:
    blogposts = json.loads(infile.read())

In [None]:
logger.info(f"Loaded {len(blogposts)} blogposts from file.")

In [None]:
for idx, post in enumerate(blogposts):
    print(f"{idx+1}: {post['title']}")

In [None]:
print("Example blogpost:")
pprint.pprint(blogposts[0])

### Create a service context
- No OpenAI API calls
- No large local LLM
- Just use the smallest sentence-transformers embeddings model

- all-minilm-l6-v2 has a maximum size of 256 tokens
- source: https://www.sbert.net/docs/pretrained_models.html#model-overview

In [None]:
service_context = llama_index.ServiceContext.from_defaults(
  embed_model="local:sentence-transformers/all-minilm-l6-v2", chunk_size=256, llm=None
)

### Create documents from the blogposts

In [None]:
documents = [llama_index.Document(text=blogpost['text']) for blogpost in blogposts]

In [None]:
len(documents)

In [None]:
print("Example document:")
documents[0]

# Exercises

### Create a vector index from `documents` (this could take a minute because we're processing lots of text)

### Retrieve blogposts about DNS

### It's really hard to figure out from which blogposts these results are! Create a set of Documents that has the blogpost title as the id.

### Build a vector index from these new documents

### Retrieve blogposts about DNS using the new index. This time, use the id of the source node to check from which blogpost they originate.

### Store your index to disk

### Load your index from disk in a new index variable. Hint: you need two ingredients: a storage context and a service context.

### Expand the documents in your index with metadata such as title, original URL, author, and update time.

### Create an index from these new documents. Retrieve 20 search results about Nix (a package manager).

### For all 20 chunks, print their score, url and the date the blogpost was published.