# Deep Memory trained on Syntethic Queries improves recall@10 by +20%

You need to have labelled data (query and relevance pairs) for training deep memory. However it is sometimes hard to obtain labelled data when you start fresh.

In this tutorial we will take an existing dataset and generate queries using GPT to train Deep Memory.

## 0. Setup packages and credentials
Install Necessary Packages

In [None]:
!pip3 install deeplake langchain openai tiktoken llama-index

Setup Activeloop and OpenAI

In [None]:
import os, getpass
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass()

In [None]:
os.environ['OPENAI_API_KEY'] = getpass.getpass()

## 1. Load the dataset and create a Deep Lake vector store

We are going to use GPT3.5 to generate questions based on the context provided by a chunk test.

In [None]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   149k      0 --:--:-- --:--:-- --:--:--  149k


In [None]:
from llama_index.node_parser import SimpleNodeParser
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Number of Documents: 1
Number of nodes: 58 with the current chunk size of 512


In [None]:
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms import OpenAI

# Create a DeepLakeVectorStore locally to store the vectors
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True, exec_option="compute_engine")

# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)



Generating embeddings:   0%|          | 0/58 [00:00<?, ?it/s]

Uploading data to deeplake dataset.


100%|██████████| 58/58 [00:00<00:00, 274.94it/s]

Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (58, 1)      str     None   
 metadata     json      (58, 1)      str     None   
 embedding  embedding  (58, 1536)  float32   None   
    id        text      (58, 1)      str     None   





Now let's upload the local Vectore Store to Active Loop's platform and then convert it into a managed database.

In [None]:
import deeplake
local = "./data/paul_graham/deep_lake_db"
hub_path = "hub://genai360/LlamaIndex_paulgraham_essay"
hub_managed_path = "hub://genai360/LlamaIndex_paulgraham_essay_managed"

# First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
# Create a managed vector store under a different name
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})

## 2. Generate a dataset of Queries and Documents

In [None]:
# fetch dataset docs and ids if they exist (optional you can also ingest)
db = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, exec_option="compute_engine", read_only=True,)
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)['value']
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)['value']
print(len(docs))

Deep Lake Dataset in hub://genai360/LlamaIndex_paulgraham_essay_managed already exists, loading from the storage
58


In [None]:
from openai import OpenAI
client = OpenAI()

def generate_question(text):
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=[
                {"role": "system", "content": "You are a world class expert for generating questions based on provided context. \
                        You make sure the question can be answered by the text."},
                {
                    "role": "user",
                    "content": text,
                },
            ],
        )
        return response.choices[0].message.content
    except:
        question_string = "No question generated"
        return question_string


In [None]:
import random
from tqdm import tqdm

def generate_queries(docs: list[str], ids: list[str], n: int):

    questions = []
    relevances = []
    pbar = tqdm(total=n)
    while len(questions) < n:
        # 1. randomly draw a piece of text and relevance id
        r = random.randint(0, len(docs)-1)
        text, label = docs[r], ids[r]

        # 2. generate queries and assign and relevance id
        generated_qs = [generate_question(text)]
        if generated_qs == ["No question generated"]:
            print("No question generated")
            continue

        questions.extend(generated_qs)
        relevances.extend([[(label, 1)] for _ in generated_qs])
        pbar.update(len(generated_qs))

    return questions[:n], relevances[:n]

# Here we choose to generate 40 questions
questions, relevances = generate_queries(docs, ids, n=40)
print(len(questions))
print(questions[0])


## 3. Train Deep Memory

In [None]:
job_id = db.vectorstore.deep_memory.train(
    queries=questions,
    relevance=relevances,
)

In [None]:
db.vectorstore.deep_memory.status('657c6edee8fc60c8a382221a')

Wait until training status becomes completed

## 4. Evaluate Deep Memory

### 4.1 Manual

In [None]:
query = ""

In [None]:
db.similarity_search(query=query, deep_memory=False, k=3)

In [None]:
db.similarity_search(query=query, deep_memory=True, k=3)

### 4.2 Quantitative Evaluation on Synthetically generated queries

In [None]:
validation_questions, validation_relevances = generate_queries(docs, ids, n=40)

In [None]:
recalls = db.vectorstore.deep_memory.evaluate(
    queries=validation_questions,
    relevance=validation_relevances,
    embedding_function=openai_embeddings.embed_documents,
)