# Issue [#12](https://github.com/ai-cfia/llamaindex-db/issues/12)


In [1]:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core import Settings
import os
from dotenv import load_dotenv
import psycopg
from psycopg.rows import dict_row
from pprint import pprint

load_dotenv()

True

## Setup LLM and Embed Model


In [2]:
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name="ailab-llm",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="ada",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

Settings.llm = llm
Settings.embed_model = embed_model

## Variables

In [30]:
database = os.getenv("DB_NAME")
host = os.getenv("DB_HOST")
password = os.getenv("DB_PASSWORD")
port = os.getenv("DB_PORT")
user = os.getenv("DB_USER")
llamaindex_db = "llamaindex_db_legacy"
llamaindex_schema = "v_0_0_1"

## Observed problem


In [4]:
vector_store = PGVectorStore.from_params(
    database=llamaindex_db,
    host=host,
    password=password,
    port=port,
    user=user,
    embed_dim=1536,
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
retriever = index.as_retriever(similarity_top_k=50)

In [18]:
query = "what are the fertilizer labelling requirements?"
nodes = retriever.retrieve(query)

In [19]:
for n in nodes:
    pprint({"meta": n.metadata, "score" : n.score, "node_id": n.node_id})

{'meta': {'chunk_id': '1854fdc5-af24-41e4-81ef-a742a08c6684',
          'id': '8780a201-a628-44a3-babb-c8830b68de72',
          'last_updated': '2020-11-13',
          'score': 0.534927215363663,
          'subtitle': 'Registered Fertilizer-Pesticides Labelling',
          'title': 'T-4- 102 - Requirements for fertilizer-pesticides under '
                   'the Fertilizers Act - Canadian Food Inspection Agency',
          'tokens_count': 305,
          'url': 'https://inspection.canada.ca/plant-health/fertilizers/trade-memoranda/t-4-102/eng/1307854513877/1307854674148'},
 'node_id': '8780a201-a628-44a3-babb-c8830b68de72',
 'score': 0.9085487454359314}
{'meta': {'chunk_id': 'f57c95ef-1dd7-4d03-886e-d82b7fa22563',
          'id': '8780a201-a628-44a3-babb-c8830b68de72',
          'last_updated': '2020-11-13',
          'score': 0.534927215363663,
          'subtitle': 'Exemptions from Registration;Customer Formula '
                      'Fertilizer-Pesticide Labelling',
          'titl

We can observe that multiple documents reference the same url (document).

## Root cause

A long enough document is split into chunks (html sections in our case). A node is a chunk and all it's metadata. A query's vector can simultaneously be similar to multiple nodes in the same document. For instance the subject of `fertilizer labelling requirements` might span multiple sections in the original webpage. Even then, we should expect nodes from the same documents to have different scores. It's not the case here, which suggests that there is a deeper issue here.

I just noticed: nodes from the same document have the same `node_id`. This suggest that the `node_ids` are referencing the document instead of the chunks. Indeed, looking at the node creation code in [pgvector_ailab_db.ipynb](./pgvector_ailab_db.ipynb), in section "Creating nodes from louis_v005.documents" at the moment this was written, we can see that it's the case.

```python
query = """
    SELECT id, content, embedding, chunk_id, url, title, subtitle, tokens_count, last_updated, score
    FROM louis_v005.documents
"""
nodes = []
with psycopg.connect(conn_string) as conn:
    with conn.cursor(row_factory=dict_row) as cur:
        results = cur.execute(query).fetchall()
        for r in tqdm(results, desc="Processing records"):
            node = TextNode(
                text=r["content"],
                id_=str(r["id"]), # <---- Here
                embedding=json.loads(r["embedding"]),
            )
```

Something else I noticed: many of the nodes that reference the same documents have the same embedding, which is virtually impossible.

In [13]:
conn_string = (
    f"dbname={llamaindex_db} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)

node_id = nodes[0].node_id
print(node_id)

query = """
    SELECT node_id, embedding
    FROM public.data_llamaindex
    WHERE node_id = %s
"""
with psycopg.connect(conn_string) as conn:
    with conn.cursor(row_factory=dict_row) as cur:
        results = cur.execute(query, (node_id,)).fetchall()

8780a201-a628-44a3-babb-c8830b68de72


In [20]:
pprint(results)

[{'embedding': '[-0.015943341,0.0076930355,-0.0024652374,-0.026246028,0.011077427,0.006799366,-0.00041030656,-0.0014900159,-0.017574372,-0.027727548,0.014176388,0.004257674,-0.0012895348,0.016092852,-0.017615149,0.021611178,0.031098349,-0.043086436,0.013850182,-0.01312981,0.0006783225,-0.022943188,-0.007964875,-0.016120035,0.0036936086,0.0068129576,0.00031813624,-0.042787414,-0.0065852925,0.00020069342,-0.008766798,-0.011240531,0.012973502,-0.028706167,0.013048258,-0.0036562306,-0.0043256334,0.0016590656,0.021162644,-0.021380115,-0.0021356328,-0.00015715676,0.0002457167,0.0014551866,0.022848044,-0.015345295,0.010961896,-0.056270614,-0.013727855,0.04243402,0.014760842,0.031777944,-0.004485339,-0.0294945,0.0131026255,0.02310629,0.005729001,0.023446089,-0.025525656,-0.0036596286,-0.017574372,-0.013340484,-0.008134773,0.00097097387,-0.024451893,-0.0043426235,-0.027754731,0.003618853,0.0076998314,-0.0105609335,0.016690897,0.027279014,0.027768325,0.00022405456,0.010187156,-0.028434329,0.0036

In [21]:
print("results[1] and results[2] are the same:", results[1] == results[2])

results[1] and results[2] are the same: True


For comparison, here are the chunks from the same document in `louis_v005`:

In [22]:
conn_string = (
    f"dbname={database} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)

node_id = nodes[0].node_id
print(node_id)

query = """
    SELECT id, chunk_id, embedding
    FROM louis_v005.documents
    WHERE id = %s
"""
with psycopg.connect(conn_string) as conn:
    with conn.cursor(row_factory=dict_row) as cur:
        results = cur.execute(query, (node_id,)).fetchall()

8780a201-a628-44a3-babb-c8830b68de72


In [23]:
pprint(results)

[{'chunk_id': UUID('dd4b52f5-8454-4b70-b993-7ed920ce8ade'),
  'embedding': '[-0.015943341,0.0076930355,-0.0024652374,-0.026246028,0.011077427,0.006799366,-0.00041030656,-0.0014900159,-0.017574372,-0.027727548,0.014176388,0.004257674,-0.0012895348,0.016092852,-0.017615149,0.021611178,0.031098349,-0.043086436,0.013850182,-0.01312981,0.0006783225,-0.022943188,-0.007964875,-0.016120035,0.0036936086,0.0068129576,0.00031813624,-0.042787414,-0.0065852925,0.00020069342,-0.008766798,-0.011240531,0.012973502,-0.028706167,0.013048258,-0.0036562306,-0.0043256334,0.0016590656,0.021162644,-0.021380115,-0.0021356328,-0.00015715676,0.0002457167,0.0014551866,0.022848044,-0.015345295,0.010961896,-0.056270614,-0.013727855,0.04243402,0.014760842,0.031777944,-0.004485339,-0.0294945,0.0131026255,0.02310629,0.005729001,0.023446089,-0.025525656,-0.0036596286,-0.017574372,-0.013340484,-0.008134773,0.00097097387,-0.024451893,-0.0043426235,-0.027754731,0.003618853,0.0076998314,-0.0105609335,0.016690897,0.0272790

We can see that the embeddings are never the same. So, in the node creation process, probably due to using the same (document) id for nodes, they were duplicated. 

## Fix wrong `node_ids`

To fix this, we will have to modify the node creation code and rebuild the index. Fortunately, all the embeddings are already created and should not generate cost. In my estimation, the only costs would be due to db read write operations.

New node generation code:

```python
#...
            node = TextNode(
                text=r["content"],
                id_=str(r["chunk_id"]), # changed "id" to "chunk_id"
                embedding=json.loads(r["embedding"]),
            )
#...
```

## Testing the `node_id` fix

In [31]:
vector_store = PGVectorStore.from_params(
    database=llamaindex_db,
    host=host,
    password=password,
    port=port,
    user=user,
    schema_name=llamaindex_schema,
    embed_dim=1536,
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
retriever = index.as_retriever(similarity_top_k=50)

In [32]:
query = "what are the fertilizer labelling requirements?"
nodes = retriever.retrieve(query)

In [33]:
for n in nodes:
    pprint({"meta": n.metadata, "score" : n.score, "node_id": n.node_id})

{'meta': {'chunk_id': '1854fdc5-af24-41e4-81ef-a742a08c6684',
          'id': '8780a201-a628-44a3-babb-c8830b68de72',
          'last_updated': '2020-11-13',
          'score': 0.534927215363663,
          'subtitle': 'Registered Fertilizer-Pesticides Labelling',
          'title': 'T-4- 102 - Requirements for fertilizer-pesticides under '
                   'the Fertilizers Act - Canadian Food Inspection Agency',
          'tokens_count': 305,
          'url': 'https://inspection.canada.ca/plant-health/fertilizers/trade-memoranda/t-4-102/eng/1307854513877/1307854674148'},
 'node_id': '1854fdc5-af24-41e4-81ef-a742a08c6684',
 'score': 0.9085487454359314}
{'meta': {'chunk_id': '8144cb04-e745-49a4-b68b-809a700dee90',
          'id': '1ca75f55-e758-4830-9226-0577f9220482',
          'last_updated': '2022-06-08',
          'score': 0.5186214394910862,
          'subtitle': 'IV. Labelling',
          'title': 'T-4- 120 – Regulation of compost under the Fertilizers Act '
                   'a

- ✅ `node_ids` are no longer duplicated
- ✅ `scores` are no longer equal
- ❌ there are still nodes referencing the same url, but a lot less

## Solution to return only the highest score node per document (url)

