Skip to content

fix(llm): failed to remove vectors when updating vid embedding#243

Merged
imbajin merged 2 commits intoapache:mainfrom
MrJs133:remove_vector_bug
May 20, 2025
Merged

fix(llm): failed to remove vectors when updating vid embedding#243
imbajin merged 2 commits intoapache:mainfrom
MrJs133:remove_vector_bug

Conversation

@MrJs133
Copy link
Contributor

@MrJs133 MrJs133 commented May 20, 2025

Bug Report

After deleting vertices, running update_vid_embedding does not successfully remove the corresponding vectors.

Initial State

  • 13 vertices
  • 13 vid embeddings

After Clearing Graph Data

  • 0 vertices

After update_vid_embedding

  • Output shows: “Removed 13 vectors”
    image

However, calling get_vector_index_info still shows 13 vid embeddings.
image


Problem Location

Function get_vector_index_info:

def get_vector_index_info():
    chunk_vector_index = VectorIndex.from_index_file(
        os.path.join(resource_path, huge_settings.graph_name, "chunks")
    )
    graph_vid_vector_index = VectorIndex.from_index_file(
        os.path.join(resource_path, huge_settings.graph_name, "graph_vids")
    )
    return json.dumps({
        "embed_dim": chunk_vector_index.index.d,
        "vector_info": {
            "chunk_vector_num": chunk_vector_index.index.ntotal,
            "graph_vid_vector_num": graph_vid_vector_index.index.ntotal,
            "graph_properties_vector_num": len(chunk_vector_index.properties)
        }
    }, ensure_ascii=False, indent=2)

This logic is correct. It reads graph_vid_vector_num from graph_vid_vector_index.index.ntotal.

update_vid_embedding Code Analysis

past_vids = self.vid_index.properties
present_vids = context["vertices"]
removed_vids = set(past_vids) - set(present_vids)
removed_num = self.vid_index.remove(removed_vids)
added_vids = list(set(present_vids) - set(past_vids))

This correctly identifies vectors to be removed and added.
self.vid_index.remove() implementation:

def remove(self, props: Union[Set[Any], List[Any]]) -> int:
    if isinstance(props, list):
        props = set(props)
    indices = []
    remove_num = 0

    for i, p in enumerate(self.properties):
        if p in props:
            indices.append(i)
            remove_num += 1
    self.index.remove_ids(np.array(indices))
    self.properties = [p for i, p in enumerate(self.properties) if i not in indices]
    return remove_num

This also seems correct.


Debug Output

log.debug("before %s", self.vid_index.index.ntotal)
removed_num = self.vid_index.remove(removed_vids)
log.debug("after %s", self.vid_index.index.ntotal)

Output:

[05/20/25 13:51:20] DEBUG before 13
[05/20/25 13:51:20] DEBUG after 0

→ This confirms that in-memory deletion is successful.

However, re-running update_vid_embedding again shows:

[05/20/25 13:53:23] DEBUG before 13
[05/20/25 13:53:23] DEBUG after 0

→ Confirms that the vector index file still contains 13 vectors (i.e., deletion was not persisted).

And this is verified by loading the index from file via:

self.index_dir = os.path.join(resource_path, huge_settings.graph_name, "graph_vids")
self.vid_index = VectorIndex.from_index_file(self.index_dir)

log.debug("after %s", self.vid_index.index.ntotal)

Note: The result of this deletion was not saved.

Root Cause

In the full BuildSemanticIndex.run() implementation:

removed_vids = set(past_vids) - set(present_vids)
removed_num = self.vid_index.remove(removed_vids)
added_vids = list(set(present_vids) - set(past_vids))

if added_vids:
    ...
    self.vid_index.add(...)
    self.vid_index.to_index_file(self.index_dir)
else:
    log.debug("No update vertices to build vector index.")

The call to self.vid_index.to_index_file(self.index_dir) only happens if added_vids is non-empty.

So if you're only removing embeddings (i.e., no new vertices), the deletion is never persisted to disk.


Fix

removed_num = self.vid_index.remove(removed_vids)
self.vid_index.to_index_file(self.index_dir)  # <-- Add this line

Verification

  • Remove only: works ✅
  • Add only: works ✅
  • No change: works ✅

Problem solved.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label May 20, 2025
@github-actions github-actions bot added the llm label May 20, 2025
@MrJs133 MrJs133 changed the title fix bug: remove failed fix(llm): remove vector failed May 20, 2025
@MrJs133 MrJs133 changed the title fix(llm): remove vector failed fix(llm): remove vector failed when update vid embedding May 20, 2025
@MrJs133 MrJs133 changed the title fix(llm): remove vector failed when update vid embedding fix(llm): failed to remove vectors when updating vid embedding May 20, 2025
@dosubot dosubot bot added the bug Something isn't working label May 20, 2025
@imbajin imbajin requested a review from Copilot May 20, 2025 06:47
imbajin
imbajin previously approved these changes May 20, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 20, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR ensures that vector removals in update_vid_embedding are persisted to disk by unconditionally writing the updated index after removals.

  • Always call to_index_file after removing vids so deletions are saved.
  • Retains existing behavior to write on additions.

@imbajin imbajin self-requested a review May 20, 2025 06:48
…tic_index.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@imbajin imbajin merged commit e1d3e44 into apache:main May 20, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer llm size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants