# Vector DB Search 

This notebook can be used to do a vector DB search of Arrow issues.  This allows for semantic search—retrieving issues based on meaning rather than exact keyword matches. This differs from GitHub's built-in search, which is mostly lexical and relies on specific terms, labels, or filters. A vector DB can surface issues with similar intent or topic even if they use different wording, making it more useful for detecting duplicates, related bugs, or thematic clusters.

In [1]:
import pandas as pd
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
import re

import gzip
import json
import pandas as pd

  from tqdm.autonotebook import tqdm, trange


In [3]:
with gzip.open("../test_data/issues_min.json.gz", "rt", encoding="utf-8") as f:
    df = json.load(f)
    
df = pd.DataFrame(df)
data = df.to_dict('records')

In [11]:
data[2]

{'url': 'https://github.com/apache/arrow/pull/46723',
 'title': 'GH-46214: [C++] Improve S3 client initialization',
 'created_at': '2025-06-05T16:50:31Z',
 'user_login': 'pitrou',
 'labels': ['Component: C++', 'awaiting review'],
 'created_at.1': '2025-06-05T16:50:31Z',
 'closed_at': {},
 'pull_request': 'https://api.github.com/repos/apache/arrow/pulls/46723',
 'body': '### Rationale for this change\r\n\r\nThe default constructor of the `ClientConfiguration` class in the AWS SDK can issue spurious EC2 metadata requests to resolve the current region, even if we would later set the region from our S3 options.\r\n\r\n### What changes are included in this PR?\r\n\r\n1. Avoid spurious EC2 metadata calls by disabling "IMDS" in the `ClientConfiguration` constructor\r\n2. Change the smart defaults from "legacy" to "standard" (see https://docs.aws.amazon.com/sdkref/latest/guide/feature-smart-config-defaults.html)\r\n3. Let the user configure the smart defaults in `S3Options`\r\n\r\n### Are thes

This is some pretty messy data cleaning and for sure needs proper pre-filtering using data fields rather than just the text, but it works for this prototype ;)

In [80]:
# Get rid of empty issues and any which are actually PRs
non_empty = [x for x in data if len(x['body']) > 1]
just_issues = [x for x in non_empty if not (len(x['pull_request']) > 0)]

# This is all issues - opened and closed
len(just_issues)

26336

Do you want to search just open issues or closed ones too? Set this to False if you want to search all previous issues

In [81]:
just_open = True

In [82]:
if just_open:
    just_issues = [x for x in just_issues if x['state'] == "open"]

In [83]:
len(just_issues)

4235

In [84]:
# Functions to clean up the data

# Remove code chunks from issue body
def remove_code_chunks(text):
    # Remove fenced code blocks (```...```)
    text = re.sub(r"```.*?\n.*?```", "", text, flags=re.DOTALL)
    
    # Remove inline code (`...`)
    text = re.sub(r"`[^`]*`", "", text)
    
    # Remove indented code blocks (lines starting with 4+ spaces or a tab)
    text = re.sub(r"^(?: {4,}|\t).*\n?", "", text, flags=re.MULTILINE)

    return text

# Remove URLs from issue body
def remove_urls(text):
    return re.sub(r'(https?://\S+|www\.\S+|ftp://\S+)', '', text)

# Remove boilerplate issue text
def remove_boilerplate(text):

    phrases_to_remove = [
        "### Describe the enhancement requested",
        "### Describe the bug, including details regarding any error messages, version, and platform.",
        "### Component(s)",
        "### Describe the usage question you have. Please include as many useful details as  possible.",
        "**_Overview_**",
        "**_Impact_**",
        "**_Key Features_**"
    ]
    
    for phrase in phrases_to_remove:
        text = text.replace(phrase, '')

    return text

In [85]:
for x in just_issues:
    body = x.get('body', '')
    if not isinstance(body, str):
        continue  # or set x['body'] = "" if you prefer
    x['body'] = remove_boilerplate(x['body'])
    x['body'] = remove_code_chunks(x['body'])
    x['body'] = remove_urls(x['body'])
    x['body'] = "\n".join(line for line in x['body'].splitlines() if line.strip())

And now to upload it to a searchable vector DB...

In [89]:
# Set up the vector DB

encoder = SentenceTransformer('all-MiniLM-L6-v2', device='cpu') # Model to create embeddings
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

# Create collection to store issues
qdrant.create_collection(
    collection_name="arrow_issues",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

# Create embeddings and upload to collection
qdrant.upload_points(
    collection_name="arrow_issues",
    points=[
        models.PointStruct(
            id=idx,
            vector=encoder.encode(doc["body"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(just_issues) 
    ]
)

  qdrant.recreate_collection(


Choose a term to search for

In [93]:
my_term_to_search = "compatability with pandas"

In [97]:
# Search!
hits = qdrant.search(
    collection_name="arrow_issues",
    query_vector=encoder.encode(my_term_to_search).tolist(),
    limit=10


In [98]:
for hit in hits:
    print(f"URL: {hit.payload['url']}\nScore: {hit.score}\n{hit.payload['body']}\n=================================\n")

URL: https://github.com/apache/arrow/issues/29025
Score: 0.6416391935666161
Currently we don't really explicitly define a "minimum supported version" for pandas, but we have (nightly) test builds with pandas 0.23 and 0.24 (in addition to latest and master) as oldest tested versions.
I think we can bump the minimum support version (pandas 0.23 was first released May 15, 2018, so more than three years ago). We could maybe directly bump to pandas 1.0 (released January 29, 2020), or otherwise something in between (eg 0.25, released July 18, 2019).
**Reporter**: [Joris Van den Bossche]( / @jorisvandenbossche
#### Related issues:
- [[Python] Remove backward compatibility hacks from pyarrow.pandas_compat]( (is related to)
<sub>**Note**: *This issue was originally created as [ARROW-13351]( Please see the [migration documentation]( for further details.*</sub>

URL: https://github.com/apache/arrow/issues/44068
Score: 0.6162408186901824
This issue is to discuss the idea of moving a significant pa