---
date:
  created: 2025-04-22
  updated: 2025-04-22

categories:
- Deep learning

tags:
- NLP
- Polars
- Hugging Face
- "Series: GitHub repo issues dataset"

slug: retrieve-info-from-github-repo-issues
---

# Retrieving information from a GitHub repo issues dataset

This is Part III of my adaptation of
:simple-huggingface: Hugging Face NLP Course: [Creating your own dataset][1].
It consists of several parts:

1. Creating a corpus of issue-comment pairs from the previously prepared dataset.
2. Embedding each issue-comment pair into dense vectors for similarity search.
3. Building a Faiss index of the embeddings to speed up querying.
4. Using a reranker to pick the best entries from the top-k similarity search results.

<a href="https://colab.research.google.com/github/dd-n-kk/notebooks/blob/main/blog/retrieve-info-from-github-repo-issues.ipynb" target="_parent">
    :simple-googlecolab: Colab notebook
</a>

<!-- more -->

## Preparations

In [None]:
!uv pip install -Uq polars
!uv pip install -q datasets faiss-cpu

In [None]:
from collections.abc import Sequence

import faiss
import numpy as np
import polars as pl
import torch as tc
from datasets import load_dataset
from numpy.typing import NDArray
from polars import col
from torch.nn import functional as F
from tqdm.auto import trange
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification

In [None]:
_ = pl.Config(
    float_precision=3,
    fmt_str_lengths=200,
    fmt_table_cell_list_len=-1,
    tbl_cols=-1,
    tbl_rows=100,
    tbl_width_chars=-1,
)

## Preparing the corpus

We directly download the `dd-n-kk/uv-github-issues` dataset prepared in [Part II][2].

In [None]:
%%capture
issues = load_dataset("dd-n-kk/uv-github-issues", "issues")
comments = load_dataset("dd-n-kk/uv-github-issues", "comments")

The train-test splits are merged because all processed entries will be used for querying.

In [None]:
issues_df = pl.concat([issues["train"].to_polars(), issues["test"].to_polars()])
comments_df = pl.concat([comments["train"].to_polars(), comments["test"].to_polars()])

### Issues

I decide to remove:

- Issues with null bodies.
- Issues created by bots, because they usually contain little info.
- Pull requests not yet merged, for they often contain suggestions not yet adopted.

In [None]:
issues_df = issues_df.filter(
    col("body").is_not_null()
    & (~col("user").str.contains("[bot]", literal=True))
    & (~col("pull_request") | col("merged_at").is_not_null())
)

A crude word count reveals a problem: A small number of issues are extremely long.

In [None]:
q = (
    issues_df.select(
        "html_url", "title", col("body").str.split(" ").list.len().alias("n_words")
    )
    .sort("n_words", descending=True)
)
q.get_column("n_words").describe()

statistic,value
str,f64
"""count""",10313.0
"""null_count""",0.0
"""mean""",152.041
"""std""",424.822
"""min""",1.0
"""25%""",21.0
"""50%""",62.0
"""75%""",151.0
"""max""",12590.0


From the GitHub web pages we can see that they contain long debug outputs,
which are unlikely to help answer user questions.

In [None]:
q.head(5)

html_url,title,n_words
str,str,u32
"""https://github.com/astral-sh/uv/issues/6443""","""`uv sync` freezes infinitely at the container root""",12590
"""https://github.com/astral-sh/uv/issues/5742""","""Allow `uv sync --no-build-isolation`""",11862
"""https://github.com/astral-sh/uv/issues/5046""","""Bad resolver error for `colabfold[alphafold]==1.5.5` on python 3.11""",11764
"""https://github.com/astral-sh/uv/issues/7183""","""Improve Python version resolution UI""",11316
"""https://github.com/astral-sh/uv/issues/2062""","""uv pip and python -m pip resolve different versions of tensorflow in pyhf developer environment""",8819


To shorten these issues, I:

- Replace Markdown fenced code blocks containing too many characters with `[CODE]`.
- Replace each HTML element `<detail>` with `[DETAIL]`.
- Replace HTML comments `<!-- ... -->` with `[COMMENT]`.
- Remove trailing whitespaces.

In [None]:
issues_df = (
    issues_df.lazy()
    .select(
        "number",
        col("html_url").alias("issue_url"),
        "title",
        (
            col("body").str.replace_all(r"\s*[\r\n]", "\n")
            .str.replace_all(r"(?s)<details>.*?</details>", "[DETAILS]")
            .str.replace_all(r"(?s)<!--.*?-->", "[COMMENT]")
            .str.replace_all(r"```(?:[^`]|`[^`]|``[^`]){768,}```", "[CODE]")
            .str.replace_all(r"~~~(?:[^~]|~[^~]|~~[^~]){768,}~~~", "[CODE]")
        ),
    )
    .collect()
)

An issue body containing long debug outputs now looks like this:

In [None]:
print(issues_df.filter(col("number") == 6443).item(0, "body"))

To reproduce, try building the following Dockerfile (remove `sudo` if your docker is rootless):
```
cat <<EOF | sudo BUILDKIT_PROGRESS=plain docker build -
FROM library/python:3.11
RUN pip install 'uv == 0.3.1' \
    && printf >pyproject.toml '\
      [project]\n\
      dependencies = ["django ~= 4.2"]\n\
      name = "demo"\n\
      version = "0.1.0"\n\
      requires-python = ">=3.11.7"\n\
    '\
    && uv lock
RUN uv sync -vv
EOF
```
This is not specific to Django, according to my experiments.
The build freezes after `uv_build::run_python_script script="get_requires_for_build_editable", python_version=3.11.9` verbose log, keeping a high CPU load for a few minutes.
[DETAILS]
However, this only happens if I build this at the root of filesystem. Adding `WORKDIR /home` before installation recovers everything, the build completes in seconds.
[DETAILS]



The word counts are now subtantially reduced.

In [None]:
issues_df.get_column("body").str.split(" ").list.len().describe()

statistic,value
str,f64
"""count""",10313.0
"""null_count""",0.0
"""mean""",81.815
"""std""",105.018
"""min""",1.0
"""25%""",19.0
"""50%""",51.0
"""75%""",110.0
"""max""",3260.0


### Comments

The comments dataset is processed similarly:
Bot comments are removed and long code blocks are snipped.

In [None]:
comments_df = comments_df.filter(~col("user").str.contains("[bot]", literal=True))

In [None]:
comments_df.get_column("body").str.split(" ").list.len().describe()

statistic,value
str,f64
"""count""",33789.0
"""null_count""",0.0
"""mean""",69.231
"""std""",454.41
"""min""",1.0
"""25%""",12.0
"""50%""",25.0
"""75%""",55.0
"""max""",39896.0


In [None]:
comments_df = (
    comments_df.lazy()
    .select(
        "issue_number",
        col("html_url").alias("comment_url"),
        (
            col("body").str.replace_all(r"\s*[\r\n]", "\n")
            .str.replace_all(r"(?s)<details>.*?</details>", "[DETAILS]")
            .str.replace_all(r"(?s)<!--.*?-->", "[COMMENT]")
            .str.replace_all(r"```(?:[^`]|`[^`]|``[^`]){768,}```", "[CODE]")
            .str.replace_all(r"~~~(?:[^~]|~[^~]|~~[^~]){768,}~~~", "[CODE]")
            .alias("comment_body")
        ),
    )
    .collect()
)

In [None]:
comments_df.get_column("comment_body").str.split(" ").list.len().describe()

statistic,value
str,f64
"""count""",33789.0
"""null_count""",0.0
"""mean""",43.568
"""std""",62.287
"""min""",1.0
"""25%""",12.0
"""50%""",24.0
"""75%""",51.0
"""max""",1934.0


### Joining

We are ready to create the corpus.
I've considered combining each issue with all associated comments,
but that may require an embedding model with a very large context length.
Therefore, I use left join to create issue-comment pairs
while preserving issues with no comment.

URLs are also collected for convenient lookups.

In [None]:
corpus_df = (
    issues_df.lazy()
    .join(comments_df.lazy(), how="left", left_on="number", right_on="issue_number")
    .select(
        (
            pl.when(col("comment_url").is_null())
            .then(col("issue_url"))
            .otherwise("comment_url")
            .alias("url")
        ),
        (
            pl.when(col("comment_body").is_null())
            .then(
                pl.format("Issue {}: {}\n\n{}", col("number"), col("title"), col("body"))
            )
            .otherwise(
                pl.format(
                    "Issue {}: {}\n\n{}\n\nComment:\n{}",
                    col("number"),
                    col("title"),
                    col("body"),
                    col("comment_body"),
                )
            )
            .alias("text")
        ),
    )
    .sort("url")
    .collect()
)

In [None]:
assert corpus_df.get_column("url").n_unique() == len(corpus_df)

In [None]:
corpus_df.get_column("text").str.split(" ").list.len().describe()

statistic,value
str,f64
"""count""",35278.0
"""null_count""",0.0
"""mean""",162.174
"""std""",149.027
"""min""",3.0
"""25%""",68.0
"""50%""",128.0
"""75%""",213.0
"""max""",3312.0


## Embedding issue-comment pairs

!!! warning "&#8203;"
    Running this section likely requires GPU.

I pick [`BAAI/bge-m3`][3] as the pretrained embedder.
It is based on [`FacebookAI/xlm-roberta-large`][4].
It is reasonably sized for a Colab T4 GPU, has a long enough context length of 8192,
and is versatile and efficient.

In [None]:
corpus = corpus_df.get_column("text").to_list()

In [None]:
%%capture
EMB_CKPT = "BAAI/bge-m3"
tokenizer = AutoTokenizer.from_pretrained(EMB_CKPT)
embedder = AutoModel.from_pretrained(EMB_CKPT)

To reduce unncessary padding, the padding length is determined batch by batch.
Also, the encoded entries are batched in decreasing lengths,
so that the batch maximum length accomodates the entries efficiently.
We do have to restore the embeddings to the original order.

In [None]:
def embed(
    texts: Sequence[str],
    *,
    tokenizer,
    embedder,
    batch_size: int,
    context_len: int,
    device=None,
    use_half: bool = True,
) -> NDArray:
    no_tqdm = len(texts) < batch_size
    if device is None:
        device = tc.device("cuda" if tc.cuda.is_available() else "cpu")

    encodings = []
    for i in trange(0, len(texts), batch_size, desc="Tokenization", disable=no_tqdm):
        # No padding now; pad within each embedder input batch.
        batch = tokenizer(
            texts[i : i + batch_size], truncation=True, max_length=context_len
        )

        # dict[list] -> list[dict]
        ## Just one way to conform to `tokenizer.pad()`.
        encodings.extend(dict(zip(batch, vals)) for vals in zip(*batch.values()))

    # Sort by token count in descending order to reduce padding.
    # Keep the sorted index to restore the original order later.
    ## Reverse view > element-wise negative (https://stackoverflow.com/a/16486305)
    sorted_index = np.argsort([len(x["input_ids"]) for x in encodings])[::-1]
    encodings = [encodings[i] for i in sorted_index]

    embedder = embedder.to(device).eval()
    # Using float16 only on GPU.
    if device.type == "cuda" and use_half:
        embedder = embedder.half()

    embeddings = []
    with tc.inference_mode():
        for i in trange(
            0,
            len(encodings),
            batch_size,
            desc="Embedding",
            disable=len(encodings) < batch_size,
        ):
            # Within-batch padding
            ## `BatchEncoding` has method `to()`.
            padded = tokenizer.pad(
                encodings[i : i + batch_size],
                padding=True,
                return_tensors="pt",
            ).to(device)

            # [CLS] pooling with normalization
            embeddings.append(
                F.normalize(embedder(**padded).last_hidden_state[:, 0], dim=-1)
                .cpu()
                .numpy()
            )

    # Merge, cast to float32 (for Faiss), and restore original order.
    return np.concatenate(embeddings, 0, dtype=np.float32)[np.argsort(sorted_index)]

The embedding process takes about 10 minutes in a Colab T4 GPU runtime.

In [None]:
embeddings = embed(
    corpus,
    tokenizer=tokenizer,
    embedder=embedder,
    batch_size=32,
    context_len=4096,
    use_half=True,
)

Tokenization:   0%|          | 0/1103 [00:00<?, ?it/s]

Embedding:   0%|          | 0/1103 [00:00<?, ?it/s]

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
embeddings.shape

(35278, 1024)

We can now create some example user questions about `astral-sh/uv` as queries.

In [None]:
queries = [
    "What is the difference between `uv pip install` and `uv add`?",
    "How to update Python in my venv to the latest version?",
    "How to install the CPU version of PyTorch?",
    "Can I add a package dependency without version requirement?",
    "What does the `.python-version` file do?",
]

In [None]:
q_embeddings = embed(
    queries,
    tokenizer=tokenizer,
    embedder=embedder,
    batch_size=8,
    context_len=512,
    use_half=True,
)

In [None]:
%timeit (q_embeddings @ embeddings.T)

55.6 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The higher the inner product of a query embedding and an issue-comment pair embedding,
the more similar they should be,
and the more likely the issue-comment pair contains an answer to the question.

In [None]:
result_indexes = (q_embeddings @ embeddings.T).argmax(-1).tolist()

In [None]:
def display_query_results(queries, corpus_df, indexes):
    for query, i in zip(queries, indexes):
        print(f"Query: {query}\n")
        print(f"Result URL: {corpus_df.get_column('url')[i]}\n")
        print(f"Result:\n{corpus_df.get_column('text')[i]}\n")
        print(f"==================================================\n")

Even though only some of the comments directly answer the questions,
all the retrieved issues are indeed highly relevant.

In [None]:
display_query_results(queries, corpus_df, result_indexes)

Query: What is the difference between `uv pip install` and `uv add`?

Result URL: https://github.com/astral-sh/uv/issues/9219#issuecomment-2613016513

Result:
Issue 9219: What's the difference between `uv pip install` and `uv add`

[COMMENT]
I've been using `uv` for a while and I really enjoy it. I keep stumbling onto the same confusion though. I never quite know whether do to:
```sh
uv init
uv venv
uv add polars marimo
uv run hello.py
```
or
```sh
uv init
uv venv
source .venv/bin/activate
pip install polars marimo
python hello.py
```
are these two above equivalent?
---
also are these two equivalent?
```
uv add polars
```
```
uv pip install polars
```

Comment:
1. `pip install` will only install into its own environment, `uv pip install` can target other environments.
2. `uv run script.py` will activate a virtual environment if necessary, and read PEP 723 inline metadata or sync your project if necessary, then call `python script.py` — the latter just uses your current environment as-i

## Building a Faiss index to improve query speed

[Faiss][5] is a library that can create indexes to speed up similarity search
among dense vector embeddings, often at very little cost of accuracy.

I use an inverted file index with inner product as metric.
The `nlist` is the number of partitions made (in the form of inverted lists)
in the embedding space,
and the `nprob` is the number of partitions examined per query.
They are set roughly according to the [guideilnes][6].

In [None]:
D = embeddings.shape[-1]
nlist = 2048
nprob = 16

In [None]:
quantizer = faiss.IndexFlatIP(D)
faiss_index = faiss.IndexIVFFlat(quantizer, D, nlist, faiss.METRIC_INNER_PRODUCT)

Building the index takes less than 1 minute.

In [None]:
faiss_index.train(embeddings)
faiss_index.add(embeddings)

In [None]:
faiss_index.is_trained, faiss_index.ntotal, faiss_index.nprobe

(True, 35278, 1)

In [None]:
faiss_index.nprobe = nprob

The index improves query time by a factor of $\sim 20$,
and happens to get exactly the same results from our example queries.

In [None]:
%timeit faiss_index.search(q_embeddings, k=1)

2.71 ms ± 69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
metrics, faiss_result_indexes = faiss_index.search(q_embeddings, k=1)
result_indexes, faiss_result_indexes.reshape(-1).tolist()

([25352, 2833, 2230, 18026, 24623], [25352, 2833, 2230, 18026, 24623])

## Using a Reranker to improve query results

A reranker is supposed to evaluate relevance between embeddings
more accurately but slowly than distance metrics.

Here I use [`BAAI/bge-reranker-v2-m3`][7], the matching reranker of `BAAI/bge-m3`.

In [None]:
%%capture
RRK_CKPT = "BAAI/bge-reranker-v2-m3"
rerank_tokenizer = AutoTokenizer.from_pretrained(RRK_CKPT)
reranker = AutoModelForSequenceClassification.from_pretrained(RRK_CKPT)

In [None]:
def rerank(
    query: str,
    corpus: Sequence[str],
    indexes: Sequence[int],
    tokenizer,
    reranker,
    context_len: int,
    device=None,
):
    if device is None:
        device = tc.device("cuda" if tc.cuda.is_available() else "cpu")

    pairs = [[query, corpus[i]] for i in indexes]
    inputs = tokenizer(
        pairs, padding=True, truncation=True, max_length=context_len, return_tensors="pt"
    ).to(device)

    reranker = reranker.to(device).eval()
    with tc.inference_mode():
        logits = reranker(**inputs).logits.reshape(-1).cpu().numpy()

    return np.array(indexes)[(-logits).argsort()]

First, issue-comment pairs of top-10 similarities are retrieved from the Faiss index.
Then, the reranker reorders them in decreasing scores.

In [None]:
metrics, faiss_result_indexes = faiss_index.search(q_embeddings, k=10)

In [None]:
reranked_indexes = np.array(
    [
        rerank(q, corpus, i, rerank_tokenizer, reranker, context_len=4096)
        for q, i in zip(queries, faiss_result_indexes)
    ]
)

With our examples, the final top-1 results are quite similar.
But the reranked results for question 1 and 3 are arguably more complete.

In [None]:
faiss_result_indexes

array([[25352, 25339, 25344, 25340, 25351, 25350, 25345, 25343, 25346,
        25341],
       [ 2833,  2832, 22932, 25560, 20466, 25559, 20467, 20468, 20469,
        20464],
       [ 2230,  2231,  6440,  6458,  2229,  6457,  6455, 16575,  6453,
         6437],
       [18026, 18024, 18025, 23750,  4664, 12986, 13010, 16315,  4665,
        13004],
       [24623, 24624, 33204,  8035,  4781, 25094, 20774,  8034,  3762,
        15749]])

In [None]:
reranked_indexes

array([[25339, 25352, 25340, 25344, 25341, 25350, 25351, 25346, 25345,
        25343],
       [ 2833,  2832, 25559, 20466, 25560, 20467, 20464, 22932, 20468,
        20469],
       [ 6458,  6440,  6453,  6457,  6455,  6437,  2229,  2230,  2231,
        16575],
       [18026, 18024, 18025,  4665,  4664, 23750, 16315, 12986, 13010,
        13004],
       [24623, 25094, 15749, 24624,  4781,  8034,  3762,  8035, 33204,
        20774]])

In [None]:
display_query_results(queries, corpus_df, reranked_indexes[:, 0].reshape(-1).tolist())

Query: What is the difference between `uv pip install` and `uv add`?

Result URL: https://github.com/astral-sh/uv/issues/9219#issuecomment-2485573603

Result:
Issue 9219: What's the difference between `uv pip install` and `uv add`

[COMMENT]
I've been using `uv` for a while and I really enjoy it. I keep stumbling onto the same confusion though. I never quite know whether do to:
```sh
uv init
uv venv
uv add polars marimo
uv run hello.py
```
or
```sh
uv init
uv venv
source .venv/bin/activate
pip install polars marimo
python hello.py
```
are these two above equivalent?
---
also are these two equivalent?
```
uv add polars
```
```
uv pip install polars
```

Comment:
``uv add`` choose universal or cross-platform dependencies , and ``uv add`` is a project API.
https://docs.astral.sh/uv/concepts/projects/
This is my understanding, but the more correct interpretation should be based on the documentation and the uv team's explanation.
> Suppose a dependency has versions 1.0.0 and 1.1.0 on Window

[1]: https://huggingface.co/learn/nlp-course/en/chapter5/5
[2]: make-huggingface-dataset-of-github-repo-issues.md
[3]: https://huggingface.co/BAAI/bge-m3
[4]: https://huggingface.co/FacebookAI/xlm-roberta-large
[5]: https://github.com/facebookresearch/faiss
[6]: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#if-below-1m-vectors-ivfk
[7]: https://huggingface.co/BAAI/bge-reranker-v2-m3