In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient
import json
from citegeist.utils.helpers import load_api_key
from citegeist.utils.prompts import (
    generate_summary_prompt_with_page_content,
    generate_related_work_prompt
)
from citegeist.utils.azure_client import AzureClient
from citegeist.utils.citations import (
    get_arxiv_abstract,
    get_arxiv_citation,
    process_arxiv_paper_with_embeddings,
    find_most_relevant_pages,
)
from citegeist.utils.long_to_short import extract_most_relevant_pages
from dotenv import load_dotenv
import os

load_dotenv()

topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
client = MilvusClient("./database.db")
prompting_client = AzureClient(
    endpoint=os.getenv("AZURE_ENDPOINT"),
    deployment_id=os.getenv("AZURE_PROMPTING_MODEL"),
    api_key=load_api_key(os.getenv("KEY_LOCATION")),
)

In [None]:
abstract = "Large Language Models have shown impressive per- formance across a wide array of tasks involving both structured and unstructured textual data. More recently, adaptions of these models have drawn attention to their abilities to work with code across different programming languages. On this notion, different benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that the performance on this benchmark does not translate to the innate ability of humans to appreciate the structural control flow of code. For this purpose, we extract code solutions from the Hu- manEval benchmark, which the relevant models perform very strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of 5 state-of-the-art LLMs to match the execution trace and find that, despite the model’s abilities to generate semantically identical code, they possess only limited ability to trace the execution path, especially for traces with increased length. We find that even the top-performing model, Gemini 1.5 Pro can only fully correctly generate the trace of 47% of HumanEval tasks. In addition, we introduce a specific subset for three key structures not, or only contained to a limited extent in Hu- manEval: Recursion, Parallel Processing, and Object Oriented Programming principles, including concepts like Inheritance and Polymorphism. Besides OOP, we show that none of the investigated models achieve an average accuracy of over 5% on the relevant traces. Aggregating these specialized parts with the ubiquitous HumanEval tasks, we present the Benchmark CoCoNUT: Code Control Flow for Navigation Understanding and Testing, which measures a models ability to trace the execu- tion of code upon relevant calls, including advanced structural components. We conclude that the current generation LLMs still need to significantly improve to enhance their code reasoning abilities. We hope our dataset can help researchers bridge this gap in the near future."
embedded_abstract = embedding_model.encode(abstract)
topic = topic_model.transform(abstract)
topic_id = topic[0][0]

res = client.search(
    collection_name="abstracts",
    data=[embedded_abstract],
    limit=30,
    anns_field="embedding",
    # filter = f'topic == {topic_id}',
    search_params={"metric_type": "COSINE", "params": {}},
    # output_fields = []
)
formatted_res = json.dumps(res, indent=4)
print(formatted_res)
print(len(res[0]))

In [4]:
# we need to remove the best match because that's the same input paper (this only has to be done for papers that are already in the arxiv corpus)
# res = res[0][1:]

res = res[0]

In [5]:
paper_embeddings = []
for paper in res:
    arxiv_id = paper["id"]  # Replace with the actual paper ID key in your JSON

    print(f"Processing paper: {arxiv_id}")
    result = process_arxiv_paper_with_embeddings(arxiv_id, topic_model)

    if result:
        paper_embeddings.append(result)
        print(f"Paper {arxiv_id}: Processed successfully.")
    else:
        print(f"Paper {arxiv_id}: No content remains after filtering.")

# Print an example: First page text and embedding of the first processed paper
if paper_embeddings:
    print("First paper, first page text:", paper_embeddings[0][0]["text"])
    print("First paper, first page embedding:", paper_embeddings[0][0]["embedding"])

relevant_pages = extract_most_relevant_pages(
    paper_embeddings, abstract, topic_model, 60
)

Processing paper: 2408.10718
PDF downloaded successfully: 2408.10718.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2408.10718: Processed successfully.
Processing paper: 2408.13001
PDF downloaded successfully: 2408.13001.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2408.13001: Processed successfully.
Processing paper: 2410.21647
PDF downloaded successfully: 2410.21647.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2410.21647: Processed successfully.
Processing paper: 2309.15432
PDF downloaded successfully: 2309.15432.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2309.15432: Processed successfully.
Processing paper: 2407.11470
PDF downloaded successfully: 2407.11470.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2407.11470: Processed successfully.
Processing paper: 2407.19055
PDF downloaded successfully: 2407.19055.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2407.19055: Processed successfully.
Processing paper: 2311.08588
PDF downloaded successfully: 2311.08588.pdf


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Paper 2311.08588: Processed successfully.
Processing paper: 2402.08699
PDF downloaded successfully: 2402.08699.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2402.08699: Processed successfully.
Processing paper: 2406.15877
PDF downloaded successfully: 2406.15877.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2406.15877: Processed successfully.
Processing paper: 2403.04811
PDF downloaded successfully: 2403.04811.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2403.04811: Processed successfully.
Processing paper: 2403.16437
PDF downloaded successfully: 2403.16437.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2403.16437: Processed successfully.
Processing paper: 2305.15507
PDF downloaded successfully: 2305.15507.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2305.15507: Processed successfully.
Processing paper: 2310.08992
PDF downloaded successfully: 2310.08992.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2310.08992: Processed successfully.
Processing paper: 2403.13583
PDF downloaded successfully: 2403.13583.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2403.13583: Processed successfully.
Processing paper: 2305.12138
PDF downloaded successfully: 2305.12138.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2305.12138: Processed successfully.
Processing paper: 2305.04087
PDF downloaded successfully: 2305.04087.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2305.04087: Processed successfully.
Processing paper: 2406.08731
PDF downloaded successfully: 2406.08731.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2406.08731: Processed successfully.
Processing paper: 2403.19114
PDF downloaded successfully: 2403.19114.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2403.19114: Processed successfully.
Processing paper: 2407.06153
PDF downloaded successfully: 2407.06153.pdf
MuPDF error: syntax error: could not parse color space (311 0 R)

MuPDF error: syntax error: could not parse color space (633 0 R)



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2407.06153: Processed successfully.
Processing paper: 2405.00253
PDF downloaded successfully: 2405.00253.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2405.00253: Processed successfully.
Processing paper: 2410.13187
PDF downloaded successfully: 2410.13187.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2410.13187: Processed successfully.
Processing paper: 2306.09896
PDF downloaded successfully: 2306.09896.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2306.09896: Processed successfully.
Processing paper: 2309.01940
PDF downloaded successfully: 2309.01940.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2309.01940: Processed successfully.
Processing paper: 2408.15658
PDF downloaded successfully: 2408.15658.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2408.15658: Processed successfully.
Processing paper: 2307.13383
PDF downloaded successfully: 2307.13383.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2307.13383: Processed successfully.
Processing paper: 2402.09664
PDF downloaded successfully: 2402.09664.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2402.09664: Processed successfully.
Processing paper: 2311.09635
PDF downloaded successfully: 2311.09635.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2311.09635: Processed successfully.
Processing paper: 2405.04520
PDF downloaded successfully: 2405.04520.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2405.04520: Processed successfully.
Processing paper: 2410.01999
PDF downloaded successfully: 2410.01999.pdf


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Paper 2410.01999: Processed successfully.
Processing paper: 2405.11430
PDF downloaded successfully: 2405.11430.pdf


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper 2405.11430: Processed successfully.
First paper, first page text: CodeJudge-Eval: Can Large Language Models be Good Judges in
Code Understanding?
♢Yuwei Zhao* , ♠Ziyang Luo* , ♡Yuchen Tian , ♠Hongzhan Lin
♣Weixiang Yan , ♢Annan Li , ♠Jing Ma†
♠Hong Kong Baptist University, ♢Beihang University
♡University of Tokyo, ♣Vaneval.AI
{yuweizhao,liannan}@buaa.edu.cn
{cszyluo,majing}@comp.hkbu.edu.hk
Abstract
Recent advancements in large language models
(LLMs) have showcased impressive code gener-
ation capabilities, primarily evaluated through
language-to-code benchmarks. However, these
benchmarks may not fully capture a model’s
code understanding abilities.
We introduce
CodeJudge-Eval (CJ-Eval), a novel bench-
mark designed to assess LLMs’ code under-
standing abilities from the perspective of code
judging rather than code generation. CJ-Eval
challenges models to determine the correctness
of provided code solutions, encompassing var-
ious error types and compilation issues. By
leveraging

In [11]:
relevant_pages[4]["text"]

'CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models\n111:3\nand discuss their potential limitations as follows: (1) Existing benchmarks only focus on a single\nprogramming task for LLMs within uni-lingual test cases (i.e., English), which makes the evaluation\nincomprehensive. (2) Existing benchmarks, excluding DebugBench [50], generally lack fine-grained\ncategorization over the test data and human-expert evaluation, which are crucial to derive deeper\ninsights and analysis for different aspects of programming, as well as the thorough comparison\nbetween LLMs and human-level abilities.\nTo this end, we propose the CodeApex benchmark, which gives a comprehensive code and\nprogramming evaluation of large language models. As shown in Table 1, CodeApex is a pioneering\nbilingual (English and Chinese) programming benchmark over three different code-related tasks\n(i.e., programming comprehension, code generation, and code correction) with fine-grained cate-\ng

In [7]:
abstracts = [get_arxiv_abstract(obj["id"]) for obj in res]
top_relevant_pages = find_most_relevant_pages(relevant_pages, abstracts, 10)

In [25]:
for key, obj in top_relevant_pages.items():
    arxiv_id = res[key]["id"]
    arxiv_abstract = obj["abstract"]
    text_segments = obj["text"]
    response: str = prompting_client.get_completions(
        generate_summary_prompt_with_page_content(
            abstract, arxiv_abstract, text_segments
        ),
        os.getenv("AZURE_PROMPTING_MODEL_VERSION"),
    )
    obj["summary"] = response
    obj["citation"] = get_arxiv_citation(arxiv_id)

In [31]:
data = list(top_relevant_pages.values())
print(generate_related_work_prompt(abstract, data))


    I am working on a research paper, and I need a well-written "Related Work" section. Below I'm providing you with the abstract of the paper I'm writing and a list of summaries of related works I've identified.
    
    Here's the abstract of my paper:
    "Large Language Models have shown impressive per- formance across a wide array of tasks involving both structured and unstructured textual data. More recently, adaptions of these models have drawn attention to their abilities to work with code across different programming languages. On this notion, different benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that the performance on this benchmark does not translate to the innate ability of humans to appreciate the structural control flow of code. For this purpose, we extract code solutions from the Hu- manEval benchmark, which the relevant models perform ver

In [32]:
response: str = prompting_client.get_completions(
    generate_related_work_prompt(abstract, data),
    os.getenv("AZURE_PROMPTING_MODEL_VERSION"),
)
print(response)

In recent years, the evaluation of large language models (LLMs) in the domain of code understanding and generation has garnered significant attention. A common theme across several studies is the critique of existing benchmarks, such as HumanEval, for their inadequacy in capturing the full spectrum of LLMs' code reasoning abilities. For instance, Allamanis et al. (2024) introduce CodeJudge-Eval, a benchmark that assesses LLMs' ability to judge code correctness, highlighting the limitations of traditional benchmarks in evaluating code reasoning capabilities. This aligns with our research, which emphasizes the need for more comprehensive evaluation methods to assess LLMs' understanding of code execution paths, particularly in complex scenarios. The concept of round-trip correctness (RTC) proposed by Allamanis et al. offers an unsupervised evaluation approach that could potentially be adapted to assess execution tracing capabilities, a core focus of our study.

Similarly, Liang et al. (20

In [33]:
print([obj["citation"] for obj in top_relevant_pages.values()])

['Miltiadis Allamanis, Sheena Panthaplackel, Pengcheng Yin (2024). Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. arXiv:2402.08699. https://arxiv.org/abs/2402.08699', "Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan (2024). Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'. arXiv:2410.21647. https://arxiv.org/abs/2410.21647", "Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang (2024). What's Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv:2407.06153. https://arxiv.org/abs/2407.06153", 'Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma (2024). CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?. arXiv:2408.10718. https:/