The original dataset has multilabels in that the titles are repeated but with different categories, so this script combines them into single entries with multiple categories.

This notebook reads the dataset, groups papers by exact title, aggregates each title's category values into a deduplicated list, keeps the first occurrence of link, authors, year, and abstract, and writes the result to combined_data_original.csv.


In [1]:
import pandas as pd

In [14]:
df = pd.read_json("../data/updated_papers_data.json")
df.head(2)

Unnamed: 0,title,category,link,authors,year,abstract
0,Llm+ p: Empowering large language models with ...,language-translation,https://arxiv.org/abs/2304.11477,"Liu, Bo and Jiang, Yuqian and Zhang, Xiaohan a...",2023,Large language models (LLMs) have demonstrated...
1,Translating natural language to planning goals...,language-translation,https://arxiv.org/abs/2302.05128,"Xie, Yaqi and Yu, Chen and Zhu, Tongyao and Ba...",2023,Recent large language models (LLMs) have demon...


In [None]:
# Group by 'title' and aggregate the 'category' column, joining multiple categories into a list
combined_df = (
    df.groupby("title")
    .agg(
        {
            "category": lambda x: list(
                set(x)
            ),  # Convert to set first to remove duplicates, then back to list
            "link": "first",  # Keep the first occurrence of link for each title
            "authors": "first",  # Keep the first occurrence of authors for each title
            "year": "first",  # Keep the first occurrence of year for each title
            "abstract": "first",  # Keep the first occurrence of year for each title
        }
    )
    .reset_index()
)

In [16]:
combined_df.category.value_counts()

category
[plan-generation]                                                 41
[interactive-planning]                                            15
[language-translation]                                            13
[model-construction]                                              10
[tool-integration]                                                 8
[brain-inspired-planning]                                          5
[multiagent-planning]                                              5
[plan-generation, interactive-planning]                            5
[language-translation, model-construction]                         5
[heuristics-optimization]                                          4
[language-translation, plan-generation]                            3
[plan-generation, heuristics-optimization]                         2
[language-translation, interactive-planning]                       1
[language-translation, heuristics-optimization]                    1
[plan-generation, model-c

In [None]:
combined_df.to_csv("combined_data_original.csv", index=False)