# Multi X-Science Dataset

- Paper: https://arxiv.org/abs/2010.14235
- Repo: https://github.com/yaolu/Multi-XScience

Source-Format: json files

Uncompressed sizes:

```
 28M test.json
166M train.json
 28M val.json

Total: 221M
```


| Property | Value |
| --- | --- |
| # train | 30,369 |
| # val | 5,066 |
| # test | 5,093 |
| doc. len (words) | 778.08 |
| summ. len (words) |  116.44 |
| # refs | 4.42 |

In [8]:
import random
from pathlib import Path
import json
from typing import Dict
import re
import pandas as pd

# potentially different options for concatenating abstract + abstracts of referenced papers

section_headers_empty = {
    'query_abs': '',
    'reference_abs': ''
}

section_headers_newline = {
    'query_abs': '',
    'reference_abs': '\n'
}

section_headers_minimal = {
    'query_abs': '',
    'reference_abs': '\n@cite: '
}

section_headers_marker = {
    'query_abs': '### Abstract of query paper ###\n',
    'reference_abs': '\n### Abstract of @cite ###\n'
}

def convert_dataset(fn: Path, output_dir: Path, output_prefix: str, section_headers: Dict[str, str], anonymize_refs: bool=False, compression:str="snappy"):
    text, summary, provenance = [], [], []

    missing_ref_abs = 0

    with fn.open('r') as f:
        entries = json.load(f)
        for entry in entries:
            abstract = entry['abstract']
            related_work = entry['related_work']
            refs = entry['ref_abstract']
            
            sb = [section_headers['query_abs'], abstract]
            for rid, ref_data in refs.items():
                ref_abstract = ref_data['abstract']
                if len(ref_abstract.strip()) == 0:
                    missing_ref_abs += 1

                ref_header = section_headers['reference_abs']
                if not anonymize_refs:
                    ref_header = ref_header.replace('@cite', rid)
                sb.append(ref_header)
                sb.append(ref_abstract)

            if anonymize_refs:
                related_work = re.sub(r'@cite_[0-9]+', r'@cite', related_work)

            text.append(''.join(sb))
            summary.append(related_work)

            # aid: arxiv id (e.g. 2010.14235)
            # mid: microsoft academic graph id
            provenance.append(json.dumps({ 'src': prefix, 'aid': entry['aid'], 'mid': entry['mid'] }))

    fn = f'{output_prefix}.{compression}.parquet'
    fn = output_dir / fn
    text_, summary_, provenance_ = map(lambda x: pd.array(x, dtype="string"), (text, summary, provenance))
    df = pd.DataFrame({"text": text_, "summary": summary_, "provenance": provenance_})
    print(f'writing: {fn} (entries: {len(entries)}; missing ref abstracts: {missing_ref_abs})')
    df.to_parquet(
        fn, 
        engine="pyarrow",
        compression=compression
    )

dataset_dir = Path('../Multi-XScience/data')
output_dir = Path('./data/multixscience/')
dataset_file_names = list(dataset_dir.glob('*.json'))
output_prefixes = map(lambda fn: 'multixscience_' + fn.stem, dataset_file_names)


In [9]:
# run conversion:
output_dir.mkdir(parents=True, exist_ok=True)
for fn, prefix in zip(dataset_file_names, output_prefixes):
    convert_dataset(fn, output_dir, prefix, section_headers_minimal, anonymize_refs=False)

writing: data/multixscience/multixscience_train.snappy.parquet (entries: 30369; missing ref abstracts: 20023)
writing: data/multixscience/multixscience_val.snappy.parquet (entries: 5066; missing ref abstracts: 3383)
writing: data/multixscience/multixscience_test.snappy.parquet (entries: 5093; missing ref abstracts: 3403)


In [7]:
# check files in output dir

def mean_strlen(col: pd.Series):
    return col.apply(len).mean()

def count_words(s):
    return sum(1 for w in s.split(' ') if len(w) > 0)

def mean_wordcount(col: pd.Series):
    return col.apply(count_words).mean()

show_random_entry = False
for fn in output_dir.glob('*.parquet'):
    df = pd.read_parquet(path=str(fn), engine='pyarrow')
    text, summary = df["text"], df["summary"]
    print(f'file: "{fn.name}"; rows: {len(df)}; mean_wordcount: {{ text: {mean_wordcount(text):.2f}; summary: {mean_wordcount(summary):.2f}; }}; mean_strlen: {{ text: {mean_strlen(text):.2f}; summary: {mean_strlen(summary):.2f} }};')

    if show_random_entry:
        i = random.randint(0, len(df)-1)
        print(f'Random entry #{i}:')
        print('### text:', df.loc[i]['text'])
        print('### summary:', df.loc[i]['summary'])
        print()


file: "multixscience_train.snappy.parquet"; rows: 30369; mean_wordcount: { text: 700.62; summary: 105.89; }; mean_strlen: { text: 4754.78; summary: 699.99 };
file: "multixscience_val.snappy.parquet"; rows: 5066; mean_wordcount: { text: 700.02; summary: 104.43; }; mean_strlen: { text: 4747.70; summary: 690.17 };
file: "multixscience_test.snappy.parquet"; rows: 5093; mean_wordcount: { text: 690.02; summary: 105.77; }; mean_strlen: { text: 4671.70; summary: 697.67 };


# Example raw source dataset entry:

```
[
  {
    "aid": "cs9809108",
    "mid": "2949225035",
    "abstract": "We present our approach to the problem of how an agent, within an economic Multi-Agent System, can determine when it should behave strategically (i.e. learn and use models of other agents), and when it should act as a simple price-taker. We provide a framework for the incremental implementation of modeling capabilities in agents, and a description of the forms of knowledge required. The agents were implemented and different populations simulated in order to learn more about their behavior and the merits of using and learning agent models. Our results show, among other lessons, how savvy buyers can avoid being cheated'' by sellers, how price volatility can be used to quantitatively predict the benefits of deeper models, and how specific types of agent populations influence system behavior.",
    "related_work": "Within the MAS community, some work @cite_15 has focused on how artificial AI-based learning agents would fare in communities of similar agents. For example, @cite_6 and @cite_8 show how agents can learn the capabilities of others via repeated interactions, but these agents do not learn to predict what actions other might take. Most of the work in MAS also fails to recognize the possible gains from using explicit agent models to predict agent actions. @cite_9 is an exception and gives another approach for using nested agent models. However, they do not go so far as to try to quantify the advantages of their nested models or show how these could be learned via observations. We believe that our research will bring to the foreground some of the common observations seen in these research areas and help to clarify the implications and utility of learning and using nested agent models.",
    "ref_abstract": {
      "@cite_9": {
        "mid": "1528079221",
        "abstract": "In multi-agent environments, an intelligent agent often needs to interact with other individuals or groups of agents to achieve its goals. Agent tracking is one key capability required for intelligent interaction. It involves monitoring the observable actions of other agents and inferring their unobserved actions, plans, goals and behaviors. This article examines the implications of such an agent tracking capability for agent architectures. It specifically focuses on real-time and dynamic environments, where an intelligent agent is faced with the challenge of tracking the highly flexible mix of goal-driven and reactive behaviors of other agents, in real-time. The key implication is that an agent architecture needs to provide direct support for flexible and efficient reasoning about other agents' models. In this article, such support takes the form of an architectural capability to execute the other agent's models, enabling mental simulation of their behaviors. Other architectural requirements that follow include the capabilities for (pseudo-) simultaneous execution of multiple agent models, dynamic sharing and unsharing of multiple agent models and high bandwidth inter-model communication. We have implemented an agent architecture, an experimental variant of the Soar integrated architecture, that conforms to all of these requirements. Agents based on this architecture have been implemented to execute two different tasks in a real-time, dynamic, multi-agent domain. The article presents experimental results illustrating the agents' dynamic behavior."
      },
      "@cite_15": {
        "mid": "2156109180",
        "abstract": "I. Introduction, 488. \u2014 II. The model with automobiles as an example, 489. \u2014 III. Examples and applications, 492. \u2014 IV. Counteracting institutions, 499. \u2014 V. Conclusion, 500."
      },
      "@cite_6": {
        "mid": "1591263692",
        "abstract": "The long-term goal of our field is the creation and understanding of intelligence. Productive research in AI, both practical and theoretical, benefits from a notion of intelligence that is precise enough to allow the cumulative development of robust systems and general results. This paper outlines a gradual evolution in our formal conception of intelligence that brings it closer to our informal conception and simultaneously reduces the gap between theory and practice."
      },
      "@cite_8": {
        "mid": "",
        "abstract": ""
      }
    }
  }
]
```