The set of doi's in every representation of the data at rest must equal the set of doi's cited by the dataset

In [None]:
import pandas as pd

PATH_TO_QUERY_DATASET = "data/dataset/nontrivial_filtered.jsonl"
PATH_TO_CHUNKS = "data/research_chunks.jsonl"
PATH_TO_CONTRIBUTIONS = "data/research_contributions.jsonl"
# If you make additional document expansions, put their paths here


Get the sets of doi's from each representation and make sure they're equal

In [3]:
chunks = pd.read_json(PATH_TO_CHUNKS, lines=True)
chunk_dois = set(chunks['doi'])

print(f"Number of distinct doi's cited by chunk dataset: {len(chunk_dois)}")

Number of distinct doi's cited by chunk dataset: 9989


In [4]:
contributions = pd.read_json(PATH_TO_CONTRIBUTIONS, lines=True)
contribution_dois = set(contributions['doi'])
print(f"Number of distinct doi's cited by contribution dataset: {len(contribution_dois)}")

Number of distinct doi's cited by contribution dataset: 9989


In [5]:
chunk_dois == contribution_dois

True

In [2]:
queries = pd.read_json(PATH_TO_QUERY_DATASET, lines=True)
query_dois = set()
for _, row in queries.iterrows():
    query_dois.update(row["citation_dois"])
print(f"Number of distinct doi's cited by query dataset: {len(query_dois)}")

Number of distinct doi's cited by query dataset: 10017


In [None]:
# Remove from the query data any rows where any of the citation dois are not in the chunk_dois (or equiv. any other reference doi set)
queries = queries[queries["citation_dois"].apply(lambda dois: all(doi in chunk_dois for doi in dois))]
print(f"Queries now have {len(queries)} rows")
query_dois = set()
for _, row in queries.iterrows():
    query_dois.update(row["citation_dois"])
print(f"Number of distinct doi's cited by query dataset: {len(query_dois)}")

Queries now have 14735 rows
Number of distinct doi's cited by query dataset: 9980


In [8]:
query_dois.issubset(chunk_dois)

True

As long as the `query_dois` are a subset of the `chunk_dois`, the data should be in a good state. This means that every record in the query dataset cites doi's that do appear in the reference data. If there are additional doi's in the reference data cited by no query records, that's fine.

In [9]:
queries.to_json("data/dataset/nontrivial_checked.jsonl", lines=True, orient="records")