# Dataset Creation
At the top level, we have access to CLEF2019 and the SYNERGY dataset. 

The slight advantage of SYNERGY is that it has a few systematic reviews (SRs) from outside the biomedical domain; CLEF is only within biomedicine.

However, CLEF has 92 reviews in total, which is massive.

Let's just start by accessing both within this notebook, before looking to combine them in some way.

## We begin with CLEF!

In [8]:
# std lib imports 
import pickle as pkl

# pkg imports
import pandas as pd

In [163]:
# path to the CLEF dataset
clef_path = "./data/Clef/CLEF2019/"

In [164]:
clef19_train = pd.DataFrame(pd.read_pickle(clef_path + "CLEF2019_training_data.pkl"))
clef19_test = pd.DataFrame(pd.read_pickle(clef_path + "CLEF2019_test_data.pkl"))

In [168]:
# combine the train and test sets into one and tranpose it
clef19 = pd.concat([clef19_train, clef19_test], axis=1).T
# drop unnecessery columns
clef19 = clef19.drop(['query', 'type', 'terms_meshs'], axis=1)
clef19

Unnamed: 0,title,included,search_results
CD007394,Galactomannan detection for invasive aspergill...,"{'ids_abs': ['24049291', '23644098', '23614624...","{'ids': ['24514094', '24505528', '24495193', '..."
CD007427,Physical tests for shoulder impingements and l...,"{'ids_abs': ['19230608', '19181972', '19059895...","{'ids': ['16396700', '16539812', '17543142', '..."
CD008054,Human papillomavirus testing versus repeat cyt...,"{'ids_abs': ['20856930', '20713299', '20708786...","{'ids': ['20301575', '20301650', '21271014', '..."
CD008081,Optical coherence tomography (OCT) for detecti...,"{'ids_abs': ['21996307', '9479300', '17502499'...","{'ids': ['23793175', '23768471', '23767192', '..."
CD008122,Rapid diagnostic tests for diagnosing uncompli...,"{'ids_abs': ['15361993', '19187517', '10715695...","{'ids': ['19164769', '9557953', '7688346', '18..."
...,...,...,...
CD012669,Point‐of‐care ultrasonography for diagnosing t...,"{'ids_abs': ['24073336', '22307581', '25405557...","{'ids': ['26973754', '23911136', '20622617', '..."
CD012768,Xpert® MTB/RIF assay for extrapulmonary tuberc...,"{'ids_abs': ['21396219', '23827859', '25574911...","{'ids': ['22815718', '27286562', '22381459', '..."
CD012661,Development of type 2 diabetes mellitus in peo...,"{'ids_abs': ['26675051', '24083174', '25093755...","{'ids': ['25954454', '26675051', '27476051', '..."
CD011558,Factors that influence the provision of intrap...,"{'ids_abs': ['23394288', '23234509'], 'ids_con...","{'ids': ['25364879', '24290947', '23461026', '..."


In [169]:
def parse_row(row: pd.Series, citation_id: str) -> pd.DataFrame:
    included_ids_abs = set(row['included']['ids_abs'])
    included_ids_cont = set(row['included']['ids_cont'])
    search_results = row['search_results']

    df = pd.DataFrame()

    # add the OpenAlex ids
    df['oa_id'] = search_results['ids']
    # add original citation's id
    df['citation_id'] = citation_id
    
    # get the titles and abstracts
    titles, abstracts = [], []
    for record in search_results['records']:
        _, title, abstract, _ = record
        titles.append(title)
        abstracts.append(float('nan') if abstract == "?" else abstract)
        
    df['title'] = titles 
    df['abstract'] = abstracts

    # the ids that passed screening at the title+abstract level
    df['label_included'] = df['oa_id'].apply(lambda id: 1 if id in included_ids_abs else 0)

    # the ids that passed screening at the full content level
    df['label_included_content'] = df['oa_id'].apply(lambda id: 1 if id in included_ids_cont else 0)
    
    return df

In [172]:
# combine the srs into a single dataset
cleaned_clef = pd.concat([parse_row(row,citation_id) for citation_id,row in clef19.iterrows()], ignore_index=True)
cleaned_clef

Unnamed: 0,oa_id,citation_id,title,abstract,label_included,label_included_content
0,24514094,CD007394,Susceptibility breakpoints for amphotericin B ...,Although conventional amphotericin B was for m...,0,0
1,24505528,CD007394,The prevalence of antifungal agents administra...,BACKGROUND: Invasive fungal infections (IFIs) ...,0,0
2,24495193,CD007394,Serum and urine Blastomyces antigen concentrat...,BACKGROUND: Serum and urine Blastomyces antige...,0,0
3,24493565,CD007394,Directed modification of the Aspergillus usami...,beta-Mannanases (EC 3.2.1.78) can catalyze the...,0,0
4,24489964,CD007394,Earlier diagnosis of invasive fusariosis with ...,Cross-reactivity of Fusarium species with seru...,0,0
...,...,...,...,...,...,...
574226,25424785,CD011787,Immunogenicity and safety of the HPV-16/18 AS0...,Immunogenicity and safety of the human papillo...,0,0
574227,9695135,CD011787,Influence of socioeconomic status on behaviora...,A double-blind prospective design was used to ...,0,0
574228,16700379,CD011787,Serosurvey of hepatitis B surface antigen in p...,"In 1990, Saudi Arabia began vaccinating all ch...",0,0
574229,24412301,CD011787,Evaluation of the feasibility of a state-based...,The vaccine safety advice network is a collabo...,0,0


In [173]:
# save the combined + cleaned dataset
cleaned_clef.to_pickle(clef_path + "CLEF2019_combined_data.pkl")

## Now for SYNERGY
We'll open each file and combine it into one.

In [148]:
# std lib import
from os import listdir

In [158]:
# path to the SYNERGY dataset directory
synergy_path = "./data/synergy_dataset/"

In [191]:
synergy_csvs = [file for file in listdir(synergy_path) if file[-4:] == '.csv']

In [194]:
def parse_synergy_dataset(file_path: str, citation_id: str) -> pd.DataFrame:
    df = pd.read_csv(file_path)
    df.insert(1, "citation_id", citation_id)
    return df

In [201]:
combined_synergy = pd.concat([parse_synergy_dataset(synergy_path + csv, csv[:-4]) for csv in synergy_csvs], ignore_index=True)
combined_synergy

Unnamed: 0,doi,citation_id,title,abstract,label_included
0,https://doi.org/10.1109/indcon.2010.5712716,Hall_2012,Computer vision based offset error computation...,The use of computer vision based approach has ...,0
1,https://doi.org/10.1109/induscon.2010.5740045,Hall_2012,Design and development of a software for fault...,This paper presents an on-line fault diagnosis...,0
2,https://doi.org/10.1109/tpwrd.2005.848672,Hall_2012,Analytical Approach to Internal Fault Simulati...,A new method for simulating faulted transforme...,0
3,https://doi.org/10.1109/icelmach.2008.4799852,Hall_2012,Nonlinear equivalent circuit model of a tracti...,The paper presents the development of an equiv...,0
4,https://doi.org/10.1109/ipdps.2006.1639408,Hall_2012,Fault tolerance with real-time Java,After having drawn up a state of the art on th...,0
...,...,...,...,...,...
169283,https://doi.org/10.1002/jts.21893,van_Dis_2020,Effects of Cognitive-Behavioral Conjoint Thera...,en,0
169284,https://doi.org/10.1037/a0026361,van_Dis_2020,Attendance and substance use outcomes for the ...,This study uses data from the largest effectiv...,0
169285,https://doi.org/10.1097/00005131-199903000-00004,van_Dis_2020,Segment Transport Employing Intramedullary Dev...,To compare two different methods of segment tr...,0
169286,https://doi.org/10.1037//0022-006x.68.1.31,van_Dis_2020,Cognitive-behavioral stress management interve...,The present study tested the effects of a mult...,0


In [200]:
# save the combined + cleaned dataset
combined_synergy.to_pickle(synergy_path + "SYNERGY_combined_data.pkl")