# How concepts in SYNERGY are computed

Concepts, also known as topics, are attached to each OpenAlex record. They are generated with an machine leanring model. You can indetify at least 3 different types of concepts per dataset in SYNERGY:

- The main (level 0) concept of the publication record
- The average main (level 0) concept of all records in the dataset (the search)
- The average main (level 0) concept of all included records in the dataset

See the table at the bottom of this notebook for the topic per dataset

In [1]:
import synergy_dataset as sd
import pandas as pd

In [2]:
# function to standardize concept data

def flatten_concepts(ds, level=0):
    result = []
    for openalex_id, record in ds.to_dict(["concepts"]).items():

        d = {"openalex_id": openalex_id,
             "label_included": record["label_included"]}

        for c in record["concepts"]:
            if c["level"] == level:
                d[c["display_name"]] = c["score"]

        result.append(d)
    
    return pd.DataFrame(result)

In [3]:
keys = [d.name for d in sd.iter_datasets()]
df = pd.concat([flatten_concepts(d) for d in sd.iter_datasets()], keys=keys)
df

Unnamed: 0,Unnamed: 1,openalex_id,label_included,Medicine,Chemistry,Physics,Biology,Psychology,Economics,Materials science,Philosophy,...,Political science,History,Computer science,Sociology,Mathematics,Geology,Business,Art,Geography,Environmental science
Appenzeller-Herzog_2019,0,https://openalex.org/W1971651073,0,0.56412965,0.09127462,0.0,,,,,,...,,,,,,,,,,
Appenzeller-Herzog_2019,1,https://openalex.org/W2060266518,0,0.4624714,0.31808412,,0.36406675,,,,,...,,,,,,,,,,
Appenzeller-Herzog_2019,2,https://openalex.org/W1985329347,0,0.0,0.8244176,,0.10137832,,,,,...,,,,,,,,,,
Appenzeller-Herzog_2019,3,https://openalex.org/W2033425122,0,0.5881644,,,,,,,,...,,,,,,,,,,
Appenzeller-Herzog_2019,4,https://openalex.org/W2174281073,0,0.41348755,0.35551745,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wolters_2018,4275,https://openalex.org/W1995096921,0,0.91154534,,,,,,,,...,,,,,,,,,,
Wolters_2018,4276,https://openalex.org/W2114922669,0,0.95886254,,,,,,,,...,,,,,,,,,,
Wolters_2018,4277,https://openalex.org/W1657669688,0,0.96129704,,,,,,,,...,,,,,,,,,,
Wolters_2018,4278,https://openalex.org/W1994490975,0,0.9699614,,,,,,,,...,,,,,,,,,,


The following table shows the average concept scores for all records in the dataset. 

In [4]:
concepts_per_dataset = df.drop(["label_included", "openalex_id"], axis=1).astype(float).fillna(0).groupby(level=0).mean()
concepts_per_dataset

Unnamed: 0,Medicine,Chemistry,Physics,Biology,Psychology,Economics,Materials science,Philosophy,Engineering,Political science,History,Computer science,Sociology,Mathematics,Geology,Business,Art,Geography,Environmental science
Appenzeller-Herzog_2019,0.588982,0.103742,0.002754,0.054124,0.030139,0.000246,0.00402,0.006154,0.000616,0.002405,0.002357,0.00677,0.0,0.00037,0.000309,0.001908,0.000138,0.001154,0.001068
Bos_2018,0.565684,0.008519,0.004312,0.028526,0.202546,5.6e-05,0.000249,0.000376,7.2e-05,9.6e-05,0.000118,0.015857,6.9e-05,0.004413,6.8e-05,0.000109,2.1e-05,7.7e-05,6.7e-05
Brouwer_2019,0.33274,0.00693,0.000627,0.011639,0.451345,0.001146,0.000212,0.004143,0.000596,0.002774,0.001253,0.006507,0.004893,0.001216,0.000184,0.001035,0.001908,0.001189,0.000237
Chou_2003,0.865703,0.005159,6.9e-05,0.002794,0.011938,0.000102,0.000122,0.000174,0.000111,0.001076,5.2e-05,0.002323,0.0,0.00033,0.0,0.000125,3.3e-05,9e-05,0.0
Chou_2004,0.760822,0.00978,0.000445,0.010023,0.056741,0.000225,0.00056,9.5e-05,0.000701,0.001352,0.000218,0.004468,0.000153,0.000146,6.8e-05,0.001643,3.3e-05,7.9e-05,0.0
Donners_2021,0.663195,0.05375,0.0,0.02747,0.002286,0.000324,0.001471,0.00227,0.002108,0.009966,0.0,0.016382,0.0,0.001673,0.0,0.005976,0.002458,0.000747,0.0
Hall_2012,0.003415,0.001369,0.032076,0.00122,0.001738,0.000374,0.020137,5.8e-05,0.139658,0.000253,9.1e-05,0.599865,9.6e-05,0.055082,0.012524,0.002743,5.4e-05,0.001584,0.00412
Jeyaraman_2020,0.723767,0.060434,0.000291,0.04108,0.0,0.0,0.055064,0.000452,0.001324,0.0,0.0,0.003419,0.0,0.000392,0.0,0.000179,0.000108,0.0,0.0
Leenaars_2019,0.233326,0.420204,0.000652,0.161313,0.064596,5.3e-05,0.001316,0.000404,0.000146,0.001013,0.000138,0.002701,7.5e-05,0.000171,2.5e-05,8.4e-05,7.7e-05,5.5e-05,5e-05
Leenaars_2020,0.839694,0.017233,0.000281,0.023396,0.000667,0.000301,0.001126,0.00047,0.000206,0.000479,1.4e-05,0.00333,2.9e-05,0.001062,0.0,0.000488,0.000247,9.4e-05,0.0


Same for the inclusions per dataset

In [5]:
concepts_per_dataset_incl = df[df["label_included"] == 1].drop(["label_included", "openalex_id"], axis=1).astype(float).fillna(0).groupby(level=0).mean()
concepts_per_dataset_incl

Unnamed: 0,Medicine,Chemistry,Physics,Biology,Psychology,Economics,Materials science,Philosophy,Engineering,Political science,History,Computer science,Sociology,Mathematics,Geology,Business,Art,Geography,Environmental science
Appenzeller-Herzog_2019,0.791792,0.006277,0.0,0.008704,0.0,0.0,0.0,0.004839,0.0,0.0,0.013664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bos_2018,0.731475,0.0,0.0,0.0,0.134993,0.0,0.0,0.004561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Brouwer_2019,0.211561,0.0,0.0,0.0,0.602484,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Chou_2003,0.847146,0.0,0.0,0.0,0.0074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Chou_2004,0.695533,0.0,0.0,0.0,0.056843,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Donners_2021,0.817031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hall_2012,0.000729,0.0,0.0,0.000634,0.0,0.0,0.0,0.0,0.061725,0.0,0.0,0.694767,0.0,0.022457,0.014671,0.000869,0.0,0.0,0.0
Jeyaraman_2020,0.906323,0.004976,0.0,0.001607,0.0,0.0,0.004009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Leenaars_2019,0.243685,0.2458,0.0,0.154579,0.256428,0.0,0.0,0.0,0.0,0.0,0.0,0.003783,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Leenaars_2020,0.865833,0.023698,0.0,0.008905,0.000295,0.0,0.001317,0.000289,0.0,0.0,0.0,0.000511,0.0,0.000449,0.0,9.7e-05,0.0,0.0,0.0


In [6]:
st = concepts_per_dataset.stack()
st = st[st.groupby(level=0).transform(max) == st].reset_index(level=1)
st.columns = ["concept_search", "score_search"]

st_incl = concepts_per_dataset_incl.stack()
st_incl = st_incl[st_incl.groupby(level=0).transform(max) == st_incl].reset_index(level=1)
st_incl.columns = ["concept_inclusions", "score_inclusions"]

concepts_pub = []
for d in sd.iter_datasets():
    r = {"name": d.name, "concept_publication": filter(lambda x: (x["level"] == 0), d.metadata["publication"]["concepts"]).__next__()["display_name"]}
    concepts_pub.append(r)
    
df_pub = pd.DataFrame(concepts_pub).set_index("name")

In [7]:
st_0n = concepts_per_dataset.stack()
st_0n[st_0n >=0.2] 


Appenzeller-Herzog_2019  Medicine             0.588982
Bos_2018                 Medicine             0.565684
                         Psychology           0.202546
Brouwer_2019             Medicine             0.332740
                         Psychology           0.451345
Chou_2003                Medicine             0.865703
Chou_2004                Medicine             0.760822
Donners_2021             Medicine             0.663195
Hall_2012                Computer science     0.599865
Jeyaraman_2020           Medicine             0.723767
Leenaars_2019            Medicine             0.233326
                         Chemistry            0.420204
Leenaars_2020            Medicine             0.839694
Meijboom_2021            Medicine             0.751780
Menon_2022               Medicine             0.675263
Moran_2021               Biology              0.360793
Muthu_2021               Medicine             0.919395
Nelson_2002              Medicine             0.794984
Oud_2018  

In [8]:
st_0n_incl = concepts_per_dataset_incl.stack()
st_0n_incl[st_0n_incl >=0.2] 


Appenzeller-Herzog_2019  Medicine            0.791792
Bos_2018                 Medicine            0.731475
Brouwer_2019             Medicine            0.211561
                         Psychology          0.602484
Chou_2003                Medicine            0.847146
Chou_2004                Medicine            0.695533
Donners_2021             Medicine            0.817031
Hall_2012                Computer science    0.694767
Jeyaraman_2020           Medicine            0.906323
Leenaars_2019            Medicine            0.243685
                         Chemistry           0.245800
                         Psychology          0.256428
Leenaars_2020            Medicine            0.865833
Meijboom_2021            Medicine            0.893434
Menon_2022               Medicine            0.605102
Moran_2021               Medicine            0.214867
                         Biology             0.338177
Muthu_2021               Medicine            0.950681
Nelson_2002              Med

In [9]:
df_pub.join(st["concept_search"]).join(st_incl["concept_inclusions"])

Unnamed: 0_level_0,concept_publication,concept_search,concept_inclusions
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Appenzeller-Herzog_2019,Medicine,Medicine,Medicine
Bos_2018,Medicine,Medicine,Medicine
Brouwer_2019,Psychology,Psychology,Psychology
Chou_2003,Medicine,Medicine,Medicine
Chou_2004,Medicine,Medicine,Medicine
Donners_2021,Medicine,Medicine,Medicine
Hall_2012,Computer science,Computer science,Computer science
Jeyaraman_2020,Medicine,Medicine,Medicine
Leenaars_2019,Medicine,Chemistry,Psychology
Leenaars_2020,Medicine,Medicine,Medicine
