# How concepts in SYNERGY are computed

Concepts, also known as topics, are attached to each OpenAlex record. They are generated with an machine leanring model. You can indetify at least 3 different types of concepts per dataset in SYNERGY:

- The main (level 0) concept of the publication record
- The average main (level 0) concept of all records in the dataset (the search)
- The average main (level 0) concept of all included records in the dataset

See the table at the bottom of this notebook for the topic per dataset

In [1]:
import synergy_dataset as sd
import pandas as pd

In [2]:
# function to standardize concept data

def flatten_concepts(ds, level=0):
    result = []
    for openalex_id, record in ds.to_dict(["concepts"]).items():

        d = {"openalex_id": openalex_id,
             "label_included": record["label_included"]}

        for c in record["concepts"]:
            if c["level"] == level:
                d[c["display_name"]] = c["score"]

        result.append(d)
    
    return pd.DataFrame(result)

# test = sd.Dataset("Brouwer_2019")
# flatten_concepts(test)

In [None]:
keys = [d.name for d in sd.iter_datasets()]
df = pd.concat([flatten_concepts(d) for d in sd.iter_datasets()], keys=keys)
df

The following table shows the average concept scores for all records in the dataset. 

In [None]:
concepts_per_dataset = df.drop(["label_included", "openalex_id"], axis=1).astype(float).fillna(0).groupby(level=0).mean()
concepts_per_dataset

Same for the inclusions per dataset

In [None]:
concepts_per_dataset_incl = df[df["label_included"] == 1].drop(["label_included", "openalex_id"], axis=1).astype(float).fillna(0).groupby(level=0).mean()
concepts_per_dataset_incl

In [None]:
st = concepts_per_dataset.stack()
st = st[st.groupby(level=0).transform(max) == st].reset_index(level=1)
st.columns = ["concept_search", "score_search"]

st_incl = concepts_per_dataset_incl.stack()
st_incl = st_incl[st_incl.groupby(level=0).transform(max) == st_incl].reset_index(level=1)
st_incl.columns = ["concept_inclusions", "score_inclusions"]

concepts_pub = []
for d in sd.iter_datasets():
    r = {"name": d.name, "concept_publication": filter(lambda x: (x["level"] == 0), d.metadata["publication"]["concepts"]).__next__()["display_name"]}
    concepts_pub.append(r)
    
df_pub = pd.DataFrame(concepts_pub).set_index("name")

In [None]:
df_pub.join(st["concept_search"]).join(st_incl["concept_inclusions"])