## Extract Meta-Triples from Biothings SemMedDB API's Predication Data

### Overview

Reference: https://github.com/biothings/pending.api/issues/63#issuecomment-1100469563

A meta-triple is a unit of `(SUBJECT_SEMTYPE, PREDICATE, OBJECT_SEMTYPE)` values (see table below). E.g. `(dsyn, PROCESS_OF, humn)`, which can be roughly interpreted as "Disease or SYNdrome is a PROCESS OF HUMaN". Statistics on such meta-triples can help determine related x-bte annotations. Later we find that the entity types of subjects/objects are also helpful to x-bte developers, so two more fields, `SUBJECT_PREFIX` and `OBJECT_PREFIX`, are also included in the statistics. However we still call it a meta-triple.

The source file (`semmedVER43_2023_R_PREDICATION.116080_clean_pyarrow_snappy.parquet`) we use here is a cleaner and wangled version of the original SemMedDB predications (`semmedVER43_2023_R_PREDICATION.116080`), and it's stored in [Apache Parquet](https://parquet.apache.org/) format for better I/O performance. It can be found on server `su06` under directory `/data/pending/datasources/semmeddb/43/CACHE`.

The columns of the source file are listed below:

|Column Name     |Remark                                                        |
|----------------|--------------------------------------------------------------|
|`_ID`           | ID of the document parsed from this row                      |
|`PREDICATION_ID` | ID of the predication                                        |
|`PREDICATE`     | Predicate of the predication                                 |
|`PMID`          | PubMed ID of the predication                                 |
|`SUBJECT_CUI`   | Subject's CUI (either UMLS CUI or NCBIGene ID)               |
|`SUBJECT_PREFIX` | Indicator of Subject's CUI type (either `"umls"` or `"ncbigene"`)|
|`SUBJECT_NAME`  | Subject's name                                               |
|`SUBJECT_SEMTYPE` | Subject's semantic type (4-letter abbreviation)              |
|`SUBJECT_NOVELTY` | Subject's novelty score (alway 1 currently)                  |
|`OBJECT_CUI`    | Object's CUI (either UMLS CUI or NCBIGene ID)                |
|`OBJECT_PREFIX` | Indicator of Object's CUI type (either `"umls"` or `"ncbigene"`)|
|`OBJECT_NAME`   | Object's name                                                |
|`OBJECT_SEMTYPE` | Object's semantic type (4-letter abbreviation)               |
|`OBJECT_NOVELTY` | Object's novelty score (alway 1 currently)                   |

### Loading Data

Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. To load the parquet file, we need to install `pyarrow` package. E.g. with `pip`:

```bash
pip install pyarrow
```

Read more:

- [Installing PyArrow](https://arrow.apache.org/docs/python/install.html)
- [Reading and Writing the Apache Parquet Format](https://arrow.apache.org/docs/python/parquet.html)

In [None]:
import pandas as pd
from tqdm import tqdm

parquet_file = "semmedVER43_2024_R_PREDICATION_clean_pyarrow_snappy.parquet"

semmed_df = pd.read_parquet(
    parquet_file,
    engine="pyarrow",
    columns=[
        "SUBJECT_PREFIX",
        "SUBJECT_SEMTYPE",
        "PREDICATE",
        "OBJECT_PREFIX",
        "OBJECT_SEMTYPE",
        "PMID",
        "SUBJECT_CUI",
        "OBJECT_CUI"
    ]
)
semmed_df.shape

In [None]:
semmed_df.head()

### Meta-Triples vs No. Predications

Because each row is a predication, we can do `.value_counts()` for each combination of `SUBJECT_PREFIX`, `SUBJECT_SEMTYPE`, `PREDICATE`, `OBJECT_PREFIX`, and `OBJECT_SEMTYPE`, for the No. predications of each meta-triple.

In [None]:
pred_stat = semmed_df.value_counts(subset=["SUBJECT_PREFIX", "SUBJECT_SEMTYPE", "PREDICATE", "OBJECT_PREFIX", "OBJECT_SEMTYPE"]).reset_index(name="PREDICATION_N")
pred_stat.shape

In [None]:
# List the top 10 meta-triples
pred_stat.head(n=10)

In [None]:
# List the top 10 meta-triples whose subject is a NCBIGene
pred_stat.loc[pred_stat["SUBJECT_PREFIX"].eq("ncbigene")].head(n=10)

In [None]:
# Save the Result
pred_stat.to_csv("meta_triple_predication_stat.tsv", sep="\t", index=False)

### Meta-Triples vs No. Documents (Original Solution)

Each semmeddb document has an ID made from 3 fields, `SUBJECT_CUI`, `PREDICATE`, and `OBJECT_CUI`. Therefore we group the input data frame by these 3 fields, and each group should contribute only 1 document to the stats. However on BTE ends, documents with `PMID` counts less than or equal 3 are often excluded due to their low significance. We also take the valid contribution of documents into account in making the statistics.

In [None]:
g = semmed_df.groupby(["SUBJECT_CUI", "PREDICATE", "OBJECT_CUI"])

# Count the number of unique PMIDs for each group
doc_stat = g["PMID"].nunique().reset_index(name="PMID_N_UNIQUE")

# Each group should contribute only 1 document
doc_stat["DOC_N"] = 1

# If a group has less than or equal 3 PMIDs, it's valid contribution is 0
doc_stat["DOC_N_VALID"] = doc_stat["DOC_N"].where(doc_stat["PMID_N_UNIQUE"] > 3, other=0)

doc_stat.head()

Now join the document contribution stats to the orignal data frame. Note that a subject/object may have multiple semtypes. E.g. in [C0009325-INHIBITS-C0162574](https://biothings.transltr.io/semmeddb/association/C0009325-INHIBITS-C0162574), the subject has two semtypes, `aapp` and `phsu`, and it's mapped to two meta-triples, `(aapp, INHIBITS, bacs)` and `(phsu, NHIBITS, bacs)`. In such cases, all the meta-triples will receive the contribution of documents from the same ID.

In [None]:
# TODO Can we use transform for all the 3 new columns, without merging? E.g. semmed_df["PMID_N_UNIQUE"] = g["PMID"].transform("nunique")
semmed_df = semmed_df.merge(doc_stat, how="inner", on=["SUBJECT_CUI", "PREDICATE", "OBJECT_CUI"])

In [None]:
# Group by the meta-triples
g = semmed_df.groupby(["SUBJECT_PREFIX", "SUBJECT_SEMTYPE", "PREDICATE", "OBJECT_PREFIX", "OBJECT_SEMTYPE"])

"""
Here I encountered a bug in Pandas.GroupBy, whose root cause might be in the pyarrow implementation of GroupBy. 

If I call `g["DOC_N"].sum()`, it will exhaust the Cartesian product of all the values from the five columns, 
    "SUBJECT_PREFIX", "SUBJECT_SEMTYPE", "PREDICATE", "OBJECT_PREFIX", and "OBJECT_SEMTYPE", leading to numerous empty groups with sum of "DOC_N" equal 0,
    which is quite counter-intuitive because each group's DOC_N shoule be at least 1.
    
Therefore I take the following workaroud to make the aggregation manually.
"""
doc_stat2 = [[*index, data["DOC_N"].sum(), data["DOC_N_VALID"].sum()] for index, data in g]
doc_stat2 = pd.DataFrame(doc_stat2, columns=["SUBJECT_PREFIX", "SUBJECT_SEMTYPE", "PREDICATE", "OBJECT_PREFIX", "OBJECT_SEMTYPE", "DOC_N", "DOC_N_VALID"])
doc_stat2.sort_values("DOC_N_VALID", ascending=False, inplace=True)

In [None]:
# Save the Result
doc_stat2.to_csv("meta_triple_document_stat.tsv", sep="\t", index=False)

### Meta-Triples vs No. Documents (New Solution)
******************** Alternative to merge ********************


In [None]:
import os 

# Temporarily switch to a safer backend to avoid pyarrow groupby issues
os.environ["PANDAS_DATAFRAME_BACKEND"] = "numpy"


In [None]:
# Step 1: Compute document contribution stats directly with transform (no merge needed)
group_cols_doc = ["SUBJECT_CUI", "PREDICATE", "OBJECT_CUI"]
g_doc = semmed_df.groupby(group_cols_doc)

In [None]:
# Apply the transform to get unique PMIDs and set DOC_N/DOC_N_VALID
semmed_df["PMID_N_UNIQUE"] = g_doc["PMID"].transform("nunique")
semmed_df["DOC_N"] = 1
semmed_df["DOC_N_VALID"] = semmed_df["PMID_N_UNIQUE"].gt(3).astype(int)


In [None]:
semmed_df.head()

In [None]:
# Step 2: Group by meta-triples
group_cols_meta = ["SUBJECT_PREFIX", "SUBJECT_SEMTYPE", "PREDICATE", "OBJECT_PREFIX", "OBJECT_SEMTYPE"]
g_meta = semmed_df.groupby(group_cols_meta)


In [None]:
# Add progress bar to manual group aggregation (loop workaround due to backend issues)
print("[INFO] Aggregating meta-triple document statistics...")

# Using tqdm to track the progress of the loop
meta_rows = []
for index, data in tqdm(g_meta, total=len(g_meta), desc="Aggregating groups"):
    doc_n = data["DOC_N"].sum()
    doc_n_valid = data["DOC_N_VALID"].sum()
    meta_rows.append([*index, doc_n, doc_n_valid])


In [None]:
# Create the final dataframe
doc_stat2 = pd.DataFrame(meta_rows, columns=group_cols_meta + ["DOC_N", "DOC_N_VALID"])
doc_stat2.sort_values("DOC_N_VALID", ascending=False, inplace=True)

In [None]:

# Define the output file path with the date and time
outfile = "metatriple_output_files/meta_triple_document_stat.tsv"
# Save the Result
doc_stat2.to_csv(outfile, sep="\t", index=False)
