## Extract Meta-Triples from Biothings SemMedDB API's Predication Data

### Overview

Reference: https://github.com/biothings/pending.api/issues/63#issuecomment-1100469563

The source file (`semmedVER43_2022_R_PREDICATION_clean_pyarrow_snappy.parquet`) we use here is a cleaner and wangled version of the original SemMedDB predications (`semmedVER43_2022_R_PREDICATION.csv`), and it's stored in [Apache Parquet](https://parquet.apache.org/) format for better I/O performance. It can be found on server `su06` under directory `/data/pending/datasources/semmeddb/43/semmeddb_20230112_vrw1vod3`.

The columns of the source file are listed below:

|Column Name     |Remark                                                        |
|----------------|--------------------------------------------------------------|
|`_ID`           | ID of the document parsed from this row                      |
|`PREDICATION_ID` | ID of the predication                                        |
|`PREDICATE`     | Predicate of the predication                                 |
|`PMID`          | PubMed ID of the predication                                 |
|`SUBJECT_CUI`   | Subject's CUI (either UMLS CUI or NCBIGene ID)               |
|`SUBJECT_PREFIX` | Indicator of Subject's CUI type (either `"umls"` or `"ncbigene"`)|
|`SUBJECT_NAME`  | Subject's name                                               |
|`SUBJECT_SEMTYPE` | Subject's semantic type (4-letter abbreviation)              |
|`SUBJECT_NOVELTY` | Subject's novelty score (alway 1 currently)                  |
|`OBJECT_CUI`    | Object's CUI (either UMLS CUI or NCBIGene ID)                |
|`OBJECT_PREFIX` | Indicator of Object's CUI type (either `"umls"` or `"ncbigene"`)|
|`OBJECT_NAME`   | Object's name                                                |
|`OBJECT_SEMTYPE` | Object's semantic type (4-letter abbreviation)               |
|`OBJECT_NOVELTY` | Object's novelty score (alway 1 currently)                   |

### Loading Data

Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. To load the parquet file, we need to install `pyarrow` package. E.g. with `pip`:

```bash
pip install pyarrow
```

Read more:

- [Installing PyArrow](https://arrow.apache.org/docs/python/install.html)
- [Reading and Writing the Apache Parquet Format](https://arrow.apache.org/docs/python/parquet.html)

In [None]:
import pandas as pd

def read_parquet_cache(path: str) -> pd.DataFrame:
    # Option description see https://pandas.pydata.org/pandas-docs/version/1.1/reference/api/pandas.DataFrame.to_parquet.html
    engine = "pyarrow"
    semmed_data_frame = pd.read_parquet(path=path, engine=engine)

    # Enforce using the pyarrow data typeS for better I/O
    string_columns = ["_ID", "PREDICATE", "SUBJECT_CUI", "SUBJECT_NAME", "SUBJECT_SEMTYPE", "OBJECT_CUI", "OBJECT_NAME", "OBJECT_SEMTYPE"]
    existing_string_columns = [col for col in string_columns if col in semmed_data_frame.columns]
    dtype_map = {col: "string[pyarrow]" for col in existing_string_columns}
    semmed_data_frame = semmed_data_frame.astype(dtype=dtype_map, copy=False)

    return semmed_data_frame

semmed_df = read_parquet_cache(path="semmedVER43_2022_R_PREDICATION_clean_pyarrow_snappy.parquet")

In [3]:
semmed_df.shape  # (85247895, 14)

(85247895, 14)

### Finding the Meta-Triples

We can simply achieve this goal by finding the unique combinations of `SUBJECT_PREFIX`, `SUBJECT_SEMTYPE`, `PREDICATE`, `OBJECT_PREFIX`, and `OBJECT_SEMTYPE` values, but here we perfer taking the counts of each combination into account.

In [5]:
vc = semmed_df.value_counts(subset=["SUBJECT_PREFIX", "SUBJECT_SEMTYPE", "PREDICATE", "OBJECT_PREFIX", "OBJECT_SEMTYPE"])
vc = vc.reset_index(name="COUNT")

In [8]:
vc.shape  # (34368, 6)

(34368, 6)

In [7]:
# List the top 10 meta-triples
vc.head(n=10)

Unnamed: 0,SUBJECT_PREFIX,SUBJECT_SEMTYPE,PREDICATE,OBJECT_PREFIX,OBJECT_SEMTYPE,COUNT
0,umls,dsyn,PROCESS_OF,umls,humn,2223363
1,umls,bpoc,PART_OF,umls,mamm,1169500
2,umls,bpoc,LOCATION_OF,umls,neop,1056360
3,umls,fndg,PROCESS_OF,umls,humn,1054339
4,umls,topp,TREATS,umls,dsyn,883269
5,umls,bpoc,LOCATION_OF,umls,aapp,855484
6,umls,bdsu,LOCATION_OF,umls,aapp,844119
7,umls,bpoc,LOCATION_OF,umls,patf,816275
8,umls,topp,TREATS,umls,neop,738368
9,umls,bpoc,LOCATION_OF,umls,dsyn,716644


In [11]:
# List the top 20 meta-triples whose subject is a NCBIGene
vc.loc[vc["SUBJECT_PREFIX"].eq("ncbigene")].head(n=20)

Unnamed: 0,SUBJECT_PREFIX,SUBJECT_SEMTYPE,PREDICATE,OBJECT_PREFIX,OBJECT_SEMTYPE,COUNT
121,ncbigene,gngm,ASSOCIATED_WITH,umls,neop,125425
137,ncbigene,aapp,ISA,umls,aapp,107500
142,ncbigene,gngm,INTERACTS_WITH,umls,aapp,106288
159,ncbigene,gngm,STIMULATES,umls,gngm,99425
174,ncbigene,gngm,LOCATION_OF,umls,genf,92708
186,ncbigene,aapp,INTERACTS_WITH,umls,aapp,88544
195,ncbigene,aapp,STIMULATES,umls,gngm,85309
202,ncbigene,gngm,ASSOCIATED_WITH,umls,dsyn,79536
222,ncbigene,gngm,INTERACTS_WITH,umls,cell,73579
248,ncbigene,gngm,INHIBITS,umls,gngm,63923


### Saving the Result

In [9]:
vc.to_csv("semmedVER43_2022_R_PREDICATION_clean_meta_triples.tsv", sep="\t", index=False)