# Processing CLO Mappings


The [Cell Line Ontology (CLO)](https://bioregistry.io/registry/clo) is a detailed resouce, however it does not follow standard OBO modeling pattern for cross-references that either uses `oboInOwl:hasDbXref` or a SKOS and pointing to a single CURIE encoded as a string. Instead, it uses `rdfs:seeAlso` with a combination of non-standard CURIEs that are either comma or semi-colon delimited.

This notebook attempts to unpack and operationalize these cross-references.

See also:

- https://github.com/CLO-ontology/CLO/issues/103
- https://gist.github.com/cthoyt/a91ae12a94c7e1647e9d9d8fa61e80ce

In [1]:
import pandas as pd
from semra.api import (
    filter_mappings,
    get_many_to_many,
    keep_prefixes,
    summarize_prefixes,
)
from semra.io import get_sssom_df
from semra.sources.clo import get_clo_mappings

from biomappings.resources import (
    PREDICTIONS_HEADER,
    append_prediction_tuples,
    prediction_tuples_from_semra,
)

## Extraction and Processing

The following cell uses [this script](https://github.com/biopragmatics/semra/blob/main/src/semra/sources/clo.py) in [SeMRA](https://github.com/biopragmatics/semra) to extract cross-references from CLO.

In [2]:
mappings = get_clo_mappings()
len(mappings)

  0%|          | 0.00/44.4k [00:00<?, ?node/s]

CLO:0001584 invalid: [33matcc:COSMICID:910697[0m from line:
  ATCC: COSMIC ID:910697; ATCC CRL-7905,CRL-7905
CLO:0002336 unparsed: [31mCCL-120[0m from line:
  CCL-120
CLO:0002406 invalid: [33mdsmz:ACC360[0m from line:
  DSMZ: ACC 360,COSMIC ID:910568; DSMZ ACC 360
CLO:0002557 invalid: [33mcldb:Cl847[0m from line:
  HyperCLDB: Cl847
CLO:0002593 unparsed: [31m92031916[0m from line:
  ECACC: 92031916,92031916; COSMIC ID:910555
CLO:0002899 invalid: [33mdsmz:ACC67[0m from line:
  ACC 67
CLO:0002936 unparsed: [31m96100920[0m from line:
  96100920
CLO:0003506 invalid: [33matcc:CRL-2597-Discontinued[0m from line:
  ATCC: CRL-2597 - Discontinued
CLO:0003591 invalid: [33matcc:CRL-8017A[0m from line:
  ATCC: CRL-8017A
CLO:0003593 unparsed: [31m:90112119[0m from line:
  : 90112119
CLO:0003627 invalid: [33mdsmz:ACC301[0m from line:
  ACC 301
CLO:0003671 invalid: [33mdsmz:ACC17[0m from line:
  ACC 17
CLO:0003672 invalid: [33mdsmz:ACC346[0m from line:
  ACC 346
CLO:0003682 u

10925

## Prefix Summary

The table below this cell summarizes all of the prefixes appearing in cross-references extracted from CLO.

In [3]:
summarize_prefixes(mappings)

Unnamed: 0_level_0,name,description
prefix,Unnamed: 1_level_1,Unnamed: 2_level_1
atcc,American Type Culture Collection,The American Type Culture Collection (ATCC) is...
bao,BioAssay Ontology,The BioAssay Ontology (BAO) describes chemical...
biosample,BioSample,The BioSample Database stores information abou...
bto,BRENDA Tissue Ontology,The Brenda tissue ontology is a structured con...
cellosaurus,Cellosaurus,The Cellosaurus is a knowledge resource on cel...
chembl.cell,ChEMBL database of bioactive drug-like small m...,Chemistry resources
cldb,Cell Line Database,The Cell Line Data Base (CLDB) is a reference ...
clo,Cell Line Ontology,The Cell Line Ontology is a community-based on...
cosmic.cell,COSMIC Cell Lines,"COSMIC, the Catalogue Of Somatic Mutations In ..."
dsmz,Deutsche Sammlung von Mikroorganismen und Zell...,The Leibniz Institute DSMZ is the most diverse...


Many of the resources cross-referenced by CLO aren't accessible in a structured format. Therefore, we can't programatically look up names or synonyms. In some (but not all) cases, the resource has a site that can be used to manually examine information about a given record, but this ultimately leaves review very difficult.

There might be an automated way to get the list of all resources that can be used with `pyobo.get_name`, but until that's figured out, the following is a shortlist of resources we can follow up on easily.

In [4]:
DESIRED_PREFIXES = {"bto", "efo", "mesh", "cellosaurus", "obi", "clo"}

mappings = keep_prefixes(mappings, prefixes=DESIRED_PREFIXES, progress=False)
len(mappings)

812

## Identify Inconsistencies

The following cell identifies many-to-many mappings, e.g., when a given CLO has multiple cross-references to entities in another semantic space, or visa-versa.

In [5]:
m2m_mappings = get_many_to_many(mappings)
get_sssom_df(m2m_mappings, add_labels=True)

Preparing SSSOM:   0%|          | 0.00/25.0 [00:00<?, ?mapping/s]

Unnamed: 0,subject_id,subject_label,predicate_id,object_id,object_label,mapping_justification,mapping_set,mapping_set_version,mapping_set_license,mapping_set_confidence
0,clo:0001230,HEK293,oboInOwl:hasDbXref,cellosaurus:0045,HEK293,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
1,clo:0037237,293-derived cell,oboInOwl:hasDbXref,cellosaurus:0045,HEK293,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
2,clo:0007050,K 562 cell,oboInOwl:hasDbXref,cellosaurus:0004,K-562,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
3,clo:0007059,K-562 cell,oboInOwl:hasDbXref,cellosaurus:0004,K-562,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
4,clo:0037163,Ishikawa cell,oboInOwl:hasDbXref,cellosaurus:D199,Ishikawa 3-H-12,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
5,clo:0037230,Ishikawa 3-H-12 cell,oboInOwl:hasDbXref,cellosaurus:D199,Ishikawa 3-H-12,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
6,clo:0037300,BALL-1 cell,oboInOwl:hasDbXref,cellosaurus:1075,BALL-1,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
7,clo:0051004,RCB0256 cell,oboInOwl:hasDbXref,cellosaurus:1075,BALL-1,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
8,clo:0051005,RCB1882 cell,oboInOwl:hasDbXref,cellosaurus:1075,BALL-1,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8
9,clo:0037372,HEK293T cell,oboInOwl:hasDbXref,cellosaurus:0063,HEK293T,semapv:UnspecifiedMatching,clo,2.1.178,CC-BY-3.0,0.8


Add short set of xrefs that aren't exact for triage

In [7]:
append_prediction_tuples(
    prediction_tuples_from_semra(m2m_mappings, confidence=0.8), standardize=True
)

Standardizing mappings: 0.00mapping [00:00, ?mapping/s]

Removing curated from predicted:   0%|          | 0.00/40.5k [00:00<?, ?it/s]

Check out the remainder

In [8]:
rows = prediction_tuples_from_semra(
    filter_mappings(mappings, m2m_mappings, progress=False), confidence=0.8
)
pd.DataFrame(rows, columns=PREDICTIONS_HEADER)

can not look up name for efo:0002082
can not look up name for efo:0002823
can not look up name for efo:0002080
can not look up name for efo:0002336
can not look up name for efo:0001256
can not look up name for efo:0002387


Unnamed: 0,source prefix,source identifier,source name,relation,target prefix,target identifier,target name,type,confidence,source
0,clo,0001008,697 cell,skos:exactMatch,cellosaurus,0079,697,semapv:UnspecifiedMatching,0.8,clo
1,clo,0001088,143B cell,skos:exactMatch,cellosaurus,2270,143B,semapv:UnspecifiedMatching,0.8,clo
2,clo,0001230,HEK293,skos:exactMatch,bto,0000007,HEK-293 cell,semapv:UnspecifiedMatching,0.8,clo
3,clo,0001230,HEK293,skos:exactMatch,efo,0001182,HEK293,semapv:UnspecifiedMatching,0.8,clo
4,clo,0001234,293/CHE-Fc cell,skos:exactMatch,cellosaurus,6352,293/CHE-Fc,semapv:UnspecifiedMatching,0.8,clo
...,...,...,...,...,...,...,...,...,...,...
776,clo,0051547,RCB2084 cell,skos:exactMatch,cellosaurus,1736,TALL-1 [Human adult T-ALL],semapv:UnspecifiedMatching,0.8,clo
777,clo,0051567,RCB1902 cell,skos:exactMatch,cellosaurus,1289,HSC-4,semapv:UnspecifiedMatching,0.8,clo
778,clo,0051568,RCB1974 cell,skos:exactMatch,cellosaurus,1675,SAS,semapv:UnspecifiedMatching,0.8,clo
779,clo,0051569,RCB1975 cell,skos:exactMatch,cellosaurus,1288,HSC-3,semapv:UnspecifiedMatching,0.8,clo
