In [1]:
import curies
import pandas as pd
import itertools as itt
import pystow

In [2]:
%%time
obo_converter = curies.get_obo_converter()

CPU times: user 30 ms, sys: 4.59 ms, total: 34.6 ms
Wall time: 646 ms


In [3]:
%%time
bioregistry_converter = curies.get_bioregistry_converter()

CPU times: user 7.08 s, sys: 69.5 ms, total: 7.15 s
Wall time: 7.22 s


# Disease Ontology SSSOM Demo

In [4]:
commit = "faca4fc335f9a61902b9c47a1facd52a0d3d2f8b"
url = f"https://raw.githubusercontent.com/mapping-commons/disease-mappings/{commit}/mappings/doid.sssom.tsv"
df = pystow.ensure_csv("tmp", url=url, read_csv_kwargs=dict(comment="#"))
df.head()[["subject_id", "predicate_id", "object_id"]].values

array([['DOID:8717', 'oboInOwl:hasDbXref', 'NCI:C50706'],
       ['DOID:8717', 'oboInOwl:hasDbXref', 'MESH:D003668'],
       ['DOID:8717', 'oboInOwl:hasDbXref', 'ICD9CM:707.0'],
       ['DOID:8717', 'oboInOwl:hasDbXref',
        'SNOMEDCT_US_2021_09_01:28103007'],
       ['DOID:8717', 'oboInOwl:hasDbXref', 'UMLS_CUI:C0011127']],
      dtype=object)

In [5]:
obo_converter.pd_standardize_curie(df.copy(), column="object_id")

## Summary

Standardization was not necessary for 2 (0.0%), resulted in 0 updates (0.0%), and 34,522 failures (100.0%)  in column `object_id`. Here's a breakdown of the prefixes that weren't possible to standardize:

| prefix                 |   count | examples                                                                                                                                                                |
|:-----------------------|--------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| EFO                    |     131 | EFO:0000195, EFO:0000612, EFO:0000729, EFO:0003914, EFO:0004222                                                                                                         |
| GARD                   |    2030 | GARD:1224, GARD:4771, GARD:6464, GARD:7179, GARD:7475                                                                                                                   |
| ICD10CM                |    3666 | ICD10CM:E75.0, ICD10CM:H35.42, ICD10CM:I75, ICD10CM:K59.39, ICD10CM:Q38.1                                                                                               |
| ICD9CM                 |    2266 | ICD9CM:335.22, ICD9CM:368.51, ICD9CM:375.15, ICD9CM:618.8, ICD9CM:622.2                                                                                                 |
| ICDO                   |     361 | ICDO:8050/3, ICDO:8051/3, ICDO:8290/0, ICDO:8470/3, ICDO:8920/3                                                                                                         |
| KEGG                   |      41 | KEGG:05210, KEGG:05219, KEGG:05221, KEGG:05310, KEGG:H02296                                                                                                             |
| MEDDRA                 |      41 | MEDDRA:10001229, MEDDRA:10036794, MEDDRA:10066387, MEDDRA:10068842                                                                                                      |
| MESH                   |    3847 | MESH:C562745, MESH:D003882, MESH:D008288, MESH:D009072, MESH:D015270                                                                                                    |
| NCI                    |    4788 | NCI:C27472, NCI:C3406, NCI:C39860, NCI:C4296, NCI:C84886                                                                                                                |
| OMIM                   |    5539 | OMIM:154800, OMIM:229050, OMIM:255300, OMIM:614465, OMIM:615725                                                                                                         |
| ORDO                   |    2023 | ORDO:2554, ORDO:295195, ORDO:397593, ORDO:733, ORDO:79257                                                                                                               |
| SNOMEDCT_US_2020_03_01 |       6 | SNOMEDCT_US_2020_03_01:236818008, SNOMEDCT_US_2020_03_01:254828009, SNOMEDCT_US_2020_03_01:52564001                                                                     |
| SNOMEDCT_US_2020_09_01 |       1 | SNOMEDCT_US_2020_09_01:1112003                                                                                                                                          |
| SNOMEDCT_US_2021_07_31 |      10 | SNOMEDCT_US_2021_07_31:205329008, SNOMEDCT_US_2021_07_31:268180007, SNOMEDCT_US_2021_07_31:75931002, SNOMEDCT_US_2021_07_31:785879009, SNOMEDCT_US_2021_07_31:86249007  |
| SNOMEDCT_US_2021_09_01 |    5088 | SNOMEDCT_US_2021_09_01:128925001, SNOMEDCT_US_2021_09_01:254916002, SNOMEDCT_US_2021_09_01:267572005, SNOMEDCT_US_2021_09_01:389261002, SNOMEDCT_US_2021_09_01:94069006 |
| UMLS_CUI               |    6890 | UMLS_CUI:C0085574, UMLS_CUI:C0153212, UMLS_CUI:C0282492, UMLS_CUI:C1332356, UMLS_CUI:C1838329                                                                           |

## Suggestions

- NCI appears in Bioregistry under [`ncit`](https://bioregistry.io/ncit). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- MESH appears in Bioregistry under [`mesh`](https://bioregistry.io/mesh). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- ICD9CM appears in Bioregistry under [`icd9cm`](https://bioregistry.io/icd9cm). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- SNOMEDCT_US_2021_09_01 appears in Bioregistry under [`snomedct`](https://bioregistry.io/snomedct). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- UMLS_CUI appears in Bioregistry under [`umls`](https://bioregistry.io/umls). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- ICD10CM appears in Bioregistry under [`icd10cm`](https://bioregistry.io/icd10cm). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- ORDO appears in Bioregistry under [`orphanet.ordo`](https://bioregistry.io/orphanet.ordo). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- GARD appears in Bioregistry under [`gard`](https://bioregistry.io/gard). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- OMIM appears in Bioregistry under [`omim`](https://bioregistry.io/omim). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- ICDO appears in Bioregistry under [`icdo`](https://bioregistry.io/icdo). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- EFO appears in Bioregistry under [`efo`](https://bioregistry.io/efo). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- MEDDRA appears in Bioregistry under [`meddra`](https://bioregistry.io/meddra). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- KEGG appears in Bioregistry under [`kegg`](https://bioregistry.io/kegg). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- SNOMEDCT_US_2021_07_31 appears in Bioregistry under [`snomedct`](https://bioregistry.io/snomedct). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- SNOMEDCT_US_2020_03_01 appears in Bioregistry under [`snomedct`](https://bioregistry.io/snomedct). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).
- SNOMEDCT_US_2020_09_01 appears in Bioregistry under [`snomedct`](https://bioregistry.io/snomedct). Consider chaining your converter with the Bioregistry using [`curies.chain()`](https://curies.readthedocs.io/en/latest/api/curies.chain.html).


In [6]:
bioregistry_converter

<curies.api.Converter at 0x16b7d8a90>

In [7]:
bioregistry_converter.pd_standardize_curie(df.copy(), column="object_id")

Standardization was successfully applied to all 36,730 CURIEs in column `object_id`.

# Mixed CURIEs and URIs demo

In [8]:
mixed_df = pd.DataFrame(
    [
        ("chebi:1",),
        ("http://purl.obolibrary.org/obo/CHEBI_2",),
        ("CHEBI:3",),
        ("not_a_curie",),
        (None,),
    ]
)
bioregistry_converter.pd_standardize_curie(mixed_df, column=0)

## Summary

Standardization was not necessary for 1 (20.0%), resulted in 1 updates (20.0%), and 2 failures (40.0%)  in column `0`. Here's a breakdown of the prefixes that weren't possible to standardize:

| prefix      |   count | examples                               |
|:------------|--------:|:---------------------------------------|
| http        |       1 | http://purl.obolibrary.org/obo/CHEBI_2 |
| not_a_curie |       1 | not_a_curie                            |

## Suggestions

- http entries are not CURIEs, try and compressing your data first.
- not_a_curie is not a valid CURIE
