In [1]:
import curies
import pandas as pd
import itertools as itt
import pystow

In [2]:
%%time
obo_converter = curies.get_obo_converter()

CPU times: user 185 ms, sys: 108 ms, total: 293 ms
Wall time: 917 ms


In [3]:
%%time
bioregistry_converter = curies.get_bioregistry_converter()

CPU times: user 6.73 s, sys: 63 ms, total: 6.79 s
Wall time: 6.8 s


# Disease Ontology SSSOM Demo

In [4]:
commit = "faca4fc335f9a61902b9c47a1facd52a0d3d2f8b"
url = f"https://raw.githubusercontent.com/mapping-commons/disease-mappings/{commit}/mappings/doid.sssom.tsv"
df = pystow.ensure_csv("tmp", url=url, read_csv_kwargs=dict(comment="#"))
df.head()[["subject_id", "predicate_id", "object_id"]].values

array([['DOID:8717', 'oboInOwl:hasDbXref', 'NCI:C50706'],
       ['DOID:8717', 'oboInOwl:hasDbXref', 'MESH:D003668'],
       ['DOID:8717', 'oboInOwl:hasDbXref', 'ICD9CM:707.0'],
       ['DOID:8717', 'oboInOwl:hasDbXref',
        'SNOMEDCT_US_2021_09_01:28103007'],
       ['DOID:8717', 'oboInOwl:hasDbXref', 'UMLS_CUI:C0011127']],
      dtype=object)

In [5]:
obo_converter.pd_standardize_curie(df.copy(), column="object_id")

## Summary

Standardization was not necessary for 2 (0.0%), resulted in 0 updates (0.0%), and 34,522 failures (100.0%)  in column `object_id`. Here's a breakdown of the prefixes that weren't possible to standardize:

| prefix                 |   count | examples                                                                                                                                                               |
|:-----------------------|--------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| EFO                    |     131 | EFO:0000274, EFO:0001071, EFO:0001075, EFO:0001422, EFO:0004705                                                                                                        |
| GARD                   |    2030 | GARD:2562, GARD:5721, GARD:6291, GARD:7065, GARD:8378                                                                                                                  |
| ICD10CM                |    3666 | ICD10CM:A21.0, ICD10CM:C03, ICD10CM:K72, ICD10CM:K82.4, ICD10CM:N30.0                                                                                                  |
| ICD9CM                 |    2266 | ICD9CM:214.4, ICD9CM:232.4, ICD9CM:377.75, ICD9CM:428.2, ICD9CM:745.6                                                                                                  |
| ICDO                   |     361 | ICDO:8300/0, ICDO:8840/3, ICDO:9442/1, ICDO:9530/0, ICDO:9590/3                                                                                                        |
| KEGG                   |      41 | KEGG:05016, KEGG:05133, KEGG:05142, KEGG:05222, KEGG:05414                                                                                                             |
| MEDDRA                 |      41 | MEDDRA:10001229, MEDDRA:10015487, MEDDRA:10021312, MEDDRA:10059200, MEDDRA:10060740                                                                                    |
| MESH                   |    3847 | MESH:D002128, MESH:D005141, MESH:D009198, MESH:D011040, MESH:D017240                                                                                                   |
| NCI                    |    4788 | NCI:C26913, NCI:C27390, NCI:C27871, NCI:C40284, NCI:C6081                                                                                                              |
| OMIM                   |    5539 | OMIM:209700, OMIM:222300, OMIM:530000, OMIM:613021, OMIM:618224                                                                                                        |
| ORDO                   |    2023 | ORDO:139441, ORDO:2510, ORDO:255229, ORDO:420702, ORDO:48652                                                                                                           |
| SNOMEDCT_US_2020_03_01 |       6 | SNOMEDCT_US_2020_03_01:236818008, SNOMEDCT_US_2020_03_01:778024005, SNOMEDCT_US_2020_03_01:8757006                                                                     |
| SNOMEDCT_US_2020_09_01 |       1 | SNOMEDCT_US_2020_09_01:1112003                                                                                                                                         |
| SNOMEDCT_US_2021_07_31 |      10 | SNOMEDCT_US_2021_07_31:268180007, SNOMEDCT_US_2021_07_31:703536004, SNOMEDCT_US_2021_07_31:721311006, SNOMEDCT_US_2021_07_31:75931002                                  |
| SNOMEDCT_US_2021_09_01 |    5088 | SNOMEDCT_US_2021_09_01:111359004, SNOMEDCT_US_2021_09_01:155748004, SNOMEDCT_US_2021_09_01:238113006, SNOMEDCT_US_2021_09_01:38804009, SNOMEDCT_US_2021_09_01:92585006 |
| UMLS_CUI               |    6890 | UMLS_CUI:C0031347, UMLS_CUI:C0206724, UMLS_CUI:C0276007, UMLS_CUI:C0392492, UMLS_CUI:C1515285                                                                          |

## Suggestions

- NCI Suggestion.x7 - ncit
- MESH Suggestion.x7 - mesh
- ICD9CM Suggestion.x7 - icd9cm
- SNOMEDCT_US_2021_09_01 Suggestion.x7 - snomedct
- UMLS_CUI Suggestion.x7 - umls
- ICD10CM Suggestion.x7 - icd10cm
- ORDO Suggestion.x7 - orphanet.ordo
- GARD Suggestion.x7 - gard
- OMIM Suggestion.x7 - omim
- ICDO Suggestion.x7 - icdo
- EFO Suggestion.x7 - efo
- MEDDRA Suggestion.x7 - meddra
- KEGG Suggestion.x7 - kegg
- SNOMEDCT_US_2021_07_31 Suggestion.x7 - snomedct
- SNOMEDCT_US_2020_03_01 Suggestion.x7 - snomedct
- SNOMEDCT_US_2020_09_01 Suggestion.x7 - snomedct


In [6]:
bioregistry_converter

<curies.api.Converter at 0x1475df390>

In [7]:
bioregistry_converter.pd_standardize_curie(df.copy(), column="object_id")

Standardization was successfully applied to all 36,730 CURIEs in column `object_id`.

# Mixed CURIEs and URIs demo

In [8]:
mixed_df = pd.DataFrame(
    [
        ("chebi:1",),
        ("http://purl.obolibrary.org/obo/CHEBI_2",),
        ("CHEBI:3",),
        ("not_a_curie",),
        (None,),
    ]
)
bioregistry_converter.pd_standardize_curie(mixed_df, column=0)

## Summary

Standardization was not necessary for 1 (20.0%), resulted in 1 updates (20.0%), and 2 failures (40.0%)  in column `0`. Here's a breakdown of the prefixes that weren't possible to standardize:

| prefix      |   count | examples                               |
|:------------|--------:|:---------------------------------------|
| http        |       1 | http://purl.obolibrary.org/obo/CHEBI_2 |
| not_a_curie |       1 | not_a_curie                            |

## Suggestions

- http Suggestion.x2
- not_a_curie Suggestion.x3
