Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing ChEBI records? #83

Closed
andrewsu opened this issue Sep 24, 2020 · 12 comments · Fixed by #106
Closed

missing ChEBI records? #83

andrewsu opened this issue Sep 24, 2020 · 12 comments · Fixed by #106
Assignees

Comments

@andrewsu
Copy link
Member

I can't find macrolide in mychem, and I wonder if it indicates that our ChEBI import is somehow incomplete? Details below (using imatinib as a positive control):

ChEBI records

MyChem queries:

@andrewsu andrewsu added the bug label Sep 24, 2020
@kevinxin90
Copy link
Contributor

kevinxin90 commented Oct 28, 2020

@andrewsu
For CHEBI dumper in MyChem, we're using the SDF file provided by CHEBI. And as stated in the website, it contains all the chemical structures and associated information. However, it excludes any ontological information.
Macrolide is an ontology term within CHEBI that represents a class of chemicals. That's why it's not included in the SDF file.
I'm fine if we wanna include all ontology info from CHEBI into MyChem. It's just we need to use a different file from CHEBI to ingest.

@andrewsu andrewsu added enhancement and removed bug labels Oct 29, 2020
@andrewsu
Copy link
Member Author

got it, thanks for looking into this, Kevin. I leave it to you to prioritize, but I do think having the ontological nodes in mychem would be valuable...

@andrewsu
Copy link
Member Author

Adding a note that mychem also does not include an entry for CHEBI:31859 (which is the entry for the racemic mixture of modafinil). We do have entries for the separate right- and left-handed versions of this molecule armodafinil (CHEBI:77590) and (S)-modafinil (CHEBI:77591), and also the nonchiral version 2-[(diphenylmethyl)sulfinyl]acetamide (CHEBI:77585).

MyChem links:
http://mychem.info/v1/query?q=chebi.id:CHEBI\:31859 (no record)
http://mychem.info/v1/query?q=chebi.id:CHEBI\:77585
http://mychem.info/v1/query?q=chebi.id:CHEBI\:77590
http://mychem.info/v1/query?q=chebi.id:CHEBI\:77591

I would expect that MyChem would have records for all four of these records (we have three out of four), and that mychem would capture the ontological relationships in ChEBI (specifically the ones listed in the screenshots below);

image
image
image
image

more info from Chris Bizon at https://ncatstranslator.slack.com/archives/C01LQKY499A/p1623775157021300

@erikyao
Copy link
Contributor

erikyao commented Jun 29, 2021

Aim

Add another set of dumper/parser/uploader to read ChEBI ontology data.

Source Files

As indicated in ChEBI website, there are 3 versions of ontology files:

  • LITE - Only ids, name, subsets and relationships are available. Small size if you are interested in our ontology only
  • CORE - As above plus chemical data (mass, charge, formula) and structures (inchis, inchikeys, smiles)
  • FULL - As above plus name synonyms and manually added cross-reference.

each in two formats:

  • obo
  • owl

The LITE version should have everything we need.

Implementation

Refer to the code of mondo plugin in mydisease.info, which

  • reads obo files into relationship networks (in networkx graph objects)
  • using obonet lib

Another lib pronto is also available to read obo or owl files, but it's not as convenient as obonet when reading relationships.

@erikyao
Copy link
Contributor

erikyao commented Jul 7, 2021

How obonet works

Our Mondo parser uses obonet library, which, on receiving an obo ontology file,

  • parses each entity into a node, and
  • connects the entity nodes into a graph by their relationships

However, the node representation has is_a relationship as an individual field beside their relationship field. E.g.

'MONDO:0016575': {
    'is_a': ['MONDO:0002254', 'MONDO:0005087', 'MONDO:0005308'],
    'relationship': ['excluded_subClassOf MONDO:0018395', 'has_modifier MONDO:0021136'],
    ...  # other fields omitted
}

'CHEBI:77590': {
    'is_a': ['CHEBI:77585'],
    'relationship': ['is_enantiomer_of CHEBI:77591', 'has_role CHEBI:35337', 'has_role CHEBI:77567']
    ...  # other fields omitted
}

Looks like is_a relationship is more often used and thus made individual. Dr. Chris Mungall has a related comment here.

However we don't have to worry about combining the is_a and relationship fields manually. The graphs returned by obonet will treat them both as edges. E.g.

graph = obonet.read_obo("chebi_lite.obo")
print(list(graph.successors("CHEBI:77590")))
# Output: ['CHEBI:77585', 'CHEBI:77591', 'CHEBI:35337', 'CHEBI:77567']

We can see that successors of CHEBI:77590 is a union of its is_a and relationship entities.

A possible problem in our Mondo parser

As shown in https://github.com/biothings/mydisease.info/blob/master/src/plugins/mondo/parser.py, a Mondo document has:

  • a parents field which includes ONLY its is_a nodes
  • a children field, a ancestors field, and a descendants field, each includes the reachable nodes from the UNION of its is_a and relationship nodes

From @andrewsu's comments above, I think all relationships (i.e. union of is_a and relationship) should be returned in the documents. Nonetheless, if we intended to only care is_a relationships, we should make a is_a subgraph before calculating each entity node's successors/predecessors/descendants/ancestors.

@andrewsu
Copy link
Member Author

andrewsu commented Jul 8, 2021

I think parents, children, ancestors, and descendants should be strictly based on is_a nodes. is_a indicates subclass relationships, and those four relationship types (parents, children, ancestors, and descendants) should only be based on subclassing.

Having said that, the other types of relationships would also be very useful to capture. I leave it to you to decide how to model the object, but my guess is that it should look something like this:

{
   "id": "CHEBI:77590",
   "parents": [ "CHEBI:77585" ],
   "children": [ ... ],
   "ancestors": [ "CHEBI:77585", ... ],
   "descendants": [ ...],
   "relationships": {
      "is_enantiomer_of": [ "CHEBI:77591" ],
      "has_role": [ "CHEBI:35337", "CHEBI:77567" ]
   }
}

@erikyao
Copy link
Contributor

erikyao commented Jul 8, 2021

I can't find macrolide in mychem, and I wonder if it indicates that our ChEBI import is somehow incomplete? Details below (using imatinib as a positive control):

ChEBI records

MyChem queries:

@andrewsu CHEBI:25106 (macrolide) is not included as an individual record in the ChEBI SDF file (although it does appear once as the "Secondary ChEBI ID" to CHEBI:3112 (biperiden)).

CHEBI:25106 has an record in the ChEBI ontology file, as below:

{'name': 'macrolide',
 'subset': ['3_STAR'],
 'def': '"A macrocyclic lactone with a ring of twelve or more members derived from a polyketide." []',
 'is_a': ['CHEBI:26188', 'CHEBI:63944']}

got it, thanks for looking into this, Kevin. I leave it to you to prioritize, but I do think having the ontological nodes in mychem would be valuable...

@andrewsu @newgene We need to discuss whether to include ontological nodes in mychem in details. Technically it's feasible.

erikyao added a commit to erikyao/mychem.info that referenced this issue Jul 16, 2021
erikyao added a commit to erikyao/mychem.info that referenced this issue Jul 17, 2021
@erikyao
Copy link
Contributor

erikyao commented Jul 17, 2021

ChEBI ids in rel201 SDF and obo files

The compound file, rel201/SDF/ChEBI_complete.sdf.gz, has 133,779 ChEBI ids (let's indicate it as a set S1), while the ontology file rel201/ontology/chebi_lite.obo.gz has 146,183 (S2). However, S1 is not a subset of S2. Their set difference is

S1 \ S2 = {"CHEBI:156068"}

which means "CHEBI:156068" will be the only document that has no ontology fields in our collection.

The set difference S2 \ S1 contains 12,404 ChEBI ids. They will have no chemical/compound structure fields.

@andrewsu
Copy link
Member Author

andrewsu commented Aug 4, 2021

@erikyao since http://mychem.info/v1/query?q=chebi.id:CHEBI\:25106 still returns zero hits, I'm guessing this hasn't yet been deployed? What is involved in doing that, and who will handle that? (As a general rule, I think we should merge and deploy before closing an issue...)

@erikyao
Copy link
Contributor

erikyao commented Aug 5, 2021

The PR to fix this issue was merged to the code base. Not deployed yet.

@erikyao erikyao reopened this Aug 5, 2021
@erikyao
Copy link
Contributor

erikyao commented Aug 5, 2021

Hi @ravila4, please let me know when you fixed the PubChem data source. We can deploy the fixes together. Thank you!

@ravila4
Copy link
Contributor

ravila4 commented Sep 7, 2021

@erikyao Can we close this issue now? It looks like the queries are working now.

@ravila4 ravila4 closed this as completed Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants