Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect separators on water synonyms #29

Open
longemen3000 opened this issue May 3, 2021 · 7 comments
Open

incorrect separators on water synonyms #29

longemen3000 opened this issue May 3, 2021 · 7 comments

Comments

@longemen3000
Copy link

longemen3000 commented May 3, 2021

What is the search string
caustic soda liquid;aquafina;distilled water;hydrogen oxide (h2o);ultrexii ultrapure;
Which chemical in the database do you believe should be found?
its water,but the separators here are wrong

@CalebBell
Copy link
Owner

Hi Andrés,
Like all software not maintained, bits and pieces of the chemicals-metadata repository have rotted away. I cannot get the inchi module in rdkit to work for me, and I am having issues building rdkit.
Thanks for letting me know about the issue. I'm afraid we may have to manually patch the file for now.
Sincerely,
Caleb

@CalebBell
Copy link
Owner

Hi Andrés,
I found a version of rdkit which works on linux - and it's on pypi! One step closer to being able to update the database again. I think I actually need to port chemical-metadata to Python 3 as well.

Sincerely,
Caleb

@longemen3000
Copy link
Author

longemen3000 commented Aug 1, 2021

what do you think of adding ; as an aditional separator? the main problem would checking if other names actually have ; as part of their name.
maybe adding:

line = line.replace(';','\t')

before this line

values = line.rstrip('\n').split('\t')

could solve the problem temporally?

Also, i noticed (by a quick view, nothing exhaustive) that those synonyms separated by ';' are always at the end of the list.

Edit: the split ; must always be done after parsing the InChI

@CalebBell
Copy link
Owner

Hi Andrés,
I have fixed the chemical-metadata repository a lot, and generated a new inorganic file without this particular issue. I attached it.

What is hard to do is that the online data has changed so much, I can't even use a diff program to see what changed. Because of that, it's hard to replace the current file with the new one. Do you want to look at it?

Sincerely,
Caleb

Inorganic db.csv

@longemen3000
Copy link
Author

longemen3000 commented Aug 2, 2021

Hi Caleb,

Given the old and new versions, i could program a manual diff to see what's changed, I'm gonna start with this and let you know what I found.

@longemen3000
Copy link
Author

longemen3000 commented Aug 2, 2021

for a preliminar parsing:
there are more synonyms, compared to the old database:

Old

julia> CC.load_db!(:inorganic_old2)
[ Info: :inorganic_old2 arrow file not generated, processing...
syms_i = 6326 #amount of synonyms
syms_unique  = 6325 # unique elements (there is one element repeated that i have yet find)
(Arrow.Table with 153 rows, 9 columns, and schema:
.....

New

julia> CC.load_db!(:inorganic_new)
[ Info: :inorganic_new database file not found, downloading from https://github.com/CalebBell/chemicals/files/6912649/Inorganic.db.csv       
[ Info: :inorganic_new database file downloaded.
[ Info: :inorganic_new arrow file not generated, processing...
syms_i = 9461
syms_unique = 9438
(Arrow.Table with 164 rows, 9 columns, and schema:

comparing the differences, by InChI:

InChI contained in the old database, not present in the new database

  "InChI=1S/CH2.Co/h1H2;/q-1;+1"
  "InChI=1S/Cr.2H2Si/h;2*1H2"
  "InChI=1S/H4Si/h1H4"
  "InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4…  "InChI=1S/F6Si.2H3N/c1-7(2,3,4,5)6;;/h;2*1H3…  "InChI=1S/Bi.2ClH.2H/h;2*1H;;/q+2;;;;/p-2"
  "InChI=1S/Al.Na.2O.2H/q-1;+1;;;;"
  "InChI=1S/BrHO3.Cs/c2-1(3)4;/h(H,2,3,4);/q;+…  "InChI=1S/2Na.H3O4P/c;;1-5(2,3)4/h;;(H3,1,2,…  "InChI=1S/2BH2.Ti/h2*1H2;"
  "InChI=1S/F6Si.2Na/c1-7(2,3,4,5)6;;/q-2;2*+1"  ""
  "InChI=1S/2Na.3H2O4S/c;;3*1-5(2,3)4/h;;3*(H2…

InChI contained in the new database, not present in the old database

  "InChI=1S/Cl2S2/c1-3-4-2"
  "InChI=1S/O.Pr"
  "InChI=1S/Bi.2ClH/h;2*1H/q+2;;/p-2"
  "InChI=1S/Cr.2Si"
  "InChI=1S/C32H16N8.Cu/c1-2-10-18-17(9-1)25-33-26(18)38-28-21-13-5-6-14-22(21)30(35-28)40-32-24-…  "InChI=1S/Al.Na.2O/q-1;+1;;"
  "InChI=1S/Al.La.O"
  "InChI=1S/2B.Ti"
  "InChI=1S/Na.H3O4P/c;1-5(2,3)4/h;(H3,1,2,3,4)"
  "InChI=1S/C.Co/q-1;+1"
  "InChI=1S/3O.2Yb/q3*-2;2*+3"
  "InChI=1S/2HI.Sm/h2*1H;/q;;+2/p-2"
  "InChI=1S/3ClH.Ru/h3*1H;/q;;;+3/p-3"
  "InChI=1S/H2O/h1H2"
  "InChI=1S/2B.Zr"
  "InChI=1S/10CO.2Re/c10*1-2;;"
  "InChI=1S/H3NO.H2O4S/c1-2;1-5(2,3)4/h2H,1H2;(H2,1,2,3,4)"
  "InChI=1S/Li.H"
  "InChI=1S/Na.H2O4S/c;1-5(2,3)4/h;(H2,1,2,3,4)"
  "InChI=1S/C.2W/q+1;;-1"
  "InChI=1S/6Al.2O2Si.9O/c;;;;;;2*1-3-2;;;;;;;;;"
  "InChI=1S/B.Li.O"
  "InChI=1S/Cd.2FH/h;2*1H/q+2;;/p-2"

@longemen3000
Copy link
Author

longemen3000 commented Aug 2, 2021

doing the same thing with the formulas:

julia> setdiff(set_new,set_old)
Set{String} with 21 elements:
  "Cl3Ru"
  "O3Yb2"
  "H2O" #water is in new the inorganics database
  "AlLaO"
  "I2Sm"
  "B2Zr"
  "H3NaO4P"
  "HLi"
  "Al6O13Si2"
  "Cl2S2"
  "As2H12O3"
  "CW2"
  "C32H16CuN8"
  "OPr"
  "ClH2Tl"
  "H5NO5S"
  "C10O10Re2"
  "BLiO"
  "H2NaO4S"
  "BrH2Tl"
  "CdF2"
julia> setdiff(set_old,set_new)
Set{String} with 11 elements:
  "HNa2O4P"
  "ClTl"
  "H4Si"
  "H4Na2O12S3"
  "As2O3"
  "BrCsO3"
  "BrTl"
  "H2NaO4P"
  "F6H8N2Si"
  "F6Na2Si"
  "D2Se"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants