Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dm.read_records in io module #130

Closed
MichelML opened this issue Sep 19, 2022 · 8 comments
Closed

Add dm.read_records in io module #130

MichelML opened this issue Sep 19, 2022 · 8 comments
Labels
enhancement New feature or request low-priority

Comments

@MichelML
Copy link
Contributor

MichelML commented Sep 19, 2022

This would simplify things when working in the context of querying a rest api where the json response contains a list of molecules. Quick REST API examples:

  • Molport
  • Chemspace
  • Mcule
  • CDD Vault
  • Dotmatics
  • PubChem
  • Etc...
@MichelML MichelML added enhancement New feature or request low-priority labels Sep 19, 2022
@zhu0619
Copy link
Contributor

zhu0619 commented Sep 19, 2022

Similarly add dm.real_yaml for convinience.

@MichelML
Copy link
Contributor Author

@zhu0619 can you log this in another issue providing context explaining the use cases of read_yaml ?

@hadim
Copy link
Contributor

hadim commented Sep 26, 2022

Having to maintain those API layers could be quite time-consuming as they tend to change over time.

Also by experience working with some of them, it can be tricky to get a unified datamol API given the difference in returned data in between all the above providers (while being nice, this is not necessarily an important point here).

@MichelML
Copy link
Contributor Author

Sorry maybe the examples were overwhelming here, in any case I think I can make a PR myself for this.

The idea is simply to replicate this case:

>>> data = [{'col_1': 3, 'col_2': 'a'},
...         {'col_1': 2, 'col_2': 'b'},
...         {'col_1': 1, 'col_2': 'c'},
...         {'col_1': 0, 'col_2': 'd'}]
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Since this case is tied with what is usually received from most rest APIs (a list of dicts), remaining would be to add the logic for a smiles_column param like other io methods.

There is definitely a way to have a simple v1 while making sure to specify its limitations imho.

In any case, not urgent or a priority, and something I can add myself when I'll need it for a production case.

@hadim
Copy link
Contributor

hadim commented Sep 26, 2022

Ok and sorry I think I misunderstood here xD

I use list of dict to create df all the time and you can simply do pd.DataFrame(list_of_dict). That being said I am not sure what logic you want to add on datamol related to this. Usually a simple workflow is:

df = pd.DataFrame(list_of_dict)
df["mol"] = df[smiles_column].apply(dm.to_mol)

But feel free to post more examples or open a PR if you have something else in mind.

@MichelML
Copy link
Contributor Author

MichelML commented Sep 26, 2022

that's exactly what I want, as a single liner @hadim 😄 , thus why I say it's not a priority, just convenient and another case datamol can handle!

df = dm.from_records(list_of_dict, smiles_column="smiles")

no more or less complex than dm.read_csv already implemented here https://github.com/datamol-org/datamol/blob/main/datamol/io.py#L27

@maclandrol
Copy link
Member

Note, that in a lot of cases, you might want to use pd.json_normalize instead.

@hadim
Copy link
Contributor

hadim commented Apr 17, 2023

Closing here. It's not clear to me whether we need this in datamol.

Please re-open if needed.

@hadim hadim closed this as completed Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low-priority
Projects
None yet
Development

No branches or pull requests

4 participants