Add dm.read_records in io module #130

MichelML · 2022-09-19T12:54:17Z

This would simplify things when working in the context of querying a rest api where the json response contains a list of molecules. Quick REST API examples:

Molport
Chemspace
Mcule
CDD Vault
Dotmatics
PubChem
Etc...

zhu0619 · 2022-09-19T13:07:30Z

Similarly add dm.real_yaml for convinience.

MichelML · 2022-09-19T13:38:48Z

@zhu0619 can you log this in another issue providing context explaining the use cases of read_yaml ?

hadim · 2022-09-26T12:58:33Z

Having to maintain those API layers could be quite time-consuming as they tend to change over time.

Also by experience working with some of them, it can be tricky to get a unified datamol API given the difference in returned data in between all the above providers (while being nice, this is not necessarily an important point here).

MichelML · 2022-09-26T13:40:52Z

Sorry maybe the examples were overwhelming here, in any case I think I can make a PR myself for this.

The idea is simply to replicate this case:

>>> data = [{'col_1': 3, 'col_2': 'a'},
...         {'col_1': 2, 'col_2': 'b'},
...         {'col_1': 1, 'col_2': 'c'},
...         {'col_1': 0, 'col_2': 'd'}]
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Since this case is tied with what is usually received from most rest APIs (a list of dicts), remaining would be to add the logic for a smiles_column param like other io methods.

There is definitely a way to have a simple v1 while making sure to specify its limitations imho.

In any case, not urgent or a priority, and something I can add myself when I'll need it for a production case.

hadim · 2022-09-26T13:58:02Z

Ok and sorry I think I misunderstood here xD

I use list of dict to create df all the time and you can simply do pd.DataFrame(list_of_dict). That being said I am not sure what logic you want to add on datamol related to this. Usually a simple workflow is:

df = pd.DataFrame(list_of_dict)
df["mol"] = df[smiles_column].apply(dm.to_mol)

But feel free to post more examples or open a PR if you have something else in mind.

MichelML · 2022-09-26T16:50:36Z

that's exactly what I want, as a single liner @hadim 😄 , thus why I say it's not a priority, just convenient and another case datamol can handle!

df = dm.from_records(list_of_dict, smiles_column="smiles")

no more or less complex than dm.read_csv already implemented here https://github.com/datamol-org/datamol/blob/main/datamol/io.py#L27

maclandrol · 2022-09-26T17:07:35Z

Note, that in a lot of cases, you might want to use pd.json_normalize instead.

hadim · 2023-04-17T11:34:05Z

Closing here. It's not clear to me whether we need this in datamol.

Please re-open if needed.

MichelML added enhancement New feature or request low-priority labels Sep 19, 2022

hadim closed this as completed Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dm.read_records in io module #130

Add dm.read_records in io module #130

MichelML commented Sep 19, 2022 •

edited

zhu0619 commented Sep 19, 2022

MichelML commented Sep 19, 2022

hadim commented Sep 26, 2022

MichelML commented Sep 26, 2022

hadim commented Sep 26, 2022 •

edited

MichelML commented Sep 26, 2022 •

edited

maclandrol commented Sep 26, 2022

hadim commented Apr 17, 2023

Add dm.read_records in io module #130

Add dm.read_records in io module #130

Comments

MichelML commented Sep 19, 2022 • edited

zhu0619 commented Sep 19, 2022

MichelML commented Sep 19, 2022

hadim commented Sep 26, 2022

MichelML commented Sep 26, 2022

hadim commented Sep 26, 2022 • edited

MichelML commented Sep 26, 2022 • edited

maclandrol commented Sep 26, 2022

hadim commented Apr 17, 2023

MichelML commented Sep 19, 2022 •

edited

hadim commented Sep 26, 2022 •

edited

MichelML commented Sep 26, 2022 •

edited