Skip to content

Conversation

@rpgv
Copy link
Collaborator

@rpgv rpgv commented Sep 25, 2025

Added: AptaCom loader - Loads AptaCom dataset from hugging face in one of two formats:
pandas dataframe, hugging face dataset;

@fkiraly fkiraly added the enhancement New feature or request label Sep 29, 2025

Parameters
----------
as_df: (bool) Requires pandas compatible format; converts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults should be stated in the docstring, and please adopt a consistent formatting. Suggestion:

paramname : type/description, default=default_value

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we use.


Returns
-------
dataset: Hugging face dataset in a pandas compatible format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to be more precise here, and also take into account the value of as_df, e.g., what is returned if it is False (the default actually)?

as_df: (bool) Requires pandas compatible format; converts
dataset into pandas dataframe

filter(str): Allows filtering dataset by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter is called differently in function signature

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

I had comments mainly on the docstring - AI is also good (under supervision) in writing clear docstrings.

It would also be great if a description of the dataset, e.g., which columns it has, a reference to a paper, etc, would be part of the docstring.

@fkiraly fkiraly changed the title Integrate aptacom database loader [ENH] aptacom database loader Sep 29, 2025
The test_download_aptacom function
"""
dataset = load_aptacom(as_df, filter_entries)
if not isinstance(dataset, Dataset | DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually test the correct type based on as_df.

Also, can you check number rows, expected columns, etc?

Rafael added 2 commits October 6, 2025 18:47
- load_aptacom function renamed to load_complete_aptacom;

- Added 'load_aptacom_for_training' function which returns a dataset or dataframe with aptamer and target sequence, with no metadata;

- Both functions within _aptacom_loader.py allow filtering the AptaCom dataset through target or aptamer chemistry;
@fkiraly
Copy link
Contributor

fkiraly commented Oct 8, 2025

code quality tests are failing - I would recommend to set up pre-commit locally, that way code is always formatted correctly (automatically)

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!



def load_complete_aptacom(as_df=False, filter_entries=None):
"""Loads a Hugging Face dataset with customizable
Copy link
Collaborator

@satvshr satvshr Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace "a Hugging Face" dataset everywhere in the docstring with the actual thing it is loading? Something like: "the AptaCom dataset from HuggingFace"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course - I'll modify it then.

]
)
else:
aptacom = aptacom.map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the only difference between load_complete_aptacom and load_aptacom_for_training seem to be the tagret_id argument, when the tagret_id is false, does load_aptacom_for_training give the same result as load_complete_aptacom?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, 'load_for_training' doesn't contain any 'metadata' relative to the aptamer entries. It only describes aptamer sequence, target sequence (and if selected, target id, which is required for AptaCom model).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If load_aptacom_for_training returns data that is a subset of load_complete_aptacom can we not just add 2 arguments to load_complete_aptacom (something like load_training and target_id) and remove the load_aptacom_for_training function altogether?

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion some changes:

  • from @NennoMP: for loading, we may be able to reuse the huggingface loader
  • since you are subsetting after loading, in load_for_train", I would instead call the load_complete`, so the loader code is not duplicated
  • I would not merge these into a single function, since these correspond to two "data artefacts" that may be worth keeping separate. Artefacts should map on python objects imo.

minor things:

  • naming; I would say load_aptacom_full, and load_aptacom_Xy - more descriptive of content, not purpose
    • having affinity might be useful, so the second could return an X (aptamer, ligand sequence) and an y (affinity)
    • this could use the return_X_y logic like in sklearn, and X with columns "target" and "ligand"
  • docstrings should be in numpydoc format, not google format

Rafael added 4 commits October 17, 2025 14:25
Renamed load_aptacom_complete -> load_aptacom_full;
Created function load_aptacom_xy -> Following sklearn format;
Novel conditions for filtering load_aptacom_full;
@rpgv rpgv requested a review from fkiraly October 22, 2025 08:17
Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@satvshr
Copy link
Collaborator

satvshr commented Oct 26, 2025

Will have a look at it tomorrow or day after.



def filter_columns(ds, columns=None):
""" " Selects columns to keep on dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the """ " formatting everywhere, seems wrong?

""" " Selects columns to keep on dataset
Parameters:
-----------
ds: pd dataframe, required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting seems a bit off, can you check this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the delayed response @satvshr, thank you for your comments - I'll make the adjustments !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check #192 :D

return ds


def prepare_xy(ds):
Copy link
Collaborator

@satvshr satvshr Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly just a question, should we not name all functions with x and y in their name similarly? So this would be prepare_x_y instead.

}


def filter_columns(ds, columns=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function not be private? Also do you mind changing ds to df (for dataframe) given ds is something I will be using for the Dataset object from HuggingFace?

Copy link
Collaborator

@satvshr satvshr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly we will be replacing the load_hf_dataset function with the one I am working on in #155, so should this be paused till then so that it can be updated with the Hugging Face loader we will be using?

@fkiraly
Copy link
Contributor

fkiraly commented Oct 28, 2025

so should this be paused till then so that it can be updated with the Hugging Face loader we will be using?

I do not think we should block this. I would instead refactor, keeping the public interfaces the same.

@fkiraly fkiraly merged commit 95ac1b1 into main Oct 28, 2025
15 checks passed
@satvshr
Copy link
Collaborator

satvshr commented Oct 28, 2025

so should this be paused till then so that it can be updated with the Hugging Face loader we will be using?

I do not think we should block this. I would instead refactor, keeping the public interfaces the same.

I left some comments, could you have a look and tell me why they are not something that should be changed so that I do not raise such points again?

@fkiraly
Copy link
Contributor

fkiraly commented Nov 1, 2025

@satvshr, apologies, these were very valid points - I overlooked them somehow. Must have had a cached version of the PR displayed without your comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants