-
Notifications
You must be signed in to change notification settings - Fork 5
[ENH] aptacom database loader #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…emistry. Both applied in a broad manner
|
|
||
| Parameters | ||
| ---------- | ||
| as_df: (bool) Requires pandas compatible format; converts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaults should be stated in the docstring, and please adopt a consistent formatting. Suggestion:
paramname : type/description, default=default_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we use.
|
|
||
| Returns | ||
| ------- | ||
| dataset: Hugging face dataset in a pandas compatible format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to be more precise here, and also take into account the value of as_df, e.g., what is returned if it is False (the default actually)?
| as_df: (bool) Requires pandas compatible format; converts | ||
| dataset into pandas dataframe | ||
|
|
||
| filter(str): Allows filtering dataset by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter is called differently in function signature
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
I had comments mainly on the docstring - AI is also good (under supervision) in writing clear docstrings.
It would also be great if a description of the dataset, e.g., which columns it has, a reference to a paper, etc, would be part of the docstring.
| The test_download_aptacom function | ||
| """ | ||
| dataset = load_aptacom(as_df, filter_entries) | ||
| if not isinstance(dataset, Dataset | DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually test the correct type based on as_df.
Also, can you check number rows, expected columns, etc?
- load_aptacom function renamed to load_complete_aptacom; - Added 'load_aptacom_for_training' function which returns a dataset or dataframe with aptamer and target sequence, with no metadata; - Both functions within _aptacom_loader.py allow filtering the AptaCom dataset through target or aptamer chemistry;
|
code quality tests are failing - I would recommend to set up |
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now!
|
|
||
|
|
||
| def load_complete_aptacom(as_df=False, filter_entries=None): | ||
| """Loads a Hugging Face dataset with customizable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace "a Hugging Face" dataset everywhere in the docstring with the actual thing it is loading? Something like: "the AptaCom dataset from HuggingFace"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course - I'll modify it then.
| ] | ||
| ) | ||
| else: | ||
| aptacom = aptacom.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the only difference between load_complete_aptacom and load_aptacom_for_training seem to be the tagret_id argument, when the tagret_id is false, does load_aptacom_for_training give the same result as load_complete_aptacom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, 'load_for_training' doesn't contain any 'metadata' relative to the aptamer entries. It only describes aptamer sequence, target sequence (and if selected, target id, which is required for AptaCom model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If load_aptacom_for_training returns data that is a subset of load_complete_aptacom can we not just add 2 arguments to load_complete_aptacom (something like load_training and target_id) and remove the load_aptacom_for_training function altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion some changes:
- from @NennoMP: for loading, we may be able to reuse the huggingface loader
- since you are subsetting after loading, in
load_for_train", I would instead call theload_complete`, so the loader code is not duplicated - I would not merge these into a single function, since these correspond to two "data artefacts" that may be worth keeping separate. Artefacts should map on python objects imo.
minor things:
- naming; I would say
load_aptacom_full, andload_aptacom_Xy- more descriptive of content, not purpose- having affinity might be useful, so the second could return an
X(aptamer, ligand sequence) and any(affinity) - this could use the
return_X_ylogic like in sklearn, andXwith columns"target"and"ligand"
- having affinity might be useful, so the second could return an
- docstrings should be in numpydoc format, not google format
Renamed load_aptacom_complete -> load_aptacom_full; Created function load_aptacom_xy -> Following sklearn format; Novel conditions for filtering load_aptacom_full;
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
|
Will have a look at it tomorrow or day after. |
|
|
||
|
|
||
| def filter_columns(ds, columns=None): | ||
| """ " Selects columns to keep on dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the """ " formatting everywhere, seems wrong?
| """ " Selects columns to keep on dataset | ||
| Parameters: | ||
| ----------- | ||
| ds: pd dataframe, required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting seems a bit off, can you check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delayed response @satvshr, thank you for your comments - I'll make the adjustments !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check #192 :D
| return ds | ||
|
|
||
|
|
||
| def prepare_xy(ds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fkiraly just a question, should we not name all functions with x and y in their name similarly? So this would be prepare_x_y instead.
| } | ||
|
|
||
|
|
||
| def filter_columns(ds, columns=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function not be private? Also do you mind changing ds to df (for dataframe) given ds is something I will be using for the Dataset object from HuggingFace?
satvshr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should block this. I would instead refactor, keeping the public interfaces the same. |
I left some comments, could you have a look and tell me why they are not something that should be changed so that I do not raise such points again? |
|
@satvshr, apologies, these were very valid points - I overlooked them somehow. Must have had a cached version of the PR displayed without your comments? |
Added: AptaCom loader - Loads AptaCom dataset from hugging face in one of two formats:
pandas dataframe, hugging face dataset;