Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python implementation for StatisticsSurface #669

Merged

Conversation

NicolasGensollen
Copy link
Member

@NicolasGensollen NicolasGensollen commented Jun 2, 2022

Fixes #643

Description

This PR proposes to replace the current implementation of the StatisticsSurface pipeline, which relies on the MATLAB SurfStats toolbox, by a pure Python implementation relying on BrainStat.

There are several reasons motivating this migration, the main ones being that SurfStats isn't maintained anymore and that it is written in MATLAB. The current solution implemented in Clinica is to vendor the MATLAB toolbox with a custom wrapper to enable calling it from Python, which is obviously far from ideal...

Test the PR

POC repo

First of all, here is the repo with the proof-of-concept I made before opening this PR: https://github.com/NicolasGensollen/POC_Stat_Pipeline

It is possible to experiment with the code more easily than through Clinica's pipelining architecture.

Run the StatisticsSurface pipeline in pure Python

Obviously, this requires to have Brainstat installed (since it is not a dependency of Clinica yet). This can be done easily with pip:

$ pip install brainstat

For now, I made the minimum amount of work to integrate it into Clinica.

So there is still a decent amount of work to do in order to have a clean integration into Clinica.

Nonetheless, it should be possible to run the pipeline (without the plots which are crashing atm for some reason...):

$ clinica run statistics-surface ./GitRepos/clinica_data_ci/data_ci/StatisticsSurface/in/caps/ UnitTest t1-freesurfer group_comparison ./GitRepos/clinica_data_ci/data_ci/StatisticsSurface/in/subjects.tsv group --covariates age --covariates sex -np 1 -wd $HOME/WD

Feel free to try it and take a look at the code.

Feedbacks are welcome as always! 😃

@pep8speaks
Copy link

pep8speaks commented Jun 2, 2022

Hello @NicolasGensollen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-09-06 10:01:09 UTC

@ghisvail
Copy link
Contributor

ghisvail commented Jun 3, 2022

I have added brainstat to the project and refreshed the lock file. It brings quite a bit of transitive dependencies as a result. That's something to keep in mind.

It could be an argument for investigating an alternative with just nilear (assuming this is possible), since Clinica already depends on it.

Copy link
Collaborator

@omar-rifai omar-rifai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NicolasGensollen, Thanks for this work ! LGTM. There are a few suggestions in comments.

clinica/pipelines/statistics_surface/_utils.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_outputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_outputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_outputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_outputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_outputs.py Outdated Show resolved Hide resolved
@NicolasGensollen NicolasGensollen force-pushed the python-implementation-statistics-surface branch 2 times, most recently from cc78ac0 to 9b86481 Compare June 21, 2022 16:04
@NicolasGensollen NicolasGensollen force-pushed the python-implementation-statistics-surface branch 3 times, most recently from faee9b0 to 4d5a797 Compare June 27, 2022 08:55
@omar-rifai
Copy link
Collaborator

@NicolasGensollen, I'd suggest to drop the vendored surfstat in a seperate PR. It will make the review process easier. Also it'll isolate the creation from the deletion in case we need to fall back to the previous version for whichever reason.

df: pd.DataFrame,
feature_label: str,
contrast: str,
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably get rid of the kwargs. All the arguments with default values are known a priori, so we would be better off listing them explicitly rather than using kwargs.pop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See a2153e5

Comment on lines +136 to +137
fsaverage_path = freesurfer_home / Path("subjects/fsaverage/surf")
average_surface, average_mesh = _get_average_surface(fsaverage_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, BrainStat provides its own fetchers. Perhaps they could be useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss this next week, I'm not really convinced that using their fetchers would be a better option compared to the solution implemented here.

@ghisvail
Copy link
Contributor

ghisvail commented Aug 24, 2022 via email

@ghisvail
Copy link
Contributor

I fixed the merge conflict and updated the brainspace dependency to use the new published version. It looks like there are still issues with unit tests though.

@NicolasGensollen
Copy link
Member Author

Thanks for fixing the conflict @ghisvail
I hadn't updated the tests (now done in 3cd895f)
I should have addressed all your comments and suggestions except for the sum() and prod() aggregations which are giving unexpected results. I opened MICA-MNI/BrainStat#290 a few days ago and reverted back to lambda aggregation functions in the mean time.

@ghisvail
Copy link
Contributor

Thanks for fixing the conflict @ghisvail I hadn't updated the tests (now done in 3cd895f) I should have addressed all your comments and suggestions except for the sum() and prod() aggregations which are giving unexpected results. I opened MICA-MNI/BrainStat#290 a few days ago and reverted back to lambda aggregation functions in the mean time.

You did well. It's a weird bug indeed.

Copy link
Contributor

@ghisvail ghisvail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of comments and suggestions. I stopped the numpydoc suggestions after I realized most of them are missing typing informations for parameters and returned values.

Please consider making another pass at them with the style guide open for a refresher.

Comment on lines 27 to 40
if not Path(tsv_file).exists():
raise FileNotFoundError(f"File {tsv_file} does not exist.")
tsv_data = pd.read_csv(tsv_file, sep="\t")
if len(tsv_data.columns) < 2:
raise ValueError(f"The TSV data in {tsv_file} should have at least 2 columns.")
if tsv_data.columns[0] != TSV_FIRST_COLUMN:
raise ValueError(
f"The first column in {tsv_file} should always be {TSV_FIRST_COLUMN}."
)
if tsv_data.columns[1] != TSV_SECOND_COLUMN:
raise ValueError(
f"The second column in {tsv_file} should always be {TSV_SECOND_COLUMN}."
)
return tsv_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the intent of this code could be simplified with something like:

Suggested change
if not Path(tsv_file).exists():
raise FileNotFoundError(f"File {tsv_file} does not exist.")
tsv_data = pd.read_csv(tsv_file, sep="\t")
if len(tsv_data.columns) < 2:
raise ValueError(f"The TSV data in {tsv_file} should have at least 2 columns.")
if tsv_data.columns[0] != TSV_FIRST_COLUMN:
raise ValueError(
f"The first column in {tsv_file} should always be {TSV_FIRST_COLUMN}."
)
if tsv_data.columns[1] != TSV_SECOND_COLUMN:
raise ValueError(
f"The second column in {tsv_file} should always be {TSV_SECOND_COLUMN}."
)
return tsv_data
try:
return pd.read_tsv(tsv_file, sep="\t").set_index(["participant_id", "session_id"])
except [...]:
# Deal with read_tsv and set_index errors.

I find the intent here clearer: it needs to be a valid dataset read from TSV and contain a set of lines with unique participant_id and session_id columns. This is more in line with the Python way of coding, i.e. better ask for forgiveness (try / except) than permission (if / else).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, implemented in b225b69

clinica/pipelines/statistics_surface/_inputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_inputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_inputs.py Outdated Show resolved Hide resolved
clinica/pipelines/statistics_surface/_model.py Outdated Show resolved Hide resolved
)


def _is_categorical(df: pd.DataFrame, column: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for the responsibility issue. I made some changes in 1a1e9bf to improve this.

That's better indeed 👍

DataFrame we are dealing with here comes from a brutal tsv read with no a priori information

So how do you figure whether a column is categorical or not from a raw string format without additional metadata, then? This is a very good question.

The implementation says anything whose dtype name does not start with "float". Which is a proxy for testing whether a column contains values which are non-float. So any column which dtype is either strings, objects or integers would be considered as categorical. Is it really the right logic, though?

Let's take the following example: if I have an age column successfully parsed as integers, then it's categorical. If it's parsed as floats, then it's not. Date-like columns would be considered categorical too. Also, object dtype is attributed when the pandas sniffers fail to detect a precise dtype from a column. Those columns may be categorical or malformed.

Should we decide to keep this logic, then it would be better to rename this filter to is_nonfloat rather than keep is_categorical in my opinion. Another option would be to enhance the reader to analyze columns which have been successfully converted as strings after convert_dtype, compute their histograms, apply a conservative heuristic between the range of different values over the total of lines and automatically convert them to a proper categorical dtype. It would be significantly more work though.

Comment on lines +104 to +106
model_term = reduce(
lambda x, y: x * y, [_build_model_term(_, df) for _ in sub_terms]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant math.prod. Sorry for the hasty copy pasting from the previous comment.

@NicolasGensollen
Copy link
Member Author

For some reason, I thought Numpydoc had converged on the automatic integration of type hints like Napoleon already does...
But you're right, this is still an open issue: numpy/numpydoc#196 so these docstrings aren't compliant. I'll change them.

@NicolasGensollen NicolasGensollen force-pushed the python-implementation-statistics-surface branch from 66bc1c9 to b225b69 Compare September 6, 2022 10:01
@ghisvail ghisvail merged commit 07d045f into aramis-lab:dev Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find an alternative to ClinicaSurfStat
4 participants