Python implementation for StatisticsSurface #669

NicolasGensollen · 2022-06-02T15:59:28Z

Fixes #643

Description

This PR proposes to replace the current implementation of the StatisticsSurface pipeline, which relies on the MATLAB SurfStats toolbox, by a pure Python implementation relying on BrainStat.

There are several reasons motivating this migration, the main ones being that SurfStats isn't maintained anymore and that it is written in MATLAB. The current solution implemented in Clinica is to vendor the MATLAB toolbox with a custom wrapper to enable calling it from Python, which is obviously far from ideal...

Test the PR

POC repo

First of all, here is the repo with the proof-of-concept I made before opening this PR: https://github.com/NicolasGensollen/POC_Stat_Pipeline

It is possible to experiment with the code more easily than through Clinica's pipelining architecture.

Run the StatisticsSurface pipeline in pure Python

Obviously, this requires to have Brainstat installed (since it is not a dependency of Clinica yet). This can be done easily with pip:

$ pip install brainstat

For now, I made the minimum amount of work to integrate it into Clinica.

So there is still a decent amount of work to do in order to have a clean integration into Clinica.

Nonetheless, it should be possible to run the pipeline (without the plots which are crashing atm for some reason...):

$ clinica run statistics-surface ./GitRepos/clinica_data_ci/data_ci/StatisticsSurface/in/caps/ UnitTest t1-freesurfer group_comparison ./GitRepos/clinica_data_ci/data_ci/StatisticsSurface/in/subjects.tsv group --covariates age --covariates sex -np 1 -wd $HOME/WD

Feel free to try it and take a look at the code.

Feedbacks are welcome as always! 😃

pep8speaks · 2022-06-02T15:59:32Z

Hello @NicolasGensollen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-09-06 10:01:09 UTC

ghisvail · 2022-06-03T08:00:43Z

I have added brainstat to the project and refreshed the lock file. It brings quite a bit of transitive dependencies as a result. That's something to keep in mind.

It could be an argument for investigating an alternative with just nilear (assuming this is possible), since Clinica already depends on it.

clinica/pipelines/statistics_surface/clinica_surfstat.py

omar-rifai

Hi @NicolasGensollen, Thanks for this work ! LGTM. There are a few suggestions in comments.

clinica/pipelines/statistics_surface/_utils.py

clinica/pipelines/statistics_surface/_outputs.py

clinica/pipelines/statistics_surface/clinica_surfstat.py

omar-rifai · 2022-07-04T14:09:25Z

@NicolasGensollen, I'd suggest to drop the vendored surfstat in a seperate PR. It will make the review process easier. Also it'll isolate the creation from the deletion in case we need to fall back to the previous version for whichever reason.

Co-authored-by: omar-rifai <omar.void@gmail.com>

ghisvail · 2022-07-16T10:53:22Z

clinica/pipelines/statistics_surface/_model.py

+        df: pd.DataFrame,
+        feature_label: str,
+        contrast: str,
+        **kwargs,


We should probably get rid of the kwargs. All the arguments with default values are known a priori, so we would be better off listing them explicitly rather than using kwargs.pop.

See a2153e5

ghisvail · 2022-07-16T11:07:51Z

clinica/pipelines/statistics_surface/clinica_surfstat.py

+    fsaverage_path = freesurfer_home / Path("subjects/fsaverage/surf")
+    average_surface, average_mesh = _get_average_surface(fsaverage_path)


Apparently, BrainStat provides its own fetchers. Perhaps they could be useful?

We can discuss this next week, I'm not really convinced that using their fetchers would be a better option compared to the solution implemented here.

Co-authored-by: Ghislain Vaillant <ghisvail@users.noreply.github.com>

ghisvail · 2022-08-24T14:16:53Z

My bad, should be [math.prod](https://docs.python.org/3.8/library/math.html#math.prod) then.

________________________________ From: Gensollen ***@***.***> Sent: Wednesday, August 24, 2022 4:07:13 PM To: aramis-lab/clinica ***@***.***> Cc: VAILLANT Ghislain ***@***.***>; Comment ***@***.***> Subject: Re: [aramis-lab/clinica] Python implementation for StatisticsSurface (PR #669) @NicolasGensollen commented on this pull request.

________________________________ In clinica/pipelines/statistics_surface/_model.py<https://fra01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faramis-lab%2Fclinica%2Fpull%2F669%23discussion_r953850275&data=05%7C01%7Cghislain.vaillant%40icm-institute.org%7C28824f40fae24877cc7b08da85d9f237%7C9df7cf1718fa41508b00dc5e754777d8%7C0%7C0%7C637969468364913674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=74tBdWSMVFW1AWaCcU7zf2opOmRTcnj%2BaeQnpWTQi%2Fg%3D&reserved=0>:

+ model_term = reduce(

+ lambda x, y: x * y, [_build_model_term(_, df) for _ in sub_terms] + ) The lambda function is the product function, not the sum. — Reply to this email directly, view it on GitHub<https://fra01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faramis-lab%2Fclinica%2Fpull%2F669%23discussion_r953850275&data=05%7C01%7Cghislain.vaillant%40icm-institute.org%7C28824f40fae24877cc7b08da85d9f237%7C9df7cf1718fa41508b00dc5e754777d8%7C0%7C0%7C637969468365069909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=z%2BN5R0AON9g692xr8YnoeczErb9ReWArWJFB12gkZPw%3D&reserved=0>, or unsubscribe<https://fra01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAO7U3YP3Z2R74TZEC43XXLV2YUBDANCNFSM5XVOFT3Q&data=05%7C01%7Cghislain.vaillant%40icm-institute.org%7C28824f40fae24877cc7b08da85d9f237%7C9df7cf1718fa41508b00dc5e754777d8%7C0%7C0%7C637969468365069909%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Mh3lGf8W9rQPttDix2e%2BrIrnqh22UuUKy494LOqH1pE%3D&reserved=0>. You are receiving this because you commented.Message ID: ***@***.***>

ghisvail · 2022-08-29T14:40:54Z

I fixed the merge conflict and updated the brainspace dependency to use the new published version. It looks like there are still issues with unit tests though.

NicolasGensollen · 2022-08-30T08:29:33Z

Thanks for fixing the conflict @ghisvail
I hadn't updated the tests (now done in 3cd895f)
I should have addressed all your comments and suggestions except for the sum() and prod() aggregations which are giving unexpected results. I opened MICA-MNI/BrainStat#290 a few days ago and reverted back to lambda aggregation functions in the mean time.

ghisvail · 2022-08-30T08:32:39Z

Thanks for fixing the conflict @ghisvail I hadn't updated the tests (now done in 3cd895f) I should have addressed all your comments and suggestions except for the sum() and prod() aggregations which are giving unexpected results. I opened MICA-MNI/BrainStat#290 a few days ago and reverted back to lambda aggregation functions in the mean time.

You did well. It's a weird bug indeed.

ghisvail

Another batch of comments and suggestions. I stopped the numpydoc suggestions after I realized most of them are missing typing informations for parameters and returned values.

Please consider making another pass at them with the style guide open for a refresher.

ghisvail · 2022-08-29T15:02:16Z

clinica/pipelines/statistics_surface/_inputs.py

+    if not Path(tsv_file).exists():
+        raise FileNotFoundError(f"File {tsv_file} does not exist.")
+    tsv_data = pd.read_csv(tsv_file, sep="\t")
+    if len(tsv_data.columns) < 2:
+        raise ValueError(f"The TSV data in {tsv_file} should have at least 2 columns.")
+    if tsv_data.columns[0] != TSV_FIRST_COLUMN:
+        raise ValueError(
+            f"The first column in {tsv_file} should always be {TSV_FIRST_COLUMN}."
+        )
+    if tsv_data.columns[1] != TSV_SECOND_COLUMN:
+        raise ValueError(
+            f"The second column in {tsv_file} should always be {TSV_SECOND_COLUMN}."
+        )
+    return tsv_data


I wonder whether the intent of this code could be simplified with something like:

Suggested change

if not Path(tsv_file).exists():

raise FileNotFoundError(f"File {tsv_file} does not exist.")

tsv_data = pd.read_csv(tsv_file, sep="\t")

if len(tsv_data.columns) < 2:

raise ValueError(f"The TSV data in {tsv_file} should have at least 2 columns.")

if tsv_data.columns[0] != TSV_FIRST_COLUMN:

raise ValueError(

f"The first column in {tsv_file} should always be {TSV_FIRST_COLUMN}."

)

if tsv_data.columns[1] != TSV_SECOND_COLUMN:

raise ValueError(

f"The second column in {tsv_file} should always be {TSV_SECOND_COLUMN}."

)

return tsv_data

try:

return pd.read_tsv(tsv_file, sep="\t").set_index(["participant_id", "session_id"])

except [...]:

# Deal with read_tsv and set_index errors.

I find the intent here clearer: it needs to be a valid dataset read from TSV and contain a set of lines with unique participant_id and session_id columns. This is more in line with the Python way of coding, i.e. better ask for forgiveness (try / except) than permission (if / else).

Good suggestion, implemented in b225b69

clinica/pipelines/statistics_surface/_inputs.py

clinica/pipelines/statistics_surface/_model.py

ghisvail · 2022-08-30T09:18:27Z

clinica/pipelines/statistics_surface/_model.py

+    )
+
+
+def _is_categorical(df: pd.DataFrame, column: str) -> bool:


I agree for the responsibility issue. I made some changes in 1a1e9bf to improve this.

That's better indeed 👍

DataFrame we are dealing with here comes from a brutal tsv read with no a priori information

So how do you figure whether a column is categorical or not from a raw string format without additional metadata, then? This is a very good question.

The implementation says anything whose dtype name does not start with "float". Which is a proxy for testing whether a column contains values which are non-float. So any column which dtype is either strings, objects or integers would be considered as categorical. Is it really the right logic, though?

Let's take the following example: if I have an age column successfully parsed as integers, then it's categorical. If it's parsed as floats, then it's not. Date-like columns would be considered categorical too. Also, object dtype is attributed when the pandas sniffers fail to detect a precise dtype from a column. Those columns may be categorical or malformed.

Should we decide to keep this logic, then it would be better to rename this filter to is_nonfloat rather than keep is_categorical in my opinion. Another option would be to enhance the reader to analyze columns which have been successfully converted as strings after convert_dtype, compute their histograms, apply a conservative heuristic between the range of different values over the total of lines and automatically convert them to a proper categorical dtype. It would be significantly more work though.

ghisvail · 2022-08-30T09:20:57Z

clinica/pipelines/statistics_surface/_model.py

+            model_term = reduce(
+                lambda x, y: x * y, [_build_model_term(_, df) for _ in sub_terms]
+            )


Yeah, I meant math.prod. Sorry for the hasty copy pasting from the previous comment.

NicolasGensollen · 2022-08-30T10:18:15Z

For some reason, I thought Numpydoc had converged on the automatic integration of type hints like Napoleon already does...
But you're right, this is still an open issue: numpy/numpydoc#196 so these docstrings aren't compliant. I'll change them.

omar-rifai reviewed Jun 3, 2022

View reviewed changes

clinica/pipelines/statistics_surface/clinica_surfstat.py Outdated Show resolved Hide resolved

omar-rifai requested changes Jun 3, 2022

View reviewed changes

ghisvail reviewed Jun 3, 2022

View reviewed changes

clinica/pipelines/statistics_surface/clinica_surfstat.py Outdated Show resolved Hide resolved

ghisvail reviewed Jun 3, 2022

View reviewed changes

clinica/pipelines/statistics_surface/clinica_surfstat.py Outdated Show resolved Hide resolved

NicolasGensollen force-pushed the python-implementation-statistics-surface branch 2 times, most recently from cc78ac0 to 9b86481 Compare June 21, 2022 16:04

NicolasGensollen force-pushed the python-implementation-statistics-surface branch 3 times, most recently from faee9b0 to 4d5a797 Compare June 27, 2022 08:55

NicolasGensollen and others added 18 commits July 5, 2022 15:23

Initial work

c83cb07

Run make format

894fd43

Add brainstat to runtime dependencies

68cce91

Fix typo in docstring

f71bdfa

Co-authored-by: omar-rifai <omar.void@gmail.com>

Fix type mismatch warning

240d250

Fix type mismatch in function signature

0a6a847

Fix unresolved reference error

89afd82

Fix extra hash in comment (PEP8)

e9489ec

Fix typo in docstring

79dd4df

Remove unused import

5db034b

Use set literal

c8d9d99

Use lowercase for inner function variable names

6b6937c

Fix typo in docstring

e451980

Add missing apostrophe

ba2a316

Use lowercase for inner function variable names

32cbce8

Fix type mismatch between plot functions

4fb8acc

Silence broad try except warning

e6237b0

Ensure nilearn is installed with plotting capabilities

3314895

ghisvail reviewed Jul 16, 2022

View reviewed changes

NicolasGensollen and others added 12 commits August 24, 2022 10:45

Enable passing a surface file to clinica_surfstat

8fe4f4a

use snake case style for attributes

ea38312

Update clinica/pipelines/statistics_surface/_model.py

cfa9dae

Co-authored-by: Ghislain Vaillant <ghisvail@users.noreply.github.com>

remove comment

77aefa2

Turn GLM results classes into dataclasses

b558a64

Refactor Results classes with better serializing API

3931d6e

Update clinica/pipelines/statistics_surface/clinica_surfstat.py

e24028b

Co-authored-by: Ghislain Vaillant <ghisvail@users.noreply.github.com>

Update clinica/pipelines/statistics_surface/_model.py

5dabdc5

Co-authored-by: Ghislain Vaillant <ghisvail@users.noreply.github.com>

Update clinica/pipelines/statistics_surface/_model.py

0db0ebc

Co-authored-by: Ghislain Vaillant <ghisvail@users.noreply.github.com>

Update clinica/pipelines/statistics_surface/_model.py

6b4998f

Co-authored-by: Ghislain Vaillant <ghisvail@users.noreply.github.com>

Refactor _is_categorical for single responsability

1a1e9bf

Some fixes...

f76e585

NicolasGensollen and others added 4 commits August 25, 2022 14:38

Replace GLMFactory class with create_glm_model function

3bb21f7

Remove use of kwargs and pass parameters explicitely

a2153e5

Merge branch 'dev' into python-implementation-statistics-surface

2e07cc4

Update minimum version of brainspace

45f90ef

Fix tests - reuse lambda sum aggregation

3cd895f

ghisvail requested changes Aug 30, 2022

View reviewed changes

NicolasGensollen added 2 commits August 30, 2022 16:20

Add types to docstrings to comply with Numpydoc specs

4f9828f

Simplify _read_and_check_tsv_file

b225b69

NicolasGensollen force-pushed the python-implementation-statistics-surface branch from 66bc1c9 to b225b69 Compare September 6, 2022 10:01

Merge branch 'dev' into python-implementation-statistics-surface

c86c3cc

ghisvail merged commit 07d045f into aramis-lab:dev Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python implementation for StatisticsSurface #669

Python implementation for StatisticsSurface #669

NicolasGensollen commented Jun 2, 2022 •

edited

pep8speaks commented Jun 2, 2022 •

edited

ghisvail commented Jun 3, 2022

omar-rifai left a comment

omar-rifai commented Jul 4, 2022

ghisvail Jul 16, 2022

NicolasGensollen Aug 25, 2022

ghisvail Jul 16, 2022

NicolasGensollen Aug 25, 2022

ghisvail commented Aug 24, 2022 via email

ghisvail commented Aug 29, 2022

NicolasGensollen commented Aug 30, 2022

ghisvail commented Aug 30, 2022

ghisvail left a comment

ghisvail Aug 29, 2022

NicolasGensollen Aug 31, 2022

ghisvail Aug 30, 2022

ghisvail Aug 30, 2022

NicolasGensollen commented Aug 30, 2022

		fsaverage_path = freesurfer_home / Path("subjects/fsaverage/surf")
		average_surface, average_mesh = _get_average_surface(fsaverage_path)

		)


		def _is_categorical(df: pd.DataFrame, column: str) -> bool:

Python implementation for StatisticsSurface #669

Python implementation for StatisticsSurface #669

Conversation

NicolasGensollen commented Jun 2, 2022 • edited

Description

Test the PR

POC repo

Run the StatisticsSurface pipeline in pure Python

pep8speaks commented Jun 2, 2022 • edited

Comment last updated at 2022-09-06 10:01:09 UTC

ghisvail commented Jun 3, 2022

omar-rifai left a comment

Choose a reason for hiding this comment

omar-rifai commented Jul 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghisvail commented Aug 24, 2022 via email

ghisvail commented Aug 29, 2022

NicolasGensollen commented Aug 30, 2022

ghisvail commented Aug 30, 2022

ghisvail left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasGensollen commented Aug 30, 2022

NicolasGensollen commented Jun 2, 2022 •

edited

pep8speaks commented Jun 2, 2022 •

edited