Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Dataset #685

Merged
merged 36 commits into from
Sep 8, 2022
Merged

Example Dataset #685

merged 36 commits into from
Sep 8, 2022

Conversation

srivarra
Copy link
Contributor

@srivarra srivarra commented Aug 30, 2022

If you haven't already, please read through our contributing guidelines before opening your PR

What is the purpose of this PR?

Closes #657. Adds an example dataset available at Hugging Face.

How did you implement your changes

Added a set of example FOVs in the dataset here.

Remaining issues

  • Add an option to download the example dataset in the jupyter notebook.
  • Adjust the notebook paths to automatically work with the default dataset.
  • In the future, add another version of the dataset with all intermediate data, and a small dataset of a couple of fovs and channels for rapid testing.

@srivarra srivarra added the enhancement New feature or request label Aug 30, 2022
@srivarra srivarra self-assigned this Aug 30, 2022
@alex-l-kong alex-l-kong marked this pull request as ready for review August 31, 2022 16:33
@alex-l-kong alex-l-kong marked this pull request as draft August 31, 2022 16:34
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@srivarra
Copy link
Contributor Author

srivarra commented Sep 1, 2022

Here is the current structure after downloading the example dataset and running through Notebook 1.

data/
└── example_dataset/
    ├── image_data/input_data/
    │   ├── fov0/
    │   │   ├── CD3.tiff
    │   │   ├── CD4.tiff
    │   │   ├── ...
    │   │   └── Vim.tiff
    │   ├── ...
    │   └── fov10/
    │       ├── CD3.tiff
    │       ├── CD4.tiff
    │       ├── ...
    │       └── Vim.tiff
    ├── segmentation/
    │   ├── deepcell_input/
    │   ├── deepcell_output/
    │   ├── deepcell_visualization/
    │   └── cell_table/
    ├── pixie/
    ├── post_clustering/
    │   ├── mantis/
    │   └── masks/
    └── analysis/
        ├── spatial_enrichment/
        └── spatial_lda/
            ├── processed/
            └── visualization/

@ngreenwald
Copy link
Member

Can we have all of the segmentation subfolders at the same level? deepcell_input, deepcell_output, deepcell_visualization, cell_tables. And put all of those in a folder called segmentation instead of processed? And instead of raw, call the folder image_data, so it's exactly the same as toffy, without any subfolders, just the FOV folders

@alex-l-kong
Copy link
Contributor

alex-l-kong commented Sep 3, 2022 via email

@alex-l-kong
Copy link
Contributor

alex-l-kong commented Sep 4, 2022

@srivarra I did some testing and reading of rankdata documentation, rank is inherently non-contiguous. Let's say you have a 1024x1024 image that's all 0 except for 1 cell. That 1 is going to get assigned a rank of 1048576 (1024 ** 2).

Main question is why isn't _convert_deepcell_seg_masks currently being tested? A test would've likely revealed this source of error beforehand.

Also wanted to re-verify that the raw segmentation output labels in #609 matched up with the previous version. If that's the case, it could also mean the inherent scipy implementation of rankdata changed (the library has gone through multiple updates since _convert_deepcell_seg_masks was implemented, including one just 9 days ago).

@srivarra
Copy link
Contributor Author

srivarra commented Sep 4, 2022

@alex-l-kong
There wasn't a test function since the bytes input wasn't formatting exactly as how deepcell returns the segmentation mask. I've just figured it out however, and push the test in once we figure out which algorithm we want to use.

Here is the function _convert_deepcell_seg_masks:

def _convert_deepcell_seg_masks(seg_mask: bytes) -> np.ndarray:
    float_mask = imread(BytesIO(seg_mask))

    # Reshape as ranked_mask returns a 1D numpy array, dims:  n^2 x 1 -> 1 x n x n
    shape = float_mask.shape

    # Create the ranked mask
    ranked_mask_repr: np.ndarray = stats.rankdata(float_mask, method = "average")
    ranked_mask: np.ndarray = ranked_mask_repr.astype(dtype="int32").reshape(shape)

    return ranked_mask

Consider the rudimentary test function below:

def test_convert_deepcell_seg_masks():
    with tempfile.TemporaryDirectory() as temp_dir:
        test_mask = np.zeros((10,10))
        test_mask[0,0] = 1
        test_mask[0,1] = 2
        test_mask[0,2] = 2
        tifffile.imwrite(f"{temp_dir}/test_mask.tiff", data = test_mask)
        
        
        with open(f"{temp_dir}/test_mask.tiff", 'r+b') as test_mask_bytes:

            print(_convert_deepcell_seg_masks(test_mask_bytes.read()))

We can adjust the method parameter in stats.rankdata. In the test function, we have a matrix of zeros: $\mathbf{A} = \mathbf{0}_{10 \times 10}$, however the following adjustments have been made: $\mathbf{A}_{0,0} = 1$, $\mathbf{A}_{0,1} = 2$, $\mathbf{A}_{0,2} = 2$.

For method = "average" we get the following matrix:

[[98 99 99 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]
 [49 49 49 49 49 49 49 49 49 49]]

For method = "min" we get the following matrix:

[[98 99 99  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]
 [ 1  1  1  1  1  1  1  1  1  1]]

For method = "dense" we get the following matrix:

[[2 3 3 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]]

Dense looks like what we want. It is the same algorithm as min however it guarantees 'integer continuity' as the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.

If we change the test data to be a random matrix of integers like below:

       ....
        # Initialize a new generator - set seed for reproducibility
        rng = np.random.default_rng(12345)
        
        test_mask = rng.integers(low = 0, high = 1000, size=(10,10))
       ....

Then for method = "dense" we get the output below:

[[68 20 77 29 16 78 60 65 93 37]
 [81 33 51 53 18 14 21 64 55 91]
 [70 22 87 92 73 63 12  8 26 38]
 [ 5 84 43 67 17 30  9 74 76 19]
 [71  6 37 13 75 35 42 41 44 26]
 [50 80 46 15  2 11  6  7 10 53]
 [79 82 61 54 32 89 59 72 74 83]
 [69 88 47 48 23 90 49 45 29 27]
 [58 39 52 62 85 31 66 86 40 24]
 [28 34 57 25 81 36  4  1  3 56]]

Ties will rank the integer values with the same value, as there are 2 instances of $6$.

@alex-l-kong
Copy link
Contributor

alex-l-kong commented Sep 6, 2022

@srivarra it looks like seaborn has released a new version which is causing the testing errors, looks like relplot was changed during this version. Probably worth taking some time to investigate.

@srivarra srivarra marked this pull request as ready for review September 7, 2022 23:31
Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor suggestions. Also looks like some extraneous files got added in spatial_enrichment_input_data

.gitignore Outdated Show resolved Hide resolved
ark/utils/data_utils.py Outdated Show resolved Hide resolved
templates_ark/1_Segment_Image_Data.ipynb Outdated Show resolved Hide resolved
templates_ark/1_Segment_Image_Data.ipynb Outdated Show resolved Hide resolved
Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some path changes

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks good

@ngreenwald ngreenwald merged commit da41621 into main Sep 8, 2022
@ngreenwald ngreenwald deleted the example_dataset branch September 8, 2022 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example Dataset
4 participants