# Saxelab Skills Challenge

Hi! This is Cherrie Chang's responses to the skills challenge (excluding final challenge) as part of the application process for the Saxelab's technical associate role. I had a lot of fun doing this! I hope it is also at least a little bit fun to read..:D
[Here is a link to the Colab version of the same notebook](https://colab.research.google.com/drive/1AXpqjMXX_S8dbifNXU5cg2Mn6iyu2Iv1?usp=sharing)

## Balancing Experimental Design

### Description

Consider a study with the following structure:

```
const factorA = [1, 2, 3, 4]; // 4 levels
const factorB = ['X', 'Y'];   // 2 levels
const factorC = ['L', 'R'];   // 2 levels
```
Write code in any language (or detailed pseudocode) to create a trial randomization system where: (1) Factor A is a between-subject factor (each subject experiences only one level) (2) Factor B and Factor C are within-subject factors (each subject experiences all combinations).

Your randomization system should:

1. Assign each subject to exactly one level of Factor A
2. Ensure each subject experiences all combinations of Factor B and Factor C (4 total)
3. Randomize the presentation order of these 4 conditions differently for each subject
4. Track which randomization sequences have been assigned to previous subjects
5. Maintain approximately equal distribution of subjects across the four levels of Factor A

_Tasks (by priority)_
1. A brief explanation of how you would approach this problem
2. Code or pseudocode implementing the mixed design
3. Discussion of potential constraints or considerations for your randomization algorithm
4. How your solution addresses potential confounds like order effects or practice effects



---



### Response

#### Approach

My approach to this problem is as follows:


1.   **Think about what kind of script/program to handle this would look like in a real life lab workflow:** At its most basic, this program would have to scale to handle any number of subjects. It should output a data format that is lightweight while understandable without needing extra context. I also made the (premature) assumption that the assignment of Factor A levels should follow a randomized sequence across subjects as well.
2.   **Decide what the input and output for my solution function would be:** Accordingly, I decided I would implement this solution as a Python function that takes in an array of subject IDs and outputs a dictionary mapping each subjectId to an array of 4 dictionaries, where each dictionary contains three labelled key-value pairs: the level value for factor A (`'factor A'`), consistent in all 4 dictionaries; and the respective level values for factors B and C (`'factor B'` and `'factor C'`), covering all 4 possible combinations across all 4 dictionaries. I contemplated whether using a Pandas dataframe to hold this information would make more sense, but decided against it since this was not a data analysis step, and a Pandas-based implementation would be slower and bulkier than NumPy.
3.   **Think about what algorithm would be more time- and space-efficient:** I first considered more naive solutions like using Python loops to generate all the subject ID-condition combinations, then refined on this by thinking I could possibly make this process faster by generating independent, right-sized vectors for each type of value (`'subjectId'`, `'factor A'`, etc.) using built-in NumPy vector tools, and then stitching them together.


#### Algorithm Iteration#1: Randomize Factor A assignment sequence across subjects as well

In [None]:
# Imports
from random import choices
import numpy as np
from collections import Counter, defaultdict

factorA = [1, 2, 3, 4]  # 4 levels (between-subject factor)
factorB = ['X', 'Y']    # 2 levels (within-subject factor)
factorC = ['L', 'R']    # 2 levels (within-subject factor)
# Make array of the 4 possible combinations of Factor B and Factor C
factorBC_combos = [[b, c] for b in factorB for c in factorC]

def randomize_trials(subjectIds):
    """
    Implements a mixed design randomization system with:
    - Factor A as between-subject (each subject gets one level)
    - Factors B and C as within-subject (each subject gets all combinations)

    Parameters:
    - subjectIds: array-like, IDs of subjects to assign conditions to

    Returns:
    - Dictionary mapping each subject ID to their condition sequence
    """
    # Step 1: Create balanced assignment of Factor A across all subjects
    # Maximize distribution equality of Factor A levels by dividing length of subjectIds by length of factorA, then tiling factorA by the quotient, finally filling in the remainder factorA-level-to-subjectId assignments by randomly sampling without replacement from factorA.
    facA_quotient, facA_remainder = divmod(len(subjectIds), len(factorA))
    # Create base array with equal distribution
    factorA_expanded = np.tile(factorA, facA_quotient)
    # Randomly assign remaining Factor A levels for perfect equal distribution
    factorA_expanded = np.append(factorA_expanded, np.random.choice(factorA, size=facA_remainder, replace=False))
    # Shuffle to randomize which subject gets which Factor A level
    np.random.shuffle(factorA_expanded)

    # Print distribution of Factor A levels to verify equal distribution
    print("Frequency of each factorA level:")
    print({key: Counter(factorA_expanded)[key] for key in factorA})

    # Step 2: Create condition sequences for each subject
    # For each subject, assign one Factor A level and all Factor B-C combinations
    subjectId_condition_map = defaultdict(list)
    for sid, a in list(zip(subjectIds, factorA_expanded)):
        # Make a copy to prevent modifying the original combinations
        factorBC_copy = np.copy(factorBC_combos)
        # Shuffle Factor B-C combinations to create a unique presentation order for this subject ID (sid)
        np.random.shuffle(factorBC_copy)
        # Assign each B-C combination to this subject while keeping their Factor A level constant
        for b, c in factorBC_copy:
            subjectId_condition_map[sid].append({'factorA': a, 'factorB': b, 'factorC': c})

    return subjectId_condition_map

##### Example Usage

In [None]:
# Generate sample subject IDs
sample_subjects = np.arange(1, 103)  # 102 subjects

# Run the randomization
subject_condition_map = randomize_trials(sample_subjects)

# Display example of the first 5 subjects' assignments
print("\nExample condition assignments:")
for subject_id in list(subject_condition_map.keys())[:5]:
    print(f"\nSubject {subject_id}:")
    for i, condition in enumerate(subject_condition_map[subject_id]):
        print(f"  Trial {i+1}: Factor A = {condition['factorA']}, Factor B = {condition['factorB']}, Factor C = {condition['factorC']}")

Frequency of each factorA level:
{1: 26, 2: 26, 3: 25, 4: 25}

Example condition assignments:

Subject 1:
  Trial 1: Factor A = 4, Factor B = Y, Factor C = L
  Trial 2: Factor A = 4, Factor B = X, Factor C = R
  Trial 3: Factor A = 4, Factor B = Y, Factor C = R
  Trial 4: Factor A = 4, Factor B = X, Factor C = L

Subject 2:
  Trial 1: Factor A = 1, Factor B = Y, Factor C = R
  Trial 2: Factor A = 1, Factor B = X, Factor C = R
  Trial 3: Factor A = 1, Factor B = Y, Factor C = L
  Trial 4: Factor A = 1, Factor B = X, Factor C = L

Subject 3:
  Trial 1: Factor A = 3, Factor B = X, Factor C = L
  Trial 2: Factor A = 3, Factor B = Y, Factor C = R
  Trial 3: Factor A = 3, Factor B = X, Factor C = R
  Trial 4: Factor A = 3, Factor B = Y, Factor C = L

Subject 4:
  Trial 1: Factor A = 1, Factor B = Y, Factor C = L
  Trial 2: Factor A = 1, Factor B = X, Factor C = R
  Trial 3: Factor A = 1, Factor B = X, Factor C = L
  Trial 4: Factor A = 1, Factor B = Y, Factor C = R

Subject 5:
  Trial 1: Fac

##### Discussion on order effects

This algorithm reduces confounds from order effects by randomizing both the factor A assignment and factor B-C combination assignment sequence across subjects, so that no two subjects receive the same condition (same combination of factors A, B and C) sequence unless by chance. This prevents formation of group-wide patterns from subjects receiving the same condition sequences. Randomization of factor A also prevents any residual within-subject order effects to confound with the between-subjects Factor A, such as all subjects assigned `'factor A' = 1` experiencing the same order effects.

One improvement to further reduce order effects is to counterbalance the B-C combination assignments. Each B-C combination (e.g. `[X, L]`) would appear in each position (first, second, third, or fourth) in the 4 trial conditions for a subject with equal frequency across subjects.

##### Potential constraints and considerations

After writing this algorithm, I revisited my thinking of how this program would fit into a real-life experiment, and discovered that it would be highly unlikely in a real-life experiment to have the total set of subjects ready from the start. My motivation for this assumption in the first place stemmed mostly from a misunderstanding that the factor A levels had to be randomly (and equally) assigned as well, an assumption that does not make much sense for a between-subject factor. This is because in most studies, especially those run online, the conditions for one subject typically does not affect that of another. It would therefore be reasonable to modify the function to generate (or fetch) the conditions assigned to a subject while maintaining equal distribution of Factor A levels by keeping a global index that cycles through the `factorA` array:

#### Algorithm Iteration#2: Cycling through `factorA` instead to allow dynamic, real time assignment

In [None]:
# Imports
from random import choices
import numpy as np
from collections import Counter

factorA = [1, 2, 3, 4]  # 4 levels (between-subject factor)
factorB = ['X', 'Y']    # 2 levels (within-subject factor)
factorC = ['L', 'R']    # 2 levels (within-subject factor)

# Make array of the 4 possible combinations of Factor B and Factor C
factorBC_combos = [[b, c] for b in factorB for c in factorC]
# Index tracking which level in the factorA array to assign to the next subject
factorA_index = 0
# Dictionary keeping track of the condition sequences each subject was assigned
subjectId_condition_map = {}

def modified_randomize_trials(subjectId):
    """
    Implements a mixed design randomization system that outputs a dictionary containing the 4 trial conditions for each subject with the following assignment rules for factors A, B and C:
    - Factor A as between-subject (each subject gets one level)
    - Factors B and C as within-subject (each subject experiences all combinations)

    Parameters:
    - subjectId: ID of the subject to assign conditions to

    Returns:
    - Dictionary mapping each subject ID to their 4 conditions
    """
    if subjectId not in subjectId_condition_map:
        subjectId_condition_map[subjectId] = [{'factorA': factorA[factorA_index], 'factorB': b, 'factorC': c} for b, c in factorBC_combos]
        global factorA_index
        factorA_index += 1
        factorA_index %= len(factorA)

    return subjectId_condition_map[subjectId]

A drawback to this implementation is that group-level order effects are now introduced as a confound of Factor A, since it is cycled in order across subjects. This can be especially problematic in a study where there is possible communication between subjects, e.g. an in-person study in a university, where a subject may gain knowledge of their factor A level assignment prior to their experiment trials; or if the cycling consistently coincides with other systematic factors, like the time and location of the trials.

#### Algorithm Iteration#3: Alternative design when a server is available; achieving both factor A randomization and real time condition assignment

Alternatively, an algorithm design that *does* randomize the sequence of factor A levels assignment while ensuring its equal distribution, and also assigning the conditions dynamically/at runtime for each new subject, can be achieved with the help of a live server. We can first decide on the total number of subjects we wish to recruit (e.g. determined by budget), then call the `randomized_trials` function we defined in the first version of the algorithm to generate a subject ID-condition mapping that randomizes and equally distributes factor A levels. For an experiment that aims to recruit 79 people, for instance, we would run this:

In [None]:
stub_subjectIds = np.arange(1, 79) # Generates stub subject IDs for the "bank" subject ID-condition map initialized in the server
server_subjectId_condition_map = randomize_trials(stub_subjectIds)
real_subjectId_condition_map = {} # Initialize an empty dictionary that will hold the actual subject ID-condition mappings assigned in real time as the experiment is run on recruited subjects

And save `server_subjectId_condition_map` on the live server before recruiting for subjects. When a subject loads/signs up for the experiment, we randomly choose a mapping from `server_subjectId_condition_map`, remove it from the dictionary, assign it to the subject and record this assignment in `real_subjectId_condition_map` like so:

In [None]:
example_subjectId = "sub-SAXEEMOfd05" # An example subject ID that is used to identify a subject in the actual experiment trials, e.g. from Prolific; or assigned in person
real_subjectId_condition_map[example_subjectId] = server_subjectId_condition_map.pop(np.random.choice(list(server_subjectId_condition_map.keys()))) # Pops a random subjectId-condition mapping from the starting dictionary in the server and updates a global dictionary to hold the pairing between the actual subject ID and condition

The drawbacks of this iteration include:

*   Large overhead of setting up a live server
*   Require prior knowledge of total number of subjects to recruit and restricts this

## Understanding Lab Standard Scripts

### Description

We use [heudiconv](https://github.com/nipy/heudiconv) in a singularity container to convert fMRI data to BIDS format based on the experimental design. Please read through the scripts in [our public lab standard script](https://github.com/saxelab-mit/lab/tree/main/fmri_analysis_pipeline/template_project_dir/scripts/1_heudiconv_scripts), and try to understand what this step entails. Then, choose one of the bash scripts and explain what the script does in detail, and how you would run it on command line.



---



### Response

#### Overview of `1_heudiconv_scripts` and what this step entails

**Overview:**

The `1_heudiconv_scripts` directory is the first step in the Saxelab's fMRI analysis pipeline, where fMRI data is converted to BIDS format. Together, the scripts in this directory provide a workflow that runs `heudiconv` in an Apptainer (Singularity) container to convert a folder of fMRI data in the form of DICOM (.dcm) files to a folder holding the same data organized in BIDS format.

The workflow works like this: `heuristic_files/heudi.py` defines naming and organization rules for how the fMRI DICOM files in the `dicoms/` folder are to be mapped to BIDS format. For example, the `infotodict` function separates between runs based on protocol name, image type, the number of time-points and motion correction, and appends each run's file to the appropriate groups accordingly. This heuristics file is used by Heudiconv in `heudiconv_single_subject.sh`, recommendedly run via the `submit_heudiconv_array.sh` wrapper, to generate a BIDS-compliant, organized directory of NIfTi (.nii.gz) files for one or an array of subjects.

**Why this is important:**

fMRI data files tend to be large, complex and numerous, resulting in different naming and organization schemes across labs and even within a lab. For instance, a lab may pull functional MRI data files, which consist of time series of 2D images showing brain activity via BOLD signals (func), from an fMRI scan of a subject and append all of them under a folder labelled with the subject ID in the `dicoms/` directory, with no standardized naming conventions for how the folder name (the subject ID) or the file names are written.

This folder organization would differentiate between subjects, but not between meaningful, distinct modalities of data from different types of MRI scans that may all be present within a subject's data folder. These modalities include:
- **anat**: anatomical image data showing static, structural and modalities features of the brain
- **dwi**: diffusion-weighted imaging data that maps water diffusion in white matter
- **perf**: perfusion data that measures blood flow in the brain (perf) ([source](https://bids.neuroimaging.io/getting_started/folders_and_files/folders.html))

Forming a full understanding of the brain's activity of a single subject, and how it compares across subjects, depends heavily on post-processing steps that enable structured integration of data from all of these modalities, like aligning the high-temporal resolution functional data to the high-spatial resolution anatomical data to ground observations of dynamic brain activity in detailed descriptions of their spatial location.

For reasons from efficient collaboration in the lab to sharing data with the scientific community, it is important to make these files findable, accessible, interoperable and reusable (FAIR!) by standardizing their naming and organization to differentiate between these standard modalities. This is the purpose of converting to BIDS format, a popular standardization of neuroimaging data that can enforce naming conventions; structurally breakdown each subject's data folder into anat, func, dwi and perf folders; and include sidecar files like a `.tsv` file detailing protocol-related information and JSON files that store metadata like scan acquisition details ([source](https://brainvoyager.com/bv/doc/UsersGuide/GettingStarted/NIfTIAndJSONSidecarFiles.html)). This directory's work of converting fMRI data to BIDS format is therefore an important and valuable workflow that aligns with the Saxelab's commitment to open science.

#### Explanation of `heudiconv_single_subject.sh`

This script contains the core logic of converting DICOM files to BIDS format using Heudiconv, run in an Apptainer container. It is designed to take in 2 arguments -- the path to a heuristic file and a subject ID -- and process the DICOM files from the `dicoms` directory for that subject. Below, I break down the script into 4 conceptual sections and explain the functionality of each section, with references to the code via line numbers in square brackets.

> **Note:** *To enable line numbers for code cells in jpynb, set "{"codeCellConfig": {"lineNumbers": true}} in settings*

##### Bash script setup and SLURM parameters

In [None]:
#!/bin/bash -l


#SBATCH -J heudiconv
#SBATCH -t 1:00:00
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=1
#SBATCH --partition=saxelab

#purpose: run heudiconv for one subject


# THIS SCRIPT IS INTENDED TO BE RUN THROUGH THE WRAPPER SCRIPT submit_heudiconv_array.sh
# THE JOB ARRAY WRAPPER SCRIPT CAN ALSO BE USED FOR SINGLE SUBJECT

This section starts with a shebang (`#!`) that sets up the script to be executed using a Bash login shell [`1`]. This is followed by `#SBATCH` directives [`4-8`] that get parsed by SLURM, the job scheduler that executes this script, as options to configure the `sbatch` call at the end of the script. The directives specify these information:

*   The name to call this SLURM job [`4`]
*   The maximum time allowed for the whole job [`5`]
*   The real memory required for each allocated node (computational unit) in the cluster that executes this job [`6`]
*   How many processors each task, and therefore each node, should have [`7`]
*   The partition (group of nodes) to use to execute the job [`8`]

Next, it includes a documentation comment detailing the script's purpose [`10`] and a note that warns users that it is designed to run via the `submit_heudiconv_array.sh` wrapper [`13-14`].

##### Constants for `sbatch` call

In [None]:
heudi_file=$1
study_root=`cat ../PATHS.txt`


subjs=("${@:2}")

This section assigns constants for the eventual `sbatch` call:

*   `heudi_fle` is assigned to the first argument provided in the call to this script. This is intended to be the path to the heuristic file detailing the naming and file organization conventions for mapping DICOM files to BIDS format [`1`]
*   `proj` reads the contents of `../PATHS.txt`. This file is intended to contain the path to the root directory of the project this script is run on, and should be modified for each new project [`3`]
*   `subs` includes all arguments after the first provided in the call. It is intended to hold the array of subject IDs to process [`5`].

##### Apptainer environment setup

In [None]:
source /etc/profile.d/modules.sh
module use /cm/shared/modulefiles
module load openmind8/apptainer/1.1.7

This section initializes the Environment Modules system [`1`], which allows dynamic configuration of the shell environment via the `module` interface. The next line adds the `/cm/shared/modulefiles` directory to the `MODULEPATH` environment variable so that the module files in it can be loaded [`2`]. It then loads the Apptainer container (v1.1.7) from the `openmind8` module in MIT's OpenMind computing cluster to use as the controlled environment to run the script in [`3`].

##### Executing the script in the Apptainer container

In [None]:
subject=${subjs[${SLURM_ARRAY_TASK_ID}]}

echo "Submitted job for: ${subject}"

singularity exec -B /om3:/om3 -B /cm:/cm -B /om2:/om2 -B /om:/om -B /mindhive:/mindhive -B /nese:/nese $study_root/singularity_images/heudiconv_0.9.0.sif \
/neurodocker/startup.sh heudiconv \
-d $study_root/data/dicoms/{subject}/dicom/*.dcm \
-s $subject -f $study_root/scripts/1_heudiconv_scripts/$heudi_file \
-c dcm2niix -o $study_root/data/BIDS \
-b --minmeta --overwrite

Finally, this section executes the script in the Apptainer container. It fetches the current subject to process and assigns it to `subject` by indexing the `subs` array of all subject IDs using `SLURM_ARRAY_TASK_ID`, an environment variable in the SLURM environment the script is executed in [`1`]. It then prints a line in the console to let the user know that the job to process the current subject has been submitted to Slurm [`3`]. Lastly, it runs the script in the Apptainer container via `singularity exec` with the following arguments provided [`5-10`]:

*   `-B /om3:/om3 ... -B /nese:/nese`: These are bind mounts that mount the listed external directories onto the container, so that they can be read and modified from within the container [`5`]
*   `$study_root/singularity_images/heudiconv_0.9.0.sif`: This specifies the path to the Apptainer container image to run Heudiconv, including the needed packages, dependencies, etc [`5`]
*   `/neurodocker/startup.sh`: I couldn't find this in the [NeuroDocker GitHub](https://github.com/ReproNim/neurodocker/tree/master), but likely a set up file for configuring and initializing the Apptainer container [`6`]
*   `heudiconv`: Runs heudiconv [`6`]



And the following options for heudiconv:

*   `-d $study_root/data/dicoms/{subject}/dicom/*.dcm`: Path to the subject's DICOM files in the project. This requires the project directory to be structured in accordance [`7`]
*   `-s $subject`: The subject ID to run heudiconv on [`8`]
*   ` -f $study_root/scripts/1_heudiconv_scripts/$heudi_file`: Path to the heuristic file [`8`]
*   `-c dcm2niix`: Converter to use under the hood [`9`]
*   `-o $study_root/data/BIDS`: Output directory path for the converted files [`9`]
*   `-b`: Specify to use BIDS format [`10`]
*   `--minmeta`: Lessen output file size by excluding dcmstack metadata from accompanying output JSON files [`10`]
*   `--overwrite`: Allow existing content in output files to be overwritten [`10`]

#### Calling `heudiconv_single_subject.sh` from the command line

This script is intended to be run via the `submit_heudiconv_array.sh` wrapper script. To run it on the command line, I would first need to be in a properly set up SLURM-based HPC environment with access to the `saxelab` partition and write access the output BIDS directory. Then, I would modify `PATHS.txt` to point to the root of my project, `cd` to the `1_heudiconv_scripts` directory, and run `./submit_heudiconv_array.sh` to process all subjects in the `$study_root/data/dicoms/` directory, or pass a list of subject IDs as arguments to the call, e.g. `./submit_heudiconv_array.sh SAXE_EMOfd_20 SAXE_EMOfd_32`. The `sbatch --array=0-$len $proj/scripts/1_heudiconv_scripts/heudiconv_single_subject.sh $heudifile ${subjs[@]}` in the `submit_heudiconv_array.sh` wrapper script is able to handle these two options by determining the indices to index the submitted job array dynamically.

#### Statement of genAI use

I used genAI in two main ways:

1.   **Understanding the structure of `template_project_dir` and `MAVE`**: I cloned the `lab` repo, opened it in VSCode and typed in "Explain the structure of this repository" in the Copilot chat (claude 3.7-Sonnet Thinking) for both directories. This is to get a quick high level understanding of what each directory does and zero in on the relevant information. Reading the answers generated helped me quickly realize `template_project_dir` is a template directory for analyzing data in a typical fMRI project, and that `MAVE` contains audiovisual stimuli that can be used in an fMRI experiment. Based on this, I was able to understand `1_heudiconv_scripts` as a first step in the pipeline that operates fairly isolatedly, allowing me to then focus on interpreting it without worrying about dependencies on the rest of the directory.
2.   **Modifying the Bash scripts to be run locally**: One of my first instincts to understanding the Bash scripts in `1_heudiconv_scripts` was to run it locally on a set of [example DICOMS](https://github.com/datalad/example-dicom-functional) that I had found. I used Copilot Edits to modify the file so that I can run it locally from my Mac terminal, and it replaced the `sbatch` call in `submit_heudiconv_array.sh` to a for loop that runs the job using `bash`. I repeated the same modifications in the `heudiconv_single_subject.sh` file when running the modified wrapper file in the terminal still resulted in an `sbatch command not found` error. Running the wrapper again after this additional modification, I then got an error saying that the container image to be used to initialize the container could not be found. I made the guess that these scripts, designed to run in an HPC cluster at MIT, would require a lot of files I do not have access to, so I dropped this approach.


## Approaching an Issue

### Description

Say you just ran a script on your computing cluster and it produced this error:

```
FATAL:   container creation failed: mount
/proc/self/fd/14->/rdma/vast-rdma/vast-home/cm/shared/openmind/singularity/singularity-3.4.1/va r/singularity/mnt/session/rootfs
error: can’t mount image /proc/self/fd/14:
kernel reported a bad superblock for squashfs image partition, possible causes are that your kernel doesn’t support the compression algorithm or the image is corrupted
```

What information could you get from the error? How would you approach resolving this issue?

One option you could consider is approaching others for help, e.g. the cluster administrator, or someone you know who has experience with running singularity containers. How would you format your question? What information would you provide?



---



### Response

#### Step 0: Information I get from this error

Based on what I learned from *Understanding Lab Standard Scripts*, I have a vague sense from reading the error that this is related to mounting the Singularity container image at `/proc/self/fd/14`, either because my "kernel doesn't support the compression algorithm", which may refer to the data algorithm used in the image's source code to reduce its file size when exported; or because the image is corrupted and thus inexecutable. A container image is a software bundle that encapsulates the components (libraries, dependencies, etc.) and the functionality of assembling them to initialize a container, so it is reasonable that failing to mount the image causes the container creation process to throw a `FATAL` error and terminate. Since `/proc/self/fd/` is the standard directory in Linux that holds the symlinks to the files opened by the current process, I also have guesses that `/rdma/vast-rdma/vast-home/cm/shared/openmind/singularity/singularity-3.4.1/va` is the external source where the image is located and being mount from. I am less certain about what `r/singularity/mnt/session/rootfs` refers to, but my guess is that this is the Singularity program for mounting container images, i.e. the current process.

#### Step 1: Google the terms I don't know

My first step is to Google the term I don't know from this error -- "`squashfs`". I learned that `squashfs` is a "read-only file system for Linux. Squashfs compresses files, inodes and directories, and supports block sizes from 4 KiB up to 1 MiB for greater compression" ([source](https://en.wikipedia.org/wiki/SquashFS)). This gives a clue that perhaps the compression algorithm used by the image, which has been identified as a possible cause of the error, is specified in `squashfs`.

(I also googled "`superblock`", a term I should definitely have remembered from undergrad...)

#### Step 2: Google the generic line about the error (error heading?)

Next, I Google `container creation failed: mount`. These GitHub issues looked relevant:


*   ~~[Singularity container creation failed](https://github.com/hpcng/singularity/issues/2282)~~: turned out less relevant
*   [error: can't mount image /proc/self/fd/3: failed to mount squashfs filesystem: invalid argument](https://github.com/apptainer/singularity/issues/5408): error similar up but ends with `failed to mount squashfs filesystem: invalid argument`; was fixed in v^3.3.0
*   ~~[container creation failed: mount hook function failure](https://github.com/apptainer/singularity/issues/6494)~~: no response

I then went to the [Singularity GitHub issues page](https://github.com/apptainer/singularity/issues?q=is%3Aissue%20%22container%20creation%20failed%3A%20mount%22%20) and typed in "container creation failed: mount", then "container creation failed: mount /proc/"and scanned the results, finding these links:

*   ~~[failed to mount squashfs filesystem](https://github.com/apptainer/singularity/issues/5626)~~: maintainers asked for additional information but did not get response -> issue went stale
*   [Failed to mount squashfs filesystem: invalid argument](https://github.com/apptainer/singularity/issues/5414): solved with v^3.5.3
*   [Latest Ubuntu 18.04 update: kernel doesn't support the compression algorithm](https://github.com/apptainer/singularity/issues/5466): points to [this issue](https://github.com/epi2me-labs/wf-flu/issues/10), which has the exact same error as us!
*   [Unable to mount squashfs via singularity exec with kernel 5.4.0](https://github.com/apptainer/singularity/issues/4801): also same exact error, fixed in [this PR](https://github.com/apptainer/singularity/pull/4802)

#### Step 3: Update Singularity version based on fixes from GitHub issue threads

Since there were closed GitHub issues that addressed the exact same error for ours (last two links), I think it is reasonable to follow the fixes detailed in these two threads. The `/rdma/vast-rdma/vast-home/cm/shared/openmind/singularity/singularity-3.4.1/va` line in the error makes me think that we are running Singularity v3.4.1, so I would first update Singularity to the newest version (or v^3.5.3, or migrate to its successor Apptainer v^1.4.0, although this may require more adaptation fixes in the code).

#### Step 4: Searching the error message in Singularity source code


From my Googling results, it seems like this issue has historically been a bug for Singularity to fix and not a client-side issue, so if this same error persists after the version udpate, it feels unlikely that the problem can be solved by me poring over my code for logical errors. One lead I can think of is looking at the Singularity code -- I entered "superblock" into the search bar in the Singularity GitHub repo and found [only two files](https://github.com/search?q=repo%3Aapptainer%2Fsingularity%20superblock&type=code) referring to it:  

*   [pkg/image/squashfs.go](https://github.com/apptainer/singularity/blob/9dceb4240c12b4cff1da94630d422a3422b39fcf/pkg/image/squashfs.go#L25)
*   [internal/pkg/runtime/engine/singularity/container_linux.go](https://github.com/apptainer/singularity/blob/9dceb4240c12b4cff1da94630d422a3422b39fcf/internal/pkg/runtime/engine/singularity/container_linux.go#L800)

Second link contains the exact error message, suggesting that this error message is printed when `c.rpcOps.Mount` returns the `syscall.EINVAL` error code and the `mountType` is `squashfs`. A quick search shows on Google and in the `squashfs` repo did not pull up issues or fixes on `squashfs`'s end. Because `squashfs` is likely used by the container image under the hood, it would be difficult and inefficient to replace `squashfs` in the lab's workflow entirely, as it would require modifying the container image which appears to be third-party.

#### Step 5: Seeking help/Reopening GitHub issue


Here is when I would then seek help from others, starting with people affiliated with the lab and the cluster administrator; but similarly, I would guess the fix is unlikely of the lab's/cluster administrator's responsibility.

In light of this, I would not start with a full, verbose description of the error and my progress in investigating it out of respect for their time, I would maybe just write a short message like "Has anyone tried to run \<this script\> and encountered this error \<error message\>?" with a screenshot of the error attached. This covers the possibility that someone in the lab/cluster administrator has seen the error before and knows how to fix it.

But if that is not the case, I would reopen [the issue that points out this same error](https://github.com/apptainer/singularity/issues/4801), adding a comment that reports my error and how to recreate it by following [their template for issue submission](https://github.com/apptainer/singularity/blob/9dceb4240c12b4cff1da94630d422a3422b39fcf/.github/ISSUE_TEMPLATE.md?plain=1#L26).