 # Pulling processed data from Gears
Niall Bourke    
14-11-24     

This script is designed to automate pulling processed data within a Flywheel project, providing real-time updates on progress and estimated completion time. It iterates through all subjects and sessions in a Flywheel project, searches for specific analysis files based on user-defined criteria, downloads these files, and aggregates the data into a single CSV. Additionally, it uploads the final results to the project’s information tab for easy access.

To make the process user-friendly, the script calculates the total number of steps at the beginning and, throughout execution, displays the percentage of completion and estimated time remaining. This allows users to monitor progress, especially useful when handling large datasets that may require significant time to process. Each part of the code is structured to help users understand how the data is managed, downloaded, and saved, ensuring both transparency and efficiency in data handling.


In [5]:
!pip install pathvalidate



### Step 1: Import Required Libraries

In [6]:
import os
import flywheel
from pathlib import Path
import pathvalidate as pv
import pandas as pd
from datetime import datetime
import pytz
import time

### Step 2: Define variables and paths for script

In [9]:
project = fw_project

# Filters
gear = 'recon-all-clinical' # recode this as variable that user selects in config
gearVersion = '0.4.0' 
timestampFilter = datetime(2025, 1, 10, 0, 0, 0, 0, pytz.UTC) # Date before which we want to filter analyses (i.e. only get analyses run after this date)
fileName_tag = 'thickness'


# --- prep output --- #

# Create a work directory in our local "home" directory
work_dir = Path(Path.home() / 'Data/', platform='auto')
# If it doesn't exist, create it
if not work_dir.exists():
    work_dir.mkdir(parents = True)
# Create a custom path for our project (we may run this on other projects in the future) and create if it doesn't exist
project_path = pv.sanitize_filepath(work_dir/project.label/gear, platform='auto')
if not project_path.exists():
    project_path.mkdir(parents = True)


# --- Find the results --- #

In [None]:
import time
import pandas as pd
from pathlib import Path

today_date = datetime.today().strftime('%Y%m%d') 

# Start time
st = time.time()

# Preallocate lists
df_list = []

# Calculate total subjects and sessions
total_subjects = sum(1 for _ in project.subjects.iter())
total_sessions = sum(1 for subject in project.subjects.iter() for _ in subject.sessions.iter())
total_steps = total_sessions
completed_steps = 0

# Iterate through subjects and sessions
for subject in project.subjects.iter():
    subject = subject.reload()
    sub_label = subject.label

    for session in subject.sessions.iter():
        session = session.reload()
        ses_label = session.label
        completed_steps += 1

        # Progress and remaining time
        elapsed_time = time.time() - st
        progress = (completed_steps / total_steps) * 100
        estimated_total_time = elapsed_time / (completed_steps / total_steps)
        remaining_time = estimated_total_time - elapsed_time

        print(f"Processing {sub_label}, {ses_label}")
        print(f"Progress: {progress:.2f}% | Estimated Time Remaining: {remaining_time / 60:.2f} minutes")

        for analysis in session.analyses:
            if (
                analysis.gear_info
                and analysis.gear_info.name == gear
                and analysis.gear_info.version == gearVersion
                and analysis.created > timestampFilter
            ):
                print("Pulling:", gear, gearVersion)
                for analysis_file in analysis.files:
                    if fileName_tag in analysis_file.name:
                        file = analysis_file.reload()

                        # Sanitize and prepare download path
                        download_dir = pv.sanitize_filepath(Path(project_path) / sub_label / ses_label, platform="auto")
                        if not download_dir.exists():
                            download_dir.mkdir(parents=True)
                        download_path = download_dir / file.name
                        print(download_path)

                        # Download file
                        print("Downloading file", ses_label, analysis.label)
                        file.download(download_path)

                        # Read CSV and append to list
                        with open(download_path) as csv_file:
                            results = pd.read_csv(csv_file, index_col=None, header=0)
                            df_list.append(results)

# Save output
try:
    outname = project.label.replace(" ", "_").replace("(", "").replace(")", "")
    filename = f"{outname}-{gear}-{gearVersion}-thickness-{today_date}.csv" 

    # Concatenate DataFrame list and save
    df = pd.concat(df_list, axis=0, ignore_index=True)
    outdir = Path(project_path) / filename
    df.to_csv(outdir)

    # Upload to project info tab
    project.upload_file(str(outdir))
except Exception as e:
    print(f"Failed to upload results for {project.label} to project info tab")
    print(f"Error: {e}")

# End time
et = time.time()
print("Execution time:", et - st, "seconds")


---

# Walkthrough: Flywheel Data Processing Script


### Step 1: Import Required Libraries

We start by importing necessary libraries:

```python
import time
import pandas as pd
import os
from pathlib import Path
```

### Step 2: Track Start Time

```python
st = time.time()
```
This initializes a variable st to store the script's start time, allowing us to calculate the total runtime later.

### Step 3: Preallocate Lists for Data Storage

```python
df = []
sub = []
ses = []
acq = []
```
These lists are used to store downloaded data and relevant subject and session information.

### Step 4: Calculate Total Steps
To estimate completion progress, we first calculate the total number of subject-session combinations. This helps us know how many "steps" the script will take to finish.

```python
total_subjects = sum(1 for _ in project.subjects.iter())
total_sessions = sum(1 for subject in project.subjects.iter() for _ in subject.sessions.iter())
total_steps = total_subjects * total_sessions
completed_steps = 0
```

### Step 5: Iterate Through Subjects and Sessions
The main part of the script loops through each subject and session, processing data as it goes.

```python
for subject in project.subjects.iter():
    subject = subject.reload()
    sub_label = subject.label
    for session in subject.sessions.iter():
        session = session.reload()
        ses_label = session.label
        completed_steps += 1  # Increment completed steps
```

### Step 6: Calculate Progress and Remaining Time
Within each loop iteration, we calculate the percentage of completion and estimate the remaining time.

```python
elapsed_time = time.time() - st
progress = (completed_steps / total_steps) * 100
estimated_total_time = elapsed_time / (completed_steps / total_steps)
remaining_time = estimated_total_time - elapsed_time

print(f"Processing {sub_label}, {ses_label}")
print(f"Progress: {progress:.2f}% | Estimated Time Remaining: {remaining_time / 60:.2f} minutes")
```
This provides the user with an ongoing update of the script’s progress, helping them understand how much longer the script is likely to take.

### Step 7: Download and Process Analysis Files
The script then checks if each session has a specific analysis file to download. If it finds the file, it downloads it and appends the data to our list df.

```python
for analysis in session.analyses:
    if analysis.gear_info is not None and analysis.gear_info.name == gear and analysis.gear_info.version == gearVersion and analysis.created > timestampFilter:
        for analysis_file in analysis.files:
            if fileName_tag in analysis_file.name:
                # Download file and append data to `df`
```

### Step 8: Save and Upload Results
Once all files are downloaded and combined into a DataFrame, the script saves the data to a CSV file. It then uploads the file to the project information tab in Flywheel.
    
```python
try:
    outname = project.label.replace(' ', '_').replace('(', '').replace(')', '')
    filename = f"{outname}-{gear}-{gearVersion}-volumes.csv"
    
    # Concatenate and save DataFrame
    df = pd.concat(df, axis=0, ignore_index=True)
    outdir = Path(project_path) / filename
    df.to_csv(outdir)

    # Uploads a file to the project information tab
    project.upload_file(str(outdir))
except:
    print(f"Failed to upload results for {project.label} to project info tab")
    print("Check that if analysis was run, and that there are results to upload")
    continue
```

### Step 9: Calculate Total Execution Time
Finally, the script calculates and prints the total time taken to complete all processing.

```python
et = time.time()
total_elapsed_time = et - st
print('Execution time:', total_elapsed_time, 'seconds')
```

# ------------------------------
# Adjusted to pull segmentations

In [None]:
import time
import pandas as pd
from pathlib import Path

# Start time
st = time.time()

# Preallocate lists
df_list = []

# Calculate total subjects and sessions
total_subjects = sum(1 for _ in project.subjects.iter())
total_sessions = sum(1 for subject in project.subjects.iter() for _ in subject.sessions.iter())
total_steps = total_sessions
completed_steps = 0

# Iterate through subjects and sessions
for subject in project.subjects.iter():
    subject = subject.reload()
    sub_label = subject.label

    for session in subject.sessions.iter():
        session = session.reload()
        ses_label = session.label
        completed_steps += 1

        # Progress and remaining time
        elapsed_time = time.time() - st
        progress = (completed_steps / total_steps) * 100
        estimated_total_time = elapsed_time / (completed_steps / total_steps)
        remaining_time = estimated_total_time - elapsed_time

        print(f"Processing {sub_label}, {ses_label}")
        print(f"Progress: {progress:.2f}% | Estimated Time Remaining: {remaining_time / 60:.2f} minutes")

        for analysis in session.analyses:
            if (
                analysis.gear_info
                and analysis.gear_info.name == gear
                and analysis.gear_info.version == gearVersion
                and analysis.created > timestampFilter
            ):
                print("Pulling:", gear, gearVersion)
                for analysis_file in analysis.files:
                    
                    
                    if 'synthSR' in analysis_file.name or 'aparc+aseg' in analysis_file.name:
        
                        file = analysis_file.reload()

                        # Sanitize and prepare download path
                        download_dir = pv.sanitize_filepath(Path(project_path) / sub_label / ses_label, platform="auto")
                        if not download_dir.exists():
                            download_dir.mkdir(parents=True)
                        download_path = download_dir / file.name
                        print(download_path)

                        # Download file
                        print("Downloading file", ses_label, analysis.label)
                        file.download(download_path)

# End time
et = time.time()
print("Execution time:", et - st, "seconds")


In [2]:
#clear output before pushing to github
from IPython.display import display, Javascript

def clear_output():
    display(Javascript('IPython.notebook.clear_all_output()'))

clear_output()

<IPython.core.display.Javascript object>