<b> Disclaimer:
This Notebook is to be implemented at the user’s discretion. We are not responsible for any unexpected behavior (user error or otherwise). Please ensure that you have saved the files you would like to persist to the Data Model (or a more permanent location) before running this Notebook. </b>

**What is this notebook?**

This notebook offers you a way to delete workflow intermediates generated from a submission launched in a single Terra workspace. 

**What does it do?**
1. (Optional) Lists the bucket size and storage cost for each workspace owned or created by the user
2. Utilizes an optimized mop function to delete unwanted and/or unneccessary intermediate files.

**What is the difference between this notebook and Remove_Workflow_Intermediates.ipynb?**

The Remove_Workflow_Intermediates notebook uses the `mop` command from FISS, which times out for users that have very large workspaces to mop. This notebook calls an optimized `mop` function from the [broadinstitute/horsefish](https://github.com/broadinstitute/horsefish/blob/main/scripts/mop_workspace/mop_workspace.py) repo. This optimized function uses the Google Cloud Storage API and lists and deletes files in batches and processes them in parallel.

This notebook also gives users the option to view the storage costs of their workspaces to determine which ones to mop.

**What gets deleted?**

Workflow output files minus logs are deleted except any outputs that are bound to the Data Model. To bind outputs to the Data Model, select Defaults from the Outputs section of the Workflow configuration before selecting "Launch Analysis."

**What gets left behind?**
1. Files uploaded to the Google bucket that do not live inside a submission “directory” will NOT be deleted.
2. Log files (stderr, stdout, *.log) within a submission “directory” will NOT be deleted.
3. Submission folders/“directories” will NOT be deleted - only the contents.
4. Notebooks in the Google bucket will NOT be deleted.

**What should you do before using this notebook?**
1. If there are outputs that should not be deleted, they will need to be bound to the Data Model. If a file is NOT bound to the Data Model, it will be removed.
2. If not bound to the Data Model, desired files should be copied to a secondary location.

**What should you do to run this notebook?**
1. Clone this workspace or copy this notebook into the workspace you want to mop.
2. Open the notebook and create a cloud environment. We recommend using the Default application configuration with 4 CPUs and 15 GB.

# Environment setup



In [None]:
# clone the git repo that contains the mop function
branch = "main"
! git clone -b $branch https://github.com/broadinstitute/horsefish.git
! mv horsefish/scripts/mop_workspace/mop_workspace.py . 
! rm -Rf horsefish

In [None]:
%%capture 
!pip install firecloud --upgrade
!pip install hurry.filesize
!pip install toolz

In [None]:
%%capture
import firecloud.api as fapi
import os
import pandas as pd
from hurry.filesize import size
from mop_workspace import mop, mop_files_from_list

In [None]:
# define a function to get the workspace storage cost
def get_workspace_cost(workspace):
    result = fapi.get_storage_cost(workspace[0],workspace[1])
    if not result.ok:
        return 'N/A'
    return result.json()['estimate']

# define a function to get the size of a workspace bucket
def get_bucket_size(workspace):
    result = fapi.get_bucket_usage(workspace[0],workspace[1])
    if not result.ok:
        return 'N/A'
    return size(result.json()['usageInBytes'])

# define a function to get the bucket name of a workspace
def get_bucket_name(workspace):
    result = fapi.get_workspace(workspace[0],workspace[1],'workspace.bucketName').json()['workspace']['bucketName']
    return 'gs://'+result

# (Optional) Get workspace list

In this section we are listing your workspaces and providing the workspace bucket size and storage cost for each workspace so you can make a decision on which workspace(s) to mop. You have the option to list only workspaces you created, or those that you have OWNER/PROJECT OWNER access to (this includes ones you created and those that have been created by others and shared with you).

**If you already have a workspace you want to mop, skip to the next section.**

In [None]:
# set environment variables
current_user = os.getenv('OWNER_EMAIL') # get email of user currently running the notebook

# User input required: 
# By default, we are only looking for workspaces that were created by the current user.
# Change the following value to False if you want to get a list of all workspaces you have 'OWNER' and 'PROJECT OWNER' access to
just_created_by = True

In [None]:
# get all workspaces that user has access to
all_ws = fapi.list_workspaces("workspace.namespace,workspace.name,workspace.bucketName,workspace.googleProject,accessLevel,workspace.createdBy").json()

# uncomment next line to view the list
#all_ws

In [None]:
# Based on value in 'just_created_by', get list of workspaces
if just_created_by:
    my_workspaces = [(workspace['workspace']['namespace'], workspace['workspace']['name']) for workspace in all_ws if workspace['workspace']['createdBy'] == current_user]
else:
    my_workspaces = [(workspace['workspace']['namespace'], workspace['workspace']['name']) for workspace in all_ws if workspace['accessLevel'] in ('OWNER','PROJECT_OWNER')]

# Uncomment next line to view the list
#my_workspaces

In [None]:
# Find workspace storage usage and costs for each workspace in 'my_workspaces'
# Note: this may take a few minutes to run if you have a lot of workspaces
my_workspaces_costs = [(my_workspaces[workspace][0],my_workspaces[workspace][1],get_bucket_size(my_workspaces[workspace]),get_workspace_cost(my_workspaces[workspace])) for workspace in range(len(my_workspaces))]

**View the list of workspaces and their bucket sizes and storage costs. The list is shown in descending order by storage cost.**

Note that in some cases, the API may have failed to get the storage cost for a workspace. We are filtering these out of the table view. If you are curious about what workspace(s) failed, comment out the last line in the following code block.

In [None]:
# configure dataframe
df = pd.DataFrame(my_workspaces_costs)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
df.columns = ['Billing Project','Workspace Name','Bucket Size','Estimated Storage Cost']
df = df.sort_values(['Estimated Storage Cost'], ascending=[0])
df = df[(df['Estimated Storage Cost']!='N/A')]

In [None]:
# view dataframe
df

# Mop a workspace

In this section we use the mop_workspace.py script to delete all intermediate files in a workspace. You have the option to input the billing project and name of a workspace you want to mop, or mop the current workspace that you are running this notebook in.

**Select a workspace to mop:**

In [None]:
# Set the workspace_project and workspace_name values based on the workspace you want to mop
workspace_project = 'my-billing-project-name'
workspace_name = 'my-workspace-name'

# Uncomment the next two lines if you want to mop the workspace that this notebook is currenty being run in.
#workspace_project = os.getenv('WORKSPACE_NAMESPACE')
#workspace_name = os.getenv('WORKSPACE_NAME')

if not fapi.get_workspace(workspace_project,workspace_name).ok:
    print(f"Workspace '{workspace_project}/{workspace_name}' does not exist.")
    print("Check that you've entered the correct billing project and workspace name of an existing workspace before proceeding.")
else:
    print("Workspace to mop:")
    print(f"workspace_project = {workspace_project}")
    print(f"workspace_name = {workspace_name}")

## Dry run
You can try a dry run first, which will write all files to be mopped to a text file that you can copy to a workspace bucket and inspect before deleting them.

If you want to just mop the workspace without inspecting the files beforehand, you can skip to section **3.2**

In [None]:
# Get list of files that will be mopped
files_to_mop_path = mop(project=workspace_project, workspace=workspace_name, include=None, exclude=None, dry_run=True, save_dir="mop_files", yes=True, weeks_old=3, verbose=True)

The list of files gets saved to the VM disk, but in order to open and inspect the entire list, it's best to copy it to a workspace bucket. You can either copy it to the bucket of the workspace this notebook is currently running in, or the workspace bucket you are trying to mop (if it's different). By default we are copying the file to the workspace being mopped (i.e. the workspace you set in the beginning of this section).

In [None]:
bucket_name = get_bucket_name([workspace_project,workspace_name]) 
# Uncomment the next line if you want to copy the file to the bucket of the workspace this notebook is currently running in
#bucket_name = os.getenv('WORKSPACE_BUCKET') 
bucket_name

In [None]:
# Copy list of files to mop

if files_to_mop_path:
    ! gsutil cp $files_to_mop_path $bucket_name 
    print(f"List of files to mop in workspace {workspace_project}/{workspace_name} copied to bucket {bucket_name}")
else:
    print("No files to mop!")

You should now be able to download the list of files to mop from the bucket and inspect it.

If you're ready to delete the files, run the next two cells. If the deletion fails due to an internal error while the mop script is running, it will retry up to 3 times.

In [None]:
# Sanity check
print(f"All intermediate files in workspace \033[1m'{workspace_project}/{workspace_name}'\033[0m will be deleted. Do not proceed if this is incorrect.")

In [None]:
# Delete the files
mop_files_from_list(workspace_project, workspace_name, files_to_mop_path, dry_run=False, yes=True, verbose=True)

In [None]:
# Optional check to confirm bucket size is now smaller.
mopped_bucket = get_bucket_name([workspace_project,workspace_name])
!gsutil du -sh $mopped_bucket

## Just mop

If you don't want/need to inspect the files beforehand, run the following cell to just delete the files. If the deletion fails due to an internal error, it will retry up to 3 times.

Note that the function will still write a list of files that were deleted, which you can download from the VM disk and inspect afterwards if desired. See section 3.1 for guidance.

In [None]:
files_to_mop_path = mop(project=workspace_project, workspace=workspace_name, include=None, exclude=None, dry_run=False, save_dir="mop_files", yes=True, weeks_old=3, verbose=True)