# Treehouse Storage Management

Implement the [Treehouse Storage Management](https://docs.google.com/document/d/1otNDUQIGOY4zqPBAp4OzUnhXAmt1FHrJjqjA2jsUBrI/edit?pli=1#heading=h.ly71etsanuvd) policies with respect to local and s3 storage in archive.

In [2]:
import os
import subprocess
from datetime import datetime, timedelta
import pandas as pd
import boto3

bucket = "archive-treehouse-ucsc-edu"

In [3]:
# Load the latest s3 inventory file into a pandas dataframe and identify secondary bams to delete
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket, Prefix="inventory/archive-treehouse-ucsc-edu/all/data/")
files = [f for f in response["Contents"] if f["Size"] > 0]  # data/ folder is one of the keys...
print(f"Found {len(files)} Inventories")
latest = sorted(files, key=lambda obj: obj["LastModified"])[-1]["Key"]
inventory = pd.read_csv("s3://{}/{}".format(bucket, latest), compression="gzip",
                        names=["bucket", "key", "version", "latest", "?", 
                               "size", "created", "etag", "class", "??", "???", "encryption"],
                        parse_dates=["created"])
print("Using Inventory Dated", sorted(files, key=lambda obj: obj["LastModified"])[-1]["LastModified"])
inventory.head()

Found 12 Inventories
Using Inventory Dated 2018-06-11 08:53:54+00:00


Unnamed: 0,bucket,key,version,latest,?,size,created,etag,class,??,???,encryption
0,archive-treehouse-ucsc-edu,compendium/pre_v4/README.txt,,True,False,232,2018-05-30 17:30:07,819dab68d80ed9b2f6648a1b51485a03,STANDARD,False,,SSE-S3
1,archive-treehouse-ucsc-edu,compendium/pre_v4/TCGA_mutations/Census_allWed...,,True,False,124659,2018-05-30 17:30:07,cd1f09273d76eea6d77e0ac730d91b30,STANDARD,False,,SSE-S3
2,archive-treehouse-ucsc-edu,compendium/pre_v4/TCGA_mutations/TCGA_Broad_mu...,,True,False,4790,2018-05-30 17:30:07,4ef5284b62a5344119acee834fc9215d,STANDARD,False,,SSE-S3
3,archive-treehouse-ucsc-edu,compendium/pre_v4/TCGA_mutations/TCGA_NonSilen...,,True,False,1158114,2018-05-30 17:30:08,9ef8a17da3ddb54a4f99b4b86853f22c,STANDARD,False,,SSE-S3
4,archive-treehouse-ucsc-edu,compendium/pre_v4/TCGA_mutations/UCSF-RNAPanel...,,True,False,2990,2018-05-30 17:30:11,fb4c7ac59552b24b41343dcf49372c38,STANDARD,False,,SSE-S3


In [5]:
print("All: {} {:.3f} TB".format(inventory.shape[0], 
                                 sum(inventory["size"]) / 10**12))
print("Standard: {} {:.3f} TB".format(inventory[inventory["class"] != "GLACIER"].shape[0],
                                 sum(inventory[inventory["class"] != "GLACIER"]["size"] / 10**12)))
print("Glacier: {} {:.3f} TB".format(inventory[inventory["class"] == "GLACIER"].shape[0],
                                 sum(inventory[inventory["class"] == "GLACIER"]["size"] / 10**12)))
print("Secondary BAMs: {} {:.3f} TB".format(
    inventory[inventory.key.str.contains("downstream\/.+?\.bam")].shape[0],
    sum(inventory[inventory.key.str.contains("downstream\/.+?\.bam")]["size"]) / 10**12))

cutoff = datetime.now() - timedelta(days=180)
secondary_bams = inventory[(inventory.created < cutoff) & (inventory.key.str.contains("downstream\/.+?\.bam"))]
print("Secondary BAMs older then 90 days:", secondary_bams.shape[0])

All: 144884 22.380 TB
Standard: 143925 13.468 TB
Glacier: 959 8.911 TB
Secondary BAMs: 941 9.000 TB
Secondary BAMs older then 90 days: 0
