# Identify Done Groups

So since the QC is taking a long time, I want to know which groups are done for record keeping & to ensure we don't run out of funds mid analysis & kill the project before we can get the results out.

# VCF file to group mapping

In [18]:
from collections import defaultdict

group_to_files = defaultdict(set)
with open("../../code/rap_specific/out.list","r") as f:
    for line in f:
        _,_,_,_,group,processed_fname = line.strip().split("/")
        group_to_files[group].add(processed_fname[:-1*len("_processed.txt")])

for group in group_to_files:
    print("'{}' - {} files".format(group,len(group_to_files[group])))

'group_1' - 96 files
'group_2' - 96 files
'group_3' - 96 files
'group_4' - 96 files
'group_5' - 96 files
'group_6' - 96 files
'group_7' - 96 files
'group_8' - 96 files
'group_9' - 96 files
'group_10' - 71 files


In [19]:
file_to_group = {}
for group in group_to_files:
    for fname in group_to_files[group]:
        file_to_group[fname] = group 
len(file_to_group)

935

# "Freeze 1"

This is the ouput from 17 hours after the job was started. 96 files have been finished & I need to verify that these are all the group 1 files.

In [20]:
import numpy as np
# The format of the batch files is:
# "filename" done
# \d+ seconds
# ...
# every 2 lines specify a file and completion time
def get_done_groups(completed_batches_file = "completed_batches_freeze_1.out"):
    done_files = {}
    group_doneness = defaultdict(int)
    with open(completed_batches_file,"r") as f:
        for i,line in enumerate(f):
            line,_ = line.strip().split()
            if i%2 == 0:
                _,_,_,_,fname = line.split("/")
            else:
                seconds = int(line)
                done_files[fname] = seconds
                group_doneness[file_to_group[fname]] += 1
    for group in group_doneness:
        if group_doneness[group] == len(group_to_files[group]):
            print("{} is done!".format(group))
        else:
            print("{} is {:.2f}% done".format(group,100*group_doneness[group]/len(group_to_files[group])))
    avg_handle_time_secs = np.mean(list(done_files.values()))
    print("average time to complete 1 file is: {:.2f} seconds ({:.2f} hours)".format(avg_handle_time_secs,avg_handle_time_secs/60/60))

get_done_groups()

group_1 is done!
average time to complete 1 file is: 51125.01 seconds (14.20 hours)


# "Freeze 2"

This is the ouput from 44 hours after the job was started. 300 files have been finished & I need to verify that these are all the group 2 & 3 files.

In [21]:
get_done_groups("completed_batches_freeze_2.out")

group_1 is done!
group_2 is done!
group_3 is done!
group_4 is 12.50% done
average time to complete 1 file is: 43820.78 seconds (12.17 hours)


# "Freeze 3"

This is the ouput from 65 hours after the job was started. 498 files have been finished & I need to verify that these are all the group 4 & 5 files.

In [22]:
get_done_groups("completed_batches_freeze_3.out")

group_1 is done!
group_2 is done!
group_3 is done!
group_4 is done!
group_5 is done!
group_6 is 18.75% done
average time to complete 1 file is: 42131.42 seconds (11.70 hours)


# "Freeze 4"

This is the ouput from 89 hours after the job was started. 702 files have been finished & I need to verify that these are all the group 6 & 7 files.

In [23]:
get_done_groups("completed_batches_freeze_4.out")

group_1 is done!
group_2 is done!
group_3 is done!
group_4 is done!
group_5 is done!
group_6 is done!
group_7 is done!
group_8 is 31.25% done
average time to complete 1 file is: 41454.47 seconds (11.52 hours)


# "Freeze 5"

This is the ouput from right after the job ended. 935 files have been finished & I need to verify that these are all the group files.

In [24]:
get_done_groups("completed_batches_freeze_5.out")

group_1 is done!
group_2 is done!
group_3 is done!
group_4 is done!
group_5 is done!
group_6 is done!
group_7 is done!
group_8 is done!
group_9 is done!
group_10 is done!
average time to complete 1 file is: 40930.48 seconds (11.37 hours)
