<a id='top'></a>

# MayoWorkflow.wdl Integration Test output in spreadsheet format

    * Spreadsheet output (".tsv", ".tsv" or ".xlsx").
    * The python working model of a spreadsheet is the pandas DataFrame.
    * This code parses results directory(s) content into a DataFrame.
    * Options to write DataFrame to a file, display on the command line or both.
    * Option to combine with PASS | FAIL spreadsheet criteria for usage with auto-test suite (Jenkins etc).
    * Demonstrate the use of jupyter notebook to develop python code in situ.
    
[MayomicsVC Research branch (private)](https://git.ncsa.illinois.edu/mayomics/MayomicsVC/tree/master/testing) <br>
[StackOverflow readable time stamp](https://stackoverflow.com/questions/16060899/alphabet-range-python/31888217) <br>

****
## Function usage Examples:
[Function Usage Examples Section](#function_usage) <br>
[time stamp strings](#time_stamp) <br>
[find strings in files in the directory tree](#find_strings) <br>
[test results dataframe](#test_results) <br>
****
### Development Note Jan 18, 2019
* functions not fully tested on server (they write files)
```python
# replace, write and display lines changed
find_replace_display_strs_in_dir_tree(find_fragment, replace_fragment, dir_name)
# just do it
contained_in_files_dict = find_replace_strs_in_dir_tree(find_fragment, replace_fragment, dir_name)
```

## Module edit cell:
Uncomment top line and run this cell to (OVERWRITE) write the file as a python module.
```python
%%writefile ../../MayomicsVC/testing/integration_test.py
```
Comment and run the cell to use for development quick-test in this notebook.
```python
# %%writefile ../../MayomicsVC/testing/integration_test.py
```

In [1]:
# %%writefile ../../MayomicsVC/testing/integration_test.py
import os
import sys
import string
import time
import datetime
import pandas as pd
import numpy as np
from collections import OrderedDict

VARIABLE_FILE_TYPES = ['.txt', '.file', '.py', '.wdl', '.sh', '.pl']

def get_time_sequence_string(decimal_shift=3):
    """ Usage: get_time_sequence_string = get_time_sequence_string(decimal_shift=3) """
    alpha_list = list(string.ascii_uppercase)[0:10]
    time_seq_int = np.int_(list(np.str_(int(time.time() * np.maximum(10**decimal_shift, 1)))))
    time_sequence_string = ''
    for d in time_seq_int:
        time_sequence_string += alpha_list[d]
    return time_sequence_string


def get_readable_time_stamp(n_digits=3):
    """ Usage: time_stamp_string = get_time_stamp(n_digits=3) localtime """
    return datetime.datetime.now(datetime.timezone.utc).strftime("%H_%M_%S_%f_%Z_%Y_%m_%d")


def find_replace_display_strs_in_dir_tree(find_fragment, replace_fragment, dir_name=None):
    """ Usage: find_replace_display_strs_in_dir_tree(find_fragment, replace_fragment, dir_name)
    Overwrite existing file with replacement strings
    """
    print('input:',find_fragment, replace_fragment)
    contained_in_files_dict = find_and_replace_string_fragment_in_dir_tree(find_fragment, replace_fragment, dir_name)
    if isinstance(contained_in_files_dict, dict) and len(contained_in_files_dict) > 0:
        print('%s found in:'%(find_fragment))
        for full_file_name, lines_list in contained_in_files_dict.items():
            print('%s'%(full_file_name))
            if len(lines_list) > 1:
                lines_string = ''
                for line_number in lines_list:
                    lines_string += '%4i '%(line_number)
                print('\tlines: %s'%(lines_string))
            else:
                print('\tline %i'%(lines_list[0]))


def find_replace_strs_in_dir_tree(find_fragment, replace_fragment, dir_name=None):
    """ Usage: contained_in_files_dict = find_replace_strs_in_dir_tree(find_fragment, replace_fragment, dir_name)
    """
    if not dir_name is None and os.path.isdir(dir_name):
        dir_tree_root = dir_name
    else:
        dir_tree_root = os.getcwd()
    
    contained_in_tuples_list = []
    if isinstance(find_fragment, str) and len(find_fragment) > 0 and isinstance(replace_fragment, str):
        obscure_string_fragment = find_fragment 
        for dir_name, dirs_list, files_list in os.walk(dir_tree_root):
            for file_name in files_list:
                full_file_name = os.path.join(dir_name, file_name)
                line_locations_list = []
                with open(full_file_name, 'r') as fh:
                    lines = fh.readlines()
                    
                if len(lines) > 0:
                    for line_number in range(len(lines)):
                        if len(lines[line_number]) > 0 and obscure_string_fragment in lines[line_number]:
                            line_locations_list.append(line_number+1)
                            lines[line_number] = lines[line_number].replace(find_fragment, replace_fragment)
            
                if len(line_locations_list) > 0:
                    contained_in_tuples_list.append((full_file_name, line_locations_list))
                    with open(full_file_name, 'w') as fh:
                        fh.writelines(lines)

        if len(contained_in_tuples_list) > 0:
            contained_in_files_dict = OrderedDict(contained_in_tuples_list)
        else:
            contained_in_files_dict = {}
            
    return contained_in_files_dict


def display_string_found_dict(string_fragment, dir_name=None):
    """ Usage: display_string_found_dict(string_fragment, dir_name)
    call find_string_fragment_in_dir_tree and display dictionary
    """
    contained_in_files_dict = find_string_fragment_in_dir_tree(string_fragment, dir_name)
    if isinstance(contained_in_files_dict, dict) and len(contained_in_files_dict) > 0:
        print('%s found in:'%(string_fragment))
        for full_file_name, lines_list in contained_in_files_dict.items():
            print('%s'%(full_file_name))
            if len(lines_list) > 1:
                lines_string = ''
                for line_number in lines_list:
                    lines_string += '%4i '%(line_number)
                print('\tlines: %s'%(lines_string))
            else:
                print('\tline %i'%(lines_list[0]))


def find_string_fragment_in_dir_tree(string_fragment, dir_name=None):
    """ Usage: files_string_dict = find_string_fragment_in_dir_tree(string_fragment, dir_name)
    """
    if not dir_name is None and os.path.isdir(dir_name):
        dir_tree_root = dir_name
    else:
        dir_tree_root = os.getcwd()
    
    contained_in_tuples_list = []
    if isinstance(string_fragment, str) and len(string_fragment) > 0:
        obscure_string_fragment = string_fragment 
        for dir_name, dirs_list, files_list in os.walk(dir_tree_root):
            for file_name in files_list:
                _, f_ext = os.path.splitext(file_name)
                if f_ext in VARIABLE_FILE_TYPES:
                    full_file_name = os.path.join(dir_name, file_name)
                    line_locations_list = []
                    try:
                        with open(full_file_name, 'r') as fh:
                            lines = fh.readlines()

                        if len(lines) > 0:
                            line_number = 0
                            for line in lines:
                                line_number += 1
                                l = line.strip()
                                if len(l) > 0 and obscure_string_fragment in l:
                                    line_locations_list.append(line_number)
                        if len(line_locations_list) > 0:
                            contained_in_tuples_list.append((full_file_name, line_locations_list))
                    except:
                        print('skip:\t',file_name)
                        pass         

        if len(contained_in_tuples_list) > 0:
            files_string_dict = OrderedDict(contained_in_tuples_list)
        else:
            files_string_dict = {}
            
    return files_string_dict


def get_test_results_dataframe(x_dir):
    """ Usage: return_codes_dataframe = get_test_results_dataframe(x_dir)
    args:
        x_dir:         the directory with the "call_..." subdirectories (else you get nothing)
        
    returns:
        rc_df:         pandas dataframe with the return codes and size of various output files
    """
    DATAFRAME_DEFAULT_EMPTY_VALUE = 'unk'
    FAILED_RETURN_CODE_READ = '-1'
    good_return_codes_list = ['0', '0\n']
    check_files_dict = {'stderr':['ERROR', 'error', 'Error'], 'stdout':['START', 'Finished']}
    
    #     This variable might be compared to the "call" entries in the .wdl tree.
    call_dirs = os.listdir(x_dir)
    call_dir_list = []
    call_dir_count = 0
    #     Get the rows list - parse the directories that begin with "call" vs getting them from the .wdl files
    for call_dir in call_dirs:
        if os.path.isdir(os.path.join(call_dir, x_dir)) and call_dir[0:4] == 'call':
            call_dir_count += 1
            call_dir_list.append(call_dir)
    
    #     Create the list of things that will be reported in each call directory and initialize the dataframe
    cols_list = ['rc', 'bam', 'bam.bai', 'stderr', 'stdout']
    rc_df = pd.DataFrame(index=call_dir_list,columns=cols_list).fillna(DATAFRAME_DEFAULT_EMPTY_VALUE)
    
    #     Check the directories in this tree against the column list
    for dir_name, dir_list, files_list in os.walk(x_dir):
        for filename in files_list:
            full_filename = os.path.join(dir_name, filename)
            if filename in cols_list:
                if filename == 'rc':
                    with open(full_filename, 'r') as fh:
                        lines = fh.readlines()
                        
                    if len(lines) > 0:
                        for call_dir in call_dir_list:
                            if call_dir in dir_name:
                                if lines[0] in good_return_codes_list:
                                    rc_df.loc[call_dir, 'rc'] = str(lines[0]).strip()
                                else:
                                    try:
                                        rc_df.loc[call_dir, 'rc'] = str(lines[0]).strip()
                                    except:
                                        rc_df.loc[call_dir, 'rc'] = FAILED_RETURN_CODE_READ
                                        pass
                                    
            if filename in list(check_files_dict.keys()):
                for call_dir in call_dir_list:
                    if call_dir in dir_name:
                        with open(full_filename, 'r') as fh:
                            lines = fh.readlines()

                        if len(lines) > 0:
                            for line in lines:
                                for check_word in check_files_dict[filename]:
                                    if check_word in line:
                                        rc_df.loc[call_dir, filename] = check_word
                                        continue
                
            # Report the File sizes - code version pending full list of required files
            else:
                for call_dir in call_dir_list:
                    if call_dir in dir_name:
                        fname, fext = os.path.splitext(filename)
                        this_file_data = os.stat(full_filename)
                        if fext[1:] == 'bam':
                            rc_df.loc[call_dir, 'bam'] = str(this_file_data.st_size)
                        elif fext[1:] == 'bai':
                            rc_df.loc[call_dir, 'bam.bai'] = str(this_file_data.st_size)
                            
    return rc_df                 

<a id='function_usage'></a>

## Function usage Examples:
[time stamp strings](#time_stamp) <br>
[find strings in files in the directory tree](#find_strings) <br>
[test results dataframe](#test_results) <br>

<a id='time_stamp'></a>
### Time stamp string format full file name

In [2]:
print('OS sortable sequence string:\t\t', get_time_sequence_string(decimal_shift=4), '\n')
print('Human-readable, UTC time stamp:\t\t', get_readable_time_stamp(n_digits=6))

OS sortable sequence string:		 BFEHIDHFJEEHAJ 

Human-readable, UTC time stamp:		 18_53_14_471255_UTC_2019_01_18


<a id='find_strings'></a>

### find string in files in directory tree
```python
VARIABLE_FILE_TYPES = ['.txt', '.file', '.py', '.wdl', '.sh', '.pl']
```

In [3]:
""" Display dictionary with the first function.
    Get the dictionary with the second one. """
string_fragment = 'bam.bai'
dir_name = '../src/shell'

display_string_found_dict(string_fragment, dir_name)
files_string_dict = find_string_fragment_in_dir_tree(string_fragment, dir_name)

bam.bai found in:
../src/shell/deliver_alignment.sh
	lines:  215  241  247 
../src/shell/alignment.sh
	line 276


<a id='test_results'></a>

### Parse a test result directory and display as spreadsheet 
* e.g. using cromwell-executions/GermlineMasterWF/5f687ed6-5e53-4864-a526-6e33f56bb4fd
* uploaded locally and not in research repo

In [4]:
x_dir = '/Users/yo/zzIForge/fullyJan10/'
if not os.path.isdir(x_dir):
    print('directory not found\n', x_dir)
    
somedf = get_test_results_dataframe(x_dir)
somedf

Unnamed: 0,rc,bam,bam.bai,stderr,stdout
call-DHVC,0,unk,unk,unk,unk
call-realign,0,65349394,1431360,unk,START
call-align,0,59500466,1431328,unk,Finished
call-bqsr,0,67462520,1431360,unk,Finished
call-dedup,0,65322724,1431360,unk,Finished
call-haplotype,0,67462520,1431360,unk,Finished
call-merge,0,59500466,1431328,unk,unk
call-vqsr,0,unk,unk,unk,Finished
call-trimseq,0,unk,unk,unk,Finished
call-DAB,0,65349394,1431360,unk,unk


[top](#top) <br>