# Testing Notebook: Building the FPR/TPR matrices for Baum-Welch performance across multiple simulated reps

---
## Purpose of Notebook
(Predone) Locally download a few sample rep ids and construct two tables for each of them measuring the True Positive Rate and False Positive Rate across iterations of Baum-Welch, facilitating averaging.

In [4]:
# Import packages.
import sys
import numpy as np
import gzip

# Print versions of our libraries.
print('numpy', np.__version__)
print('gzip', np.__version__)

numpy 1.22.3
gzip 1.22.3


---
## Function Descriptions

---
### `post_convergence_nan()`
#### Purpose:
`post_convergence_nan()` converts all values in a results file after Baum-Welch has converged to np.nan values to faciliate averaging later by starting at the rightmost column and moving left until it encounters a column with a nonzero value.
#### Input:
- `rep_filepath`: a filepath to a rep that needs to have its post-convergence values converted from 0 to np.nan
#### Output:
- `nan_results`: a nparray of the exact same results, but with all post-convergnece 0 values converted to np.nan
- `convergence_index`: the first column in the array containing nan values


# NOTE: Long-term fix is to actually convert the values to Nan in the Hmm function itself

In [125]:
def post_convergence_nan(rep_filepath):
    rep = np.genfromtxt(rep_filepath, 
                    delimiter='\t', 
                   )
    
    convergence_index = -1
    # iterates over each column of the array in reverse order (right to left)
    for c in range (len(rep[0])-1, -1, -1):
        # print(rep[:, c])
        # print(convergence_index)  
        # print(not np.all(rep[:, c]==0))
        if not np.all(rep[:, c]==0) and convergence_index == -1:
            # TODO: set the convergence index as the column to the right of where the first nonzero values are found
            convergence_index = c + 1
    
    # go through the matrix from the convergence_index column to the right and fill with nans
    for n in range(convergence_index, len(rep[0])):
        rep[:, n] = np.nan

    # TODO: convert all values in the array from that column on to np.nan
    return rep

---
### `cross_rep_TPR()`
#### Purpose:
`cross_rep_TPR()` creates a table measuring the total True Positive Rate of a given iteration of Baum-Welch across simulations.
#### Input:
- `rep_array`: an array of nparrays representing the reps (assuming they have already been converted to nan
#### Output:
- `tpr`: an nparray with dimensions `(R X B)`, where `R` is the number of reps being measured and `B` is the Baum-Welch optimization limit set when the HMM was run (in this case). The value at each position corresponds to the overall True Positive Rate given by $\gamma$ after `B` many iterations of Baum-Welch. TPR values in columns after convergence will be equivalent to np.nan.

---
## Walkthrough

In [118]:
# (Hardcoded) download the necessary files - reps are represented in nparray form after post_convergence_nan()
rep1 = post_convergence_nan('/Users/briankirz/Documents/GitHub/mentee_research/kirz/site_pattern_hmm/results_testing/local_test_reps/prufer_results_rep_id_1.csv.gz')
# rep2 = post_convergence_nan('/Users/briankirz/Documents/GitHub/mentee_research/kirz/site_pattern_hmm/results_testing/local_test_reps/prufer_results_rep_id_2.csv.gz')
# rep3 = post_convergence_nan('/Users/briankirz/Documents/GitHub/mentee_research/kirz/site_pattern_hmm/results_testing/local_test_reps/prufer_results_rep_id_3.csv.gz')
# rep4 = post_convergence_nan('/Users/briankirz/Documents/GitHub/mentee_research/kirz/site_pattern_hmm/results_testing/local_test_reps/prufer_results_rep_id_4.csv.gz')

IndexError: index 104 is out of bounds for axis 1 with size 104

In [74]:
def cross_rep_TPR(r1, r2, r3, r4):
    reps = np.array([r1, r2, r3, r4])
    return reps

In [21]:
cross_rep_TPR(rep1, rep2, rep3, rep4).shape

(4, 40000, 104)