# Running a Hidden Markov Model with the Baum-Welch Algorithm on Simulated Ancestral Data

---
## Purpose of Guide
Using simulated genomic data of human populations, this Hidden Markov Model infers the likelihood of introgression from an archaic population at each locus on a simulated genome. It further uses the Baum-Welch Algorithm to optimize this detection.
In this guide, I'll walk through how to run my HMM from start to finish on a single pre-generated rep id.

In [8]:
# Import packages.
import sys
import numpy as np
import time

# Print versions of our libraries.
print('sys', sys.__version__)
print('numpy', np.__version__)
print('time', time.__version__)

AttributeError: module 'sys' has no attribute '__version__'

## Functions

---
### `genotype_matrix_windows()`
#### Purpose:
`genotype_matrix_windows()` is a helper function that splits a genotype matrix into non-overlapping windows and assigns positions that are variable across populations to a window according to their position. By default, the length of the simulated genome is 20 million base pairs. Each window is 500 base pairs long, with inclusive/exclusive bounds (ex. [0,500]). This results in 40,000 windows ranging from [0, 20,000,000].
#### Input:
- `variant_positions`: an array of the positions of variant sites across populations. Each index represents a separate variant (ordered by relative position but not evenly spaced), and its corresponding cell value represents the location of that variant position on the genome.
- `polarized_genotype_matrix`: an array of multiple populations. It includes the ancestral population whose introgression is being inferred (Neanderthals), the population in whom introgression is being tested (Europeans), one or more sister populations to the one being tested, as a control (Africans), and an "ancestral state" reference population against which the rest can be polarized (Chimpanzee). Only biallelic sites are included. The ancestral allele and all identical populations are represented by '0' while the derived mutated allele is represented by '1'.
- `window_size` The size of each window here is set to 500 by default. In Prufer's 2014 paper, the team used a genetic map, with crossover positions. We will map the genetic position by relative physical location on the chromosome \[0 to 20,000,000).
- `sequence_length`: the length of the simulated genome being tested. Set to 20,000,000 by default.

#### Output:
`windows` (dictionary). It has the following key ==> value relationship:

*Window number (from 1 to 40,000) => [window start position, window end position, index of variant position in the input array].*

Types are:
`int => array[int, int, int...]`

If there are multiple variant positions within a window, their indices in the input array of variant positions are appended to the value array.

In [31]:
def genotype_matrix_windows(
        variant_positions,
        polarized_genotype_matrix,
        window_size=500,
        sequence_length=20_000_000,
):
    # Intialize a dictionary with the start and stop position for each window.
    windows = {}
    index = 1
    # Create consistent-length windows spanning the length of the sequence
    for window_start in range(0, int(sequence_length), int(window_size)):
        # The index (window number) is set as the key to a value of an array which contains its start and stop position
        windows[index] = [window_start, (window_start + window_size)]
        index += 1
    # Locate and assign each variant position to its respective window
    # keeps track of index number in the variant_position array
    index = 0
    pos = variant_positions[index]
    for key in windows:
        # extract the window bounds
        start, stop = windows[key]
        # "bin" the variant a the window if it is within bounds
        while start <= pos < stop:
            # append the index of the variant position to the corresponding value array in windows
            windows[key].append(index)
            index += 1
            if index < len(variant_positions):
                pos = variant_positions[index]
            else: # (all variant positions have been binned)
                break
    # window # (1-40,000) -> [0 (start), 500 (stop), index of local variable positions (if any)]
    return windows

---
### `calc_window_intro_percent()`
#### Purpose:
`calc_window_intro_percent()` stores the locations of genomic regions that are a result of the archaic population introgression that the HMM will infer. Known as "true introgression positions," these segments represent the hidden states of the HMM. In practice, the exact loci in modern human DNA that are a result of Neanderthal introgression cannot be known, so in order to evaluate the model's efficacy and compare its performance, we record the true introgression positions during data simulation to create an "answer key".

To this end, this function creates a dictionary of windows similar to the one created by `genotype_matrix_windows()`, but instead of recording the bounds and variant positions of each window, it represents how much each window is covered by a segment of "true introgression" as a percentage value. The data structure allows the quick identification of areas of true introgression in the genome, which allows the HMM's accuracy to be evaluated.

#### Input:
- `Binned_windows`: a dictionary where the keys represent the iwndow number from 1 to 40,000 and the values are arrays where the first two elements represent positional boundaries, and any following elements represent the index of variant positions that lie within that window in the `variant_positions` array. Binned_windows is the direct output of `genotype_matrix_windows`.
- `true_introgression_positions`: nparray representing the locations of introgressed loci on the genome. Each row represents a different introgressed segment. The first column represents its starting location, and the second column represents its stopping location.

#### Output:
- `Win_intro_percent`: a dictionary of 500 base pair bins and their contents included to keep track of the true introgression state windows. The key is the window number and the value is a float percentage between 0 and 1 of how much of the window is covered by the true introgression segment.

In [35]:
def calc_window_intro_percent(Binned_windows, true_introgression_positions):

    Windows = Binned_windows
    true_intro_pos = true_introgression_positions
    
    # Initializing dictionary of Window Introgression Percentages
    Win_intro_percent = {}
    # Extract the columns into numpy arrays and round.
    # Sorting makes iterating easier. Not changing any start positions. intro_starts is 'official' starting position
    intro_starts = np.sort(np.round(true_intro_pos[:, 0]))
    intro_stops = np.sort(np.round(true_intro_pos[:, 1]))
    intro_sizes = np.sort(intro_stops - intro_starts)

    # The index of the true introgression segment in start/stop/sizes
    intro_index = 0
    for key in Windows:
        # if intro_index is the same as the number of true introgressed segments, we can end and assign the rest 0
        if intro_index == intro_sizes.shape[0]:
            Win_intro_percent[key] = 0.
        else:
            # Tracking indices
            # integer starting and ending positions of the true introgressed segments
            curr_start = int(intro_starts[intro_index])
            curr_stop = int(intro_stops[intro_index])
            # integer offset of curr_start and curr_stop from most recent window
            curr_start_mod = int(intro_starts[intro_index] % 500)
            curr_stop_mod = int(intro_stops[intro_index] % 500)
            # current window that contains the beginning or end of the current segment
            curr_start_window = int(((curr_start - curr_start_mod) / 500) + 1)
            curr_stop_window = int(((curr_stop - curr_stop_mod) / 500) + 1)
            # boolean that tracks whether the segment falls completely within a window (exception)
            tiny_intro = curr_stop - curr_start < 500
            # skips windows that come before the current start window
            if key < curr_start_window:
                Win_intro_percent[key] = 0.
            elif key == curr_start_window:
                # If the introgressed segment is less than 500, we need to do a special case to find the percentage
                if tiny_intro:
                    Win_intro_percent[key] = (curr_stop - curr_start) / 500
                    # since this counts as a whole segment, we have to tick the index to seach for the next segment
                    intro_index += 1
                else:  # normal case, the true introgressed segment is over 500 base pairs long
                    # calculates the % of the window that is covered by the segment from curr_start to the window's end
                    Win_intro_percent[key] = (Windows[key][1] - curr_start) / 500
            # In the middle of the introgressed segment, so each window is 100% covered
            elif curr_start_window < key < curr_stop_window:
                Win_intro_percent[key] = 1.
            # In the last window containing the segment. It should be partially introgressed.
            elif key == curr_stop_window:
                # calculates the % of the window that is covered by the segment from the window's start to curr_stop
                Win_intro_percent[key] = (curr_stop - Windows[key][0]) / 500
                # since we found the stop window of a large segment, we can move onto the next segment, if any
                intro_index += 1
                # check to make sure that we record the same number of windows as there are segments
                if intro_index > intro_sizes.shape[0]:
                    print("ERROR: Recorded more windows than there are segments")
                    break
            else:  # Error check
                print("----------------------")
                print("ERROR: bug in key iteration for calculation of introgression percentages")
                print("----------------------")
                break

    return Win_intro_percent

### `logsum()`
#### Purpose: 
`logsum()` takes the numpy builtin function `numpy.logaddexp()` and makes it applicable to multidimensional arrays. `numpy.logaddexp()` is used to calculate the logarithm of the sum of exponentiations of the inputs. This is useful in statistical methodologies where the calculated probabilites of events become so small they exceed the range of floating point numbers and the computer loses information by rounding them to zero.

In this model, some derived values can be smaller than 1*10^-130.

In cases like these,  calculation values are stored as the logarithm of the true probability. `logaddexp()` allows such probabilities to be added, as in the format `log(exp(arr1) + exp(arr2))`. 

`logsum()` is a more flexible version which can take in multidimensional arrays as valid input by stringing all data into a one-dimensional array by appending rows one after another. `logaddexp()` only typically works on a 1-dimensional array `[1, 2, 3, 4]`, but `logsum()` accounts for the case of multiple dimensions by converting the input `[[1, 2], [3, 4]]` into `[1, 2, 3, 4]` before calling `logaddexp()`.

#### Input:
- array: an arra
#### Output:

In [45]:
array = np.zeros((2, 2))

array[0][1] = 1
array[1][0] = 2
array[1][1] = 3

np.reshape(array, np.product(array.shape))
# np.reshape(array, (np.product(array.shape),))
# array

array([0., 1., 2., 3.])

### `genotype_matrix_windows`

### `calc_alpha`

### `calc_beta`

### `calc_xi`

In [15]:
### calc_gamma

In [16]:
### `update_A`

In [None]:
### `update_B`

In [17]:
### `update_pi`

In [18]:
### `hmm`

In [19]:
### `eval_accuracy`

In [20]:
### `viterbi`

In [21]:
### `compute_logP`

In [22]:
### `print_results`

In [23]:
### `logsum`

In [24]:
### `exp_logsum`

In [25]:
### `compare_3`

In [None]:
### `display_performances

---
## Example Workflow

1) Explain types of simulated data
2) Load in simulated data
3) 

In [32]:
windows = 24

In [33]:
windows

24

---
### Workflow function

#### Done step by step

#### Once as a large function, compare its output to