# Baymer Tutorial

Baymer can be run either using the baymer package by running each script directly. This markdown will demonstrate primarily how to run baymer by importing different scripts, however, the commands to run each script individually will also be shown.

The first part of the tutorial will demonstrate the standard baymer pipeline and subsequently a few downstream applications will be described.

## Standard Baymer Pipeline

Baymer expects jsons of the following format as as input files:

{Context: [A polymorphisms, C polymorphisms, G polymorphisms, T polymorphisms, total contexts, context reference index, list reference index], ...}

where the "context reference index" is the 0-indexed position of the reference nucleotide in the context and "list reference index" is the position in the list of the reference nucleotide. Therefore, for the 3-mer context "AAA", the context reference index would be 1 and the list reference index would be 0.

Note that the input json should only contain a single context reference index and sequence context length. In other words, all contexts should be of uniform length (e.g all 3-mers) and should have the same nucleotide in scope (e.g for 3-mers the central nucleotide is the polymorphic nucleotide for all context counts enumerated).

Generating these files necessitates counting both the contexts and the polymorphism counts. We have provided scripts to count these quantities and format the proper json if desired.

Note that for all of the following tutorial we will be using 5-mer sequence context windows for speed purposes, but the mer length can be designated as desired. The only caveat being that this will necessitate adjusting the buffer on the BED file to accomodate the given sequence context length (described in further detail below).

### Counting Contexts

#### Inputs

Note flags for standalone script and named arguments for imported modules indicated as follows at the beginning **(-a/named argument)**

* **(-c/config_file)** config file
    * The format of this file can be found [here](https://github.com/bvoightlab/Baymer/blob/main/tutorial_data/context_and_mutation_counter_config.yaml)
    * The FASTA files designated must be trimmed to only include the feature region of interest. We recommend using [BEDtools](https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) to accomplish this. We also recommend adding a buffer to the trimming BED file (can be accomplished using [BEDtools slop](https://bedtools.readthedocs.io/en/latest/content/tools/slop.html)) to extract the FASTAs to account for the overhang of the sequence context allowing the first position in the feature to be included. Otherwise this locus could only be considered as a member of the flanking nucleotides in the sequence contexts ultimately counted. 
* **(--feature/feature)** feature of interest (must be designated in the config file)
* **(-m/mer_length)** the size of the sequence context desired to be counted
* **(--co/context_output_file)** the desired output file location and name
* **(-b/buffer_bp)** the buffer shift of the fasta (this parameter defaults to the symmetric flank size of the mer size given)
* **(-a/offset)** the offset of the mer from perfect symmetry
    * this allows for counting mers with different nucleotides in scope than the default center nucleotide
    * offsets are designated with the following examples of length 3 and 4 mers:
        - 3-mers: NNN -> -1 0 1
        - 4-mers: NNNN -> -2 -1 1 2
    * by default the script assumes symmetric mers i.e offset = 0 for odd mers and -1 for even mers
* **(-u/unfolded)** whether you want folded or unfolded sequence contexts i.e T central nucleotides are folded to A contexts and G nucleotides to C
* **(-h/high_confidence)** whether you only want to use high-confidence bases from the FASTAs (this assumes they are designated as such with capital letters)

#### Running standalone script

python baymer/context_counter.py \
    **-c** tutorial_data/context_and_mutation_counter_config.yaml \
    **--feature** cpg_islands \
    **-m** 5 \
    **--co** tutorial_data/context_count_out.tsv \
    **-b** 5 \
    **-a** 0 \
    **-h**


#### Running in python script

In [15]:
from baymer import context_counter

config_file = "tutorial_data/context_and_mutation_counter_config.yaml"
context_count_output_file = "tutorial_data/context_count_out.tsv"

# run context counter
context_counter.driver(config_file = config_file, 
                       feature = "cpg_islands", 
                       mer_length = 5, 
                       context_output_file = context_count_output_file,
                       offset = 0,
                       buffer_bp = 5,
                       unfolded = False,
                       high_confidence = True)

TypeError: driver() got an unexpected keyword argument 'config_file'

### Counting Mutations

#### Inputs

Note flags for standalone script and named arguments for imported modules indicated as follows at the beginning **(-a/named argument)**

* **(-c/config_file)** FASTA feature config file
    * The format of this file can be found [here](https://github.com/bvoightlab/Baymer/blob/main/tutorial_data/context_and_mutation_counter_config.yaml)
    * Note that this is the same config file as used in the context counting script
* **(--feature/feature)** feature of interest (must be designated in the config file)
* **(-p/pop)** the population of interest (must be designated in the config file)
* **(-m/mer_length)** the size of the sequence context desired to be counted
* **(--mo/mutation_output_file)** the desired output file location and name
* **(--ac/allele_count)** specifies the allele count you would like to count
* **(--min/min_bool)** specifies that you would like to count *at least* the number of alleles specified (i.e with this option the script will count the # of alleles specified or greater)
* **(-b/buffer_bp)** the buffer shift of the fasta (this parameter defaults to the symmetric flank size of the mer size given)
* **(-a/offset)** the offset of the mer from perfect symmetry
    * this allows for counting mers with different nucleotides in scope than the default center nucleotide
    * offsets are designated with the following examples of length 3 and 4 mers:
        - 3-mers: NNN -> -1 0 1
        - 4-mers: NNNN -> -2 -1 1 2
    * by default the script assumes symmetric mers i.e offset = 0 for odd mers and -1 for even mers
* **(-u/unfolded)** whether you want folded or unfolded sequence contexts i.e T central nucleotides are folded to A contexts and G nucleotides to C
* **(-h/high_confidence)** whether you only want to use high-confidence bases from the FASTAs (this assumes they are designated as such with capital letters)
* **(--nygc/nygc_bool)** Boolean to specify whether the vcf of interest is using 1KG or gnomad population formatting
    * this essentially controls for the syntax of population names and qc metrics. If a different standard than either of these two is used, this may need to be adjusted
    * gnomad populations are labeled "POP_AC" etc etc
    * 1KG populations 
* **(--fasta-consistent/fasta_consistent)** Boolean to specify that you would like enforce vcf consistency (e.g fasta is ancestral but vcf is not)

#### Running standalone script

python baymer/mutation_counter.py \
    **-c** tutorial_data/context_and_mutation_counter_config.yaml \
    **--feature** cpg_islands \
    **-p** AFR \
    **-m** 5 \
    **--mo** tutorial_data/AFR_mutation_count_out.tsv \
    **--ac** 1 \
    **--min** \
    **-b** 5 \
    **-a** 0 \
    **-h** \
    **--nygc** \
    **--fasta-consistent**
    

#### Running in python script

In [9]:
from baymer import mutation_counter

mutation_count_output_file = "tutorial_data/AFR_mutation_count_out.tsv"

# run context counter
mutation_counter.driver(config_file = config_file, 
                       feature = "cpg_islands",
                       pop = "AFR",
                       mer_length = 5, 
                       mutation_output_file = mutation_count_output_file,
                       allele_count = 1,
                       min_bool = True,
                       offset = 0,
                       buffer_bp = 5,
                       unfolded = False,
                       high_confidence = True,
                       nygc_bool = True,
                       fasta_consistent = True)

chr1
VCF REF matched ancestral:  10815
VCF ALT matched ancestral:  0
Could not find ancestral match:  0
Total size of df before qc:  10815
Output file successfully saved
tutorial_data/AFR_mutation_count_out.tsv
