# OVP tutorial

This is a tutorial of the pipeline that performs adaptive permutation test to identify outside variants that significantly alter risk beyond the GWAS SNP. We try to replicate the experience of running the script in a sandbox environment 

In this tutorial, we will run the pipeline on sample gen files and focus on one type of the pipeline: the case-control pipeline (`PIPE_TYPE = CC`). Before continuing, we highly recommend you go through the [Inputs Explanation](#inputs_explanation) section first, exploring the example input files in this directory as you go.

We first run the script with the `-h` option to see the required and optional arguments for the script

In [3]:
%run python_scripts/OVP_script.py -h

usage: OVP_script.py [-h] [--override] [--debug_mode]
                     [--one_pair [\{GWAS\}_\{Outside\}]] [--iter ITER]
                     [--case CASE] [--control CONTROL]
                     input_folder_path case_gen control_gen SNP_pairs
                     init_file output_folder unique_identifier

outside var pipeline

positional arguments:
  input_folder_path     absolute path of input folder
  case_gen              name chromosome-specific case .gen file, with path
                        specified using argument PATH in init file. Example:
                        ALL_MS_impute2_chr20.gen
  control_gen           chromosome-specific control .gen file, with path
                        specified using argument PATH in init_file. Example:
                        ALL_controls_58C_NBS_WTC2_impute2_chr20.gen
  SNP_pairs             file with each line having GWAS_rsid and outside_rsid,
                        separated by a delimiter. Example: SNPpairs_SAMPLE
  init_file    

In [1]:
%run python_scripts/OVP_script.py\
input_data_other CASE_chr22_10SNP_1k.gen\
CONTROL_chr22_10SNP_1k.gen SNPpairs_chr22_n10\
INIT_small_file_10iter_nocheck.txt\
test test --override

INFO:root:Starting outside variant pipeline analysis
INFO:root:Initializing pipeline. This might take a few seconds.
INFO:root:Making output directory
INFO:root:READING PAIRING FILE
INFO:root:READING SAMPLE FILE(S) (if found)
INFO:root:ALL COVARIATES FOUND
INFO:root:ALL COVARIATES FOUND
INFO:root:READING GENOTYPE FILE(S)
INFO:root:Running pipeline...
INFO:root:File for G_O_dict not found in current directory: /lab/corradin_data/FOR_AN/OUTSIDE_VARIANT_PIPELINE/github_repos/outside-variants/test_all_files, creating from scratch
INFO:root:1 out of 1 GWAS rsIDs found in ../input_data_gen/CASE_chr22_10SNP_1k.gen.
INFO:root:10 out of 10 outside rsIDs found in ../input_data_gen/CASE_chr22_10SNP_1k.gen.
INFO:root:1 out of 1 GWAS rsIDs found in ../input_data_gen/CONTROL_chr22_10SNP_1k.gen.
INFO:root:10 out of 10 outside rsIDs found in ../input_data_gen/CONTROL_chr22_10SNP_1k.gen.
INFO:root:File for case_combined and/or control_combined not found, creating and saving both dicts
INFO:root:work be

Running the pipeline again without the `--override` flag causes an error

In [5]:
%run python_scripts/OVP_script.py\
input_data_other CASE_chr22_10SNP_1k.gen\
CONTROL_chr22_10SNP_1k.gen SNPpairs_chr22_n10\
INIT_small_file_10iter_nocheck.txt\
test test

ERROR:root:File `'python_scripts/super_pipeline_with_timing.py'` not found.


Running in debug mode gives you more information about the pipeline and runtime, but also can get cluttered very quickly:

In [6]:
%run python_scripts/OVP_script.py\
input_data_other CASE_chr22_10SNP_1k.gen\
CONTROL_chr22_10SNP_1k.gen SNPpairs_chr22_n10\
INIT_small_file_10iter_nocheck.txt\
test test --override

ERROR:root:File `'python_scripts/super_pipeline_with_timing.py'` not found.


---
<a id='inputs_explanation'></a>

# Inputs explanation:
## Required inputs

### `input_folder_path`
   a relative path of a directory where you will place all the other inputs (**`init_file, snp_pairs, MTC_file, sample_file`**)  except for the genetic input files. Although you can also put the genetic files in the same folder (will be discussed later), we find that this setup helps organizing files. Often, large genetic files are placed in a separate storage system and thus moving them or storing multiple copies is impractical
   
   

### `SNP_pairs`

pairing file matching GWAS rsIDs to outside rsIDs, **separated by space**. See [example pairing file](./input_data_other/SNPpairs_chr22_n10)

### `case_gen` and `control_gen`

Example files: [Case gen](./input_data_gen/CASE_chr22_10SNP_1k.gen), [Control gen](./input_data_gen/CASE_chr22_10SNP_1k.gen)

Format expected: Space separated, 5 columns, then three columns of genotype probabilities for each subject

![gen_file_format](images/gen_file_format.png)


* IF PIPE_TYPE IS "CC":
    - FILE1: file containing case genotype data
    - FILE2: file containing control genotype data
    - PAIRING: file matching GWAS rsIDs to outside rsIDs (two columns: GWAS OUTSIDE)

* IF PIPE_TYPE IS "COMB":
    - FILE1: file containing all genotype data (case and control combined)
    - FILE2: sample file (contains information on which columns in FILE1 are case and which are control)
    - PAIRING: file matching GWAS rsIDs to outside rsIDs (two columns: GWAS OUTSIDE)

* IF PIPE_TYPE IS "TRANS_CC":
    - FILE1: file containing a single column of filenames, each file containing case genotype data
    - FILE2: file containing a single column of filenames, each file containing control genotype data
    - PAIRING: file matching GWAS rsIDs to outside rsIDs (six columns: GWAS CASE1 CONTROL1 OUTSIDE CASE2 CONTROL2)
        * CASE1: case file containing GWAS rsID data
        * CONTROL1: control file containing GWAS rsID data
        * CASE2: case file containing outside rsID data
        * CONTROL2: control file containing outside rsID data

* IF PIPE_TYPE IS "TRANS_COMB":
    - FILE1: file containing a single column of filenames, each file containing combined (case + control) genotype data
    - FILE2: sample file (contains information on which columns are case and which are control)
    - PAIRING: file matching GWAS rsIDs to outside rsIDs (four columns: GWAS CHR1 OUTSIDE CHR2)
        * CHR1: genotype file containing GWAS rsID data
        * CHR2: genotype file containing outside rsID data
   
### `init_file`: file containing additional pipeline parameter values

[Example init file](input_data_other/INIT_small_file_10iter_nocheck.txt)

INIT_FILE format:
   * two columns, tab-delimited (KEYWORD   VALUE)
   * init file keywords:
        * PIPE_TYPE
            - type of pipeline desired
            - possible values:
                * CC:   pipeline with case and control input files
                * COMB: pipeline with genotype and sample input files
                * TRANS_CC:   pipeline with multiple case and control input files
                * TRANS_COMB: pipeline with multiple genotype and sample input files
        * GS
            - column index for start of genotype data (1-indexed)
        * ITER
            - desired number of randomizing iterations to perform
        * TRIPS
            - binary value indicating whether or not the genotype data is in triplicate or letter format
            - 0 indicates letter format, 1 indicates triplicate format
        * RSID
            - column index for rsID info (1-indexed)
        * SNP
            - column index for SNP information (1-indexed)
        * DELIM
            - delimiter for input files (FILE1, FILE2, PAIRING)
            - possible values:
                * TAB: tab-delimited ('\t')
                * SPACE: space-delimited (' ')
        * OR_CALC
            - integer value indicating which odds ratio formula to use
            - possible values:
                * 0:    (# case hits / # total cases) / (# control hits / # total control)
                * 1:    (# case hits / # case non-hits) / (# control hits / # control non-hits)
                * 2:    (# A cases / # A controls) / (# B cases / # B controls)
                        *assuming possible genotypes are AA/AB/BB*
        * 3:	odds ratio calculated using logistic regression (allows for covariates to be included)
        * SKIP
            - string used to replace unknown possible genotype data (optional; default is "<DEL>")
        * PATH
            - path to input files
        * PARTIAL
            - binary value indicating whether or not to create output files as the pipeline runs
            - 0 indicates creating output files at the very end, 1 indicates creating output files on the go
        * CHECK
            - binary value indication whether or not to check the start indices of the genotype input files
            - 0 indicates no format checking, 1 indicates format checking will be performed
        * SAMPLE
            - only needs to be specified when CC or TRANS_CC pipeline runs logistic regression (OR_CALC = 3), otherwise omit
            - valid filenames for case and control sample files, separated by a comma, ie, SAMPLE	fake_case_sample.txt,fake_control_sample.txt
        * COVS
            - valid path to file containing column titles in the sample files that wish to be used as covariates in logistic regression (OR_CALC = 3) (optional)
            - ie, sample file could contain columns for "age", "sex", "height", "weight", etc.
            - a valid COVS file would contain a single column of whichever of these identifiers should be included in the logistic regression
        * CUTOFF
            - high and low cutoffs for converting from triplicate genotype encoding to letter encoding (optional; default is .9, .3)
            - in order to be valid, one element must be >= high and all other < low
            - must input high and low decimal values separated by a comma
                - ie, "CUTOFF	.9,.3" has a high cutoff of .9 and a low cutoff of .3
        * NA
            - string identifier to be used when genotype for individual is unknown (optional; default is "NA")
   
   

---
# Output explanations

* **objs/**
 - a folder for caching data structures 