# Genotype PLINK file quality control

This step is necessary for TensorQTL applications.

## Input

1. `protocol_example.genotype.chr21_22.bed`

## Output

1. `protocol_example.genotype.chr21_22.21.bed`
2. `protocol_example.genotype.chr21_22.22.bed`
3. `protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt`

## Minimal Working Example

The proteomics data used through this MWE can be found on [synapse](https://www.synapse.org/#!Synapse:syn52369482).

### Step 1: Genotype data partition by chromosome
Timing: < 1 minute

In [None]:
sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile input/protocol_example.genotype.chr21_22.bed \
    --cwd output \
    --chrom `cut -f 1 input/protocol_example.genotype.chr21_22.bim | uniq | sed "s/chr//g"` \
    --container singularity/bioinfo.sif 

```
INFO: Running genotype_by_chrom_1:
INFO: genotype_by_chrom_1 (index=0) is completed.
INFO: genotype_by_chrom_1 (index=1) is completed.
INFO: genotype_by_chrom_1 output:   /Users/alexmccreight/xqtl-pipeline-new/output/protocol_example.genotype.chr21_22.21.bed /Users/alexmccreight/xqtl-pipeline-new/output/protocol_example.genotype.chr21_22.22.bed in 2 groups
INFO: Running genotype_by_chrom_2:
INFO: genotype_by_chrom_2 is completed (pending nested workflow).
INFO: Running write_data_list:
INFO: write_data_list is completed.
INFO: write_data_list output:   /Users/alexmccreight/xqtl-pipeline-new/output/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt
INFO: genotype_by_chrom_2 output:   /Users/alexmccreight/xqtl-pipeline-new/output/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt
INFO: Workflow genotype_by_chrom (ID=wee2aaca94d6f6708) is executed successfully with 3 completed steps and 4 completed substeps.
```

## Command interface

In [None]:
sos run genotype_formatting.ipynb -h

In [None]:
usage: sos run genotype_formatting.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  plink_to_vcf
  vcf_to_plink
  genotype_by_region
  ld_by_region_plink
  genotype_by_chrom
  write_data_list
  merge_plink
  merge_vcf

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --container ''
                        The filename name for containers
  --entrypoint ('micromamba run -a "" -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else ""

  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 3G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --genoFile  paths

                        the path to a bed file or VCF file, a vector of bed
                        files or VCF files, or a text file listing the bed files
                        or VCF files to process
  --[no-]keep-allele-order (default to False)
                        Do not keep allele order by default -- let PLINK decide
                        about major and minor alleles

Sections
  plink_to_vcf_1:
  vcf_to_plink:
    Workflow Options:
      --[no-]remove-duplicates (default to False)
      --remove-samples . (as path)
                        The path to the file that contains the list of samples
                        to remove (format FID, IID)
      --keep-samples . (as path)
                        The path to the file that contains the list of samples
                        to keep (format FID, IID)
  genotype_by_region_1:
    Workflow Options:
      --window 0 (as int)
                        cis window size
      --region-list VAL (as path, required)
                        Region definition
  ld_by_region_plink_1:
    Workflow Options:
      --region-list VAL (as path, required)
                        Region definition
      --float-type 16 (as int)
  genotype_by_chrom_1:
    Workflow Options:
      --chrom VAL VAL ... (as type, required)
  genotype_by_chrom_2:
  plink_to_vcf_2:
  genotype_by_region_2:
  ld_by_region_*_2:
    Workflow Options:
      --region-list VAL (as path, required)
  write_data_list:
    Workflow Options:
      --out VAL (as path, required)
      --ext VAL (as str, required)
      --data-files  paths

  merge_plink:
    Workflow Options:
      --name VAL (as str, required)
                        File prefix for the analysis output
      --keep-samples . (as path)
                        The path to the file that contains the list of samples
                        to keep (format FID, IID)
  merge_vcf:
    Workflow Options:
      --name VAL (as str, required)
                        File prefix for the analysis output

### Step 1: Genotype data partition by chromosome

In [None]:
[genotype_by_chrom_1]
stop_if(len(paths(genoFile))>1, msg = "This workflow expects one input genotype file.")
parameter: chrom = list
chrom = list(set(chrom))
input: genoFile, for_each = "chrom"
output: f'{cwd}/{_input:bn}.{_chrom}.bed'
# look up for genotype file
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, volumes = [f'{genoFile:ad}:{genoFile:ad}'], entrypoint=entrypoint
    ##### Get the locus genotypes for $[_chrom]
    plink --bfile $[_input:an] \
    --make-bed \
    --out $[_output[0]:n] \
    --chr $[_chrom] \
    --threads $[numThreads] \
    --memory $[int(expand_size(mem) * 0.9)/1e06] \
    --allow-no-sex  $["--keep-allele-order" if keep_allele_order else ""]

In [None]:
[genotype_by_chrom_2]
input: group_by = "all"
output: f'{_input[0]:nn}.{step_name[:-2]}_files.txt'
sos_run("write_data_list", data_files = _input, out = _output, ext = "bed")