# Individual-level genotype-expression data preprocessing

Drafted by Jiarun Chen (Tsinghua), with input from Fabio and Gao (UChicago).


## Goal
This pipeline aims to extract individual-level genotype-expression data in preparation for cis-eQTL fine-mapping.

## Overview of methods
Specifically, we are to organize the genotype, expression, and covariates data into a gene-wise form, i.e., a data file per gene. Besides, we need the expression data 'perpendicular' to covariates, which will be used for regression with genotype, in instrument variable idea.

## Details of methods

Genotype (`X`): Obtain TSS of all genes as preprocessing. Once querying a gene, take the TSS of the gene, extracting SNPs within a predefined flanking region around the TSS and converting their genotypes to additive coding 0/1/2 as `X` matrix.

Covariates (`Z`): This is independent of genes, so this is the same among files and can be directly copied from the original covariates data.

Expression (`y`): As `grep` is too time-consuming, we construct `gene:line_number` as preprocessing. For each gene, the `line_number` is the line numbers of the gene in the #{tissues} expression data files. As a result, to extract the expression of each gene, it is only the job of reading the lines in the files, and combining the lines together. Specifically:

1. extract the column of ensembl_gene_id in the 49 expression files by `pandas`
2. assign `gene:line_number` for each gene in each tissue based on the extracted column just now
3. for each gene, extract the row of expression of the gene with sample names in each tissue

Expression 'perpendicular' to covariates (`y_res`): The part of expression data that is independent from covariates by regression `Z` out of `y`.

# Documentation

## Input
genotype data all in a file and expression and covariates data organized tissue-wise.

- genotype: .vcf.gz

```
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-1122O
    chr1    13526   chr1_13526_C_T_b38      C       T       .       PASS    AN=1676;AF=0.000596659;AC=1     GT      0|0     0|0     0|0     0|0     0|0     0|0
```    

- expression: .bed.gz
```
    #chr    start   end     gene_id GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-1122O      GTEX-1128S
    chr1    29552   29553   ENSG00000227232.5       1.3135329173730264      -0.9007944960568992     -0.29268956046586164    -0.7324113622418431     -0.27475411245874887    -0.6990255198601908     0.18188866123299216
```

- covariates: .txt
```
    ID      GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS
    PC1     -0.0867  0.0107  0.0099  0.0144  0.0154
    PC2     -0.0132 -0.0026 -0.0050 -0.0081 -0.0093
```    

## Output
An RDS file containing genotype specific for each gene and expression and covariates of each gene.


In [4]:
dat <- readRDS(file = 'ENSG00000284484.1.GTEx_V8.rds')
str(dat)

List of 4
 $ X    : chr [1:838, 1:40947] "0" "0" "0" "0" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:838] "GTEX-1117F" "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" ...
  .. ..$ : chr [1:40947] "chr16_74677011_G_C_b38" "chr16_74677177_G_A_b38" "chr16_74677272_C_T_b38" "chr16_74677278_C_G_b38" ...
 $ y    :List of 12
  ..$ Brain_Cerebellar_Hemisphere          : chr [1:175, 1] "0.4099833" "0.7860842" "-0.8056329" "0.2007308" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:175] "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" "GTEX-11EMC" ...
  .. .. ..$ : NULL
  ..$ Brain_Cerebellum                     : chr [1:209, 1] "-0.08365173" "-1.0467" "0.8247319" "2.18935" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:209] "GTEX-111FC" "GTEX-1128S" "GTEX-117XS" "GTEX-1192X" ...
  .. .. ..$ : NULL
  ..$ Brain_Hypothalamus                   : chr [1:170, 1] "1.003148" "-0.1992013" "0.4960161" "-1.003148" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:

In [2]:
sos run GTEx_V8_preprocessing.ipynb -h

usage: sos run GTEx_V8_preprocessing.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  get_gene_meta
  preprocess
  get_gene_meta_biomaRt
  extract
  dap
  extract_midway

Global Workflow Options:
  --cwd output (as path)
  --expression-data  paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices/*.gz'))

  --genotype-data /project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz (as path)
  --covariates-data  paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_An

## Example

**You can edit and change the following bash variable.** First edit and run the following bash variable.
```
work_dir=/scratch/midway2/gaow/GTExV8
```
Then run as follows:

```
cd $work_dir
sos run GTEx_V8_preprocessing.ipynb preprocess
sos run GTEx_V8_preprocessing.ipynb extract --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready
sos run GTEx_V8_preprocessing.ipynb dap --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready
```

We provide a template to run the extraction code on UChicago RCC midway. It can be similarly adopted for other cluster systems:

```
sos run GTEx_V8_preprocessing.ipynb extract_midway --gene-id-file output/ensembl_gene_id.txt \
                                    --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready \
                                    --cluster-config /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready/midway2.yml
```

## Minimal working example


If you want to only extract a self-defined list, eg, first 2 genes from the gene_id_file, 

```
cd $work_dir
sos run GTEx_V8_preprocess.ipynb preprocess
head -2 output/ensembl_gene_id.txt > output/test.txt
sos run GTEx_V8_preprocess.ipynb extract --gene-id-file output/test.txt
```

In [2]:
[global]
import os
from glob import glob
parameter: cwd = path('./output')
parameter: expression_data = paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices/*.gz'))
parameter: genotype_data = path('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz')
parameter: covariates_data = paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_covariates/*.txt'))
parameter: gene_id_file = path(f"{cwd:a}/ensembl_gene_id.txt")
parameter: gene_tss_file = path(f"{cwd:a}/ensembl_gene_tss.txt")
parameter: analysis_ready_dir = cwd

fail_if(len(expression_data) == 0, msg = "Cannot find expression data files. Please specify the right path via ``--expression-data``")
fail_if(len(covariates_data) == 0, msg = "Cannot find covariate data files. Please specify the right path via ``--covariate-data``")

## Preprocessing

In [None]:
# get gene-line mappping
[get_gene_meta_1]
input: expression_data, group_by = 1
output: f"{cwd:a}/{_input:bnn}.gene_line.json"
python3: expand = '${ }'
    import pandas as pd
    genes = pd.read_csv(${_input:ar}, sep='\t', compression='gzip', header=0, skip_blank_lines=False, usecols = [0,1,2,3])
    genes.to_csv("${_output:n}.tmp", sep = "\t", header = False, index=False)
    gene_linenum = dict([(x,y+1) for y, x in enumerate(genes['gene_id'].tolist())])
    import json
    with open(${_output:r}, 'w') as outfile:
        json.dump({${_input:r}:gene_linenum}, outfile)

# TSS for genes using provided TSS information
[get_gene_meta_2]
input: group_by = "all"
output: gene_tss_file, gene_id_file
bash: expand = '${ }', workdir = cwd
    cat ${paths([x.with_suffix(".tmp") for x in _input])} | sort -u > ${_output[0]}
    awk '{print $4}' ${_output[0]} > ${_output[1]}
    rm -f ${paths([x.with_suffix(".tmp") for x in _input])}

In [None]:
[preprocess]
sos_run('get_gene_meta')

### The reason why we don't use biomaRt to get TSS for genes

The following section is not used for the current project because biomaRt fails to find some ensembl_gene_id. For instance, the gene ENSG00000284523.1 is actually ENSG00000284523.2 in Ensembl, and the gene ENSG00000284552.1 is retired! Many genes suffer from this problem.

In [None]:
# # TSS for genes by biomaRt
[get_gene_meta_biomaRt]
depends: R_library("biomaRt"), R_library("data.table")
input: gene_id_file
output: gene_tss_file
R: expand = '${ }'
    ensembl_gene_id <- data.table::fread(file = ${_input:r}, sep = "\n", quote = "", header = FALSE)
    mart <- biomaRt::useDataset("hsapiens_gene_ensembl", biomaRt::useMart("ensembl"))
    gene_TSS <- biomaRt::getBM(attributes = c("chromosome_name", "transcript_start", "ensembl_gene_id", "ensembl_gene_id_version"), filters = "ensembl_gene_id_version", values = ensembl_gene_id, mart = mart)
    write.table(x = gene_TSS, file = ${_output:r}, sep = '\t', quote = FALSE, col.names = TRUE, row.names = FALSE)
    # gene_TSS_retrieval <- read.table(file = "gene_TSS.csv", sep = '\t', quote = "", stringsAsFactors = FALSE, header = TRUE)

## Check for tri-allelic allele before genotype data conversion

I check for tri-allelic allele in the `ALT` column of VCF file via:

In [None]:
zcat /project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz | cut -f 5 | grep -v "#" | grep "," | wc -l

the output here is zero means there is no tri-allelic case. If there are tri-allelic or multi-allelic variants we should left normalize the VCF file, eg:

In [None]:
bcftools norm -m -any /project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz | bgzip  > GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.left_norm.vcf.gz

## Data extraction

In [3]:
[extract]
depends: R_library("data.table"), R_library('rjson'), R_library('tidyverse'), Py_Module('cyvcf2'), Py_Module('pandas')
genes = open(gene_id_file).read().splitlines()
parameter: cis_window = 1000000
input: for_each = 'genes'
output: f"{analysis_ready_dir:a}/{_genes}.GTEx_V8.rds"
# each job uses 10 nodes, each node 4 cores in parallel each core using 2G memory; and jobs are created in batches of 40.
task: trunk_workers = [4] * 10, trunk_size = 40, walltime = '20m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'
python: expand = '${ }', workdir = cwd
    from cyvcf2 import VCF
    import pandas as pd
    from sos.utils import get_output
    cmd = "awk '$4 ~ /${_genes}/ {print $0}' ${gene_tss_file:a}"
    coord = get_output(cmd).strip().split()
    vcf = VCF(${genotype_data:ar})
    res = []
    var_ids = []
    offset = 1
    build = "b38"
    for variant in vcf(f'{coord[0]}:{int(coord[1])-${cis_window}}-{int(coord[1])+${cis_window}}'):
        # we have checked that there is no tri-allelic case in this data but still we loop over alternative alleles just in case
        for i in range(len(variant.ALT)):
            var_id = f'{variant.CHROM}_{variant.start + offset}_{variant.REF}_{variant.ALT[i]}_{build}'
            line = [x[:2].count(i+1) for x in variant.genotypes]
            if len(set(line[1:])) == 1:
                # remove non-variant site
                continue
            var_ids.append(var_id)
            res.append(line)
    res = pd.DataFrame(res, columns = vcf.samples, index=var_ids)
    res.to_csv("${_genes}_genotype.txt.gz")

R: expand = "${ }", workdir = cwd
    # X
    suppressMessages(library(tidyverse))
    X = as.matrix(data.table::fread("${_genes}_genotype.txt.gz", sep=',', header=TRUE)  %>% remove_rownames %>% column_to_rownames(var="V1"))
    # Z
    Z <- lapply(c(${covariates_data:ar,}), function(x)  as.matrix(t(data.table::fread(x, header=TRUE)  %>% remove_rownames %>% column_to_rownames(var="ID"))))
    for (i in 1:length(Z)) {
        names(Z)[i] <- strsplit(x = c(${covariates_data:abr,})[i], split = "[.]")[[1]][1] # add tissue names, so as to match with those of y
    }
    # code below are written by Jiarun Chen
    # y
    filenames_json <- list.files(path = ${cwd:ar}, pattern = "*.json$", full.names = TRUE)
    line_numbers <- lapply(filenames_json, function(x) rjson::fromJSON(file=x))
  
    extract_y <- function(y_file, line, gene = "${_genes}") {
        if (is.null(line)) return(NULL)
        # a sed version here
        # cmd <- paste0("zcat ", y_file, " | sed '", line, "q;d'")
        # yi <- data.table::fread(cmd=cmd)
        yi <- data.table::fread(file = y_file, skip = line, nrows = 1)
        samplenames_yi <- data.table::fread(file = y_file, skip = 0, nrows = 1)
        colnames(yi) <- colnames(samplenames_yi)
        yi <- t(as.matrix(yi))
        if (yi[4,1] != gene) stop(paste("Wrong gene extracted! Expect", gene, "got", yi[4,1]))
        yi <- yi[-1:-4, , drop = FALSE]
        # colnames(yi) <- file_path_sans_ext(file_path_sans_ext(basename(y_file))) # y_file:bnn, add tissue names
        return(yi)
    }
    
    y <- lapply(line_numbers, function(x) extract_y(names(x)[1], x[[1]][["${_genes}"]]))
    for (i in 1:length(y)) {
        names(y)[i] <- strsplit(x = basename(names(line_numbers[[i]])), split = '[.]')[[1]][1] # add tissue names, so as to match with those of Z
    }
    y[sapply(y, is.null)] <- NULL
    
    ## tissue names matching between y and Z (y as reference)
    
    tissuenamesmatching <- function(x, reference) {
        x[match(names(reference), names(x))] # match list names
    }
    
    Z <- tissuenamesmatching(x = Z, reference = y)
    
    ## sample names matching between y and Z (Z as reference)
    
    samplenamesmatching <- function(x, reference) {
        lapply(1:length(reference), function(i) x[[i]][match(rownames(reference[[i]]), rownames(x[[i]])), , drop = FALSE]) # match sample names within each dataframe
    }
    
    y <- samplenamesmatching(x = y, reference = Z)
    for (i in 1:length(y)) {
        names(y)[i] <- names(Z)[i] # the tissue names are lost during samplenamesmatching, it is therefore needed to retrieve the tissue names
        # An alternative version
        # names(y)[i] <- strsplit(x = basename(names(line_numbers[[i]])), split = '[.]')[[1]][1] # the tissue names are lost during samplenamesmatching, it is therefore needed to retrieve the tissue names
    }
    
    # y_res
    y_res <- lapply(1:length(Z), function(i) .lm.fit(x = Z[[i]], y = y[[i]])$residuals)
    for (i in 1:length(y_res)) {
        names(y_res)[i] <- names(Z)[i] # add tissue names of y_res
    }
 
    # create a y_res matrix that matches X
    # code below are written by Fabio

    options(stringsAsFactors=F)
    ###If the gene has expression values in multiple tissues
    if(length(y_res)>1){
      ###Loop through tissues and recursively join them
      for(i in 2:length(y_res)){
        if(i==2){
          df_a <- data.frame(id=rownames(y_res[[i-1]]), y_res[[i-1]])
          colnames(df_a)[2] <- names(y_res)[[i-1]]
          df_b <- data.frame(id=rownames(y_res[[i]]), y_res[[i]])
          colnames(df_b)[2] <- names(y_res)[[i]]
          Y_df <- dplyr::full_join(df_a, df_b, by="id")
        } else {
          df_b <- data.frame(id=rownames(y_res[[i]]), y_res[[i]])
          colnames(df_b)[2] <- names(y_res)[[i]]
          Y_df <- dplyr::full_join(Y_df, df_b, by="id")
        }
      }

      ###Assign row names as ID to Y_df and drop id column
      rownames(Y_df) <- Y_df[, 1]
      Y_df <- Y_df[, -1]

      ###If the tissue data contains the same individuals as the genotype data
      if(nrow(Y_df)==ncol(X)){
        ###Order the rows (ID) of the joined data according to the columns (ID) of the genotype matrix 
        X_names <- colnames(X)
        Y_mat <- as.matrix(Y_df[X_names, ])
      } else if(nrow(Y_df)<ncol(X)){ ###If the tissue data contains fewer individuals than the genotype data
        ###Compute the individuals in common between the tissue data and the genotype data
        X_names <- colnames(X)
        Y_names <- rownames(Y_df)
        in_common <- base::intersect(X_names, Y_names)

        ###Extract from the genotype data only the individuals in common between tissue data and the genotype data, and order tissue data according to the genotype data
        X <- X[, which(colnames(X) %in% in_common)]
        Y_mat <- as.matrix(Y_df[colnames(X), ])
      } else {
        stop("Error: There is a problem with IDs")
      } 
    } else { ###If the gene has expression values in only one tissue
      ###Extract y_res
      Y_df <- data.frame(y_res[[1]])

      ###If the tissue data contains the same individuals as the genotype data
      if(nrow(Y_df)==ncol(X)){
        ###Order the rows (ID) of the joined data according to the columns (ID) of the genotype matrix 
        X_names <- colnames(X)
        Y_mat <- as.matrix(Y_df[X_names, ])
        rownames(Y_mat) <- X_names
        colnames(Y_mat)[1] <- names(y_res)[1]
      } else if(nrow(Y_df)<ncol(X)){ ###If the tissue data contains fewer individuals than the genotype data
        ###Compute the individuals in common between the tissue data and the genotype data
        X_names <- colnames(X)
        Y_names <- rownames(Y_df)
        in_common <- base::intersect(X_names, Y_names)

        ###Extract from the genotype data only the individuals in common between tissue data and the genotype data, and order tissue data according to the genotype data
        X <- X[, which(colnames(X) %in% in_common)]
        Y_mat <- as.matrix(Y_df[colnames(X), ])
        rownames(Y_mat) <- colnames(X)
        colnames(Y_mat)[1] <- names(y_res)[1]
      } else {
          stop("Error: There is a problem with IDs")
      } 
    }

    # save
    saveRDS(object = list(X = t(X), y = y, Z = Z, y_res = Y_mat), file = ${_output:ar})

bash: expand = '${ }', workdir = cwd
    # remove intermediate files
    rm -f ${_genes}_genotype.txt.gz

## Fine-mapping with DAP

Currently the `dap` step will use a loop to analyze all phenotypes for the gene.

In [None]:
[dap]
# Extra arguments to pass to DAP
parameter: args = ''
depends: executable('dap-g'), Py_Module('pandas'), Py_Module('numpy'), R_library('data.table')
genes = open(gene_id_file).read().splitlines()
input:  [f"{analysis_ready_dir:a}/{g}.Multi_Tissues.rds" for g in genes if os.path.isfile(f"{analysis_ready_dir:a}/{g}.Multi_Tissues.rds")], group_by = 1
output: f'{cwd:a}/dap_output/{_input:bn}.DAP_G.pkl'

task: trunk_workers = [2] * 10, trunk_size = 40, walltime = '2h', mem = '6G', cores = 1, tags = f'{step_name}_{_output:bn}'

R: expand = '${ }', stderr = f'{_output:n}.stderr', workdir = f'{_output:ad}'

    remove_missing = function(x, y) {
      return(list(x=x[not_missing,],y=y[not_missing]))
    }
  
    write_data = function(x,y,prefix,group) {
      geno = cbind('geno', rownames(x), group, x)
      colnames(geno) = rownames(geno) = NULL
      pheno = c('pheno', ${_input:bnnr}, group, y)
      dat = data.frame(rbind(pheno, geno))
      data.table::fwrite(dat, paste0(prefix, ".", group, ".data"), sep='\t', row.names = F, col.names = F, verbose=F)
    }

    # Prepare DAP-G input
    data = readRDS(${_input:r})
    prefix = "${_output:nn}"
    for (group in colnames(data$y_res)) {
      not_missing = !is.na(data$y_res[,group])
      X = data$X[not_missing,]
      y = data$y_res[,group][not_missing]
      write_data(t(X), y, prefix, group)
    }

python: expand = '${ }', stderr = f'{_output:n}.stderr', workdir = f'{_output:ad}'
    # Run DAP-G and organize output
    import sys
    import subprocess
    import pandas as pd
    import numpy as np

    def run_dap_full(data_file, args):
        cmd = ['dap-g', '-d', data_file, '-o', f'{data_file:n}.result', '--output_all'] + ' '.join(args).split()
        subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()    
        # remove the temp data file but keep the output in text format
        subprocess.Popen(['rm', '-f', data_file])

    def extract_dap_output(data_file):
        out = [x.strip().split() for x in open(f'{data_file:n}.result').readlines()]
        pips = []
        clusters = []
        still_pip = True
        for line in out:
            if len(line) == 0:
                continue
            if len(line) > 2 and line[2] == 'cluster_pip':
                still_pip = False
                continue
            if still_pip and (not line[0].startswith('((')):
                continue
            if still_pip:
                pips.append([line[1], float(line[2]), float(line[3]), int(line[4])])
            else:
                clusters.append([len(clusters) + 1, float(line[2]), float(line[3])])
        pips = pd.DataFrame(pips, columns = ['snp', 'snp_prob', 'snp_log10bf', 'cluster'])
        clusters = pd.DataFrame(clusters, columns = ['cluster', 'cluster_prob', 'cluster_avg_r2'])
        clusters = pd.merge(clusters, pips.groupby(['cluster'])['snp'].apply(','.join).reset_index(), on = 'cluster')
        return {'snp': pips, 'set': clusters}


    def dap_single(data_file, args):
        run_dap_full(data_file,args)
        return extract_dap_output(data_file)

    def dap_batch(files, *args):
        return dict([(f'{x:bn}', dap_single(x, args)) for x in files])

    import glob
    import pickle
    from sos.targets import paths
    data_files = paths(glob.glob("${_output:nn}.*.data"))
    output = dap_batch(data_files, '${args}')

    with open(${_output:r}, 'wb') as handle:
        pickle.dump(output, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Submit jobs on RCC

In [None]:
[extract_midway]
parameter: wf_file = path("GTEx_V8_preprocessing.ipynb")
parameter: cluster_config = path
input: wf_file, cluster_config

script: interpreter= 'qsub', expand = True
#!/bin/bash
  
#SBATCH --time=96:00:00
#SBATCH --partition=mstephens
#SBATCH --account=pi-mstephens
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name={step_name}
#SBATCH --mail-type=BEGIN,END,FAIL

sos run {_input[0]} extract_submit --gene-id-file {gene_id_file} \
                                    --analysis-ready-dir {analysis_ready_dir} \
                                    -c {_input[1]} -q midway2