# Molecular QTL workflow

In [6]:
%revisions -s -n 10

Revision,Author,Date,Message
,,,
63045aa,Gao Wang,2018-07-05,Change figure layout
8c6e722,Gao Wang,2018-07-05,Add cluster configurations
aeaf3fa,Gao Wang,2018-07-05,Fix susie execution pipeline step
1420e4c,Gao Wang,2018-07-05,Use chromsomes as batches
044e3a1,Gao Wang,2018-07-04,Add data convertion pipeline for molecular qtls


## Data

Molecular QTL data from [Yang et al (2016) Science](http://eqtl.uchicago.edu/jointLCL/). Input are genotypes of ~100 YRI samples with their molecular QTL data measured in LCL.

- alternative splicing (AS) data is of the primary interest here.

### Genotypes

[Genotype data for YRI](http://eqtl.uchicago.edu/jointLCL/genotypesYRI.gen.txt.gz) is the conventional VCF format but has dosage for genotypes.

### Phenotypes

Phenotype data has format:

```
#Chr	start	end	ID	18486	18487	18488	18489	18498	18499
chr1	880180	880422	chr1:880180:880422:clu_15502	0.201694364955	0.665990212763	-1.21881815589	-0.342480185427	0.165404160483	-1.58524292941
```

The first 4 columns are genomic coordinates info. Others are molecular QTL in samples.

We analyze:

1. [Alternative splicing](http://eqtl.uchicago.edu/jointLCL/qqnorm_ASintron_RNAseqGeuvadis.txt)

## Analysis plan

- For each analysis unit (gene, or intron cluster for AS), get the 1MB up/down-stream variants in genotypes
- Remove top phenotype PC from phenotype data
- Fine-mapping using various methods. SuSiE for starters

## Workflow overview

In [7]:
!sos run 20180704_MolecularQTL_Workflow.ipynb -h

usage: sos run 20180704_MolecularQTL_Workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  preprocess
  index_vcf
  SuSiE
  SuSiE_Summary

Global Workflow Options:
  --cwd /home/gaow/GIT/LargeFiles/AS_output (as path)
                        Specify work directory
  --x-data . (as path)
                        X data, the genotype VCF file path
  --y-data . (as path)
                        Y data, the phenotype file paths
  --max-dist 1000000 (as int)
                        Maximum distance to site of interest, eg. 1Mb
                        up/downstream to TSS for gene level QTL

Sections
  preprocess_1:         PCA on phenotype and remove top PCs
    Work

In [None]:
[global]
# Specify work directory
parameter: cwd = path("~/GIT/LargeFiles/AS_output")
# X data, the genotype VCF file path
parameter: x_data = path()
# Y data, the phenotype file paths
parameter: y_data = path()
# Maximum distance to site of interest, eg. 1Mb up/downstream to TSS for gene level QTL
parameter: max_dist = 1000000
fail_if(not x_data.is_file() and not h5_data.is_file(), msg = 'Please provide ``--x-data`` or ``--h5-data``!')
fail_if(not y_data.is_file() and not h5_data.is_file(), msg = 'Please provide ``--y-data`` or ``--h5-data``!')
pop = 'YRI'

The preprocessing pipeline can be executed locally, takes 2hrs:

```
sos run analysis/20180704_MolecularQTL_Workflow.ipynb preprocess \
    --x-data ~/GIT/LargeFiles/AS/genotypesYRI.gen.txt.gz \
    --y-data ~/GIT/LargeFiles/AS/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --max-dist 1000000
```

## Regressing out top PCs on phenotype

There may well be better approach to control for covariates etc, but [here](https://github.com/bmvdgeijn/WASP/blob/master/examples/example_data/H3K27ac/get_PCs.R) is workflow from the authors and was deemed sufficient. See their supplemental table of Yang et al 2016 for how many PC to use for each molecular QTLs.

Need to cope with missing phenotype data here. See `na.omit` function call in `prcomp` and `na.actions=na.exclude`.

In [None]:
# PCA on phenotype and remove top PCs
[preprocess_1 (Remove top phenotype PC)]
# Num. PC to remove from phenotype
parameter: num_pcs = 3 # Table S2 of NIHMS835311-supplement-supplement.pdf
# column name patter for `grep` in R to select phenotype columns
# eg. "^NA[0-9]+" to extract sample names
parameter: colname_pattern = '^[0-9]+'
input: y_data
output: f"{cwd}/{_input:bn}.PC{num_pcs}.removed.gz"
R: expand = "${ }", workdir = cwd, stdout = f"{_output:n}.stdout"
    num_pcs = ${num_pcs}
    dat <- read.table(${_input:r}, header=T, comment.char='', check.names=F)
    phenotype.matrix <- dat[,5:ncol(dat)]
    # extract columns of interest
    phenotype.matrix <- phenotype.matrix[,grep("${colname_pattern}", colnames(phenotype.matrix), value = T)]
    # perform principal component analysis
    pca <- prcomp(na.omit(phenotype.matrix))
    # PCA summary
    print(summary(pca))
    cat("output", num_pcs, "PCs \n")
    # remove top PC from phenotype; takes a while
    cov_pcs <- pca$rotation[, 1:num_pcs]
    new.phenotype.matrix <- do.call(rbind, lapply(1:nrow(phenotype.matrix), function(i) residuals(lm(t(phenotype.matrix[i,]) ~ as.matrix(cov_pcs), na.action=na.exclude))))
    colnames(new.phenotype.matrix) <- colnames(phenotype.matrix)
    new.dat <- cbind(dat[,1:4], new.phenotype.matrix)
    colnames(new.dat)[1] <- 'chr'
    write.table(new.dat, gzfile(${_output:r}), sep="\t", quote=F, col.names=T, row.names=F)

## Extract per unit variables

In [None]:
# this step provides VCF file index
[index_vcf: provides = '{filename}.gz.tbi']
depends: executable('tabix')
input: f"{filename}.gz"
bash: expand=True
   tabix -p vcf {_input}

# Extract cis-SNPs and make fine-mapping datasets
[preprocess_2 (Get per-unit dataset)]
depends: Py_Module('pysam'), Py_Module('pandas'), Py_Module('feather'), Py_Module('rpy2'), f"{x_data}.tbi"
chroms = [f'chr{x+1}' for x in range(22)]
input: for_each = 'chroms', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/chr*/*.rds'))
python: workdir = cwd, expand = "${ }"
    def read_header(gzfile):
        import gzip
        with gzip.open(gzfile,'r') as f:
            for line in f:
                res = [x.decode() for x in line.split()]
                break
        return res

    chrom = "${_chroms}"
    phenotype_id = [f'${pop}_{x}' for x in read_header(${_input:r})[4:]]
    vcf_id = [f'${pop}_{x}' for x in read_header(${x_data:r})[9:]]
    from pathlib import Path
    import pysam
    tbx = pysam.TabixFile(${x_data:r})    
    import pandas as pd, numpy as np
    from feather import write_dataframe
    qts = pd.read_csv(${_input:r}, sep = '\t')
    qts = qts.loc[qts['chr'] == chrom]
    #
    import os, time, tempfile
    start_time = time.time()
    i = 0
    cmds = ["library(feather)"]
    tf = tempfile.NamedTemporaryFile()
    for site in sorted(set(qts['start'].tolist())):
        if(i % 100 == 0):
            print('[chrom %s percent completed] %.1f (%.1f sec elapsed)' % (chrom, (float(i+1)/qts.shape[0])*100, time.time() - start_time))
            if len(cmds) > 1:
                with open(tf.name, 'w') as f:
                    f.write('\n'.join(cmds))
                os.system(f"Rscript {tf.name}")
                cmds = ["library(feather)"]            
        unit = qts.loc[qts['start'] == site]
        i += unit.shape[0]
        start = max(site - ${max_dist}, 0)
        end = site + ${max_dist}
        genotypes = np.array([row for row in tbx.fetch(chrom, start, end, parser=pysam.asTuple())])
        if len(genotypes) == 0:
            continue
        Y_data = unit.drop(["chr", "start", "end", "ID"], axis=1).T
        Y_data.columns = [x.replace(':', '_') for x in unit['ID']]
        Y_data.index = phenotype_id
        X_data = pd.DataFrame(genotypes[:,9:].T,
                              columns = ['_'.join(x) for x in genotypes[:,[2,0,1,3,4]]], 
                              index = vcf_id)
        merged = Y_data.join(X_data, how='inner').astype(np.float32)
        Path(f'${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/{chrom}').mkdir(exist_ok=True, parents=True)
        basename = f'${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/{chrom}/{chrom}_{site}_{max(unit["end"].tolist())}'
        write_dataframe(merged, basename + '.feather')
        cmds.append("saveRDS(read_feather('{0}'), '{1}');system('rm -f {0}')".format(basename + '.feather', basename + '.rds'))
    #
    if len(cmds) > 1:
        with open(tf.name, 'w') as f:
            f.write('\n'.join(cmds))
        os.system(f"Rscript {tf.name}")

## Finemapping with SuSiE

This step is to be executed on midway. 

```
sos run analysis/20180704_MolecularQTL_Workflow.ipynb SuSiE \
    --x-data ~/GIT/LargeFiles/AS/genotypesYRI.gen.txt.gz \
    --y-data ~/GIT/LargeFiles/AS/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --max-dist 100000 \
    -c ~/software/sos.config.yml -J 40
```

In [None]:
# Run finemapping with SuSiE
[SuSiE (SuSiE analysis)]
depends: R_library('susieR')
parameter: maxL = 5
parameter: prior_var = 0.1
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/chr*/*.rds'), group_by = 1, concurrent = True
output: dynamic(glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_*/*.rds'))
#task: trunk_workers = 1, queue = 'midway2_head', walltime = '10m', trunk_size = 1000, mem = '2G', cores = 1, workdir = cwd, concurrent = True
R: expand = "${ }"
    library(susieR)
    dat = readRDS(${_input:r})
    n_y = length(grep("^chr", colnames(dat), value = T))
    Y = as.matrix(dat[,1:n_y,drop=F])
    X = as.matrix(dat[,(n_y+1):ncol(dat),drop=F])
    storage.mode(X) = 'double'
    storage.mode(Y) = 'double'
    bad = which(sapply(1:ncol(X), function(i) all(is.na(X[,i]))))
    if (length(bad) >= 1) {
      snps = colnames(X)[-bad]
      X = X[,-bad]
    } else {
      snps = colnames(X)
    }
    for (r in ncol(Y)) {
      keep_rows = which(!is.na(Y[,r]))
      x = X[keep_rows,]
      y = Y[,r][keep_rows]
      z_score = susieR:::calc_z(x,y)
      names(z_score) = snps
      fitted = susie(x, y,
               L=${maxL},
               estimate_residual_variance = TRUE, 
               prior_variance = ${prior_var}, 
               intercept=FALSE,
               tol=1e-3)
      sets = susie_get_CS(fitted,
                    coverage = 0.95,
                    X = x, 
                    min_abs_corr = 0.4)
      pip = susie_get_PIP(fitted, sets$cs_index)
      dirname = paste0('${cwd}/${y_data:bnn}_${int(max_dist/1000)}Kb/SuSiE_CS_', length(sets$cs_index), '/')
      system(paste("mkdir -p", dirname))
      if (length(sets$cs_index) > 0) {
          saveRDS(list(z_score=z_score,fitted=fitted,sets=sets,pip=pip), paste0(dirname, colnames(Y)[r], '.rds'))
      } else {
          saveRDS(list(z_score=z_score,fitted=fitted), paste0(dirname, colnames(Y)[r], '.rds'))          
      }
    }

In [None]:
# Make SuSiE result plots for significant results
[SuSiE_Summary (plot SuSiE results)]
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:n}.png'
R: expand = '${ }', stdout = f'{_output:n}.log'
    library(susieR)
    dat = readRDS(${_input:r})
    b = rep(0,length(dat$z_score))
    b[which.max(abs(dat$z_score))] = 1
    png(${_output:r}, 12, 6, units = 'in', res = 500)
    par(mfrow=c(1,2))
    susie_pplot(dat$z_score, dtype='z', b=b)
    susie_pplot(dat$pip, fitted=dat$fitted, dtype='PIP', b=b) 
    dev.off()