# Molecular QTL workflow

In [1]:
%revisions -s -n 10

## Data

Molecular QTL data from [Yang et al (2016) Science](http://eqtl.uchicago.edu/jointLCL/). Input are genotypes of ~100 YRI samples with their molecular QTL data measured in LCL.

- alternative splicing (AS) data is of the primary interest here.

### Genotypes

[Genotype data for YRI](http://eqtl.uchicago.edu/jointLCL/genotypesYRI.gen.txt.gz) is the conventional VCF format but has dosage for genotypes.

### Phenotypes

Phenotype data has format:

```
#Chr	start	end	ID	18486	18487	18488	18489	18498	18499
chr1	880180	880422	chr1:880180:880422:clu_15502	0.201694364955	0.665990212763	-1.21881815589	-0.342480185427	0.165404160483	-1.58524292941
```

The first 4 columns are genomic coordinates info. Others are molecular QTL in samples.

We analyze:

1. [Alternative splicing](http://eqtl.uchicago.edu/jointLCL/qqnorm_ASintron_RNAseqGeuvadis.txt)

## Analysis plan

- For each analysis unit (gene, or intron cluster for AS), get the 1MB up/down-stream variants in genotypes
- Remove top phenotype PC from phenotype data
- Fine-mapping using various methods. SuSiE for starters

## Workflow overview

In [1]:
!sos run 20180704_MolecularQTL_Workflow.ipynb -h

usage: sos run 20180704_MolecularQTL_Workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  SuSiE
  index_vcf
  R_loader

Global Workflow Options:
  --cwd /home/gaow/GIT/LargeFiles/finemap_workflow_output (as path)
                        Specify work directory
  --x-data . (as path)
                        X data, the genotype VCF file path
  --y-data . (as path)
                        Y data, the phenotype file paths
  --h5-data . (as path)
                        HDF5 file, merged X and Y data

Sections
  SuSiE_1:              PCA on phenotype and remove top PCs
    Workflow Options:
      --num-pcs 7 (as int)
                        Num. PC to remove from p

Frontend communicator is broken. Please restart jupyter server


In [4]:
[global]
# Specify work directory
parameter: cwd = path("~/GIT/LargeFiles/finemap_workflow_output")
# X data, the genotype VCF file path
parameter: x_data = path()
# Y data, the phenotype file paths
parameter: y_data = path()
# HDF5 file, merged X and Y data
parameter: h5_data = path()
fail_if(not x_data.is_file() and not h5_data.is_file(), msg = 'Please provide ``--x-data`` or ``--h5-data``!')
fail_if(not y_data.is_file() and not h5_data.is_file(), msg = 'Please provide ``--y-data`` or ``--h5-data``!')

Workflow can only be executed with magic %run or %sosrun.

## Regressing out top PCs on phenotype

There may well be better approach to control for covariates etc, but [here](https://github.com/bmvdgeijn/WASP/blob/master/examples/example_data/H3K27ac/get_PCs.R) is workflow from the authors and was deemed sufficient. See their supplemental table of Yang et al 2016 for how many PC to use for each molecular QTLs.

Need to cope with missing phenotype data here. See `na.omit` function call in `prcomp` and `na.actions=na.exclude`.

In [None]:
# PCA on phenotype and remove top PCs
[SuSiE_1 (Remove top phenotype PC)]
# Num. PC to remove from phenotype
parameter: num_pcs = 7
# column name patter for `grep` in R to select phenotype columns
# eg. "^NA[0-9]+" to extract sample names
parameter: colname_pattern = '^[0-9]+'
input: y_data
output: f"{cwd}/{_input:bn}.PC{num_pcs}.removed.gz"
R: expand = "${ }", workdir = cwd, stdout = f"{_output:n}.stdout"
    num_pcs = ${num_pcs}
    dat <- read.table(${_input:r}, header=T, comment.char='', check.names=F)
    phenotype.matrix <- dat[,5:ncol(dat)]
    # extract columns of interest
    phenotype.matrix <- phenotype.matrix[,grep("${colname_pattern}", colnames(phenotype.matrix), value = T)]
    # perform principal component analysis
    pca <- prcomp(na.omit(phenotype.matrix))
    # PCA summary
    print(summary(pca))
    cat("output", num_pcs, "PCs \n")
    # remove top PC from phenotype; takes a while
    cov_pcs <- pca$rotation[, 1:num_pcs]
    new.phenotype.matrix <- do.call(rbind, lapply(1:nrow(phenotype.matrix), function(i) residuals(lm(t(phenotype.matrix[i,]) ~ as.matrix(cov_pcs), na.action=na.exclude))))
    colnames(new.phenotype.matrix) <- colnames(phenotype.matrix)
    new.dat <- cbind(dat[,1:4], new.phenotype.matrix)
    colnames(new.dat)[1] <- 'chr'
    write.table(new.dat, gzfile(${_output:r}), sep="\t", quote=F, col.names=T, row.names=F)

## Extract per unit variables

In [None]:
# this step provides VCF file index
[index_vcf: provides = '{filename}.gz.tbi']
depends: executable('tabix')
input: f"{filename}.gz"
bash: expand=True
   tabix -p vcf {_input}

# Extract cis-SNPs and make fine-mapping datasets
[SuSiE_2 (Get per-unit dataset)]
depends: Py_Module('pysam'), Py_Module('pandas'), Py_Module('dsc'), Py_Module('rpy2'), f"{x_data}.tbi"
# Maximum distance to site of interest, eg. 1Mb up/downstream to TSS for gene level QTL
parameter: max_dist = 1000000
# Number of batches to perform analysis
parameter: n_batch = 100
batch = [x+1 for x in range(n_batch)]
input: for_each = 'batch', concurrent = True
output: dynamic(glob.glob('{cwd}/{y_data:bnn}/chr*/*.rds'))
python: workdir = cwd, expand = "${ }"
    def read_header(gzfile):
        import gzip
        with gzip.open(gzfile,'r') as f:
            for line in f:
                res = [x.decode() for x in line.split()]
                break
        return res

    def chunk_ranges(items, chunks):
        result = []
        if items <= chunks:
            for i in range(0, items):
                result.append((i, i + 1))
            return result
        chunk_size, extras = divmod(items, chunks)
        start = 0
        for i in range(0, chunks):
            if i < extras:
                end = start + chunk_size + 1
            else:
                end = start + chunk_size

            result.append((start, end))
            start = end
        return result
    #
    batch = ${_batch}
    phenotype_id = read_header(${_input:r})[4:]
    vcf_id = read_header(${x_data:r})[9:]
    from pathlib import Path
    import pysam
    tbx = pysam.TabixFile(${x_data:r})    
    import pandas as pd, numpy as np
    from dsc.dsc_io import save_rds
    qts = pd.read_csv(${_input:r}, sep = '\t')
    the_chunk = chunk_ranges(qts.shape[0], ${n_batch})[batch-1]
    qts = qts.iloc[the_chunk[0]:the_chunk[1],:]
    #
    import time
    start_time = time.time()
    i = 0
    for idx, unit in qts.iterrows():
        if(i % 100 == 0):
            print('[batch %s percent completed] %.2f (%s sec elapsed)' % (batch, (float(i+1)/qts.shape[0])*100, np.around(time.time() - start_time, 2)))
        i += 1
        chrom = unit['chr']
        name = unit['ID'].replace(':', '_')
        start = max(unit['start'] - ${max_dist}, 0)
        end = unit['start'] + ${max_dist}
        phenotypes = np.array(unit.tolist())
        genotypes = np.array([row for row in tbx.fetch(chrom, start, end, parser=pysam.asTuple())])
        if len(genotypes) == 0:
            continue
        X_data = pd.DataFrame(genotypes[:,9:].T,
                              columns = ['_'.join(x) for x in genotypes[:,[0,1,3,4,2]]], 
                              index = vcf_id)
        Y_data = pd.DataFrame(np.matrix(phenotypes[4:]).T, columns = [name],
                              index = phenotype_id).dropna()
        merged = Y_data.join(X_data, how='inner').T
        Path(f'${cwd}/${y_data:bnn}/{chrom}').mkdir(exist_ok=True, parents=True)
        save_rds(merged, f'${cwd}/${y_data:bnn}/{chrom}/{name}.rds')

## Finemapping with SuSiE

In [None]:
# Data loader utility function
[R_loader: provides = file_target('.sos/data_loader.R')]
depends: R_library("rhdf5"), R_library("tools")
report: output = f'{_output}'
    load_data = function(fn, name) {
        library(rhdf5)
        dat = h5read(fn, name)
        #
        X = data.frame(dat$X$block0_values)
        colnames(X) = dat$X$axis1
        rownames(X) = dat$X$axis0
        #
        y = data.frame(t(dat$y$block0_values))
        colnames(y) = tools::file_path_sans_ext(name)
        rownames(y) = dat$y$axis1
        y = y[!(rowSums(is.na(y))),,drop=F]
        return(merge(y, X, by=0))
    }

In [None]:
# Run finemapping with SuSiE
[SuSiE_3 (SuSiE analysis)]
depends: file_target('.sos/data_loader.R')
input: group_by = 1, concurrent = True
output: f'{_input:n}.SuSiE.complete'
R: expand = True, input='.sos/data_loader.R'
    saveRDS(1, {_output:r})