# m&m ash pipeline execution interface

## Preparation & preprocessing GTEx data
See [this page](../writeup/GTEx7_Analysis_Plan.html#Preprocessing) and [this meeting note](../writeup/Meetings.html#Project-meeting-20170518) for details. 

In [2]:
%save prep.sos -f -x


[gene_annotation: provides = "${CONFIG['rna_cnts']!n}.annotation"]
input: "${CONFIG['rna_cnts']}"
output: "${CONFIG['rna_cnts']!n}.annotation"
sos_run("MC.ensembl_annotation", workdir = CONFIG['wd'])

[genotype_formatting]
parameter: original_variants = "{}/GTEx7.dbGaP.bed".format(CONFIG['wd'])
parameter: gene_annotation = "${CONFIG['rna_cnts']!n}.annotation"
depends: original_variants
input: "{}/GTEx7.Imputed.bed".format(CONFIG['wd'])
sos_run("DW.variants_filter+DW.plink_to_hdf5_batch", 
        workdir = CONFIG['wd'], 
        include = original_variants,
        ann = gene_annotation)

[covariate_preparation]
# Covariates are: sex, platform, 3 PC and PEER factors
parameter: peer_factors = glob.glob("{}/*_PEER_covariates.txt".format("${CONFIG['wd']!a}"))
parameter: pc_file = "{}/GTEx7.Imputed.prune.pc.ped".format(CONFIG['wd'])
parameter: attr_file = CONFIG['sample_attr']
parameter: covar_file = CONFIG['phenotype']
parameter: expression_file = CONFIG['expression_db']
sos_run("DW.recode_platform + DW.covariates_to_HDF5",
        workdir = CONFIG['wd'],
        peer_factors = peer_factors,
        pc_file = pc_file,
        attr_file = attr_file,
        covar_file = covar_file,
        output_file = "{}/GTEx7.Imputed.covariates.h5".format(CONFIG['wd']))

[make_toy]
# Create a toy example
sos_run("DW.subset_HDF5_data",
        workdir = CONFIG['wd'],
        ann_file = "${CONFIG['rna_cnts']!n}.annotation",
        geno_file = "{}/GTEx7.Imputed.genotyped.filtered.cis.h5".format(CONFIG['wd']),
        expr_file = "{}/${CONFIG['rna_rpkm']!bnn}.qnorm.std.h5".format(CONFIG['wd']),
        toy_file = CONFIG['toy_prefix'],
        gene_list = CONFIG['toy_gene_list'])

In [None]:
!./prep.sos gene_annotation -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1
!./prep.sos genotype_formatting -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1
!./prep.sos genotype_pca_umich_filtered -c conf/20170507.conf -b ~/Documents/GTEx/bin/ -J 6

### Variants annotation, cis-SNP selection and genotype formatting
Genes are annotated to chromosomal positions, and annotate variants to genes. Then for each gene, variants are selected 2MB of a gene's TSS. It results a **single analysis ready file** in HDF5 format containing ~50K groups of genotype data (gene-names).

In [None]:
!./prep.sos gene_annotation -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1
!./prep.sos genotype_formatting -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1

### Merge covariates info
Covariates for analysis we've got so far include sample phenotypes (sex), sample attributes (genotyping platform), first principle components for population structure, and PEER factors. All saved in various files.

This workflow consolidates these files and generates a **single analysis ready covariate file** in HDF5 format.

In [None]:
!./prep.sos covariate_preparation -c conf/20170507.conf -b ~/Documents/GTEx/bin/ 

### Generate a toy data-set
Finally, a toy data-set is created from the data bundle. This toy can be used for methods / pipeline development. Genes selected for the toy are the same as the [LD show-case in the mash paper](https://stephenslab.github.io/gtexresults_mash/TwoSNP/2SNP.sos) (although the workflow itself takes an arbitary list of genes). See [this table](https://stephenslab.github.io/gtexresults_mash/TwoSNP/) for motivation that these genes get selected.

In [None]:
!./prep.sos make_toy -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1

## Simulations
Please see [this notebook](../analysis/MR-ASH-Simulation.html) for interactive codes simulating expression data for given genotypes, and this [notebook](../analysis/MR-ASH-Example.html) for a toy analysis. This pipeline is a more formal version of those exploratory analysis.

In [20]:
%save simulation.sos -f -x
#!/usr/bin/env sos-runner
#fileformat=SOS1.0

#%include Misc as MC

[phenotype_original_genotype]
sos_run('MC.genotype_LD', 
        genotype_data = CONFIG['genotype'],
        workdir = "${CONFIG['wd']!a}",
        src = "${CONFIG['src']!a}")

sos_run('MC.simulation',
        genotype_data = CONFIG['genotype'],
        workdir = "${CONFIG['wd']!a}",
        src = "${CONFIG['src']!a}",
        pi0 = CONFIG['pi0'],
        shape = CONFIG['shape'],
        n_rep = CONFIG['n_rep'])

[phenotype_permuted_genotype]
sos_run('MC.genotype_LD',
        genotype_data = CONFIG['genotype'],
        workdir = "${CONFIG['wd']!a}",
        src = "${CONFIG['src']!a}",
        permute_genotype = "True")

sos_run('MC.simulation',
        genotype_data = CONFIG['genotype'],
        workdir = "${CONFIG['wd']!a}",
        src = "${CONFIG['src']!a}",
        permuted_genotype = "True",
        pi0 = CONFIG['pi0'],
        shape = CONFIG['shape'],
        n_rep = CONFIG['n_rep'])

### Generate phenotype from original genotype

One replicate multiple scenarios
```bash
./simulation.sos phenotype_original_genotype -J 38 -c conf/simulate-20170630.conf
./simulation.sos phenotype_permuted_genotype -J 38 -c conf/simulate-20170630.conf
```
One scenario multiple replicates
```bash
./simulation.sos phenotype_original_genotype -J 38 -c conf/simulate-20170630-reps.conf
./simulation.sos phenotype_permuted_genotype -J 38 -c conf/simulate-20170630-reps.conf
```

## Univariate eQTL analysis
### Single SNP
A [fastEQTL](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/) based pipeline.

### Multi-SNP

In [None]:
[mr_ash]
parameter: cwd = "./"
parameter: genotype_file = ''
parameter: expr_file = ''
parameter: covar_file = ''
parameter: genes = []
parameter: tissues = []
depends: R_library('rhdf5'), R_library('tools')
input: for_each = ['genes', 'tissues']
output: "${cwd!a}/${_tissues}_${_genes!b}.rds"
task:
R:
load_data = function(genotype_file, expr_file, cov_file, geno_table, expr_table,cov_table) {
  geno = h5read(genotype_file, geno_table)
  gdata = geno$block0_values
  colnames(gdata) = geno$axis1
  rownames(gdata) = geno$axis0
  
  expr = h5read(expr_file, expr_table)
  edata = expr$block0_values
  # colnames(edata) = expr$axis1
  colnames(edata) = tools::file_path_sans_ext(expr$axis1)
  # rownames(edata) = expr$axis0
  rownames(edata) = apply(sapply(strsplit(expr$axis0,"-"), `[`, c(1,2)), 2, function(x) paste(x, collapse = '-'))
  
  index = which(duplicated(row.names(edata)))
  edata = edata[-index,]
  # edata = data.frame(edata)
  covariate <- h5read(cov_file, cov_table)
  cdata = covariate$block0_values
  colnames(cdata) = apply(sapply(strsplit(covariate$axis1,"-"), `[`, c(1,2)), 2, function(x) paste(x, collapse = '-'))
  # rownames(edata) = expr$axis0
  rownames(cdata) = covariate$axis0 
  cdata = t(cdata)[-index,]
  gdata = data.frame(gdata)
  # I want to use merge but later
  # index_overlap = which(row.names(gdata) %in% row.names(edata))
  edata = edata[, basename(geno_table)]
  edata = data.frame(edata)
  edata$ID = rownames(edata)
  gdata$ID = rownames(gdata)
  output = merge(x = edata, y = gdata, by = "ID", all.x = TRUE)
  # gdata = gdata[index_overlap,]
  return(list(X=as.matrix(output[,-c(1,2)]), y = as.vector(output$edata), Z = as.matrix(cdata)))
}

autoselect.mixsd = function(betahat,sebetahat,mult = sqrt(2)){
  sebetahat=sebetahat[sebetahat!=0] #To avoid exact measure causing (usually by mistake)
  sigmaamin = min(sebetahat)/10 #so that the minimum is small compared with measurement precision
  if(all(betahat^2<=sebetahat^2)){
    sigmaamax = 8*sigmaamin #to deal with the occassional odd case where this could happen; 8 is arbitrary
  }else{
    sigmaamax = 2*sqrt(max(betahat^2-sebetahat^2)) #this computes a rough largest value you'd want to use, based on idea that sigmaamax^2 + sebetahat^2 should be at least betahat^2   
  }
  if(mult==0){
    return(c(0,sigmaamax/2))
  }else{
    npoint = ceiling(log2(sigmaamax/sigmaamin)/log2(mult))
    return(mult^((-npoint):0) * sigmaamax)
  }
}

initial_step = function(X,y,Z = NULL){
  P = dim(X)[2]
  output = matrix(0,nrow = P,ncol = 2)
  for(i in 1:P){
    if(is.null(Z)){
      g = summary(lm(y~X[,i]))
    } else{
      g = summary(lm(y~X[,i]+Z))
    }
    
    output[i,] = g$coefficients[2,1:2]
  }
  return(list(betahat = output[,1],sebetahat = output[,2]))
}
                                                                  
analyze = function(genename = '/chr4/ENSG00000145214', tissue = '/Lung', out = 'test.rds'){
  library(rhdf5)
  genotype_file = ${genotype_file!ar}
  expr_file = ${expr_file!ar}
  geno_table = genename
  expr_table = tissue 
  gene = basename(geno_table)
  cov_file = ${covar_file!ar}
  cov_table = expr_table
  dat = load_data(genotype_file = genotype_file,
                  expr_file = expr_file,
                  cov_file = cov_file,
                  geno_table = geno_table,
                  expr_table = expr_table,
                  cov_table = cov_table)
  X = as.matrix(dat$X)
  X = X[,which(colSums(X)!=0)]
  if ((nrow(X) == 0) || (ncol(X) == 0)) {
  	saveRDS(list(), out)
  } else {
  storage.mode(X) <- "double"
  y = as.vector(dat$y)
  Z = as.matrix(dat$Z)
  initial = initial_step(X,y,Z)
  mixsd = autoselect.mixsd(initial$betahat,initial$sebetahat)
  logdata = capture.output({res = varbvs::varbvsmix(X, Z, y, sa = c(0,mixsd^2)) })
  betahat = rowSums(res$alpha * res$mu)
  names(betahat) = colnames(X)
  mrash_out = list(betahat = betahat, lfsr = res$lfsr)
  ash_out = ashr::ash(initial$betahat,initial$sebetahat,mixcompdist = "normal")
  saveRDS(list(ash = ash_out, uni = initial, mr_ash = mrash_out, logdata = logdata), out)
}
}

analyze(genename = "/${_genes}", tissue = "/${_tissues}", out = ${_output!r})