# Multi-Trait Aanalysis of GWAS (MTAG)

## Aim

Joint analysis of summary statistics from GWAS of different traits from possibly overlapping samples

## Method

[MTAG](https://www.nature.com/articles/s41588-017-0009-4) uses bivariate linkage disequilibrium (LD) score regression to account for (possibly unknown) sample overlap between the GWAS results for different traits

The MTAG estimator is a generalization of inverse-variance-weighted meta-analysis that takes summary statistics from single-trait GWAS and outputs trait-specific association statistics. The resulting P values can be used like P values from a single-trait GWAS, for example, to prioritize SNPs for subsequent analyses such as biological annotation or to construct polygenic scores.

## Requirements

MTAG needs the use of python=2.7. to install follow intructions [here](https://github.com/JonJala/mtag)

## Format for Summary Statistics

```
snpid    chr    bpos    a1    a2    freq    z    pval    n
```

snpid: the ld reference panel for mtag uses hg19 and rsid identifiers. If you are going to use this one for you analyses you have to make sure you snpid column is in the correct format. 

chr: chromosome number

bpos: base position

a1 should be the effect allele

a1: the non-effect allele

freq: allele frequency for a1

z: The Z-scores associated with the SNP effect sizes for the GWAS

pval: p-value

n: the SNP sample sizes


## LD reference panel FIXME

One issue that I encountered is that in our summary stats, we use the snpid as `chr:pos:ref:alt` format and it is in hg38. Therefore, I've added the liftover module to deal with this and get the summary stats in hg19. 

However, there's still the problem between matching the rsid from the LD reference panel and the snpid column in the GWAS summary statistics. 

To try to solve this problem I've downloaded the LDSC-compatible flat files from [Pan-UKBiobank](https://pan.ukbb.broadinstitute.org/downloads) open source (also in hg19). However, these need to be split by chromosome which I did using this file `/mnt/mfs/statgen/data_public/UKBB.ALL.ldscore.hg19/UKBB.ALL.ldscore.hg19/UKBB.EUR.l2.ldscore.gz`. Then I encountered the problem that `mtag` needs that each of the `{1..22}.l2.ldscore.gz` has two accompanying files `{1..22}.l2.M` and `{1..22}.l2.M_5_50` with the total number of SNPs and the number of SNP's with MAF>5% respectively. The PAN-UK provides these counts for the whole dataset (chrom 1 to 22) and does not provide a MAF which makes is very difficult to create the `l2.M5_50` files.


## Errors encountered

```
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTAT is -0.38 (should be close to 0.0). This column may be mislabeled
```

This issue is discussed [here](https://github.com/JonJala/mtag/issues/86)

## MTAG defaults

1. Read in the input GWAS summary statistics and filter the SNPs by minor allele frequency (MAF) >= 0.01 and sample size N >= (2/3) * 90th percentile

2. Merge the filtered GWAS summary statistics results together, taking the intersection of available SNPs.

3. Estimate the residual covariance matrix via LD Score regression

4. Estimate the genetic covariance matrix Omega

5. Perform MTAG and output results

## MTAG special options

1. Assumes no overlap between any of the cohorts in any pair of GWAS studies fed into mtag

`--no-overlap`


2. Assumes the T summary statistics used in MTAG are GWAS estimates for traits that are perfectly correlated with one another, i.e., each GWAS is on a different measure of the same "trait".

`--perfect_gencov`

3. Performing meta-analysis with mtag. Assumes: Variation between "traits" is only due to non-genetic factors. All summary statistics files have in MTAG have the same heritability as they considered to be results on the same measure of a single trait

`-equal_h2`

Use mtag to implement a type of inverse-variance meta-analysis that can handle sample overlap in the GWAS results. 

## Output

* .log: timestamps the different steps taken by mtag.py

* sigma_hat.txt: stores the estimated residual covariance matrix

* omega_hat.txt: stores the estimated genetic covariance matrix

* trait_{n}.txt: tab-delimited results files corresponding to the MTAG-adjusted effect sizes and standard errors for n imputed traits

The first 8 columns are copied from the sumstats and then the results obtained by MTAG are given:

* mtag_beta: unstandardized weights 

* mtag_se: unstandardized standard errors

* mtag_z: z-scores

* mtag_pval: p-values

## MWE
```
sumstatsFiles=`echo ~/output/*hg19.snp_stats_original_columns.gz`
sos dryrun ~/project/UKBB_GWAS_dev/workflow/MTAG.ipynb mtag \
--cwd ~/output \
--sumstatsFiles $sumstatsFiles \
--formatFile ~/project/bioworkflows/GWAS/data/mtag_template.yml \
--ld_ref_panel ~/output/ldscores_ukbb_per_chrom/ \
--job_name 'f2247_f2257_combined'
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Path to summary stats file
parameter: sumstatsFiles = paths('.')
# snpid column
parameter: snp_name = 'snpid'
# The effect allele
parameter: a1_name = 'a1'
# The non-effect allele
parameter: a2_name = 'a2'
# Frequency of the effect allele
parameter: freq = 'freq'
# Sample size for every SNP
parameter: n_name = 'n'
# Z-scores column
parameter: z_name = 'z'
# chrosomose column
parameter: chr_name = 'chr'
# base pair position column
parameter: bpos_name = 'bpos'
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# The container with the lmm software. Can be either a dockerhub image or a singularity `sif` file.
parameter: container_lmm = 'statisticalgenetics/lmm:3.0'
# Summary statistics format file path used for unifying input column names. Will not unify names if empty
parameter: formatFile = path('.')
# If the sumstatsfile has logP instead of P-val
parameter: reverse_log_p = True
# If the zscore needs to be calculated
parameter: z_score = True
# If there's no overlap between samples
parameter: no_overlap = False
# If the traits are perfectly correlated
parameter: perfect_gencov = False
# Assume equal heritability of traits
parameter: h2_equal = False
# Reference Ld used by ldsc.py needs to be splitted by chromosome
parameter: ld_ref_panel = str

In [None]:
[liftover]
parameter: lifover_pipeline = path
parameter: formatFile_liftover = path('.')
parameter: fr = 'hg38'
parameter: to = 'hg19'
parameter: container = ''
input: susmtastsFiles, group_by=1
output: f'{cwd}/{_input:bnn}.{_hg}.snp_stats.gz'
depends: formatFile_regenie
task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }",  stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    sos run ${liftover_pipeline} \
    --cwd ${cwd} \
    --input_file ${_input} \
    --output_file ${_input}.${to}.mtag \
    --fr ${fr} --to ${to} \
    --yaml_file ${formatFile_liftover} \
    --no-rename \
    --container ${container}

In [None]:
[mtag_1]
input: sumstatsFiles, group_by=1
output: f'{cwd}/{_input:bnn}.mtag.snp_stats'
depends: formatFile
task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand = "${ }",  stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    import gzip
    import pandas as pd
    
    # unify output format
    if ${formatFile.is_file()} or ${reverse_log_p} or ${z_score}:
        sumstats = pd.read_csv(${_input:r}, compression='gzip', header=0, sep='\t', quotechar='"')  
        if ${formatFile.is_file()}:
            import yaml
            config = yaml.safe_load(open(${formatFile:r}, 'r'))
        try:
            sumstats = sumstats.loc[:,list(config.values())]
        except:
            raise ValueError(f'According to ${formatFile}, input summary statistics should have the following columns: {list(config.values())}.')
        sumstats.columns = list(config.keys())
        if ${reverse_log_p}:
            sumstats['pval'] = sumstats['pval'].apply(lambda row: 10**-row)
        if ${z_score}:
            sumstats['z'] = sumstats['beta']/sumstats['se']

        sumstats[["chr", "bpos"]] = sumstats[["chr", "bpos"]].astype(str)
        sumstats["snpid"] = sumstats.chr.str.cat(others=[sumstats.bpos, sumstats.a2, sumstats.a1], sep=':')
        sumstats[["chr", "bpos"]] = sumstats[["chr", "bpos"]].astype(int)
        sumstats.to_csv(${_output:r}, sep='\t', header = True, index = False, na_rep='.')

In [None]:
[mtag_2]
parameter: job_name=''
input: group_by='all'
output: f'{cwd}/{job_name}_sigma_hat.mtag.txt',
        f'{cwd}/{job_name}_omega_hat.mtag.txt',
        [f'{cwd}/{job_name}.trait_{x}.mtag.txt' for x in range(len(sumstatsFiles))]
task: trunk_workers = 1, trunk_size = job_size, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[1]:bn}'
bash: expand = "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.log'

    ~/conda/my-envs/py27/bin/python2.7 ~/project/mtag/mtag.py  \
    --sumstats ${','.join(['%s' % x for x in _input if x is not None])} \
    --snp_name ${snp_name} \
    --a1_name ${a1_name} \
    --a2_name ${a2_name} \
    --eaf_name ${freq} \
    --n_name ${n_name} \
    --z_name ${z_name} \
    --chr_name ${chr_name} \
    --bpos_name ${bpos_name} \
    --out ${cwd}/${job_name} \
    --n_min 0.0 \
    ${('--ld_ref_panel ' + ld_ref_panel + '/') } \
    ${('--' + no_overlap ) if no_overlap is True else ''} \
    ${('--' + perfect_gencov) if perfect_gencov is True else ''} \
    ${('--' + h2_equal) if h2_equal is True else ''} \
    --force