# Enrichment analysis workflow for molecular QTL results

For molecular QTL analysis results we obtained we'd like to see if:

1. PIP are higher on average in certain annotation groups than in the rest of genome
2. Is there an enrichment for variables both in CS and in some annotation groups
    - Specifically, whether or not there is an enrichment in the secondary CS that we capture
    
We focus only on the results that has 1 or more CS identified.

In [None]:
%revisions -s -n 10

In [2]:
! sos run 20180712_Enrichment_Workflow.ipynb -h

usage: sos run 20180712_Enrichment_Workflow.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  extract_sumstats
  zscore2bed
  get_variants
  range2var_annotation
  pip_rank_test
  cs_fisher_test

Global Workflow Options:
  --y-data . (as path)
                        Y data, the phenotype file paths
  --trait VAL (required)
                        Trait name
  --cwd  path(f'{y_data:d}/{trait}_output')

                        Specify work / output directory
  --annotation-dir /home/gaow/Documents/GIT/LargeFiles/he_lab_annotations_bed (as path)
                        Path to directory of annotation
                        files
  --single-annot . (as path)
         

## Annotation input


Annotation files are in bed format (for example: `Coding_UCSC.bed`).

```
chr1    69090   70008
chr1    367658  368597
chr1    621095  622034
......
chr9    141121357       141121553
chr9    141124188       141124276
chr9    141134069       141134172
```

There are many annotations one can use. For this workflow one should prepare a list of annotations for input, like this:

```
# my annotation
E8_TCM_T_48h
E8_TCM_D_48h
NR2F2
PGR_Demayo
DNaseI
H3K27me3
H3K4me3
H3K27ac
FAIRE
```

comment symbol `#` is allowed.

## Data input

Need to format the data to:

```
Variant_ID PIP z_score CS_ID 
```

Where `CS_ID` is 0 if variant is not in SuSiE CS, 1 if in CS 1, 2 in CS 2, etc. `Variant_ID` carries information of chrom and pos. eg, `rs10131831_chr14_20905250_G_A`

In [None]:
[global]
# Y data, the phenotype file paths
parameter: y_data = path()
# Trait name
parameter: trait = None
# Specify work / output directory
parameter: cwd = path(f'{y_data:d}/{trait}_output')
# Path to directory of annotation files
parameter: annotation_dir = path('~/Documents/GIT/LargeFiles/he_lab_annotations_bed')
# Path to list of single annotations to use
parameter: single_annot = path() #parameter: single_annot = path("data/all_annotations.txt")
# Maximum distance to site of interest, set to eg. 100Kb or 1Mb up/downstream to start site of analysis unit
parameter: max_dist = 100000
fail_if(not y_data.is_file(), msg = 'Please provide valid ``--y-data``!')
z_score = path(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/SuSiE_loci.sumstats.gz')
out_dir = f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/{z_score:bn}'.replace('.', '_')
try:
    single_anno = [f"{annotation_dir}/{x.split()[0]}.bed" for x in open(single_annot).readlines() if not x.startswith('#')]
except (FileNotFoundError, IsADirectoryError):
    single_anno = []

## Prepare summary statistics file

```
sos run analysis/20180712_Enrichment_Workflow.ipynb extract_sumstats \
    --y-data ~/Documents/GIT/LargeFiles/JointLCL/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --trait AS -j 16 
```

In [None]:
# Extract summary stats from RDS files to plain text
[extract_sumstats_1]
input: glob.glob(f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/SuSiE_CS_[1-9]/*.rds'), group_by = 1, concurrent = True
output: f'{_input:dd}/enrichment/{_input:bn}.sumstats.gz'
R: expand = '${ }'
    dat = readRDS(${_input:r})
    pip = dat$pip
    names = names(readRDS(dat$input)[[dat$idx]]$z_score)
    zscore = readRDS(dat$input)[[dat$idx]]$z_score
    cs_id = rep(0, length(pip))
    for (i in 1:length(dat$sets$cs)) {
        cs_id[dat$sets$cs[[i]]] = dat$sets$cs_index[i]
    }
    write.table(cbind(names,pip,zscore,cs_id), gzfile(${_output:r}), quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")

# Consolidate results to one file
[extract_sumstats_2]
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/SuSiE_loci.sumstats.gz'
bash: expand = True
    zcat {_input} | gzip --best > {_output}
_input.zap()

```
[GW] zcat /home/gaow/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/SuSiE_loci.sumstats.gz | wc -l
1667666
[GW] zcat /home/gaow/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/SuSiE_loci.sumstats.gz | cut -f 4 | grep 1 | wc -l
46659
[GW] zcat /home/gaow/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/SuSiE_loci.sumstats.gz | cut -f 4 | grep 2 | wc -l
2212
[GW] zcat /home/gaow/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/SuSiE_loci.sumstats.gz | cut -f 4 | grep 3 | wc -l
152
```

So we have total of 1667666 variants, 46659 in CS 1, 2212 in CS 2.

## Convert variants from summary statistics to bed format

In [None]:
# Auxiliary step to get variant in bed format based on variant ID in z-score file
[zscore2bed_1]
parameter: in_file = path()
parameter: chr_prefix = ""
input: in_file
output: f'{_input:n}.bed.unsorted'
R: expand = "${ }", docker_image = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(readr)
    library(stringr)
    library(dplyr)
    var_file <- ${_input:r}
    out_file <- ${_output:r}

    variants <- read_tsv(var_file)
    colnames(variants) = c('variant', 'pip', 'zscore', 'cs')
    var_info <- str_split(variants$variant, "_")
    variants <- mutate(variants, chr = paste0("${chr_prefix}", sapply(var_info, function(x){x[2]})), 
                                 pos = sapply(var_info, function(x){x[3]})) %>%
                mutate(start = as.numeric(pos), stop=as.numeric(pos)  + 1) %>%
                select(chr, start, stop, variant)
    options(scipen=1000) # So that positions are always fully written out)
    write.table(variants, file=out_file, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")

[zscore2bed_2]
output: f'{_input:n}'
bash: expand = True, docker_image = 'gaow/atac-gwas', workdir = cwd
     sort-bed {_input} > {_output}
_input.zap()

[get_variants: provides = '{data}.bed']
output: f'{data}.bed'
sos_run('zscore2bed', in_file = f'{_output:n}.gz')

## Apply ranged based annotations

```
sos run analysis/20180712_Enrichment_Workflow.ipynb range2var_annotation \
    --y-data ~/Documents/GIT/LargeFiles/JointLCL/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --trait AS -j 16 \
    --single-annot data/annotation.list 
```

In [None]:
# Get variants in data that falls in target region
[range2var_annotation_1]
depends: f'{z_score:n}.bed'
input: set(paths(single_anno)), group_by = 1, concurrent = True
output: f'{out_dir}/{_input:bn}.{z_score:bn}.bed'
bash: expand = True, docker_image = 'gaow/atac-gwas', workdir = cwd, volumes = f'{annotation_dir}:{annotation_dir}'
    bedops -e {z_score:n}.bed {_input} > {_output}

In [None]:
# Make binary annotation file
[range2var_annotation_2]
depends: z_score
input: group_by = 1, concurrent = True
output: f'{_input:n}.gz'
R: expand = "${ }", docker_image = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    library(readr)
    library(dplyr)
    library(stringr)

    variant_tsv <- ${z_score:r}
    annotation_var_bed <- ${_input:r}
    annot_name <- ${_input:bnr} %>% str_replace(paste0(".",${z_score:bnr}), "")
    out_name <- ${_output:r}

    vars <- read_tsv(variant_tsv)[,1]
    annot_vars = read_tsv(annotation_var_bed, col_names=FALSE)
    names(vars) <- "SNP"
    vars <- vars %>%
            mutate(annot_d = case_when(SNP %in% annot_vars$X4 ~ 1,
                                                        TRUE ~ 0))
    names(vars)[2] <- annot_name
    write.table(vars, file=gzfile(out_name),
                col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE)

## Enrichment analysis

Test for larger PIP in annotation vs outside it.

```
sos run analysis/20180712_Enrichment_Workflow.ipynb pip_rank_test \
    --y-data ~/Documents/GIT/LargeFiles/JointLCL/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --trait AS -j 16 \
    --single-annot data/annotation.list 
```

In [None]:
# Test if PIP is larger in annotations
[pip_rank_test_1]
depends: z_score
input_files = [f'{out_dir}/{value:bn}.{z_score:bn}.gz' for value in paths(single_anno)]
input: input_files, group_by = 1, concurrent = True
output: f'{_input:n}.{step_name}.csv'
R: expand = '${ }', docker_image = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    set.seed(1)
    library(readr)
    library(dplyr)
    variants <- read_tsv(${z_score:r}, col_names=FALSE)
    colnames(variants) = c('SNP', 'PIP', 'Z', 'CS')
    annotation <- read_tsv(${_input:r}, col_names=TRUE)
    name = colnames(annotation)[2]
    colnames(annotation) = c('SNP', 'GROUP')
    # add two random groupings
    annotation$RAND_1 = sample(annotation$GROUP)
    annotation$RAND_2 = sample(annotation$GROUP)
    variants = inner_join(variants, annotation, by = "SNP")
    if (length(unique(variants$GROUP)) == 1) {
      write(paste(name, NA, NA, NA, sep=','), file = ${_output:r})
    } else {
    test = wilcox.test(PIP ~ GROUP, data=variants)
    c_1 = wilcox.test(PIP ~ RAND_1, data=variants)
    c_2 = wilcox.test(PIP ~ RAND_2, data=variants)
    write(paste(name, test$p.value, c_1$p.value, c_2$p.value, sep=','), file = ${_output:r})
    }

# Consolidate results to one file
[pip_rank_test_2, cs_fisher_test_2]
output: f'{cwd}/{y_data:bnn}_{int(max_dist/1000)}Kb/enrichment/SuSiE_loci.sumstats.{step_name}.csv'
bash: expand = True
    cat {_input} > {_output}
_input.zap()

In [12]:
dat = read.table('/home/gaow/Documents/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.pip_rank_test_2.csv', head=F, sep=',')
colnames(dat) = c('annotation', 'enrichment', 'random_1', "random_2")
dat

Unnamed: 0,name,p-val
0,1_GATA2-Interval-Track,0.0001830602
1,CN_range_anno,8.382343e-16
2,Coding_UCSC,2.36765e-137
3,Conserved_LindbladToh,8.528925999999999e-200
4,CTCF_Hoffman,0.0004655771
5,DGF_ENCODE,1.86832e-14
6,DHS_peaks_Trynka,0.1212949
7,DNaseI,3.6252020000000004e-190
8,DN_range_anno,5.7066750000000003e-61
9,E8_TCM_D_48h,2.373551e-59


Test for enrichment of annotation in CS.

```
sos run analysis/20180712_Enrichment_Workflow.ipynb cs_fisher_test \
    --y-data ~/Documents/GIT/LargeFiles/JointLCL/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF.txt.gz \
    --trait AS -j 16 \
    --single-annot data/annotation.list 
```

In [None]:
# Test if CS is enriched with annotations
[cs_fisher_test_1]
depends: z_score
input_files = [f'{out_dir}/{value:bn}.{z_score:bn}.gz' for value in paths(single_anno)]
input: input_files, group_by = 1, concurrent = True
output: f'{_input:n}.{step_name}.csv'
R: expand = '${ }', docker_image = 'gaow/atac-gwas', workdir = cwd, stdout = f'{_output:n}.stdout'
    run_test = function(dat) {
        d1 = dat
        d1$CS[which(d1$CS>0)] = 1
        d1 = table(d1)
        test.d1 = fisher.test(d1)
        # test for non-first CS only
        d2 = dat[which(dat$CS != 1),]
        d2$CS[which(d2$CS>0)] = 1
        d2 = table(d2)
        test.d2 = fisher.test(d2)
        return(c(test.d1$p.value, test.d2$p.value))
    }
    set.seed(1)
    library(readr)
    library(dplyr)
    variants <- read_tsv(${z_score:r}, col_names=FALSE)
    colnames(variants) = c('SNP', 'PIP', 'Z', 'CS')
    annotation <- read_tsv(${_input:r}, col_names=TRUE)
    name = colnames(annotation)[2]
    colnames(annotation) = c('SNP', 'GROUP')
    # add two random groupings
    annotation$RAND_1 = sample(annotation$GROUP)
    annotation$RAND_2 = sample(annotation$GROUP)
    variants = inner_join(variants, annotation, by = "SNP")[,4:7]
    if (length(unique(variants$GROUP)) == 1) {
      write(paste(name, NA, NA, NA, NA, NA, NA, sep=','), file = ${_output:r})
    } else {
    # test against all CS
    test = run_test(variants[,c('CS', 'GROUP')])
    ctrl_1 = run_test(variants[,c('CS', 'RAND_1')])
    ctrl_2 = run_test(variants[,c('CS', 'RAND_2')])
    write(paste(name, test[1], test[2], ctrl_1[1], ctrl_1[2], ctrl_2[1], ctrl_2[2], sep=','), file = ${_output:r})
    }

In [14]:
dat = read.table('/home/gaow/Documents/GIT/LargeFiles/JointLCL/AS_output/fastqtl_qqnorm_ASintron_RNAseqGeuvadis_YangVCF_100Kb/enrichment/SuSiE_loci.sumstats.cs_fisher_test_2.csv', head=F, sep=',')
colnames(dat) = c('annotation', 'enrichment CS>0', 'enrichment CS>1', "random_1 CS>0", "random_1 CS>1", "random_2 CS>0", "random_2 CS>1")
dat

Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_double(),
  X3 = col_double(),
  X4 = col_double()
)
“number of columns of result is not a multiple of vector length (arg 2)”
“1 parsing failure.
row # A tibble: 1 x 5 col     row col   expected  actual    file                                          expected   <int> <chr> <chr>     <chr>     <chr>                                         actual 1    69 <NA>  4 columns 7 columns '/home/gaow/Documents/GIT/LargeFiles/JointLC… file # A tibble: 1 x 5
”


X1,X2,X3,X4
1_GATA2-Interval-Track,8.212912e-01,0.38092063,0.61885253
1_GATA2-Interval-Track,1.098493e-01,0.77235991,0.51011592
CN_range_anno,1.228953e-08,0.51158775,0.95074335
CN_range_anno,2.621717e-01,0.50837502,0.54243696
Coding_UCSC,4.079106e-86,0.59766878,0.36280463
Coding_UCSC,1.883235e-14,0.03185144,0.13665816
Conserved_LindbladToh,1.502215e-71,0.10594412,0.24181969
Conserved_LindbladToh,5.028735e-09,0.29427964,0.64958592
CTCF_Hoffman,3.360994e-62,0.07425694,0.21765072
CTCF_Hoffman,1.477895e-04,0.58590582,0.95172204
