# Stratified LD Score Regression 
This notebook implements the pipepline of [S-LDSC](https://github.com/bulik/ldsc/wiki) for LD score and functional enrichment analysis. It is written by Anmol Singh (singh.anmol@columbia.edu), with input from Dr. Gao Wang.

**FIXME: the initial draft is complete but pending Gao's review and documentation with minimal working example**

The pipeline is developed to integrate GWAS summary statistics data, annotation data, and LD reference panel data to compute functional enrichment for each of the epigenomic annotations that the user provides using the S-LDSC model. We will first start off with an introduction, instructions to set up, and the minimal working examples. Then the workflow code that can be run using SoS on any data will be at the end. 

## A brief review on Stratified LD score regression

Here I briefly review LD Score Regression and what it is used for. For more in depth information on LD Score Regression please read the following three papers:

1. "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies" by Sullivan et al (2015)

2. "Partitioning heritability by functional annotation using genome-wide association summary statistics" by Finucane et al (2015)

3. "Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection" by Gazal et al (2017)

As stated in Sullivan et al 2015, confounding factors and polygenic effects can cause inflated test statistics and other methods cannot distinguish between inflation from confounding bias and a true signal. LD Score Regression (LDSC) is a technique that aims to identify the impact of confounding factors and polygenic effects using information from GWAS summary statistics. 

This approach involves using regression to mesaure the relationship between Linkage Disequilibrium (LD) scores and test statistics of SNPs from the GWAS summary statistics. Variants in LD with a "causal" variant show an elevation in test statistics in association analysis proportional to their LD (measured by $r^2$) with the causal variant within a certain window size (could be 1 cM, 1kB, etc.). In contrast, inflation from confounders such as population stratification that occur purely from genetic drift will not correlate with LD. For a polygenic trait, SNPs with a high LD score will have more significant χ2 statistics on average than SNPs with a low LD score. Thus, if we regress the $\chi^2$ statistics from GWAS against LD Score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics. The regression model is known as LD Score regression. 

### LDSC model

Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to  $1/(p(1-p))$ where p is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant j is:

$$E[\chi^2|l_j] = Nh^2l_j/M + Na + 1 \quad (1)$$

where $N$ is the sample size; $M$ is the number of SNPs, such that $h^2/M$ is the average heritability explained per SNP; $a$ measures the contribution of confounding biases, such as cryptic relatedness and population stratification; and $l_j = \sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation of this equation is provided in the Supplementary Note of Sullivan et al (2015). An alternative derivation is provided in Supplementary Note of Zhu and Stephens (2017) AoAS.

From this we can see that LD Score regression can be used to compute SNP-based heritability for a phenotype or trait, from GWAS summary statistics and does not require genotype information like other methods such as REML do. 

### Stratified LDSC

Heritability is the proportion of phenotypic variation (VP) that is due to variation in genetic values (VG) and thus can tell us how much of the difference in observed phenotypes in a sample is due to difference in genetics in the sample. It can also be extended to analyze partitioned heritability for a phenotype/trait split over categories. 

For Partitioned Heritability or Stratified LD Score Regression (S-LDSC) more power is added to our analysis by leveraging LD Score information as well as using SNPs that haven't reached Genome Wide Significance to partition heritability for a trait over categories which many other methods do not do. 


S-LDSC relies on the fact that the $\chi^2$ association statistic for a given SNP includes the effects of all SNPs tagged by this SNP meaning that in a region of high LD in the genome the given SNP from the GWAS represents the effects of a group of SNPs in that region.

S-LDSC determines that a category of SNPs is enriched for heritability if SNPs with high LD to that category have more significant $\chi^2$ statistics than SNPs with low LD to that category.

Here, enrichment of a category is defined as the proportion of SNP heritability in the category divided by the proportion of SNPs in that category.

More precisely, under a polygenic model, the expected $\chi^2$ statistic of SNP $j$ is

$$E[\chi^2_j] = N\sum_CT_Cl(j,C) + Na + 1 \quad (2)$$

where $N$ is sample size, C indexes categories, $ℓ(j, C)$ is the LD score of SNP j with respect to category $l(j,C) = \sum_{k\epsilon C} r^2_{jk}$, $a$ is a term that measures the contribution of confounding biases, and if the categories are disjoint, $\tau_C$ is the per-SNP heritability in category $C$; if the categories overlap, then the per-SNP heritability of SNP j is $\sum_{C:j\epsilon C} \tau_C$.  Equation 2 allows us to estimate $\tau_C$ via a (computationally simple) multiple regression of $\chi^2$ against $ℓ(j, C)$, for either a quantitative or case-control study. 

To see how these methods have been applied to real world data as well as a further discussion on methods and comparisons to other methods please read the three papers listed at the top of the document.

## Command Interface

In [116]:
!sos run LDSC_Code.ipynb -h

usage: sos run LDSC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  make_annot
  munge_sumstats_no_sign
  munge_sumstats_sign
  calc_ld_score
  calc_enrichment

Sections
  make_annot:
    Workflow Options:
      --bed VAL (as str, required)
                        path to bed file
      --bim VAL (as str, required)
                        path to bim file
      --annot VAL (as str, required)
                        name of output annotation file
  munge_sumstats_no_sign: This option is for when the summary statistic file
                        does not contain a signed summary statistic (Z or Beta).
                        In this case,the program will calculate Z for you based

In [None]:
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
# Prefix for the analysis output
parameter: annotation_name = str
parameter: annotation_file = path()
parameter: reference_anno_file = path()
parameter: genome_ref_file = path() # with .bed 
parameter: ldsc_path = path() #ldsc github
parameter: chromosome = []
parameter: snp_list = path()
parameter: ld_wind_cm = 1.0

parameter: all_traits_file = path()
parameter: brain_traits_file = path()
parameter: blood_traits_file = path()
# Directory containing GWAS summary statistics
parameter: sumstat_dir = path() #/mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/sumstat
parameter: target_anno_dir = path()  # Directory containing target annotation files
parameter: baseline_ld_dir = path()  # Directory containing baseline LD score files
parameter: frqfile_dir = path()  # Directory containing allele frequency files
parameter: weights_dir = path()  # Directory containing LD weights

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

# Use the header of the covariate file to decide the sample size
import pandas as pd

import os
import pandas as pd
# Process input files

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
    
# Process input files based on file type
if str(annotation_file).endswith("rds") and str(reference_anno_file).endswith("annot.gz"):
    # Case 1: Direct file paths
    input_files = [[annotation_file, reference_anno_file, genome_ref_file]]
    if len(chromosome) > 0:
        input_chroms = [int(x) for x in chromosome]
    else:
        input_chroms = [0]
else:
    # Case 2: Files with #id and #path columns
    target_files = pd.read_csv(annotation_file, sep="\t")
    reference_files = pd.read_csv(reference_anno_file, sep="\t")
    genome_ref_files = pd.read_csv(genome_ref_file, sep="\t")
    
    # Standardize #id and adapt file paths
    target_files["#id"] = [x.replace("chr", "") for x in target_files["#id"].astype(str)]
    target_files["#path"] = target_files["#path"].apply(lambda x: adapt_file_path(x, annotation_file))
    
    reference_files["#id"] = [x.replace("chr", "") for x in reference_files["#id"].astype(str)]
    reference_files["#path"] = reference_files["#path"].apply(lambda x: adapt_file_path(x, reference_anno_file))
    
    genome_ref_files["#id"] = [x.replace("chr", "") for x in genome_ref_files["#id"].astype(str)]
    genome_ref_files["#path"] = genome_ref_files["#path"].apply(lambda x: adapt_file_path(x, genome_ref_file))
    
    # Merge the files based on #id
    input_files = target_files.merge(reference_files, on="#id").merge(genome_ref_files, on="#id")
    
    # Filter by specified chromosomes, if any
    if len(chromosome) > 0:
        input_files = input_files[input_files['#id'].isin(chromosome)]
    
    # Extract relevant columns as a list of file paths
    input_files = input_files.values.tolist()
    input_chroms = [x[0] for x in input_files]  # Chromosome IDs
    input_files = [x[1:] for x in input_files]  # File paths (annotation, reference, genome_ref)

## Make Annotation File

In [93]:
[make_annot]

# Make Annotated Bed File

# path to bed file
parameter: bed = str 
#path to bim file
parameter: bim = str
#name of output annotation file
parameter: annot = str
bash: expand = True
    make_annot.py --bed-file {bed} --bimfile {bim} --annot-file {annot}

In [None]:
[make_annotation_files_ldscore]
# consider joint tau
parameter: chr_column = "CHR"  
parameter: score_column = 3    # Fixed score column for all annotations
parameter: joint_tau = False   # Whether to perform joint tau analysis
parameter: target_files = []   # List of target annotation files (used when joint_tau is True)
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: dict([
    ('annot', f'{cwd:a}/{annotation_name}/{annotation_name}.{input_chroms[_index]}.annot.gz'),
    ('ldscore', f'{cwd:a}/{annotation_name}/{annotation_name}.{input_chroms[_index]}.l2.ldscore.gz')
])
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output["annot"]}.stderr', stdout = f'{_output["annot"]}.stdout', container = container, entrypoint = entrypoint
    library(data.table)
    ref_anno <- fread(${_input[1]:ar})
    ref_anno <- ref_anno[,-5]  # Remove the last column
    
    chr_value = unique(ref_anno$CHR)
    
    if(${joint_tau}) {
        # For joint tau analysis
        joint_anno <- matrix(0, nrow=nrow(ref_anno), ncol=length(${target_files}))
        colnames(joint_anno) <- paste0("anno", 1:length(${target_files}))
        
        for(i in 1:length(${target_files})) {
            target_anno <- readRDS(${target_files}[i])
            pos <- which(target_anno$chr_num == chr_value)
            pp <- match(target_anno$pos, ref_anno$BP)
            pp1 <- as.numeric(na.omit(pp))
            joint_anno[pp1,i] <- target_anno[[${score_column}]][!is.na(pp)]
        }
        
        result_anno <- cbind(ref_anno, as.data.frame(joint_anno))
    } else {
        # Single annotation analysis
        target_anno <- readRDS(${_input[0]:ar})
        anno_scores <- rep(0, nrow(ref_anno))
        pos <- which(target_anno$chr_num == chr_value)
        pp <- match(target_anno$pos, ref_anno$BP)
        pp1 <- as.numeric(na.omit(pp))
        anno_scores[pp1] <- target_anno[[${score_column}]][!is.na(pp)]
        result_anno <- ref_anno
        result_anno$ANNOT <- anno_scores
    }
    
    fwrite(result_anno, ${_output["annot"]:nr}, 
           quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\t")

bash: expand= "$[ ]", stderr = f'{_output["annot"]:nnn}.stderr', stdout = f'{_output["annot"]:nnn}.stdout', container = container, entrypoint = entrypoint
    gzip -f $[_output["annot"]:n]     

bash: expand="${ }", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'
    ldsc \
        --print-snps ${snp_list} \
        --ld-wind-cm ${ld_wind_cm} \
        --out ${_output["ldscore"]:nnn} \
        --bfile ${_input[2]:nar} \
        --yes-really \
        --annot ${_output[0]:a} \
        --l2

## Munge Summary Statistics (Option 1: No Signed Summary Statistic)

In [None]:
#This option is for when the summary statistic file does not contain a signed summary statistic (Z or Beta). 
#In this case,the program will calculate Z for you based on A1 being the risk allele
[munge_sumstats_no_sign]



#path to summary statistic file
parameter: sumst = str
#path to Hapmap3 SNPs file, keep all columns (SNP, A1, and A2) for the munge_sumstats program
parameter: alleles = "w_hm3.snplist"
#path to output file
parameter: output = str

bash: expand = True
    munge_sumstats.py --sumstats {sumst} --merge-alleles {alleles} --out {output} --a1-inc

## Munge Summary Statistics (Option 2: No Signed Summary Statistic)

In [None]:
# This option is for when the summary statistic file does contain a signed summary statistic (Z or Beta)
[munge_sumstats_sign]



#path to summary statistic file
parameter: sumst = str
#path to Hapmap3 SNPs file, keep all columns (SNP, A1, and A2) for the munge_sumstats program
parameter: alleles = "w_hm3.snplist"
#path to output file
parameter: output = str

bash: expand = True
    munge_sumstats.py --sumstats {sumst} --merge-alleles {alleles} --out {output}

## Calculate LD Scores

**Make sure to delete SNP,CHR, and BP columns from annotation files if they are present otherwise this code will not work. Before deleting, if these columns are present, make sure that the annotation file is sorted.**

In [None]:
#Calculate LD Scores
#**Make sure to delete SNP,CHR, and BP columns from annotation files if they are present otherwise this code will not work. Before deleting, if these columns are present, make sure that the annotation file is sorted.**
[calc_ld_score]

#Path to bim file
parameter: bim = str
#Path to annotation File. Make sure to remove the SNP, CHR, and BP columns from the annotation file if present before running.
parameter: annot_file = str
#name of output file
parameter: output = str
#path to Hapmap3 SNPs file, remove the A1 and A2 columns for the Calculate LD Scores program 
parameter: snplist = "w_hm3.snplist"

bash: expand = True
    ldsc.py --bfile {bim} --l2 --ld-wind-cm 1 --annot {annot_file} --thin-annot --out {output} --print-snps {snplist}

## Calculate Functional Enrichment using Annotations

In [None]:
#Calculate Enrichment Scores for Functional Annotations
[calc_enrichment]

#Path to Summary statistics File
parameter: sumstats = str
#Path to Reference LD Scores Files (Base Annotation + Annotation you want to analyze, format like minimal working example)
parameter: ref_ld = str
#Path to LD Weight Files (Format like minimal working example)
parameter: w_ld = str
#path to frequency files (Format like minimal working example)
parameter: frq_file = str
#Output name
parameter: output = str

bash: expand = True
    ldsc.py --h2 {sumstats} --ref-ld-chr {ref_ld} --w-ld-chr {w_ld} --overlap-annot --frqfile-chr {frq_file} --out {output}

In [None]:
[get_heritability]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads

bash: expand = "${ }"
    while read -r trait; do
        ldsc \
            --h2 ${sumstat_dir}/$trait \
            --ref-ld-chr ${target_anno_dir}/${annotation_name}.,${baseline_ld_dir}/baselineLD. \
            --out ${cwd}/$trait \
            --overlap-annot \
            --frqfile-chr ${frqfile_dir}/1000G.EUR.hg38. \
            --w-ld-chr ${weights_dir}/weights.hm3_noMHC. \
            --print-coefficients \
            --print-delete-vals
    done < ${all_traits_file}

In [None]:
[meta_analysis]
parameter: trait_group_paths = []      # List of paths to trait group files
parameter: trait_group_names = []      # List of names for each group
parameter: annot_cell = str           # Root path for annotation files
parameter: results_cell = str         # Root path for results files
parameter: annot_name = str           # Annotation name
parameter: joint_tau = False          # Whether to use joint tau analysis
parameter: annot_index = None         # Index in results (None for auto-detection)
parameter: base_index = None          # Number of baseline annotations (None for auto-detection)
parameter: base_path = path           # Path to baseline files (needed for joint tau)
output: [
    f'{cwd}/{step_name}/single_tau_{annot_cell}_{annot_name}.rds' if not joint_tau else f'{cwd}/{step_name}/{annot_cell}_{annot_name}.rds',
    f'{cwd}/{step_name}/enrichment_{annot_cell}_{annot_name}.rds' if not joint_tau else None
]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads

R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    library(data.table)
    library(rmeta)
    
    # Function for single tau analysis
    get_sd_annot = function(cell_path, annot_index = 1, flag = 0) {
        if(flag == 0 && file.exists(paste0(cell_path, "/", "sd_annot_", annot_index, ".rda"))) {
            sd_annot = get(load(paste0(cell_path, "/", "sd_annot_", annot_index, ".rda")))
            return(sd_annot)
        }
        
        num = 0
        den = 0
        ll <- list.files(cell_path, pattern = ".annot.gz")
        for(m in 1:length(ll)) {
            dat <- data.frame(fread(paste0(cell_path, "/", ll[m])))
            num = num + (nrow(dat)-1) * var(dat[,4+annot_index])
            den = den + (nrow(dat)-1)
        }
        
        estd_sd_annot = sqrt(num/den)
        save(estd_sd_annot, file = paste0(cell_path, "/", "sd_annot_", annot_index, ".rda"))
        return(estd_sd_annot)
    }
    
    # Function for joint tau analysis
    get_sd_annot_joint = function(cell_path, annot_index = 1, base_path, flag = 0) {
        if(flag == 0) {
            sd_annot = rep(0, length(annot_index))
            for(i in 1:length(annot_index)) {
                if(file.exists(paste0(cell_path, "/", "sd_annot_", annot_index[i], ".rda"))) {
                    sd_annot[i] = as.numeric(get(load(paste0(cell_path, "/", "sd_annot_", annot_index[i], ".rda"))))
                } else {
                    flag = 1
                    break
                }
            }
            if(flag == 0) return(sd_annot)
        }
        
        num = rep(0, length(annot_index))
        den = rep(0, length(annot_index))
        ll <- list.files(cell_path, pattern = ".annot.gz")
        ordering = c(1, 10:19, 2, 20:22, 3:9)
        
        for(m in 1:length(ll)) {
            dat <- data.frame(fread(paste0(cell_path, "/", ll[m])))
            base <- data.frame(fread(paste0(base_path, "/", "baselineLD.", ordering[m], ".annot.gz")))
            pooled_dat <- cbind(dat[,-(1:4)], base[,-(1:4)])
            num = num + (nrow(pooled_dat)-1) * apply(pooled_dat[,annot_index], 2, var)
            den = den + (nrow(pooled_dat)-1)
            rm(pooled_dat)
        }
        
        sd_annot = sqrt(num/den)
        for(i in 1:length(annot_index)) {
            temp = sd_annot[i]
            save(temp, file = paste0(cell_path, "/", "sd_annot_", annot_index[i], ".rda"))
        }
        return(sd_annot)
    }
    
    # Original single tau analysis
    run_single_tau_analysis = function(annot_cell, results_cell, annotations, traits,
                                     index_in_results=1, base_index = NULL, flag = 1) {
        if(is.null(base_index)) base_index = index_in_results
        tau_star_table = matrix(0, length(annotations), 3)
        
        for(annot_id in 1:length(annotations)) {
            cell_path = paste0(annot_cell, "/", annotations[annot_id])
            sd_annot1 = get_sd_annot(cell_path, annot_index=index_in_results, flag = flag)
            Mref = 5961159
            df = c()
            
            for(trait_id in 1:length(traits)) {
                result.file = paste0(results_cell, "/", annotations[annot_id], "/", 
                                   traits[trait_id], ".sumstats.gz.part_delete")
                new_table = read.table(result.file, header=F)
                logfile = paste0(results_cell, "/", annotations[annot_id], "/", 
                               traits[trait_id], ".sumstats.gz.log")
                log = read.table(logfile, h=F, fill=T)
                h2g = as.numeric(as.character(log[which(log$V4=="h2:"), 5]))
                coef1 = sd_annot1 * Mref/h2g
                sc = sapply(1:nrow(new_table), function(i) {
                    tau1 = as.numeric(new_table[i, base_index])
                    tau1 * coef1
                })
                mean_sc = mean(sc)
                se_sc = sqrt(199^2/200 * var(sc))
                df = rbind(df, c(mean_sc, se_sc))
            }
            
            test_tauj = meta.summaries(df[,1], df[,2], method="random")
            tau = test_tauj$summary
            tau_se = test_tauj$se.summary
            z = tau/tau_se
            tau_star_table[annot_id, ] = c(tau, tau_se, 2*pnorm(-abs(z)))
        }
        rownames(tau_star_table) = annotations
        return(tau_star_table)
    }
    
    # Original enrichment analysis
    run_single_enrichment_analysis = function(annot_cell, results_cell, annotation, traits,
                                            index_in_results=1) {
        enrich_table = matrix(0, length(index_in_results), 3)
        cell_path = paste0(annot_cell, "/", annotation)
        res = paste0(results_cell, "/", annotation, "/", traits[1], ".sumstats.gz.results")
        tab2 = read.table(res, header=T)
        annot_names = as.character(tab2$Category[index_in_results])
        Mref = 5961159
        
        for(id in 1:length(index_in_results)) {
            meta_enr = NULL
            meta_enrstat = NULL
            
            for(trait_id in 1:length(traits)) {
                result.file = paste0(results_cell, "/", annotation, "/", 
                                   traits[trait_id], ".sumstats.gz.results")
                res = read.table(result.file, header=T)
                logfile = paste0(results_cell, "/", annotation, "/", 
                               traits[trait_id], ".sumstats.gz.log")
                log = read.table(logfile, h=F, fill=T)
                h2g = as.numeric(as.character(log[which(log$V4=="h2:"), 5]))
                
                myenrstat = (h2g/Mref)*((res[index_in_results[id],3]/res[index_in_results[id],2])-
                                      (1-res[index_in_results[id],3])/(1-res[index_in_results[id],2]))
                myenrstat_z = qnorm(res[index_in_results[id],7]/2)
                myenrstat_sd = myenrstat/myenrstat_z
                meta_enrstat = rbind(meta_enrstat, c(myenrstat, myenrstat_sd))
                meta_enr = rbind(meta_enr, c(res[index_in_results[id],5], 
                                           res[index_in_results[id],6]))
            }
            
            test_eni1 = meta.summaries(meta_enr[,1], meta_enr[,2], method="random")
            test_eni2 = meta.summaries(meta_enrstat[,1], meta_enrstat[,2], method="random")
            
            enrich_table[id, ] = c(test_eni1$summary, test_eni1$se.summary,
                                2*pnorm(-abs(test_eni2$summary/test_eni2$se.summary)))
        }
        rownames(enrich_table) = annot_names
        return(enrich_table)
    }
    
    # Joint tau analysis
    run_many_tau_analysis = function(annot_cell, results_cell, base_path, annotation, traits,
                                   index_in_results=NULL, base_index = NULL, flag = 1) {
        base <- data.frame(fread(paste0(base_path, "/", "baselineLD.", 22, ".annot.gz")))
        if(is.null(base_index)) base_index = ncol(base) - 4
        
        cell_path = paste0(annot_cell, "/", annotation)
        res = paste0(results_cell, "/", annotation, "/", traits[1], ".sumstats.gz.results")
        tab2 = read.table(res, header=T)
        
        if(is.null(index_in_results)) index_in_results = 1:(nrow(tab2) - base_index)
        tau_star_table = matrix(0, length(index_in_results), 3)
        annot_names = as.character(tab2$Category[index_in_results])
        
        sd_annot = get_sd_annot_joint(cell_path, annot_index=index_in_results, 
                                     base_path=base_path, flag=flag)
        
        for(id in 1:length(index_in_results)) {
            sd_annot1 = sd_annot[id]
            Mref = 5961159
            df = c()
            
            for(trait_id in 1:length(traits)) {
                result.file = paste0(results_cell, "/", annotation, "/", 
                                   traits[trait_id], ".sumstats.gz.part_delete")
                new_table = read.table(result.file, header=F)
                logfile = paste(results_cell, "/", annotation, "/", 
                              traits[trait_id], ".sumstats.gz.log", sep="")
                log = read.table(logfile, h=F, fill=T)
                h2g = as.numeric(as.character(log[which(log$V4=="h2:"), 5]))
                
                coef1 = sd_annot1 * Mref/h2g
                sc = sapply(1:dim(new_table)[1], function(i) {
                    tau1 = as.numeric(new_table[i, index_in_results[id]])
                    tau1 * coef1
                })
                
                mean_sc = mean(sc)
                se_sc = sqrt(199^2/200 * var(sc))
                df = rbind(df, c(mean_sc, se_sc))
            }
            
            test_tauj = meta.summaries(df[,1], df[,2], method="random")
            tau = test_tauj$summary
            tau_se = test_tauj$se.summary
            z = tau/tau_se
            tau_star_table[id,] = c(tau, tau_se, 2*pnorm(-abs(z)))
        }
        
        rownames(tau_star_table) = annot_names
        return(tau_star_table)
    }
    
    # Process groups
    group_paths = strsplit("${trait_group_paths}", " ")[[1]]
    group_names = strsplit("${trait_group_names}", " ")[[1]]
    
    if(${joint_tau}) {
        # Joint tau analysis
        results = list()
        for(i in 1:length(group_paths)) {
            traits = read.table(group_paths[i])[[1]]
            traits = sapply(traits, function(x) return(strsplit(x, ".sumstats")[[1]][1]))
            
            results[[group_names[i]]] = run_many_tau_analysis(
                annot_cell = "${annot_cell}",
                results_cell = "${results_cell}",
                base_path = "${base_path}",
                annotation = "${annot_name}",
                traits = traits,
                index_in_results = ${annot_index},
                base_index = ${base_index},
                flag = 0
            )
        }
        saveRDS(results, '${_output[0]}', compress='xz')
        
    } else {
        # Single tau and enrichment analysis
        tau_results = list()
        enrichment_results = list()
        
        for(i in 1:length(group_paths)) {
            traits = read.table(group_paths[i])[[1]]
            traits = sapply(traits, function(x) return(strsplit(x, ".sumstats")[[1]][1]))
            
            tau_results[[i]] = run_single_tau_analysis(
                "${annot_cell}", "${results_cell}", 
                "${annot_name}", traits, 
                index_in_results=${annot_index}
            )
            
            enrichment_results[[i]] = run_single_enrichment_analysis(
                "${annot_cell}", "${results_cell}", 
                "${annot_name}", traits,
                index_in_results=${annot_index}
            )
        }
        
        # Save single tau results
        tau_df = do.call(rbind.data.frame, tau_results)
        rownames(tau_df) = group_names
        colnames(tau_df) = c("Mean", "SD", "P")
        saveRDS(tau_df, '${_output[0]}', compress='xz')
        
        # Save enrichment results
        enrichment_df = do.call(rbind.data.frame, enrichment_results)
        rownames(enrichment_df) = group_names
        colnames(enrichment_df) = c("Mean", "SD", "P")
        saveRDS(enrichment_df, '${_output[1]}', compress='xz')
    }