# Multi-trait colocalization using ColocBoost

## Description

This notebook performs colocalizatoin for multiple xQTL in a given region, with/without GWAS summary statistics. Current protocol (Feb, 2025) supports individual level xQTL data from the same cohort (multiple phenotypes, same genotype) along with summary statistics and companion LD reference data.

## Input

1. A list of regions to be analyzed (optional); the last column of this file should be region name.
2. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK `bed` format. 
3. Vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.
4. Vector of covariate files corresponding to the lists above.
5. Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)
6. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace.
7. Optionally summary statistics meta-data file and LD reference meta-data file.

Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

### Example genotype, phenotype and association analysis windows 

See [this page](mnm_regression) for example inputs of these information.

### Example summary statistics and LD reference

See [this page](rss_analysis) for example inputs of these information.

### About indels

Option `--no-indel` will remove indel from analysis. FIXME: Gao need to provide more guidelines how to deal with indels in practice.

## Output

For each analysis region, the output are various ColocBoost models fitted and saved in RDS format.

## Minimal Working Example Steps

Timing [FIXME]

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).

Here using `--region-name` we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze.

**Note:** Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

In [None]:
sos run pipeline/colocboost.ipynb colcoboost  \
    --name protocol_example_protein  \
    --genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed   \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
    --customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --region-name ENSG00000241973_P42356 ENSG00000160209_O00764 ENSG00000100412_Q99798 \
    --phenotype-names trait_A trait_B

It is also possible to analyze a selected list of regions using option `--region-list`. The last column of this file will be used for the list to analyze. Here for example use the same list of regions as we used for customized association-window:

In [None]:
sos run xqtl-protocol/pipeline/mnm_regression.ipynb susie_twas  \
    --name protocol_example_protein  \
    --genoFile xqtl_association/protocol_example.genotype.chr21_22.bed   \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
    --customized-association-windows xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --region-list xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --phenotype-names trait_A trait_B

**Note:** When both `--region-name` and `--region-list` are used, the union of regions from these parameters will be analyzed. 

## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
|  |  |  |  |  |




## Command interface

In [None]:
sos run colocboost.ipynb -h

## Workflow implementation

In [1]:
[global]
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
# A list of file paths for genotype data, or the genotype data itself. 
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
parameter: phenoIDFile = paths()
# Covariate file path
parameter: covFile = paths
# Summary statistics interface, see `rss_analysis.ipynb` for details
parameter: gwas_meta_data = path()
parameter: ld_meta_data = path()
parameter: gwas_name = []
parameter: gwas_data = []
parameter: column_mapping = []
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# Only focus on a subset of samples
parameter: keep_samples = path()
# Only focus on a subset of variants
parameter: keep_variants = path()
# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_association_windows = path()
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# When this is set to negative, we will rely on using customized_association_windows
parameter: cis_window = -1
# save data object or not
parameter: save_data = False
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
# And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
parameter: trans_analysis = False
parameter: seed = 999

# association analysis paramters
# initial number of single effects for SuSiE
parameter: init_L = 8
# maximum number of single effects to use for SuSiE
parameter: max_L = 30
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF and variance of X cutoff
parameter: maf = 0.0025
parameter: xvar_cutoff = 0.0
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: pip_cutoff = 0.025
parameter: coverage = [0.95, 0.7, 0.5]
# If this value is not 0, then an initial single effect analysis will be performed 
# to determine if follow up analysis will be continued or to simply return NULL
# If this is negative we use a default way to determine this cutoff which is conservative but still useful
parameter: skip_analysis_pip_cutoff = []
# Skip fine-mapping
parameter: skip_fine_mapping = False
# Skip TWAS weights computation
parameter: skip_twas_weights = False
# Perform K folds valiation CV for TWAS
# Set it to zero if this is to be skipped
parameter: twas_cv_folds = 5
parameter: twas_cv_threads = twas_cv_folds
# maximum number of variants to consider for CV
# We will randomly pick a subset of it for CV purpose
# We can set it to eg 8000 to save computational burden althought may risk overfitting for methods comparison purpose
# When set to -1 we don't use this feature
parameter: max_cv_variants = -1
parameter: ld_reference_meta_file = path()
# Analysis environment settings
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "20G"
# Number of threads
parameter: numThreads = 1

if len(phenoFile) != len(covFile):
    raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
    raise ValueError("Number of input phenotypes files must match the number of phenotype names")
if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
    raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")

if len(skip_analysis_pip_cutoff) == 0:
    skip_analysis_pip_cutoff = [0.0] * len(phenoFile)
if len(skip_analysis_pip_cutoff) == 1:
    skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)
if len(skip_analysis_pip_cutoff) != len(phenoFile):
    raise ValueError(f"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)")

# make it into an R List string
skip_analysis_pip_cutoff = [f"'{y}'={x}" for x,y in zip(skip_analysis_pip_cutoff, phenotype_names)]
    
def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))

In [1]:
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary like:
#{'data': [("genotype_1.bed", "phenotype_1.bed.gz", "covariate_1.gz"), ("genotype_2.bed", "phenotype_1.bed.gz", "phenotype_2.bed.gz", "covariate_1.gz", "covariate_2.gz") ... ],
# 'meta_info': [("chr12:752578-752579","chr12:752577-752580", "gene_1", "trait_1"), ("chr13:852580-852581","chr13:852579-852580", "gene_2", "trait_1", "trait_2") ... ]}
import numpy as np

def preload_id_map(id_map_files):
    id_maps = {}
    for id_map_file in id_map_files:
        if id_map_file is not None and os.path.isfile(id_map_file):
            df = pd.read_csv(id_map_file, sep=r"\s+", header=None, comment='#', names=['old_ID', 'new_ID'])
            id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()
    return id_maps

def load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps):
    pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0)
    pheno_df['Original_ID'] = pheno_df['ID']
    if id_map_path in preloaded_id_maps:
        id_map = preloaded_id_maps[id_map_path]
        pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])
    return pheno_df

def filter_by_region_ids(data, region_ids):
    if region_ids is not None and len(region_ids) > 0:
        data = data[data['ID'].isin(region_ids)]
    return data

def custom_join(series):
    # Initialize an empty list to hold the processed items
    result = []
    for item in series:
        if ',' in item:
            # If the item contains commas, split by comma and convert to tuple
            result.append(tuple(item.split(',')))
        else:
            # If the item does not contain commas, add it directly
            result.append(item)
    # Convert the list of items to a tuple and return
    return tuple(result)

def aggregate_phenotype_data(accumulated_pheno_df):
    if not accumulated_pheno_df.empty:
        accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr','ID','cond','path','cov_path'], as_index=False).agg({
            '#chr': lambda x: np.unique(x).astype(str)[0],
            'ID': lambda x: np.unique(x).astype(str)[0],
            'Original_ID': ','.join,
            'start': 'min',
            'end': 'max'
        }).groupby(['#chr','ID'], as_index=False).agg({
            'cond': ','.join,
            'path': ','.join,
            'Original_ID': custom_join,
            'cov_path': ','.join,
            'start': 'min',
            'end': 'max'
        })
    return accumulated_pheno_df

def process_cis_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):
    '''
    Example output:
    #chr    start      end    ID  Original_ID   path     cov_path             cond
    chr12   752578   752579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
    '''
    accumulated_pheno_df = pd.DataFrame()
    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files

    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
        if not os.path.isfile(cov_path):
            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
        pheno_df = load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps)
        pheno_df = filter_by_region_ids(pheno_df, region_ids)
        
        if not pheno_df.empty:
            pheno_df.iloc[:, 4] = adapt_file_path_all(pheno_df, pheno_df.columns[4], f"{pheno_path:a}")
            pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)           
            accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)

    accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
    return accumulated_pheno_df

def process_trans_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, customized_association_windows):
    '''
    Example output:
    #chr    start      end    ID  Original_ID   path     cov_path             cond
    chr21   0   0  chr21_18133254_19330300  carnitine,benzoate,hippurate  metabolon_1.bed.gz,metabolon_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
    '''
    
    if not os.path.isfile(customized_association_windows):
        raise ValueError("Customized association analysis window must be specified for trans analysis.")
    accumulated_pheno_df = pd.DataFrame()
    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
    genotype_windows = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
    genotype_windows = filter_by_region_ids(genotype_windows, region_ids)
    if genotype_windows.empty:
        return accumulated_pheno_df
    
    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
        if not os.path.isfile(cov_path):
            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
        pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0, names=['Original_ID', 'path'])
        if not pheno_df.empty:
            pheno_df.iloc[:, -1] = adapt_file_path_all(pheno_df, pheno_df.columns[-1], f"{pheno_path:a}")
            pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
            # Here we combine genotype_windows which contains "#chr" and "ID" to pheno_df by creating a cartesian product
            pheno_df = pd.merge(genotype_windows.assign(key=1), pheno_df.assign(key=1), on='key').drop('key', axis=1)
            # then set start and end columns to zero
            pheno_df['start'] = 0
            pheno_df['end'] = 0
            if id_map_path is not None:
                # Filter pheno_df by specific association-window and phenotype pairs
                association_analysis_pair = pd.read_csv(id_map_path, sep=r"\s+", header=None, comment='#', names=['ID', 'Original_ID'])
                pheno_df = pd.merge(pheno_df, association_analysis_pair, on=['ID', 'Original_ID'])
            accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)

    accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
    return accumulated_pheno_df

def load_regional_data(genoFile, phenoFile, covFile, phenotype_names, phenoIDFile, 
                      region_list=None, region_name=None, trans_analysis=False, 
                      customized_association_windows=None, cis_window=-1):   
    # Ensure region_name is a list
    if region_name is None:
        region_name = []
    
    # Load genotype meta data
    if f"{genoFile:x}" == ".bed":
        geno_meta_data = pd.DataFrame([("chr"+str(x), f"{genoFile:a}") for x in range(1,23)] + [("chrX", f"{genoFile:a}")], columns=['#chr', 'geno_path'])
    else:
        geno_meta_data = pd.read_csv(f"{genoFile:a}", sep=r"\s+", header=0)
        geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f"{genoFile:a}")
        geno_meta_data.columns = ['#chr', 'geno_path']
        geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')

    # Checking the DataFrame
    valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']
    if not all(value in valid_chr_values for value in geno_meta_data['#chr']):
        raise ValueError("Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.")

    region_ids = []
    # If region_list is provided, read the file and extract IDs
    if region_list is not None and region_list.is_file():
        region_list_df = pd.read_csv(region_list, delim_whitespace=True, header=None, comment="#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs

    # If region_name is provided, include those IDs as well
    # --region-name A B C will result in a list of ["A", "B", "C"] here
    if len(region_name) > 0:
        region_ids = list(set(region_ids).union(set(region_name)))

    if trans_analysis:
        meta_data = process_trans_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, customized_association_windows)
    else:
        meta_data = process_cis_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preload_id_map(phenoIDFile))
        
    if not meta_data.empty:
        meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')
        # Adjust association-window
        if customized_association_windows is not None and os.path.isfile(customized_association_windows):
            print(f"Loading customized association analysis window from {customized_association_windows}")
            association_windows_list = pd.read_csv(customized_association_windows, comment="#", header=None, 
                                                  names=["#chr","start","end","ID"], sep="\t")
            meta_data = pd.merge(meta_data, association_windows_list, on=['#chr', 'ID'], how='left', suffixes=('', '_association'))
            mismatches = meta_data[meta_data['start_association'].isna()]
            if not mismatches.empty:
                print("First 5 mismatches:")
                print(mismatches[['ID']].head())
                raise ValueError(f"{len(mismatches)} regions to analyze cannot be found in ``{customized_association_windows}``. "
                                f"Please check your ``{customized_association_windows}`` database to make sure it contains all "
                                f"association-window definitions. ")
        else:
            if cis_window < 0:
                raise ValueError("Please either input valid path to association-window file via ``--customized-association-windows``, "
                                "or set ``--cis-window`` to a non-negative integer.")
            if cis_window == 0:
                print("Warning: only variants within the range of start and end of molecular phenotype will be considered "
                     "since cis_window is set to zero and no customized association window file was found. "
                     "Please make sure this is by design.")
            meta_data['start_association'] = meta_data['start'].apply(lambda x: max(x - cis_window, 0))
            meta_data['end_association'] = meta_data['end'] + cis_window

        # Example meta_data:
        # #chr    start      end    start_association       end_association           ID  Original_ID   path     cov_path             cond             coordinate     geno_path
        # 0  chr12   752578   752579  652578   852579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B    chr12:752578-752579  protocol_example.genotype.chr21_22.bed       
        # Create the final dictionary
        regional_data = {
            'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],
            'meta_info': [(f"{row['#chr']}:{row['start']}-{row['end']}", # this is the phenotypic region to extract data from
                          f"{row['#chr']}:{row['start_association']}-{row['end_association']}", # this is the association window region
                          row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]
        }
    else:
        regional_data = {'data': [], 'meta_info': []}
        
    return regional_data

def file_exists(file_path, relative_path=None):
    """Check if a file exists at the given path or relative to a specified path."""
    if os.path.exists(file_path) and os.path.isfile(file_path):
        return True
    elif relative_path:
        relative_file_path = os.path.join(relative_path, file_path)
        return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)
    return False

def check_required_columns(df, required_columns):
    """Check if the required columns are present in the dataframe."""
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")

def parse_region(region):
    """Parse a region string in 'chr:start-end' format into a list [chr, start, end]."""
    chrom, rest = region.split(':')
    start, end = rest.split('-')
    return [int(chrom), int(start), int(end)]

def load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name=None, region_list=None):
    """
    Extracts data from GWAS metadata files and additional GWAS data provided. 
    Optionally filters data based on specified regions.

    Args:
    - gwas_meta_data (str): File path to the GWAS metadata file.
    - gwas_name (list): Vector of GWAS study names.
    - gwas_data (list): Vector of GWAS data.
    - column_mapping (list, optional): Vector of column mapping files.
    - region_name (list, optional): List of region names in 'chr:start-end' format.
    - region_list (str, optional): File path to a file containing regions.

    Returns:
    - GWAS Dictionary: Maps study IDs to a list containing chromosome number, 
      GWAS file path, and optional column mapping file path.
    - Region Dictionary: Maps region names to lists [chr, start, end].

    Raises:
    - FileNotFoundError: If any specified file path does not exist.
    - ValueError: If required columns are missing in the input files or vector lengths mismatch.
    """
    # Check vector lengths
    if len(gwas_name) != len(gwas_data):
        raise ValueError("gwas_name and gwas_data must be of equal length")
    
    if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):
        raise ValueError("If column_mapping is provided, it must be of the same length as gwas_name and gwas_data")

    # Required columns for GWAS file type
    required_gwas_columns = ['study_id', 'chrom', 'file_path']

    # Base directory of the metadata files
    gwas_base_dir = os.path.dirname(gwas_meta_data)
    
    # Reading the GWAS metadata file
    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    check_required_columns(gwas_df, required_gwas_columns)
    gwas_dict = OrderedDict()

    # Process additional GWAS data from vectors
    for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):
        gwas_dict[name] = {0: [data, mapping]}

    for _, row in gwas_df.iterrows():
        file_path = row['file_path']
        mapping_file = row.get('column_mapping_file')
        n_sample = row.get('n_sample')
        n_case = row.get('n_case')
        n_control = row.get('n_control')

        # Check if the file and optional mapping file exist
        if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):
            raise FileNotFoundError(f"File {file_path} not found for {row['study_id']}")
        
        # Adjust paths if necessary
        file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)
        if mapping_file:
            mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)
        
        # Create or update the entry for the study_id
        if row['study_id'] not in gwas_dict:
            gwas_dict[row['study_id']] = {}

        # Expand chrom 0 to chrom 1-22 or use the specified chrom
        chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]
        for chrom in chrom_range:
            if chrom in gwas_dict[row['study_id']]:
                existing_entry = gwas_dict[row['study_id']][chrom]
                raise ValueError(f"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. "
                                 f"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}")
            gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control]

         # Process region_list and region_name
            region_dict = dict()
            if region_list and os.path.isfile(region_list):
                with open(region_list, 'r') as file:
                    for line in file:
                        # Skip empty lines
                        if not line.strip():
                            continue
                        if line.startswith("#"):
                            continue
                        parts = line.strip().split()
                        if len(parts) == 1:
                            region = parse_region(parts[0])
                        elif len(parts) == 3:
                            region = [int(parts[0].replace("chr", "")), int(parts[1]), int(parts[2])]
                        elif len(parts) >= 4 and  region_list != ld_meta_data : # for eQTL where chr:start:end:gene_id:gene_name, and path if LD_meta are used.
                            region = [int(parts[0].replace("chr", "")), int(parts[1]), int(parts[2]),parts[3]]

        
                        else:
                            raise ValueError("Invalid region format in region_list")
                
                        region_dict[f"{region[0]}:{region[1]}_{region[2]}"] = region
                
    if region_name:
        for region in region_name:
            parsed_region = parse_region(region)
            region_key = f"{parsed_region[0]}:{parsed_region[1]}_{parsed_region[2]}"
            if region_key not in region_dict:
                region_dict[region_key] = parsed_region

    return gwas_dict, region_dict

regional_data = load_regional_data(
     genoFile=genoFile, 
     phenoFile=phenoFile, 
     covFile=covFile, 
     phenotype_names=phenotype_names, 
     phenoIDFile=phenoIDFile,
     region_list=region_list, 
     region_name=region_name, 
     trans_analysis=trans_analysis,
     customized_association_windows=customized_association_windows, 
     cis_window=cis_window
)

if gwas_meta_data.is_file():
    gwas_dict, region_dict = load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name, region_list)
    regional_rss_data = dict([("GWAS", gwas_dict), ("regions", region_dict)])
else:
    regional_rss_data = dict()

## ColocBoost analysis

In [1]:
[colocboost]
depends: sos_variable("regional_data"), sos_variable("regional_rss_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"

if skip_fine_mapping and skip_twas_weights:
    save_data = True
    output_files = [f'{cwd:a}/data/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.univariate_data.rds']
elif not skip_fine_mapping and skip_twas_weights:
    output_files = [f'{cwd:a}/fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.univariate_bvsr.rds']
elif skip_fine_mapping and not skip_twas_weights:
    output_files = [f'{cwd:a}/twas_weights/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.univariate_twas_weights.rds']
else:
    output_files = [f'{cwd:a}/fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.univariate_bvsr.rds',
    f'{cwd:a}/twas_weights/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.univariate_twas_weights.rds']
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stdout = f"{_output[0]:nn}.susie_twas.stdout", stderr = f"{_output[0]:nn}.susie_twas.stderr", container = container, entrypoint = entrypoint
    options(warn=1)
    library(pecotmr)
    phenotype_files = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])})
    covariate_files = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])})
    conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])})
    pip_cutoff_to_skip = list(${",".join(skip_analysis_pip_cutoff)})
    pip_cutoff_to_skip = unlist(c(pip_cutoff_to_skip[conditions]))
    # extract subset of samples
    keep_samples = NULL
    if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
      keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
      message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
    }
    # Load regional association data
    tryCatch({
        fdat = load_regional_univariate_data(genotype = ${_input[0]:anr},
                                              phenotype = phenotype_files,
                                              covariate = covariate_files,
                                              region = ${("'%s'" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'}, # if the end position is zero return NULL
                                              association_window = "${_meta_info[1]}",
                                              conditions = conditions,
                                              maf_cutoff = ${maf},
                                              mac_cutoff = ${mac},
                                              imiss_cutoff = ${imiss},
                                              keep_indel = ${"TRUE" if indel else "FALSE"},
                                              keep_samples = keep_samples,
                                              keep_variants = ${'"%s"' % keep_variants if not keep_variants.is_dir() else "NULL"}, # use "not keep_variants.is_dir()" in case user input filename is wrong without realizing
                                              extract_region_name = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}),
                                              phenotype_header = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"},
                                              region_name_col = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"},
                                              scale_residuals = FALSE)
    }, NoSNPsError = function(e) {
        message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
        saveRDS(list(${_meta_info[2]} = e$message), "${_output[0]:ann}.univariate_data.rds", compress='xz')
        quit(save="no")
    })
    # save data-set
    if (${"TRUE" if save_data else "FALSE"}) {
        saveRDS(list(${_meta_info[2]} = fdat), "${_output[0]:ann}.univariate_data.rds", compress='xz')
    }
    if (${"FALSE" if skip_fine_mapping and skip_twas_weights else "TRUE"}) {
      if ("${_meta_info[2]}" != "${_meta_info[3]}") {
          region_name = c("${_meta_info[2]}", c(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}))
      } else {
          region_name = "${_meta_info[2]}"
      }

      region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)

      finemapping_result = list()
      preset_variants_result = list()
      condition_names = vector()
      empty_elements_cnt = 0
      r = 1

      while (r<=length(fdat$residual_Y)) {
          # Update condition names
          dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], 
                               y=fdat$dropped_sample$dropped_samples_Y[[r]], 
                               covar=fdat$dropped_sample$dropped_samples_covar[[r]])
          new_names = names(fdat$residual_Y)[r]
          ##FIXME:  residule Y lost their colname when there was only 1 column
          # new_col_names = colnames(fdat$residual_Y[[r]])
          new_col_names = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])})[[r]]
          if (is.null(new_col_names)) {
            # column names does not exist, create generic names instead
            new_col_names = 1:ncol(fdat$residual_Y[[r]])
          }
          # if the name is different as region name ie, these original ID the same as gene ID, then give the new name
          if(!(identical(new_names, new_col_names)))
          new_names = paste(new_names, new_col_names, sep="_") # DLPFC_iso1 DLPFC_iso2

          out <- list()
          # Run the first round focused on fine-mapping with many variants included
          # fine-mapping with complete data-set
          if (${"FALSE" if skip_fine_mapping else "TRUE"}) {
              out$finemapping <- lapply(1:ncol(fdat$residual_Y[[r]]), function(i) {
                                  set.seed(${seed})
                                  rs <- univariate_analysis_pipeline( X=fdat$residual_X[[r]], 
                                                                      Y=fdat$residual_Y[[r]][,i,drop=FALSE],
                                                                      maf=fdat$maf[[r]], 
                                                                      X_scalar=fdat$residual_X_scalar[[r]], 
                                                                      Y_scalar=if (fdat$residual_Y_scalar[[r]] == 1) 1 else fdat$residual_Y_scalar[[r]][,i,drop=FALSE],
                                                                      X_variance=fdat$X_variance[[r]],
                                                                      other_quantities = list(dropped_samples = dropped_samples),
                                                                      # filters
                                                                      imiss_cutoff = ${imiss},
                                                                      maf_cutoff = NULL,
                                                                      xvar_cutoff = 0,
                                                                      ld_reference_meta_file = NULL,
                                                                      pip_cutoff_to_skip = pip_cutoff_to_skip[r],
                                                                      # methods parameter configuration
                                                                      init_L = ${init_L},
                                                                      max_L = ${max_L},
                                                                      l_step = 5,
                                                                      # fine-mapping results summary
                                                                      signal_cutoff = ${pip_cutoff},
                                                                      coverage = c(${",".join([str(x) for x in coverage])}),
                                                                      # TWAS weights and CV for TWAS weights
                                                                      twas_weights = FALSE, 
                                                                      max_cv_variants=${max_cv_variants},
                                                                      cv_folds=${twas_cv_folds},
                                                                      cv_threads = ${twas_cv_threads}
                                                                  )
                                                                  return(rs)
                                                                  })
          }
          if (${"FALSE" if skip_twas_weights else "TRUE"}) {
              common_cols = intersect(colnames(fdat$X), colnames(fdat$residual_X[[r]]))
              X_r = fdat$X[rownames(fdat$residual_X[[r]]), common_cols, drop=F]
              maf = fdat$maf[[r]][common_cols]
              # Run the second round focused on TWAS using selected variants
              out$twas_models <- lapply(1:ncol(fdat$residual_Y[[r]]), function(i) {
                                  set.seed(${seed})
                                  rs <- univariate_analysis_pipeline(X=X_r, # use original X matrix for TWAS model: 1) to be comparable to multivariate analysis in CV; 2) to avoid overfitting in CV
                                                                  Y=fdat$residual_Y[[r]][,i,drop=FALSE],
                                                                  maf=maf, 
                                                                  X_scalar=fdat$residual_X_scalar[[r]], 
                                                                  Y_scalar=if (fdat$residual_Y_scalar[[r]] == 1) 1 else fdat$residual_Y_scalar[[r]][,i,drop=FALSE],
                                                                  X_variance=fdat$X_variance[[r]],
                                                                  other_quantities = list(dropped_samples = dropped_samples),
                                                                  # filters
                                                                  imiss_cutoff = ${imiss},
                                                                  maf_cutoff = ${min_twas_maf},
                                                                  xvar_cutoff = ${min_twas_xvar},
                                                                  ld_reference_meta_file=${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
                                                                  pip_cutoff_to_skip = pip_cutoff_to_skip[r],
                                                                  # methods parameter configuration
                                                                  init_L = ${init_L},
                                                                  max_L = ${max_L},
                                                                  l_step = 5,
                                                                  # fine-mapping results summary
                                                                  signal_cutoff = ${pip_cutoff},
                                                                  coverage = c(${",".join([str(x) for x in coverage])}),
                                                                  # TWAS weights and CV for TWAS weights
                                                                  twas_weights = TRUE, 
                                                                  max_cv_variants=${max_cv_variants},
                                                                  cv_folds=${twas_cv_folds},
                                                                  cv_threads = ${twas_cv_threads}
                                                                  )
                                                                  return(rs)
                                                                  })
          }

          empty_elements_idx <- unique(do.call(c, lapply(out, function(results) which(sapply(results, function(x) is.list(x) && length(x) == 0)))))
          if (length(empty_elements_idx)>0) {
            empty_elements_cnt <- empty_elements_cnt + length(empty_elements_idx)
            if (!is.null(out$finemapping)) {
                out$finemapping <- out$finemapping[-empty_elements_idx]
            }
            if (!is.null(out$twas_models)) {
                out$twas_models <- out$twas_models[-empty_elements_idx]
            }
            new_names <- new_names[-empty_elements_idx]
          }
          if (!is.null(out$finemapping)) {
              finemapping_result = c(finemapping_result, out$finemapping)
          }
          if (!is.null(out$twas_models)) {
              preset_variants_result = c(preset_variants_result, out$twas_models)
          }
          condition_names = c(condition_names, new_names)
          if (length(new_names)>0) {
            message("Analysis completed for: ", paste(new_names, collapse=","))
          }
          # original data no longer relevant, set to NA to release memory
          fdat$residual_X[[r]] <- NA
          fdat$residual_Y[[r]] <- NA
          r = r + 1
      }
      # Reorganize outputs
      # The BVSR model in this case contain different versions of SuSiE fits
      # preset_variants_result
      # Clean up TWAS weights output
      twas_output <- list()
      if (length(preset_variants_result)>0) {
          names(preset_variants_result) <- condition_names
          for (r in condition_names) {
            twas_output[[r]] <- preset_variants_result[[r]]$twas_weights_result
            preset_variants_result[[r]]$twas_weights_result <- NULL
            twas_output[[r]]$variant_names <- preset_variants_result[[r]]$variant_names
            twas_output[[r]]$region_info <- region_info
            preset_variants_result[[r]]$region_info <- region_info
          }
      }
      # Clean up fine-mapping output
      finemapping_output <- list()
      if (length(finemapping_result)>0) {
          names(finemapping_result) <- condition_names
      }
      for (r in condition_names) {
          if (r %in% names(finemapping_result)) {
              finemapping_output[[r]] <- finemapping_result[[r]]
              finemapping_output[[r]]$region_info <- region_info
              finemapping_output[[r]]$susie_fitted <- NULL
          }
          if (r %in% names(preset_variants_result)) {
              finemapping_output[[r]]$preset_variants_result <- preset_variants_result[[r]]
          }
      }
      saveRDS(list("${_meta_info[2]}" = finemapping_output), "${_output[0]:ann}.univariate_bvsr.rds", compress='xz')
      
      if (length(twas_output) > 0) saveRDS(list("${_meta_info[2]}" = twas_output), "${_output[0]:ann}.univariate_twas_weights.rds", compress='xz')
      if (empty_elements_cnt>0) {
          message(empty_elements_cnt, " analysis are skipped for failing to pass initial screen for potential association signals")
      }
    }