# Fine-mapping with individual level data

This notebook performs statistical fine-mapping using SuSiE, fSuSiE and mvSuSiE for individual level data. 

## Input

1. A list of regions to be analyzed (optional); the last column of this file should be region name.
2. A list of genotype files per region to be analyzed, in PLINK `bed` format. 
3. vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.
4. vector of covariate files corresponding to the lists above.
5. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace. 

Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

### Example genotype list

```
# region        path
ENSG00000000457 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000457.bed
ENSG00000000460 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000460.bed
ENSG00000000938 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000938.bed
ENSG00000000971 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000971.bed
ENSG00000001036 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001036.bed
ENSG00000001084 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001084.bed
ENSG00000001167 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001167.bed
ENSG00000001460 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001460.bed
```

### Example phenotype list

```
#chr    start   end ID  path
chr12   752578  752579  ENSG00000060237  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   990508  990509  ENSG00000082805  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   2794969 2794970 ENSG00000004478  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   4649113 4649114 ENSG00000139180  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   6124769 6124770 ENSG00000110799  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   6534516 6534517 ENSG00000111640  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
```

## Output

For each analysis region, the output is SuSiE model fitted and saved in RDS format.

## Minimal working example

### SuSiE

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well)

```
# suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk
sos run pipeline/SuSiE.ipynb susie \
    --name protocol_example_protein \
    --genoFile output/phenotype_by_region/protocol_example.protein.enhanced_cis_chr21_chr22_genotype_by_region/protocol_example.genotype.chr21_22.genotype_by_region_files.txt \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --phenotype-names A B \
    --container oras://ghcr.io/cumc/stephenslab_apptainer:latest
```

It is also possible to only analyze a selected list of regions by name, using either option `--region-list` or option `--region-name` or both. The command below will include 6 regions to analyze:

In [2]:
import pandas as pd

# Define the data
data = {
    "#chr": ["chr1", "chr1", "chr1"],
    "start": [1000000, 1010000, 1020000],
    "end": [1000100, 1010100, 1020100],
    "ID": ["ENSG00000159082_O43426", "ENSG00000159131_P22102", "ENSG00000205726_Q15811"]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a tab-separated file
df.to_csv("output/selected_regions.tsv", sep="\t", index=False)

```
sos run pipeline/SuSiE.ipynb susie \
    --name protocol_example_protein \
    --genoFile output/protocol_example.protein.enhanced_cis_chr21_chr22_genotype_by_region/protocol_example.genotype.chr21_22.genotype_by_region_files.txt\
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --phenotype-names A B \
    --cwd output/ \
    --region-list output/selected_regions.tsv \
    --region-name ENSG00000154654_O15394 ENSG00000142192_P05067 ENSG00000159082_O43426 \
    --container oras://ghcr.io/cumc/stephenslab_apptainer:latest
```

### fSuSiE


```
# suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk
sos run pipeline/fSuSiE.ipynb susie \
    --name protocol_example_methylation \
    --genoFile output/phenotype_by_region/protocol_example.methylation.enhanced_cis_chr21_chr22_genotype_by_region/protocol_example.genotype.chr21_22.genotype_by_region_files.txt \
    --phenoFile output/phenotype_by_region/protocol_example.methylation.bed.phenotype_by_region_files.txt \
                output/phenotype_by_region/protocol_example.methylation.bed.phenotype_by_region_files.txt  \
    --covFile output/covariate/protocol_example.methylation.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.methylation.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --container oras://ghcr.io/cumc/stephenslab_apptainer:latest
```

In [1]:
[global]
# A list of file paths for genotype data. 
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# Covariate file path
parameter: covFile = paths
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
parameter: cwd = path("output")
# It is required to input the name of the analysis
parameter: name = str
# path to utility script. In the future we will consolidate this into an R package.
parameter: utils_R = path("pipeline/xqtl_utils.R")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "20h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 2
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
utils_R = f"{utils_R:a}"

In [None]:
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary with:
# 1. a list of tuples: {data: [(gene_1.genotype, condition_1, cov_1), (gene_2.genotype, condition_1, cov_1, condition_2, cov_2), ...]} each element may not be of the same length
# 2. a list of region meta_info: {meta_info: ( "chr:start-end",gene_1,"cond_1"), ("chr:start-end",gene_2, "cond_1','cond_2"), ...]}
import pandas as pd
import os
genoFile = pd.read_csv(genoFile, sep = "\t", header=0)

if len(phenoFile) != len(covFile):
    raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
    raise ValueError("Number of input phenotypes files must match the number of phenotype names")
## pos and covar are condition specific, this way when there is no phenotype file, there is na in the corresponding column.
phenoFile = [pd.read_csv(x, sep = "\t", header=0).assign(pos = lambda y:y['#chr']+':'+y['start'].astype("str")+'-'+
                                              y['end'].astype("str")).assign(cov_path = z, cond = a ).drop(columns = ["#chr","start","end"]).rename(columns = {"ID":"#id"})   
             for x,z,a in zip(phenoFile,covFile,phenotype_names)]
for i in range(len(phenoFile)):
    genoFile = genoFile.merge(phenoFile[i], on='#id', how='left', suffixes = (f'{i}_x', f'{i}_y'))

# remove id that has no phenotype.

genoFile = genoFile[~genoFile.drop(columns=['#id',"#path"]).isna().all(axis=1)]    

if len(genoFile.index) == 0:
    raise ValueError("No region overlap between genotype #id and any of the phenotypes ID")

# Get position for meta_data
pos_col = [col for col in genoFile.columns if col.startswith('pos')]
genoFile.index = pd.Series(genoFile[pos_col].values.flatten()).dropna()
# Get the conditions strings for each ID
cond_col = [col for col in genoFile.columns if col.startswith('cond')]
genoFile["phenotype_names"] = ["','".join(pd.Series((x)).dropna()) for x in genoFile[cond_col].to_dict("split")["data"]]
# Clean up
genoFile = genoFile.drop(columns = cond_col).drop(columns = pos_col)


region_ids = []

# If region_list is provided, read the file and extract IDs
if region_list.is_file():
    region_list_df = pd.read_csv(region_list, sep = "\t", header=None, comment = "#")
    region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs

# If region_name is provided, include those IDs as well
# --region-name A B C will result in a list of ["A", "B", "C"] here
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))

# If either region_list or region_name is provided, filter the genoFile
if region_ids:
    genoFile = genoFile[genoFile['#id'].isin(region_ids)]

file_inv = genoFile.drop(columns = ["#id", "phenotype_names"]).to_dict("split")
file_inv['data'] = [[value for value in sublist if not pd.isna(value)] for sublist in file_inv['data']] 


## There will alwayse be genotype file due to left join,
## There will alwayse be covar file as len(covFile) must == len(PhenoFile), and covar column is the same string accross all rows
## So only if there is no bed.gz there will be problem.
regional_data = {"data":file_inv["data"],"meta_info": genoFile[["#id","phenotype_names"]].reset_index().to_dict("split")['data'] }

# Recreate file_inv based on the filtered genoFile
file_inv = genoFile.drop(columns=["#id", "phenotype_names"]).to_dict("split")
file_inv['data'] = [[value for value in sublist if not pd.isna(value)] for sublist in file_inv['data']] 

# Recreate the regional_data based on the filtered data
regional_data = {"data": file_inv["data"],
                 "meta_info": genoFile[["#id", "phenotype_names"]].reset_index().to_dict("split")['data']}

## Univariate SuSiE

In [1]:
[susie_1]
parameter: max_L = 20
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1
# MAF cutoff
parameter: maf = 0.0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 5 rather than using an MAF cutoff
parameter: mac = 5
parameter: pip_cutoff = 0.7
parameter: coverage = 0.95
depends: sos_variable("regional_data")

def group_by_region(lst, data):
    vector = [len(x) for x in data]
    return [lst[sum(vector[:i]):sum(vector[:i+1])] for i in range(len(vector))]

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[1]}.susie_fitted.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint, input = utils_R
    # Load regional association data
    fdat = load_regional_finemapping_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = "${_meta_info[0]}",
                                          conditions = c('${_meta_info[2]}'),
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          y_as_matrix = FALSE)
    # Fine-mapping with SuSiE
    library(susieR)  
    fitted = list()
    for (r in 1:length(fdat$residual_Y_scaled)) { ## Cant have a universal way to specify names due to the accomodation of missingness, use index instead
        st = proc.time()
        fitted[[r]] <- susie(fdat$residual_X_scaled[[r]],
                             fdat$residual_Y_scaled[[r]],
                             L=${max_L},
                             max_iter=500,
                             estimate_residual_variance=TRUE,
                             estimate_prior_variance=TRUE,
                             refine=TRUE,
                             compute_univariate_zscore=FALSE,
                             coverage=${coverage})
        fitted[[r]]$analysis_time <- proc.time() - st
        fitted[[r]] <- post_process_susie(fitted[[r]], fdat, r, signal_cutoff = ${pip_cutoff})
    }
    names(fitted) <- names(fdat$residual_Y_scaled)
    saveRDS(fitted, ${_output:ar}, compress='xz')

## Multivariate SuSiE

In [None]:
[mvsusie_1]
# Prior model file generated from mashr. 
# Default will be used if it does not exist.
parameter: mixture_prior = path()
parameter: max_L = 20
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0.0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 10 rather than using an MAF cutoff
# I don't set it to 5 because I'm not so sure of performance of SuSiE on somewhat infrequent variants
# MAC = 10 would not be too infrequenty for xQTL data where sample size is about ~1,000 at most (as of 2022)
parameter: mac = 10
depends: sos_variable("regional_data")

def group_by_region(lst, data):
    vector = [len(x) for x in data]
    return [lst[sum(vector[:i]):sum(vector[:i+1])] for i in range(len(vector))]

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.susie_fitted.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint, input = utils_R
   
    get_prior_indices <- function(Y, U) {
      # make sure the prior col/rows match the colnames of the Y matrix
      y_names = colnames(Y)
      u_names = colnames(U)
      if (is.null(y_names) || is.null(u_names)) {
          return(NULL)
      } else if (identical(y_names, u_names)) {
          return(NULL)
      } else {
          return(match(y_names, u_names))
      }
    }

    # Load regional association data
    fdat = load_regional_association_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          y_as_matrix = TRUE)

    # univariate summary statistics
    non_missing = lapply(1:ncol(fdat$residual_Y_scaled), function(r)) which(!is.na(fdat$residual_Y_scaled[,r]))
    univariate_res = lapply(1:ncol(fdat$residual_Y_scaled), function(r) susieR:::univariate_regression(X[non_missing[[r]], ], fdat$residual_Y_scaled[non_missing[[r]], r]))
    sumstat = list(bhat=do.call(cbind, lapply(1:ncol(fdat$residual_Y_scaled), function(r) univariate_res[[r]]$betahat)),
                   sbhat=do.call(cbind, lapply(1:ncol(fdat$residual_Y_scaled), function(r) univariate_res[[r]]$sebetahat)))
  
    # Multivariate fine-mapping
    # FIXME: handle it when prior does not exist
    prior = readRDS(${mixture_prior:r})
    print(paste("Number of components in the mixture prior:", length(prior$U)))
    prior = mvsusieR::create_mash_prior(mixture_prior=list(weights=prior$w, matrices=prior$U), include_indices = get_prior_indices(fdat$residual_Y_scaled, prior$U[[1]]), max_mixture_len=-1)   
    resid_Y = compute_cov_flash(fdat$residual_Y_scaled)
    st = proc.time()
    fitted = mvsusieR::mvsusie(fdat$X, 
                               fdat$residual_Y_scaled, 
                               L=${max_L}, 
                               prior_variance=prior, 
                               residual_variance=resid_Y, 
                               precompute_covariances=F, 
                               compute_objective=T, 
                               estimate_residual_variance=F, 
                               estimate_prior_variance=T, 
                               estimate_prior_method='EM',
                               max_iter = 200, 
                               n_thread=${numThreads}, 
                               approximate=F)
    fitted$analysis_time = proc.time() - st
    fitted$cs_corr = susieR::get_cs_correlation(fitted, X=fdat$X)
    fitted$cs_snps = names(fitted$X_column_scale_factors[unlist(fitted$sets$cs)])
    fitted$variable_name = names(fitted$pip)
    fitted$analysis_script = load_script()
    fitted$dropped_samples = fdat$dropped_sample
    fitted$sample_names = colnames(fdat$residual_Y_scaled)
    fitted$residual_y = resid_Y
    saveRDS(fitted, ${_output:ar})

## Univariate fSuSiE

In [None]:
[fsusie_1]
parameter: max_L = 30
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0.0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 10 rather than using an MAF cutoff
# I don't set it to 5 because I'm not so sure of performance of SuSiE on somewhat infrequent variants
# MAC = 10 would not be too infrequenty for xQTL data where sample size is about ~1,000 at most (as of 2022)
parameter: mac = 10
# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
parameter: prior  = "mixture_normal_per_scale"
parameter: max_SNP_EM = 1000

depends: sos_variable("regional_data")

def group_by_region(lst, data):
    vector = [len(x) for x in data]
    return [lst[sum(vector[:i]):sum(vector[:i+1])] for i in range(len(vector))]

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.fsusie_{prior}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint, input = utils_R
    # Load regional association data
    fdat = load_regional_association_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          y_as_matrix = FALSE)
    # Fine-mapping with fSuSiE
    library("susiF.alpha")
    fitted = list()
    for (r in 1:length(fdat$residual_Y_scaled)) {
        st = proc.time()  
        fitted[[r]] <- susiF(fdat$residual_X_scaled[[r]],
                             fdat$residual_Y_scaled[[r]],
                             pos=fdat$phenotype_coordiates[[r]], #FIXME: this needs to be edited and added dto load_regional_association_data
                             L=${max_L},
                             prior="${prior}",
                             max_SNP_EM=${max_SNP_EM})
        fitted[[r]]$analysis_time = proc.time() - st
        fitted[[r]]$cs_corr = get_cs_correlation(fitted[[r]], X=fdat$residual_X_scaled[[r]])
        fitted[[r]]$cs_snps = names(fitted[[r]]$X_column_scale_factors[unlist(fitted[[r]]$sets$cs)])
        fitted[[r]]$variable_name = names(fitted[[r]]$pip)
        fitted[[r]]$coef = coef.susie(fitted[[r]])
        fitted[[r]]$analysis_script = load_script()
        fitted[[r]]$analysis_name = fdat$traits[[r]]
        fitted[[r]]$dropped_samples = fdat$dropped_sample[[r]]
        fitted[[r]]$sample_names = colnames(fdat$residual_Y_scaled[[r]])
    }
    saveRDS(fitted, ${_output:ar})

## Multivariate fSuSiE

In [None]:
[mvfsusie_1]
parameter: max_L = 30
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0.0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 10 rather than using an MAF cutoff
# I don't set it to 5 because I'm not so sure of performance of SuSiE on somewhat infrequent variants
# MAC = 10 would not be too infrequenty for xQTL data where sample size is about ~1,000 at most (as of 2022)
parameter: mac = 10
# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
parameter: prior  = "mixture_normal_per_scale"
parameter: max_SNP_EM = 1000

depends: sos_variable("regional_data")

def group_by_region(lst, data):
    vector = [len(x) for x in data]
    return [lst[sum(vector[:i]):sum(vector[:i+1])] for i in range(len(vector))]

meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.mvfsusie_{prior}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint, input = utils_R
    # Load regional association data
    fdat = load_regional_association_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          y_as_matrix = FALSE)
    # Fine-mapping with mvfSuSiE
    library("mvf.susie.alpha")
    Y = map(fdat$residual_Y_scaled, ~left_join(fdat$X[,1]%>%as.data.frame%>%rownames_to_column("rowname"), .x%>%t%>%as.data.frame%>%rownames_to_column("rowname") , by = "rowname")%>%select(-2)%>%column_to_rownames("rowname")%>%as.matrix )
    fitted <- multfsusie(Y_f = list(Y[[1]],Y[[3]]), 
                         Y_u = Reduce(cbind, Y[[2]]),
                         pos = list(pos1 =fdat$phenotype_coordiates[[1]], pos2 = fdat$phenotype_coordiates[[3]]),
                         X=X,
                         L=${max_L},
                         data.format="list_df")
    saveRDS(fitted, ${_output:ar})

In [None]:
[*_2]
input: group_by = "all"
output: f'{cwd}/{name}.susie_output.txt'
python: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    import pandas as pd
    pd.DataFrame({"output" : [$[_input:ar,]]}).to_csv("$[_output]",index = False ,header = False, sep = "\t")