# Fine-mapping with SuSiE model

This notebook performs statistical fine-mapping using SuSiE model for individual level data. 

## Input

1. A list of regions to be analyzed (optional); the last column of this file should be region name.
2. A list of genotype files per region to be analyzed, in PLINK `bed` format. 
3. vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.
4. vector of covariate files corresponding to the lists above.

Input 2 and 3 should be outputs from `genotype_per_region` and `phenotype_per_region` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

### Example genotype list

```
# region        dir
ENSG00000000457 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000457.bed
ENSG00000000460 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000460.bed
ENSG00000000938 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000938.bed
ENSG00000000971 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000000971.bed
ENSG00000001036 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001036.bed
ENSG00000001084 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001084.bed
ENSG00000001167 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001167.bed
ENSG00000001460 /mnt/mfs/statgen/xqtl_workflow_testing/genotype_per_region/ENSG00000001460.bed
```

### Example phenotype list

```
# region dir
ENSG00000000457 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000000457.t1.bed.gz
ENSG00000000460 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000000460.t1.bed.gz
ENSG00000000938 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000000938.t1.bed.gz
ENSG00000000971 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000000971.t1.bed.gz
ENSG00000001036 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000001036.t1.bed.gz
ENSG00000001084 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000001084.t1.bed.gz
ENSG00000001167 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000001167.t1.bed.gz
ENSG00000001460 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_per_region/ENSG00000001460.t1.bed.gz
```

## Output

For each analysis region, the output is SuSiE model fitted and saved in RDS format.

## Minimal working example

```
# suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk
sos run SuSiE.ipynb susie \
    --name cohort_modality \
    --genoFile /path/to/genotype_list \
    --phenoFile /path/to/phenotype_list_1 /path/to/phenotype_list_1 /path/to/phenotype_list_3 \
    --covFile phenotype_1_cov phenotype_2_cov phenotype_3_cov \
    --utils-R /path/to/misc/xqtl_utils.R \
    --container /path/to/stephenslab.sif
```


sos run pipeline/SuSiE.ipynb susie \
    --name protocol_example_protein \
    --genoFile output/phenotype_by_region/protocol_example.protein.enhanced_cis_chr21_chr22_genotype_by_region/protocol_example.genotype.chr21_22.genotype_by_region_files.txt \
    --phenoFile output/phenotype_by_region/protocol_example.protein.bed.phenotype_by_region_files.txt \
                output/phenotype_by_region/protocol_example.protein.bed.phenotype_by_region_files.txt  \
    --covFile output/covariate///protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
        output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
    --container containers/stephenslab.sif -n


In [1]:
[global]
import glob
import pandas as pd
# A list of file paths for genotype data. 
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# Covariate file path
parameter: covFile = paths
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
parameter: cwd = path("output")
# It is required to input the name of the analysis
parameter: name = str
# path to utility script. In the future we will consolidate this into an R package.
parameter: utils_R = path("misc/xqtl_utils.R")
parameter: container = ""
parameter: entrypoint= {('micromamba run -a "" -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else f''}
# For cluster jobs, number commands to run per job
parameter: job_size = 100
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 2


utils_R = f"{utils_R:a}"

## Univariate SuSiE

In [None]:
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
#
# regional_data should be a dictionary with:
# 1. a list of tuples: {data: [(gene_1.genotype, gene_1.condition_1, cov_1, gene_1.condition_2, cov_2), (gene_2.genotype, gene_2.condition_1, cov_1, gene_2.condition_2, cov_2), ...]}
# 2. a list of region ID: {ID: [gene_1, gene_2, ...]}
# FIXME: Hao, please implement the details
import pandas as pd
genoFile = pd.read_csv(genoFile,sep = "\t")
phenoFile = [pd.read_csv(x,sep = "\t") for x in phenoFile]
if len(phenoFile) > len(covFile):
    raise ValueError("There more phenotypes files specified than covariates files")
for i in range(len(phenoFile)):
    genoFile =genoFile.merge(phenoFile[i], on='#id', how='left',  suffixes = (f'{i}_x', f'{i}_y')).assign(**{f'{i}_covar': covFile[i]})
file_inv = genoFile.set_index("#id").to_dict("split")
file_inv['data'] = [[value for value in sublist if not pd.isna(value)] for sublist in file_inv['data']]
file_inv['data'] = [sublist for sublist in file_inv['data'] if any(str(value).endswith("bed.gz") for value in sublist)] # Remove item where there is no phenotypes file

## There will alwayse be genotype file due to left join,
## There will alwayse be covar file as len(covFile) must == len(PhenoFile), and covar column is the same string accross all rows
## So only if there is no bed.gz there will be problem.

regional_data = {"data":file_inv["data"],"ID": file_inv["index"]  }


In [1]:
[susie_1]
parameter: max_L = 20
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 0.1
# MAF cutoff
parameter: maf = 0
# MAC cutoff, on top of MAF cutoff
# Here I set default to mac = 10 rather than using an MAF cutoff
# I don't set it to 5 because I'm not so sure of performance of SuSiE on somewhat infrequent variants
# MAC = 10 would not be too infrequenty for xQTL data where sample size is about ~1,000 at most (as of 2022)
parameter: mac = 10
depends: sos_variable("regional_data")

def group_by_region(lst, data):
    vector = [len(x) for x in data]
    return [lst[sum(vector[:i]):sum(vector[:i+1])] for i in range(len(vector))]

input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"])
output: f'{cwd:a}/{step_name[:-2]}/{name}.{regional_data["ID"][_index]}.susie_fitted.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint, input = utils_R
    # Load regional association data
    fdat = load_regional_association_data(genotype = ${_input[0]:anr},
                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
                                          maf_cutoff = ${maf},
                                          mac_cutoff = ${mac},
                                          imiss_cutoff = ${imiss},
                                          y_as_matrix = FALSE)
    # Fine-mapping with SuSiE
    library("susieR")
    fitted = list()
    for (r in 1:length(fdat$residual_Y_scaled)) { ## Cant have a universal way to specify names due to the accomodation of missingness, use index instead
        st = proc.time()
        fitted[[r]] <- susie(fdat$residual_X_scaled[[r]],
                             fdat$residual_Y_scaled[[r]],
                             L=${max_L},
                             max_iter=1000,
                             estimate_residual_variance=TRUE,
                             estimate_prior_variance=TRUE,
                             refine=TRUE,
                             compute_univariate_zscore=TRUE, 
                             coverage=0.95)
        fitted[[r]]$time = proc.time() - st
        fitted[[r]]$cs_corr = get_cs_correlation(fitted[[r]], X=fdat$residual_X_scaled[[r]])
        fitted[[r]]$cs_snps = names(fitted[[r]]$X_column_scale_factors[unlist(fitted[[r]]$sets$cs)])
        fitted[[r]]$variable_name = names(fitted[[r]]$pip)
        fitted[[r]]$coef = coef.susie(fitted[[r]])
        fitted[[r]]$analysis_script = load_script()
        fitted[[r]]$molecular_trait = fdat$traits[[r]]
        fitted[[r]]$dropped_samples = fdat$dropped_sample[[r]]
        fitted[[r]]$sample_names = colnames(fdat$residual_Y_scaled[[r]])


    }
    saveRDS(fitted, ${_output:ar})

In [None]:
[susie_2]
input: group_by = "all"
output: f'{cwd}/{name}.susie_output.txt'
python: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    import pandas as pd
    pd.DataFrame({"output" : [$[_input:ar,]]}).to_csv("$[_output]",index = False ,header = False, sep = "\t")