# MMQTL association testing with individual level data

This notebook performs association analysis and meta analysis using mmQTL on individual data.

## Input

1. A list of regions to be analyzed (optional); the last column of this file should be region name.
2. A list of genotype files for each region to be analyzed, in PLINK `bed` format, same genotype could be used for multiple regions. 
3. vector of lists of phenotype_list files per region to be analyzed, each in count table format.
4. vector of covariate files corresponding to the lists above.
5. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace. 

Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

### Example genotype list

```
#region        path
ENSG00000000457 xqtl_workflow_testing/genotype_per_region/ENSG00000000457.bed
ENSG00000000460 xqtl_workflow_testing/genotype_per_region/ENSG00000000460.bed
ENSG00000000938 xqtl_workflow_testing/genotype_per_region/ENSG00000000938.bed
ENSG00000000971 xqtl_workflow_testing/genotype_per_region/ENSG00000000971.bed
ENSG00000001036 xqtl_workflow_testing/genotype_per_region/ENSG00000001036.bed
ENSG00000001084 xqtl_workflow_testing/genotype_per_region/ENSG00000001084.bed
ENSG00000001167 xqtl_workflow_testing/genotype_per_region/ENSG00000001167.bed
ENSG00000001460 xqtl_workflow_testing/genotype_per_region/ENSG00000001460.bed
```

### Example phenotype list

```
#chr    start   end ID  path
chr12   752578  752579  ENSG00000060237  MWE/output/phenotype/protocol_example.protein.count_matrix
chr12   990508  990509  ENSG00000082805  MWE/output/phenotype/protocol_example.protein.count_matrix
chr12   2794969 2794970 ENSG00000004478  MWE/output/phenotype/protocol_example.protein.count_matrix
chr12   4649113 4649114 ENSG00000139180  MWE/output/phenotype/protocol_example.protein.count_matrix
chr12   6124769 6124770 ENSG00000110799  MWE/output/phenotype/protocol_example.protein.count_matrix
chr12   6534516 6534517 ENSG00000111640  MWE/output/phenotype/protocol_example.protein.count_matrix
```

## Output
**TBA**


## Minimal working example

### mmQTL

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well)

```
# suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk
sos run pipeline/MMQTL.ipynb MMQTL \
     --genoFile /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/GenoFile.list \
     --phenoFile /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_1 \
     /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_2 \
    /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_3 \
    /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_4 \
    /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_5 \
  --grm_file /sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/GRM/simulated_genotype.grm.rel  --name demo --region_list test_region_list  \
  --covFile "/sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_2" "/sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_2" "/sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_2" "/sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_2" "/sc/arion/projects/CommonMind/roussp01a/snmulti_QTL/MMQTL/Test_data_for_mmQTL/phenotype_list_2"  # As place holder, use the actual covariates file in your analysis. 
```

It is also possible to only analyze a selected list of regions by name, using either option `--region-list` or option `--region-name` or both. The command below will include 6 regions to analyze:

In [1]:
[global]
# A list of file paths for genotype data. 
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths

# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
parameter: cwd = path("output")
# It is required to input the name of the analysis
parameter: name = str
# path to utility script. In the future we will consolidate this into an R package.
parameter: utils_R = path("pipeline/xqtl_utils.R")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 3
# Wall clock time expected
parameter: walltime = "10h"
# Memory expected
parameter: mem = "90G"
# Number of threads
parameter: numThreads = 2
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
utils_R = f"{utils_R:a}"

In [None]:
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary with:
# 1. a list of tuples: {data: [(gene_1.genotype, condition_1, cov_1), (gene_2.genotype, condition_1, cov_1, condition_2, cov_2), ...]} each element may not be of the same length
# 2. a list of region meta_info: {meta_info: ( "chr:start-end",gene_1,"cond_1"), ("chr:start-end",gene_2, "cond_1','cond_2"), ...]}
import pandas as pd
import os

genoFile = pd.read_csv(genoFile, sep = "\t", header=0).rename(columns = {"#region":"#id"})  

#if len(phenoFile) != len(covFile):
#    raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
    raise ValueError("Number of input phenotypes files must match the number of phenotype names")
## pos and covar are condition specific, this way when there is no phenotype file, there is na in the corresponding column.
phenoFile = [pd.read_csv(x, sep = "\t", header=0).assign(pos = lambda y:y['#chr'].astype("str")+':'+y['start'].astype("str")+'-'+y['end'].astype("str")+";"+y["ID"].astype("str") 
                                              ).assign( cond = a ).drop(columns = ["#chr","start","end"]).rename(columns = {"ID":"#id"})   
             for x,a in zip(phenoFile,phenotype_names)]
for i in range(len(phenoFile)):
    genoFile = genoFile.merge(phenoFile[i], on='#id', how='left', suffixes = (f'{i}_x', f'{i}_y'))

# remove id that has no phenotype.
genoFile = genoFile[~genoFile.drop(columns=['#id',"#path"]).isna().all(axis=1)] 
if len(genoFile.index) == 0:
    raise ValueError("No region overlap between genotype #id and any of the phenotypes ID")

# Get position for meta_data
genoFile.to_csv("test_df","\t",index = False)
pos_col = [col for col in genoFile.columns if col.startswith('pos')]
genoFile.index = pd.Series(genoFile[pos_col].values.flatten()).dropna().unique()
genoFile.index = [x.split(";")[0] for x in genoFile.index ]
# Get the conditions strings for each ID
cond_col = [col for col in genoFile.columns if col.startswith('cond')]
genoFile["phenotype_names"] = ["','".join(pd.Series((x)).dropna()) for x in genoFile[cond_col].to_dict("split")["data"]]
# Clean up
genoFile = genoFile.drop(columns = cond_col).drop(columns = pos_col)

region_ids = []

# If region_list is provided, read the file and extract IDs
if region_list.is_file():
    region_list_df = pd.read_csv(region_list, sep = "\t", header=None, comment = "#")
    region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs

# If region_name is provided, include those IDs as well
# --region-name A B C will result in a list of ["A", "B", "C"] here
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))

# If either region_list or region_name is provided, filter the genoFile
if len(region_ids) > 0:
    genoFile = genoFile[genoFile['#id'].isin(region_ids)]

file_inv = genoFile.drop(columns = ["#id", "phenotype_names"]).to_dict("split")
file_inv['data'] = [[value for value in sublist if not pd.isna(value)] for sublist in file_inv['data']] 


## There will alwayse be genotype file due to left join,
## There will alwayse be covar file as len(covFile) must == len(PhenoFile), and covar column is the same string accross all rows
## So only if there is no bed.gz there will be problem.
regional_data = {"data":file_inv["data"],"meta_info": genoFile[["#id","phenotype_names"]].reset_index().to_dict("split")['data'] }

# Recreate file_inv based on the filtered genoFile
file_inv = genoFile.drop(columns=["#id", "phenotype_names"]).to_dict("split")
file_inv['data'] = [[value for value in sublist if not pd.isna(value)] for sublist in file_inv['data']] 

# Recreate the regional_data based on the filtered data
regional_data = {"data": file_inv["data"],
                 "meta_info": genoFile[["#id", "phenotype_names"]].reset_index().to_dict("split")['data']}

In [None]:
[MMQTL_1]
## This step further processes the input data into files before running mmQTL on it
depends: sos_variable("regional_data")
parameter: grm_file = path()
meta_info = regional_data['meta_info']
data = regional_data["data"]
print(len(meta_info))
print(len(data))
input: for_each=dict(_data=data, _meta_info=meta_info)
output: f'{cwd:a}//{step_name[:-2]}/{_meta_info[1]}/genotype_input_tmp',
        f'{cwd:a}//{step_name[:-2]}/{_meta_info[1]}/phenotype_input_tmp'
task: trunk_workers = 1, trunk_size = 50, walltime = "1h", mem = "5G", cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand = '${ }', stdout = f"{_output[0]}.stdout", stderr = f"{_output[0]}.stderr", container = container, entrypoint = entrypoint
    library("dplyr")
    library("readr")
    genotype = "${_data[0].replace('.bed','')}"
    phenotype = c('${"','".join([x for x in _data[1::1]])}')
    tib = tibble(genotype = genotype, phenotype = phenotype ${f", grm = '{grm_file}'" if grm_file.is_file() else ""} )
    tib%>%select(genotype)%>%write_delim("${_output[0]}",col_names = FALSE)
    tib%>%select(phenotype)%>%write_delim("${_output[1]}",col_names = FALSE)
    ${f'tib%>%select(grm)%>%write_delim("{_output[0]}.grm")' if grm_file.is_file() else ""}

In [None]:
[MMQTL_2]
parameter: grm_file = path()
depends: sos_variable("regional_data")
meta_info = regional_data['meta_info']
input: group_with = "meta_info"
output: f'{cwd:a}//{step_name[:-2]}/{_meta_info[1]}/{name}.{_meta_info[1]}._peak_1_statistical_signal'
task: trunk_workers = 1, trunk_size = 1 , walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = '${ }', stdout = f"{_output}.stdout", stderr = f"{_output}.stderr", container = container, entrypoint = entrypoint
    cd ${_output:dd}
    MMQTL26 -P ${_input[1]:a} -Z ${_input[0]:a} -a ${region_list:a} -T ${_meta_info[1]}  -o ${_output:bn}. ${f'--grm_file {_output[0]:a}.grm' if grm_file.is_file() else "" }