# Extract genome-wide data for multivariate analysis

## Description

This notebook prepares input data for Utimate Decomposition to generate mixture prior (for mvSuSiE) or to use for MASH analysis. It outputs 3 sets of data: $Z_s$, $Z_n$ and $Z_r$ (strong, null and random)

* $Z_s$: **this is now extracted from genome-wide cis analysis fine-mapping results**. We extract the top loci data frame of each condition, where the CS threshold is set to be 0.7. Then we merge the z-scores of them into one data frame.
* $Z_n$: (null $Z$-scores): we first extract up to $M$ candidate SNPs from each region which satisify $|z| \le 2$, then overlap it with the list of independent SNPs to keep only independent variants, then finally take the union of the extracted.
* $Z_r$: we randomly extract variants based on input independent list of variants.

**FIXME: We need to apply the independent list of variants Anqi developed and use it here to filter and get $Z_n$ and $Z_r$. This logic shoud be added to `processing_1`. Also, it might be a good idea we take some of these utility functions into pecotmr package for better maintenance. For example `processing_1` the function to load regional summary stats from tensorQTL into a matrix should be packed into pecotmr; plus this one function `handle_nan_etc`. processing_2 can stay as is; the `susie_signal` step can also go into `pecotmr` as a way for users to summarize signals from SuSiE for other purposes**

## Input
1. **Marginal summary statistics files**: Bgzipped summary statistics for chromosomes 1-22, generated by tensorQTL cis-analysis and indexed by `tabix`.
2. **Fine-mapping results file index**: Path to lists of fine-mapped RDS files from finemapping output.
2. **Genome region partition** (optional): Defines genomic regions for each gene as enhanced cis regions where we should extract $Z_n$ and $Z_r$ from. This list is used for fine-mapping, so if the complete list of fine-mapping RDS (rather than a handful of it) is already avaiable (#2 above) then there is no need to provide this file. Otherwise, it's going to be limited to only certaion regions, which is also good for testing purpose.

## Output
A list of 10 elements:

```
List of 10
 $ random.z: num [1:36, 1:2] -0.785 -0.785 -0.785 -0.785 -0.785 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:36] "1:97960:A:G" "1:138565:G:A" "1:15112:C:T" "1:189947:G:A" ...
  .. ..$ : chr [1:2] "A" "B"
 $ null.z  : num [1:36, 1:2] -0.785 -0.785 -0.785 -0.785 -0.785 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:36] "1:93692:C:T" "1:273645:A:G" "1:10442:CCTA:." "1:198942:A:C" ...
  .. ..$ : chr [1:2] "A" "B"
 $ random.b: num [1:36, 1:2] -0.123 -0.123 -0.123 -0.123 -0.123 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:36] "1:97960:A:G" "1:138565:G:A" "1:15112:C:T" "1:189947:G:A" ...
  .. ..$ : chr [1:2] "A" "B"
 $ null.b  : num [1:36, 1:2] -0.123 -0.123 -0.123 -0.123 -0.123 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:36] "1:93692:C:T" "1:273645:A:G" "1:10442:CCTA:." "1:198942:A:C" ...
  .. ..$ : chr [1:2] "A" "B"
 $ null.s  : num [1:36, 1:2] 0.157 0.157 0.157 0.157 0.157 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:36] "1:93692:C:T" "1:273645:A:G" "1:10442:CCTA:." "1:198942:A:C" ...
  .. ..$ : chr [1:2] "A" "B"
 $ random.s: num [1:36, 1:2] 0.157 0.157 0.157 0.157 0.157 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:36] "1:97960:A:G" "1:138565:G:A" "1:15112:C:T" "1:189947:G:A" ...
  .. ..$ : chr [1:2] "A" "B"
 $ strong.b:Classes ‘data.table’ and 'data.frame':	1 obs. of  2 variables:
  ..$ A: num -0.217
  ..$ B: num -0.217
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ strong.s:Classes ‘data.table’ and 'data.frame':	1 obs. of  2 variables:
  ..$ A: num 0.0481
  ..$ B: num 0.0481
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ strong.z:Classes ‘data.table’ and 'data.frame':	1 obs. of  2 variables:
  ..$ A: num -4.5
  ..$ B: num -4.5
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ XtX     : num [1:2, 1:2] 20.3 20.3 20.3 20.3
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "A" "B"
  .. ..$ : chr [1:2] "A" "B"
  ```

### Example

In [None]:
# generate random and null only
sos run pipeline/mash_preprocessing.ipynb processing \
    --name protocol_example_protein \
    --sum_files test_pQTL_asso_list \
               test_pQTL_asso_list \
    --region_file test.region \
    --traits A B 


In [None]:
# generate strong only
sos run pipeline/mash_preprocessing.ipynb susie_signal \
    --name protocol_example_protein \
    --susie_list protocol_example_protein.susie_output.txt \
    --traits A B 


In [None]:
# generate mashr input directly
sos run pipeline/mash_preprocessing.ipynb mash_input \
    --name protocol_example_protein \
    --sum_files test_pQTL_asso_list \
               test_pQTL_asso_list \
    --region_file test.region \
    --susie_list protocol_example_protein.susie_output.txt \
    --traits A B


In [1]:
[global]
parameter: name = str
# Path to work directory where output locates
parameter: cwd = path("./output")
parameter: seed = 999
parameter: n_random = 10
parameter: n_null = 10
parameter: expected_ncondition = 0
parameter: exclude_condition = []
# Containers that contains the necessary packages
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "10min"
# Memory expected
parameter: mem = "8G"
# Number of threads
parameter: numThreads = 8
# This is in principle required; but in practice it can be optional if we are not exactly stringent about getting independent SNPs
parameter: independent_variant_list = path

In [None]:
# extract data for MASH from summary stats
[susie_to_mash_1]
parameter: per_chunk =100
parameter: susie_list = path
# first 3 col are chr start end, 4th column is region ID, 5th col are file names, 6 col is all the condition names comma split
susie_data = [("c(" + ",".join(f"'{y}'" for y in x.strip().split()) + ")") for x in open(susie_list).readlines()] 
input: for_each = susie_data, group_by = per_chunk
output: f"{cwd}/{name}_cache/{name}_batch{_index+1}.rds"
task: trunk_workers = job_size, walltime = walltime, trunk_size = 1, mem = mem, cores = numThreads, tags = f'{_output:bn}'
R: expand = "${ }",stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else if (is.null(is.null(res$random.b)|is.null(res$null.b))) {
          return(one_data)
      } else if (is.null(one_data)) {
          return(res)
      } else {
          for (d in names(one_data)) {
            if (is.null(one_data[[d]])) {
              next
            } else {
                res[[d]] = as.matrix(rbind(res[[d]],as.data.frame(one_data[[d]])))
            }
          }
          return(res)
      }
    }
    exclude_condition = c(${",".join([repr(x) for x in exclude_condition])})
    meta_df = rbind(${",".join(_susie_data)}}
    res = list()
    for (i in 1:nrows(meta_df)) {
      line = meta_df[i,]
      region_files = strsplit(line[5], split=",")
      traits = strsplit(line[6], split=",")
      strong <- pecotmr::load_multitrait_R_sumstat(region_files, top_loci=TRUE)
      ran_null <- pecotmr::load_multitrait_R_sumstat(region_files, filter_file=${independent_variant_list:r})
      if (length(exclude_condition)>0) {
          strong = list(strong.z=strong$bhat[,-exclude_condition]/strong$sbhat[,-exclude_condition], strong.b=strong$bhat[,-exclude_condition], strong=strong$sbhat[,-exclude_condition])
      } else {
          strong = list(strong.z=strong$bhat/strong$sbhat, strong.b=strong$bhat, strong=strong$sbhat)
      }
      ran_null <- pecotmr::mash_ran_null_sample(dat, ${n_random}, ${n_null}, ${expected_ncondition}, exclude_condition, seed=${seed})
      res <- merge_data(res, c(strong, ran_null))
    }
    
    saveRDS(out, ${_output:r}, compress="xz")

In [None]:
[susie_to_mash_2]
input: group_by = "all"
output: f"{cwd}/{name}.mash_input.rds" 
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '16G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", container = container, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint

  # Function to merge data
  merge_data <- function(res, one_data) {
    if (length(res) == 0) {
      return(one_data)
    } else if (is.null(one_data)) {
      return(res)
    } else {
      for (d in names(one_data)) {
        if (is.null(one_data[[d]])) {
          next
        } else {
          res[[d]] <- as.matrix(rbind(res[[d]], as.data.frame(one_data[[d]])))
        }
      }
      return(res)
    }
  }

  dat = list()
  for (f in c(${_input:r,})) {
    dat = merge_data(dat, readRDS(f))
  }
  dat$ZtZ = t(as.matrix(dat$strong.z)) %*% as.matrix(dat$strong.z) / nrow(dat$strong.z)
  saveRDS(dat, ${_output:r}, compress="xz")
 
bash: expand = "${ }", container = container, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    rm -rf ${cwd}/${name}_cache/

## Get the random and null effects per analysis unit

**FIXME: notice that we no longer rely on tensorQTL results for MASH analysis. Our new protocol uses SuSiE output. We keep this piece of code to extract from tensorQTL which may still be relevant for trans-analysis. But currently it is limited to random and null variants extraction**

In [None]:
[random_null_tensorqtl_1]
parameter: sum_files = paths
parameter: region_file = path
parameter: traits = paths

import re
import pandas as pd
def find_matching_files_for_region(chr_id):
    chr_number = chr_id[3:]  # subset 1 from chr1
    pattern_str = r"\.{chr_number}\."
    pattern = re.compile(pattern_str.format(chr_number=chr_number))
    paths = []
    for sum_file in sum_files:
        with open(sum_file, 'r') as af:
            for aline in af:
                if pattern.search(aline):
                    paths.append(aline.strip())
    return ",".join(paths)

updated_regions = []
with open(region_file, 'r') as regions:
    header = regions.readline().strip()
    updated_regions.append(header + "\tpath\tregion")
    for line in regions:
        parts = line.strip().split("\t")
        chr_id, start, end, gene_id = parts
        paths = find_matching_files_for_region(chr_id)
        updated_regions.append(f"{chr_id}\t{start}\t{end}\t{gene_id}\t{paths}\t{chr_id}:{start}-{end}")

meta_df = pd.DataFrame([line.split("\t") for line in updated_regions[1:]], columns=updated_regions[0].split("\t"))
meta = meta_df[['gene_id', 'path', 'region']].to_dict(orient='records')

input: for_each='meta'
output: f'{cwd:a}/{name}_cache/{name}.{_meta["gene_id"]}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint
    region <- "${_meta['region']}"
    # FIXME I am sure there is a more elegant way to put together the path, via SoS
    phenotype_path <- unlist(strsplit("${_meta['path']}", ","))
    dat <- tryCatch(
      {
        # Try to run the function
         pecotmr::load_multitrait_tensorqtl_sumstat(phenotype_path = phenotype_path, region = region, 
          trait_names = c(${traits:r,}), filter_file = NULL, remove_any_missing = TRUE, max_rows_selected = 300, na_remove = ${"T" if na_remove else "F"})
      },
      error = function(e) {
        warning("Attempt remove chr in region ID to load the data.")
        # If an error occurs, modify the region and try again
        pecotmr::load_multitrait_tensorqtl_sumstat(phenotype_path = phenotype_path, region =  gsub("chr", "", region), 
          trait_names = c(${traits:r,}), filter_file = NULL, remove_any_missing = TRUE, max_rows_selected = 300, na_remove = ${"T" if na_remove else "F"})
      }
    )
    exclude_condition = c(${",".join([repr(x) for x in exclude_condition])})
    dat <- pecotmr::mash_ran_null_sample(dat, ${n_random}, ${n_null}, ${expected_ncondition}, exclude_condition, z_only = ${"TRUE" if z_only else "FALSE"}, seed=${seed})
    saveRDS(dat, ${_output:r}, compress="xz")