# sQTL summarize

From Chao:     
In my sQTL mapping result which each row has intron ID, intron cluster ID, and intron label (Productive or UnProductive).    
I then first created another label to categorize what kind of sQTL a cluster has based on the qvalue of the introns in the cluster:    
    - Productive clusters: all introns that are significant are Productive    
    - UnProductive clusters: all introns that are significant are UnProductive    
    - Productive, UnProductive clusters: some significant introns are Productive, some are UnProductive     
Any clusters that do not have any significant introns are not included in analysis     
Then to pick 1 representative intron for a cluster:    
    - Productive clusters: pick the intron with the smallest p-value for the cluster, and label as Productive     
    - UnProductive clusters: if this cluster doesn't have any Productive intron (regardless of pvalue), remove cluster. Otherwise, pick the smallest UnProductive intron for the cluster, and label as UnProductive    
    - Productive, UnProductive clusters: since clearly there are both types of introns here, and that there are significant UnProductive introns, I pick the smallest UnProductive intron for the cluster, and label as UnProductive.    
(cell 171)

In [None]:
#work
sos run /home/rf2872/codes/xqtl-pipeline/pipeline/sQTL_summarize.ipynb pick_top_intron \
    --emp_file /mnt/vast/hpc/csg/rf2872/Work/leaf_cutter2/ROSMAP_DLPFC/output/association_scan/sQTL/TensorQTL_without_phenogroup/ROSMAP_DLPFC_perind.counts.noise.gz_raw_data.qqnorm.formated.bed.per_chrom_ROSMAP_DLPFC_perind.counts.noise.gz_raw_data.qqnorm.formated.cov_pca.resid.Marchenko_pc.emprical.cis_sumstats.txt

In [None]:
[global]
import os
# Work directory & output directory
parameter: cwd = path('./')
# The filename prefix for output data
parameter: job_size = 1
parameter: mem = '60G'
parameter: container = ''

import pandas as pd
#parameter: analysis_units = path
# handle N = per_chunk data-set in one job
parameter: per_chunk = 1


In [None]:
#[merge_emp]
parameter: emp_files = paths
input: emp_files, group_by = 'all'
output:merged_emp=f'{_input[0]:annnn}.all.emprical.cis_sumstats.txt'
task: trunk_workers = 1, trunk_size = job_size, mem = mem,  walltime = '24h', tags = f'{step_name}_{_output[0]:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
      library(tidyverse)
      library(data.table)
      emp<-data.table()
      emp_files<-stringr::str_split(${_input:r},"' '",simplify=T)%>%as.character

      for(i in emp_files){
          #print (i)
          emp_tmp<-fread(i)
          emp<-rbind(emp,emp_tmp) 
      }
      emp = emp%>%unnest()
      emp["q_beta"] = qvalue(emp$p_beta)$qvalue
      emp["q_perm"] = qvalue(emp$p_perm)$qvalue
      emp["fdr_beta"] = p.adjust(emp$p_beta,"fdr")    
      emp["fdr_perm"] = p.adjust(emp$p_perm,"fdr")    
      emp$cluster_ID<-str_split(emp$molecular_trait_id,":",simplify = T)%>%.[,4]%>%str_split(.,"_",simplify = T)%>%.[,1:2]%>%apply(.,1,function(x) paste(x,collapse="_")) 
      #write.table(emp, file="${_output}",col.names=TRUE, row.names=FALSE, quote=FALSE)

In [None]:
[pick_top_intron]
parameter: pvalue = "fdr_beta"
parameter: p_thres = 0.01
parameter: emp_file = path
input:emp_file
output:f'{cwd:a}/cluster_resresentative_StrategyByChao.txt'
task: trunk_workers = 1, trunk_size = job_size, mem = mem,  walltime = '24h', tags = f'{step_name}_{_output[0]:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
        library(tidyverse)
        library(data.table)
        emp<-fread(${_input:r})
        p_id <- "${pvalue}"
        emp_sig<-emp[emp[[p_id]]<${p_thres},] 
        clus_top <- data.table()
        for (clu in unique(emp_sig$cluster_ID)) {
            tmp_raw <- emp[emp$cluster_ID == clu, ]
            tmp <- emp_sig[emp_sig$cluster_ID == clu, ]

            if ( # 1. Productive clusters:
                    tmp$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    length() == 0) {
                print("#1 P")
                clus_top_tmp <- tmp[which.min(tmp[[p_id]]), ]
                clus_top_tmp$"category" <- "Productive"
                } 
            if ( # 2. UnProductive clusters:
                    tmp$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    length() == nrow(tmp)) {
                print("#2 Unp")
                if (tmp_raw$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    length() == nrow(tmp_raw)){
                clus_top_tmp <- NULL
                } 
                if (tmp_raw$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    length() < nrow(tmp_raw)){
                clus_top_tmp <- tmp[which.min(tmp[[p_id]]), ]
                clus_top_tmp$"category" <- "UnProductive"      
                }
            } 
            if ( # 3. Productive and UnProductive clusters:
                tmp$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    length() != nrow(tmp) & tmp$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    length() != 0){   
                print("#3 P & Unp")
                tmp_unp <- tmp$molecular_trait_id %>%
                    grep("[*]", .) %>%
                    tmp[., ]
                tmp_p <- tmp[-which(tmp$molecular_trait_id %in% tmp_unp$molecular_trait_id), ]
                clus_top_tmp <- tmp_unp[which.min(tmp_unp[[p_id]]), ]
                clus_top_tmp$"category" <- "UnProductive" 
            }
            clus_top <- rbind(clus_top, clus_top_tmp)
        }
     #saveRDS(clus_top,"${_output}")
    write.table(clus_top, file="${_output}",col.names=TRUE, row.names=FALSE, quote=FALSE)