## Get a pathway file based on V14 but using the same filtering as scGWAS original authors

The filtering described by the original authors is below with our annotations of how we changed anything, if at all:
* "We collected the gene–gene relationship data from PathwayCommons (v12, data access date: 12/12/2019) to construct the background network. The data downloaded originally had 1,851,006 interaction pairs that were curated and integrated from the public pathway and interaction databases."
    * We downloaded this data from v14 (data access data: 09/04/2024) and it originally had 2,484,222 interaction pairs.
* "We excluded those interactions that were annotated as in-complex-with, because those genes tended to be co-expressed and might inflate the results."
    * We followed the same methodology but also ended up removing any interactions based on non-gene interactions as well.
* "We further excluded 2291 ribosomal genes and housekeeping genes defined by the HSIAO_HOUSEKEEPING_GENES set from MSigDB."
    * We removed the housekeeping genes using the same file they used (from their github)
    * Although unable to get the same ribosomal genes they used, we curated a list of 639 ribosomal genes according to HGNC and hand annotation. We also removed 64 Polymerase genes, totaling to removing 1,112 genes.
    * We also removed the following genes from the final list given they can sometimes be considered housekeeping genes: HSPA1B (Heat shock protein 70 family member 1B), PSMD3 (Proteasome 26S subunit, non-ATPase 3), ACTRT2 (Actin-related protein T2), TST (Thiosulfate sulfurtransferase), HSPA1L (Heat shock protein family A member 1 like), PABPN1L (Poly(A) binding protein nuclear 1 like)

* "In addition, for each gene–gene pair, we examined their genomic locations and excluded those pairs that were located within 50 kb of each other... Furthermore, we excluded all pairs whose interacting genes are located in the MHC region (chr6:26000000_34000000, hg19) due to the complex LD in this region.""
    * We did the same except that we used the hg38 MHC region

We ended 1,124,417 interaction pairs compared to their 805,375. This led to 560 genes being added that were not considered previously, none that clearly show housekeeping or ribosomal annotations.

In [1]:
library(data.table)

In [2]:
pc_hgnc <- fread("~/Downloads/pc-hgnc.txt.gz", fill=TRUE)#, fill=TRUE)
pc_hgnc[1:2,]
# downloaded from https://download.baderlab.org/PathwayCommons/PC2/v14/

PARTICIPANT_A,INTERACTION_TYPE,PARTICIPANT_B,INTERACTION_DATA_SOURCE,INTERACTION_PUBMED_ID,PATHWAY_NAMES,MEDIATOR_IDS
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
A1BG,controls-expression-of,A2M,pid,12456685;7678052;9794795,IL6-mediated signaling events,pid:pid_74043;pid:pid_74038
A1BG,interacts-with,ABCC6,BioGRID,21988832,,biogrid:MolecularInteraction_bebab183-9762-4322-b91f-e9a69a0c9e97___null__162374_


In [3]:
# split according to metadata (the file has the interactions AND metadata on each participant)

pc_hgnc[2484223,]
metadata = pc_hgnc[2484224:2524907,1:5]

metadata[1:2,1:5]
colnames(metadata) = c("PARTICIPANT", "PARTICIPANT_TYPE", "PARTICIPANT_NAME", "UNIFICATION_XREF", "RELATIONSHIP_XREF")
unique(metadata$PARTICIPANT_TYPE)
pc_hgnc = pc_hgnc[1:2484222,]

PARTICIPANT_A,INTERACTION_TYPE,PARTICIPANT_B,INTERACTION_DATA_SOURCE,INTERACTION_PUBMED_ID,PATHWAY_NAMES,MEDIATOR_IDS
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
PARTICIPANT,PARTICIPANT_TYPE,PARTICIPANT_NAME,UNIFICATION_XREF,RELATIONSHIP_XREF,,


PARTICIPANT_A,INTERACTION_TYPE,PARTICIPANT_B,INTERACTION_DATA_SOURCE,INTERACTION_PUBMED_ID
<chr>,<chr>,<chr>,<chr>,<chr>
chebi:31403,SmallMoleculeReference,cinnarizine,chebi:CHEBI:31403,
SNORD121B,RnaReference,SNORD121B mRNA,,


## 1. Remove those in complex with 

In [4]:
sort(unique(pc_hgnc$INTERACTION_TYPE))
old = nrow(pc_hgnc)
old
# remove the in-complex-with
pc_hgnc <- pc_hgnc[pc_hgnc$INTERACTION_TYPE != "in-complex-with",]
nrow(pc_hgnc)
cat("\n# Pairs lost due to in-complex-with:", old-nrow(pc_hgnc))


# Pairs lost due to in-complex-with: 204317

## 1. Remove housekeeping genes
- Use the HSIAO_HOUSEKEEPING_GENES_new.txt found on their github 

In [5]:

# elcude houseeping genes
old = nrow(pc_hgnc)
housekeep <- fread("~/Downloads/HSIAO_HOUSEKEEPING_GENES_new.txt", 
                  header=FALSE)
housekeep = housekeep$V1
cat("\n# Housekeeping Genes:", length(housekeep))
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_A %in% housekeep,]
nrow(pc_hgnc)
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_B %in% housekeep,]
nrow(pc_hgnc)
cat("\n# Pairs lost due to housekeeping:", old-nrow(pc_hgnc))


# Housekeeping Genes: 405


# Pairs lost due to housekeeping: 225978

## 2. Remove ribosomal proteins and RNAs

In [6]:
old = nrow(pc_hgnc)
rib_prots <- fread("~/Downloads/ribosomal_proteins_HGNC_9.4.24.csv", skip=1, header=TRUE)
rib_prots[1,]
rib_prots = rib_prots[["Approved symbol"]]
rib_RNAs <- fread("~/Downloads/ribosomal_RNAs_HGNC_9.4.24.csv", skip=1, header=TRUE)
rib_RNAs[1,]
rib_RNAs = rib_RNAs[["Approved symbol"]]
length(rib_RNAs)
length(rib_prots)
rib_genes = union(rib_RNAs, rib_prots)
length(rib_genes)

pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_A %in% rib_genes,]
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_B %in% rib_genes,]
nrow(pc_hgnc)

# hand annotated the remaining ribosomal genes not captured in these files from HGNC
rem_rib_genes = unique(pc_hgnc$PARTICIPANT_A[grep("^RPS|^RPL|^RPP", pc_hgnc$PARTICIPANT_A)])
rem_rib_genes = union(rem_rib_genes, 
                      unique(pc_hgnc$PARTICIPANT_B[grep("^RPS|^RPL|^RPP", pc_hgnc$PARTICIPANT_B)]))
#rem_rib_genes
rib_genes = union(rib_genes, rem_rib_genes)
length(rib_genes)
length(rem_rib_genes)
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_A %in% rem_rib_genes,]
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_B %in% rem_rib_genes,]
nrow(pc_hgnc)
cat("\n# Pairs lost due to ribosomal:", old-nrow(pc_hgnc))


HGNC ID (gene),Approved symbol,Approved name,Previous symbols,Aliases,Chromosome,Group
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
HGNC:14275,MRPL1,mitochondrial ribosomal protein L1,,"BM022,uL1m",4q21.1,Large subunit mitochondrial ribosomal proteins


HGNC ID (gene),Approved symbol,Approved name,Previous symbols,Aliases,Chromosome,Group
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
HGNC:34362,RNA5S1,"RNA, 5S ribosomal 1",RN5S1,,1q42.13,5S ribosomal RNAs



# Pairs lost due to ribosomal: 32548

In [7]:
# Polymerase genes
old = nrow(pc_hgnc)
rnapol_genes = unique(pc_hgnc$PARTICIPANT_A[grep("^RPAP|^POL", pc_hgnc$PARTICIPANT_A)])
rnapol_genes = union(rem_rib_genes, 
                      unique(pc_hgnc$PARTICIPANT_B[grep("^RPAP|^POL", pc_hgnc$PARTICIPANT_B)]))
rnapol_genes
length(rnapol_genes)
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_A %in% rnapol_genes,]
pc_hgnc <- pc_hgnc[!pc_hgnc$PARTICIPANT_B %in% rnapol_genes,]
nrow(pc_hgnc)

cat("\n# Pairs lost due to polymerase genes:", old-nrow(pc_hgnc))


# Pairs lost due to polymerase genes: 13148

## 3. Exclude pairs within 50kb of each other and in MHC complex

### A. Get the hg38 based loci (using same input as MAGMA)

In [8]:
loci <- fread("~/Desktop/SC_GWAS_Bench/data/MAGMA/NCBI38/NCBI38.gene.loc")
loci[1:2,]

V1,V2,V3,V4,V5,V6
<int>,<chr>,<int>,<int>,<chr>,<chr>
79501,1,69091,70008,+,OR4F5
100996442,1,141934,174394,-,LOC100996442


In [9]:
# see which ones can be kept cuz MAGMA equivalent
old = nrow(pc_hgnc)
pc_hgnc_magma <- pc_hgnc[pc_hgnc$PARTICIPANT_A %in% loci$V6 & pc_hgnc$PARTICIPANT_B %in% loci$V6,]
dim(pc_hgnc_magma)

loci_partA = loci[loci$V6 %in% pc_hgnc$PARTICIPANT_A,]
loci_partA = loci_partA[order(loci_partA$V6),]
colnames(loci_partA) = c("Orig_Name", "A_Chrom", "A_Start", "A_End", "Strand", "A_Gene")

loci_partB = loci[loci$V6 %in% pc_hgnc$PARTICIPANT_B,]
loci_partB = loci_partB[order(loci_partB$V6),]
colnames(loci_partB) = c("Orig_Name", "B_Chrom", "B_Start", "B_End", "Strand", "B_Gene")
cat("\n# Pairs lost due to not being in MAGMA loci:", old-nrow(pc_hgnc_magma))


# Pairs lost due to not being in MAGMA loci: 871559

In [10]:
pc_hgnc_magma_new <- merge(pc_hgnc_magma, loci_partA[,c("A_Chrom", "A_Start", "A_End", "A_Gene")], by.x = "PARTICIPANT_A", by.y = "A_Gene", all.x = TRUE)
pc_hgnc_magma_new <- merge(pc_hgnc_magma_new, loci_partB[,c("B_Chrom", "B_Start", "B_End", "B_Gene")], by.x = "PARTICIPANT_B", by.y = "B_Gene", all.x = TRUE)
loci_partA[loci_partA$V6 == "Gene",]
pc_hgnc_magma_new[1:3,]

Orig_Name,A_Chrom,A_Start,A_End,Strand,A_Gene
<int>,<chr>,<int>,<int>,<chr>,<chr>


PARTICIPANT_B,PARTICIPANT_A,INTERACTION_TYPE,INTERACTION_DATA_SOURCE,INTERACTION_PUBMED_ID,PATHWAY_NAMES,MEDIATOR_IDS,A_Chrom,A_Start,A_End,B_Chrom,B_Start,B_End
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>
A1BG,ABCG2,controls-state-change-of,CTD,23896426;37660740,,ctdbase:chemical_mesh_d002104;ctdbase:ABU_MESH_D002104;ctdbase:B_4750700;ctdbase:RXN_8494535,4,88090264,88231417,19,58346806,58353499
A1BG,ATP7A,controls-state-change-of,CTD,11350187;14985339;15467011;15923132;21336677;23345593;23776592;23896426;24522273;25247420;32278528,,ctdbase:REC_6713796;ctdbase:ABU_3770939;ctdbase:ABU_3792340;ctdbase:UPT_6713802;ctdbase:ABU_3770941;ctdbase:chemical_mesh_d003300;ctdbase:ABU_4940882;ctdbase:ABU_MESH_D003300;ctdbase:B_4750526;ctdbase:ABU_3727198,X,77910656,78050395,19,58346806,58353499
A1BG,ATP7B,controls-state-change-of,CTD,16549536;23896426;24368744;24522273;32278528,,ctdbase:ABU_4773668;ctdbase:ABU_4772080;ctdbase:ABU_3767626;ctdbase:ABU_4773673;ctdbase:ABU_4773671;ctdbase:UPT_6713798;ctdbase:chemical_mesh_d003300;ctdbase:ABU_MESH_D003300;ctdbase:REC_6713792;ctdbase:B_4750526,13,51891086,52012099,19,58346806,58353499


In [11]:
sort(unique(pc_hgnc$INTERACTION_TYPE))
sort(unique(pc_hgnc_magma_new$INTERACTION_TYPE))

### B. Remove those gene pairs overlapping and/or within 50kb of each other

In [12]:
# remove those that are within 50kb of one another
# add unique id
pc_hgnc_magma_new$unique_id = seq(1, nrow(pc_hgnc_magma_new))
# get those that should be removed
## see which have same chromosome
remove = pc_hgnc_magma_new[pc_hgnc_magma_new$A_Chrom == pc_hgnc_magma_new$B_Chrom,]
nrow(remove)
## Get the distances between them
remove$DIFF1 = abs(remove$A_Start - remove$B_Start)
remove$DIFF2 = abs(remove$A_Start - remove$B_End)
remove$DIFF3 = abs(remove$A_End - remove$B_End)
remove$DIFF4 = abs(remove$A_End - remove$B_Start)

# get those overlapping
overlap <- remove[(remove$A_Start <= remove$B_End) & (remove$B_Start <= remove$A_End), ]
dim(overlap)
## get those within 50kb
remove = remove[remove$DIFF1 < 50000 | remove$DIFF2 < 50000 | remove$DIFF3 < 50000 | remove$DIFF4 < 50000,]
dim(remove)

remove[1:3,]

remove_ids = union(overlap$unique_id, remove$unique_id)
length(remove_ids)



PARTICIPANT_B,PARTICIPANT_A,INTERACTION_TYPE,INTERACTION_DATA_SOURCE,INTERACTION_PUBMED_ID,PATHWAY_NAMES,MEDIATOR_IDS,A_Chrom,A_Start,A_End,B_Chrom,B_Start,B_End,unique_id,DIFF1,DIFF2,DIFF3,DIFF4
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
ABCA10,ABCA5,catalysis-precedes,Reactome,11178988;11309290;11435397;11478798;11606068;12150964;12183068;12504089;12821155;24831815,ABC transporters in lipid homeostasis,reactome:Catalysis2;http://bioregistry.io/reactome:R-HSA-8848053;reactome:Catalysis6;http://bioregistry.io/reactome:R-HSA-1454928,17,69244435,69327182,17,69148007,69244815,1310,96428,380,82367,179175
ABCA6,ABCA10,interacts-with,BioGRID,33961781,,biogrid:MolecularInteraction_c754f6e2-156a-4282-8a92-8b5716e31917___null__816562_,17,69148007,69244815,17,69078089,69141992,1352,69918,6015,102823,166726
ABCC12,ABCC11,interacts-with,BioGRID,33961781,,biogrid:MolecularInteraction_0814948a-71c8-44f3-96a1-b7676f60558c___null__843848_,16,48166910,48235177,16,48082973,48146770,1709,83937,20140,88407,152204


In [13]:
# only keep those not overlapping significantly
old = nrow(pc_hgnc_magma_new)
pc_hgnc_magma_new <- pc_hgnc_magma_new[!pc_hgnc_magma_new$unique_id %in% remove_ids,]
nrow(pc_hgnc_magma_new)
cat("\n# Pairs lost due to overlap or within 50kb:", old-nrow(pc_hgnc_magma_new))


# Pairs lost due to overlap or within 50kb: 870

### C. Remove those within the MHC region

In [14]:
# remove anything within the MHC region
pc_hgnc_magma_new[1:2,]
old = nrow(pc_hgnc_magma_new)
remove_mhc = pc_hgnc_magma_new[pc_hgnc_magma_new$A_Chrom == "6" & 
                               pc_hgnc_magma_new$A_Start > 29600000 & 
                               pc_hgnc_magma_new$A_End < 33300000,]$unique_id
length(remove_mhc)
remove_mhc=c()
remove_mhc = union(remove_mhc, pc_hgnc_magma_new[pc_hgnc_magma_new$B_Chrom == "6" & 
                               pc_hgnc_magma_new$B_Start > 29600000 & 
                               pc_hgnc_magma_new$B_End < 33300000,]$unique_id)

length(remove_mhc)

pc_hgnc_magma_new = pc_hgnc_magma_new[!pc_hgnc_magma_new$unique_id %in% remove_mhc,]
nrow(pc_hgnc_magma_new)
cat("\n# Pairs lost due to being within MHC region:", old-nrow(pc_hgnc_magma_new))

PARTICIPANT_B,PARTICIPANT_A,INTERACTION_TYPE,INTERACTION_DATA_SOURCE,INTERACTION_PUBMED_ID,PATHWAY_NAMES,MEDIATOR_IDS,A_Chrom,A_Start,A_End,B_Chrom,B_Start,B_End,unique_id
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<int>
A1BG,ABCG2,controls-state-change-of,CTD,23896426;37660740,,ctdbase:chemical_mesh_d002104;ctdbase:ABU_MESH_D002104;ctdbase:B_4750700;ctdbase:RXN_8494535,4,88090264,88231417,19,58346806,58353499,1
A1BG,ATP7A,controls-state-change-of,CTD,11350187;14985339;15467011;15923132;21336677;23345593;23776592;23896426;24522273;25247420;32278528,,ctdbase:REC_6713796;ctdbase:ABU_3770939;ctdbase:ABU_3792340;ctdbase:UPT_6713802;ctdbase:ABU_3770941;ctdbase:chemical_mesh_d003300;ctdbase:ABU_4940882;ctdbase:ABU_MESH_D003300;ctdbase:B_4750526;ctdbase:ABU_3727198,X,77910656,78050395,19,58346806,58353499,2



# Pairs lost due to being within MHC region: 9513

* excluded those interactions that were annotated as in-complex-with, because those genes tended to be co-expressed and might inflate the result
* excluded 2291 ribosomal genes and housekeeping genes defined by the HSIAO_HOUSEKEEPING_GENES set from MSigDB (expressed across 19 tissues 
* excluded those pairs that were located within 50 kb of each other (since 50kb window)
* excluded pairs in the MHC complex: MHC region (chr6:26000000_34000000, hg19)

### Final Comparison to original

In [15]:
nrow(pc_hgnc_magma_new)

In [16]:
# original pathway
pathway <- fread("~/Downloads/PathwayCommons12.All.hgnc.exPCDHA.MHC.NCBI37.tsv", 
                header=FALSE)
pathway[1:2,]
dim(pathway)

V1,V2
<chr>,<chr>
A1BG,A2M
A1BG,ABCC6


In [17]:
nrow(pc_hgnc_magma_new) - nrow(pathway)
new_genes = union(pc_hgnc_magma_new$PARTICIPANT_A, pc_hgnc_magma_new$PARTICIPANT_B)
old_genes = union(pathway$V1, pathway$V2)
length(new_genes)
length(old_genes)
length(setdiff(new_genes, old_genes))
setdiff(new_genes, old_genes)
length(setdiff(old_genes, new_genes))
setdiff(old_genes, new_genes)[1:5]

In [18]:
rem_hk_genes = c("HSPA1B", "PSMD3", "ACTRT2", "TST", "HSPA1L", "PABPN1L")
old = nrow(pc_hgnc_magma_new)
pc_hgnc_magma_new = pc_hgnc_magma_new[!pc_hgnc_magma_new$PARTICIPANT_A %in% rem_hk_genes,]
pc_hgnc_magma_new = pc_hgnc_magma_new[!pc_hgnc_magma_new$PARTICIPANT_B %in% rem_hk_genes,]
cat("\n# Pairs lost due to added hand annotated housekeeping genes:", old-nrow(pc_hgnc_magma_new))
nrow(pc_hgnc_magma_new)


# Pairs lost due to added hand annotated housekeeping genes: 1872

In [19]:
# Save file in the same format as the original
write.table(pc_hgnc_magma_new[,c("PARTICIPANT_A", "PARTICIPANT_B")], 
            "~/Desktop/SCRNA-GWAS-Benchmarking/data/PathwayCommons14.All.hgnc.exPCDHA.MHC.NCBI38.tsv", 
            row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")