# Get Hallmark Geneset Descriptions

Save Hallmark geneset .gmt file (v7.5.1) to read into python  
Obtain geneset descriptions via msigdbr 

## Set Up

In [1]:
# install.packages("msigdbr")

In [1]:
library(msigdbr)
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Download GMT file

In [2]:
out_gmt <- './data/h.all.v7.5.1.symbols.gmt'

In [3]:
download.file(
    url = 'https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/h.all.v7.5.1.symbols.gmt',
    destfile = out_gmt)   

## Format Pathway Descriptions

In [4]:
hallmark_anno <- msigdbr(species = "Homo sapiens", category = "H")

In [5]:
names(hallmark_anno)

In [6]:
head(hallmark_anno, 1)

gs_cat,gs_subcat,gs_name,gene_symbol,entrez_gene,ensembl_gene,human_gene_symbol,human_entrez_gene,human_ensembl_gene,gs_id,gs_pmid,gs_geoid,gs_exact_source,gs_url,gs_description
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
H,,HALLMARK_ADIPOGENESIS,ABCA1,19,ENSG00000165029,ABCA1,19,ENSG00000165029,M5905,26771021,,,,Genes up-regulated during adipocyte differentiation (adipogenesis).


In [7]:
hallmark_desc <- hallmark_anno %>%
    select('gs_name', 'gs_description') %>%
    distinct()

In [8]:
dim(hallmark_desc)

In [9]:
hallmark_desc

gs_name,gs_description
<chr>,<chr>
HALLMARK_ADIPOGENESIS,Genes up-regulated during adipocyte differentiation (adipogenesis).
HALLMARK_ALLOGRAFT_REJECTION,Genes up-regulated during transplant rejection.
HALLMARK_ANDROGEN_RESPONSE,Genes defining response to androgens.
HALLMARK_ANGIOGENESIS,Genes up-regulated during formation of blood vessels (angiogenesis).
HALLMARK_APICAL_JUNCTION,Genes encoding components of apical junction complex.
HALLMARK_APICAL_SURFACE,"Genes encoding proteins over-represented on the apical surface of epithelial cells, e.g., important for cell polarity (apical area)."
HALLMARK_APOPTOSIS,Genes mediating programmed cell death (apoptosis) by activation of caspases.
HALLMARK_BILE_ACID_METABOLISM,Genes involve in metabolism of bile acids and salts.
HALLMARK_CHOLESTEROL_HOMEOSTASIS,Genes involved in cholesterol homeostasis.
HALLMARK_COAGULATION,Genes encoding components of blood coagulation system; also up-regulated in platelets.


In [10]:
write.csv(hallmark_desc, "data/hallmark_gs_descriptions.csv")

## QC Check GMT vs Query

In [11]:
# Read GMT file in as list of gene sets
# Based on code form GSEABase
readGMT <- function(fp, sep = "\t", ...){
    assertthat::assert_that(file.exists(fp), 
                           msg = sprintf("Could not locate input file %s", fp))
    assertthat::assert_that(grepl(".gmt$", fp),
                            msg = sprintf("Expecting file extension '.gmt'. Input file: %s", fp))
    
    lines <- strsplit(readLines(fp, ...), sep)
    gene_list <- lapply(lines, function(line) {
        unlist(line[-(1:2)])
    })
    names(gene_list) <- sapply(lines, "[[",1)
    gene_list                    
}

In [12]:
# read in all gene sets and merge into one master list
hallmark_gs <- readGMT(out_gmt)

In [13]:
all(hallmark_desc$gs_name %in% names(hallmark_gs))

In [14]:
all(names(hallmark_gs) %in% hallmark_desc$gs_name)

## Session Info

In [15]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.3   msigdbr_7.5.1

loaded via a namespace (and not attached):
 [1] crayon_1.5.2     vctrs_0.6.4      cli_3.6.1        rlang_1.1.2     
 [5] generics_0.1.3   assertthat_0.2.1 jsonlite_1.8.7   glue_1.6.2      
 [9] htmltools_0.5.7  IRdisplay_1.1    IRkernel_1.3.2   fansi_1.0.5     
[13] babelgene_22.