# MSigDB Abstract Wrangling
---

## Process used to collect raw data

1. Open the geneset page (ex. https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_APOPTOSIS.html)
2. In "Related gene sets", click on (show {X} founder gene sets for this hallmark gene set)
3. Scroll down to where it says, "Download founder gene sets as: gmt | gmx | xml"
4. Download xml
5. Upload it to a folder with the other gene sets 


This XML file will contain all the abstracts from all the founder set associated with the original gene set.

**Note:** This can be done programmatically by iterating through all the pages linked on the MSigDB website, but for the scope of this study (50 genesets) it was faster to just manually download the sets.

## Import necessary packages

In [11]:
import pandas as pd
pd.set_option('max_colwidth', 400)
import numpy as np
import os

import xmltodict
import json

## Create the dataframe and add paths to XML files

In [2]:
def geneset_names_to_df(names, folder):
    '''
    names: text or csv file containing the names of the gene set used
    folder: path to the folder containing the necessary XML files 
    '''
    genesets = pd.read_csv(names)
    genesets['FILENAME'] = folder + genesets['NAME'] + '_FOUNDERS.v2023.2.Hs.xml'
    return genesets

genesets = geneset_names_to_df('hallmark-names.txt','hallmark-abstracts/')
genesets.describe()

Unnamed: 0,NAME,FILENAME
count,50,50
unique,50,50
top,HALLMARK_ADIPOGENESIS,hallmark-abstracts/HALLMARK_ADIPOGENESIS_FOUNDERS.v2023.2.Hs.xml
freq,1,1


## Extract only the @DESCRIPTION_BRIEF and @DESCRIPTION_FULL from the XML file

In [3]:
def extract_abstracts_to_dict(filename):
    with open(filename) as file:
        abstracts = xmltodict.parse(file.read())
        counter = 0
        abstract_text = {}
        for founder in abstracts['MSIGDB']['GENESET']:
            if founder['@DESCRIPTION_BRIEF'] or founder['@DESCRIPTION_FULL']:
                abstract_text[counter] = f'{founder["@DESCRIPTION_BRIEF"]} {founder["@DESCRIPTION_FULL"]}'
            counter += 1
    return abstract_text

genesets['ABSTRACTS'] = genesets['FILENAME'].apply(extract_abstracts_to_dict)
genesets['ABSTRACT_COUNT'] = genesets['ABSTRACTS'].apply(lambda x: len(x))

In [16]:
abstracts_adipogenesis = genesets[genesets['NAME'] == 'HALLMARK_ADIPOGENESIS']['ABSTRACTS'][0]
#abstracts_adipogenesis

In [21]:
pd.set_option('max_colwidth', 400)
genesets[['NAME','ABSTRACTS','ABSTRACT_COUNT']]

Unnamed: 0,NAME,ABSTRACTS,ABSTRACT_COUNT
0,HALLMARK_ADIPOGENESIS,"{0: 'Genes down-regulated in the invasive ductal carcinoma (IDC) compared to the invasive lobular carcinoma (ILC), the two major pathological types of breast cancer. Invasive ductal carcinomas (IDCs) and invasive lobular carcinomas (ILCs) are the two major pathological types of breast cancer. Epidemiological and histoclinical data suggest biological differences, but little is known about the m...",36
1,HALLMARK_ALLOGRAFT_REJECTION,"{0: 'Genes down-regulated in cancer stem cells derived from glyoblastoma tumors: CD133+ [GeneID=8842] vs. CD133- cells. Although glioblastomas show the same histologic phenotype, biological hallmarks such as growth and differentiation properties vary considerably between individual cases. To investigate whether different subtypes of glioblastomas might originate from different cells of origin...",189
2,HALLMARK_ANDROGEN_RESPONSE,"{0: 'Genes having at least one occurence of the motif ACGCACA in their 3' untranslated region. The motif represents putative target (that is, seed match) of human mature miRNA hsa-miR-210 (v7.1 miRBase). ', 1: 'Genes that physically map to the hematopoietic stem cell (HSC) proliferation QTL (quantitative trait locus) Scp2. We combined large-scale mRNA expression analysis and gene mapping to id...",8
3,HALLMARK_ANGIOGENESIS,"{0: 'Binding to a carbohydrate, which includes monosaccharides, oligosaccharides and polysaccharides as well as substances derived from monosaccharides by reduction of the carbonyl group (alditols), by oxidation of one or more hydroxy groups to afford the corresponding aldehydes, ketones, or carboxylic acids, or by replacement of one or more hydroxy group(s) by a hydrogen atom. Cyclitols are g...",14
4,HALLMARK_APICAL_JUNCTION,"{0: 'Genes annotated by the GO term GO:0051017. The assembly of actin filament bundles; actin filaments are on the same axis but may be oriented with the same or opposite polarities and may be packed with different levels of tightness. ', 1: 'Genes down-regulated in T98G cells (glioma, express MGMT [GeneID=4255]) by carmustine [PubChem=2578] at 24 h. Chemotherapy with the alkylating agent BCNU...",37
5,HALLMARK_APICAL_SURFACE,"{0: 'Genes annotated by the GO term GO:0044463. Any constituent part of a cell projection, a prolongation or process extending from a cell, e.g. a flagellum or axon. ', 1: 'Genes in cytogenetic band chr6q Genes in cytogenetic band chr6q', 2: 'Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, ...",12
6,HALLMARK_APOPTOSIS,{0: 'Genes up-regulated in immunoglobulin light chain amyloidosis plasma cells (ALPC) compared to multiple myeloma (MM) cells. Immunoglobulin light chain amyloidosis (AL) is characterized by a clonal expansion of plasma cells within the bone marrow. Gene expression analysis was used to identify a unique molecular profile for AL using enriched plasma cells (CD138+) from the bone marrow of 24 pa...,80
7,HALLMARK_BILE_ACID_METABOLISM,"{0: 'Genes in cytogenetic band chr11p Genes in cytogenetic band chr11p', 1: 'Genes in cytogenetic band chr15q Genes in cytogenetic band chr15q', 2: 'The chemical reactions and pathways involving bile acids, a group of steroid carboxylic acids occurring in bile, where they are present as the sodium salts of their amides with glycine or taurine. [GOC:go_curators] ', 3: 'The chemical reactions an...",28
8,HALLMARK_CHOLESTEROL_HOMEOSTASIS,"{0: 'Genes up-regulated during prostate cancer progression in the JOCK1 model due to inducible activation of FGFR1 [GeneID=2260] gene in prostate. Fibroblast Growth Factor Receptor-1 (FGFR1) is commonly overexpressed in advanced prostate cancer (PCa). To investigate causality, we utilized an inducible FGFR1 (iFGFR1) prostate mouse model. Activation of iFGFR1 with chemical inducers of dimerizat...",28
9,HALLMARK_COAGULATION,{0: 'STAT1 [GeneID=6772] targets in hematopoetic signaling. Hematopoiesis is the cumulative result of intricately regulated signaling pathways that are mediated by cytokines and their receptors. Proper culmination of these diverse pathways forms the basis for an orderly generation of different cell types. Recent studies conducted over the past 10-15 years have revealed that hematopoietic cytok...,71
