Here, we present GlyCompare, a novel method wherein glycans from glycomic data are decomposed to a minimal set of intermediate substructures, thus incorporating shared intermediate glycan substructures into all comparisons of glycans.
The all data and code for paper is provided under the branch https://github.com/LewisLabUCSD/GlyCompare/tree/for_publish
Bokan Bao+, Benjamin P. Kellman+, Austin W. T. Chiang, Austin K. York, Mahmoud A. Mohammad, Morey W. Haymond, Lars Bode, and Nathan E. Lewis. 2019. “Correcting for Sparsity and Non-Independence in Glycomic Data through a System Biology Framework.” bioRxiv. https://doi.org/10.1101/693507.
- Code Ocean Compute Capsule
- Jupyter Notebooks
- Try our new beta webapp
Glycompare was developed and tested in Mac OS X
and Ubuntu 18.04
Requirements and dependencies
- python 3+
- glypy
- anaconda3 (for cython, pandas, numpy, seaborn, networkx==2.1, ndex, xlrd==1.2)
- git-lfs (can be installed with homebrew or
sudo apt install git-lfs
The package can also be installed locally using
# get the repo
git clone https://github.com/LewisLabUCSD/GlyCompare.git
# enter the repo
cd GlyCompare
# get the large files (~150MB including the glycompare db [glytoucan_db_addr])
git lfs pull
# install glycompare
python3 setup.py install
If you don't have sudo
privileges, you can run a local install by adding the --user
tag to the install
python3 setup.py install --user
If git lfs or the install fails, follow these alternative instructions for install: https://github.com/LewisLabUCSD/GlyCompare/wiki#install-glycompare
A quick run through the top-level functions using the CHO-EPO data
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from copy import deepcopy
from scipy.stats import yeojohnson, probplot, wilcoxon, norm
from scipy import stats
from glycompare import *
# parameter setting
# environment parameter setting
glycompare_addr = '<PATH_TO_GLYCOMPARE>/GlyCompare/'
glytoucan_db_addr = os.path.join(glycompare_addr, 'glycompare','database', 'glytoucan_database.json')
linkage_specific = False
num_processors = 8
# project parameter
working_addr = os.path.join(glycompare_addr,'paper_epo/')
project_name = "paper_epo"
costumized_glycan_identifier = True
external_profile_naming= True
rerun = False
# initiator
keywords_dict = pipeline_functions.load_para_keywords(project_name, working_addr, glytoucan_db_addr=glytoucan_db_addr)
keywords_dict
pipeline_functions.check_init_dir(keywords_dict)
##
## initialize glycans
meta_name = pd.read_csv(os.path.join(working_addr,'source_data','glycan_id_list.txt'), sep='\t')
structure_loader = meta_name['glycan_id'].tolist()
data_type = 'mix'
glycan_dict = pipeline_functions.load_glycans_pip(keywords_dict=keywords_dict,
data_type=data_type,
structure_loader=structure_loader)
## extract substructure and generate substructure vector
matched_dict = pipeline_functions.extract_and_merge_substrutures_pip(keywords_dict, num_processors=num_processors,linkage_specific=linkage_specific, forced=rerun)
## calculate substructure abundance -> glycoprofile vector
abd_table = glycan_io.load_table(os.path.join(keywords_dict['source_dir'], 'abundance_table.xls'))
_, glycoprofile_list = pipeline_functions.glycoprofile_pip(keywords_dict,
abd_table,
unique_glycan_identifier_to_structure_id=False,
already_glytoucan_id=False,
external_profile_naming=True,
forced=rerun)
_name_dict = json_utility.load_json(keywords_dict['source_dir']+'external_profile_naming.json')
selected_name_list = ["EPO127.mgat1.","EPO174.mgat2.","EPO266.fut8.","st3gal4.st3gal6.mgat4a.mgat4b.mgat5",
"KI_ST6GalNAc1.st3gal4.st3gal6.mgat4a.mgat4b.mgat5","mgat4A.mgat4B.mgat5","B3gnt2.mgat4a.mgat4b.mgat5",
"st3gal4.st3gal6","B4GalT1","B4GalT2","B4GalT3","WT","B4GalT4","EPO78.mgat4B.","mgat4A.mgat4B","mgat5"]
select_col=[]
_ = {}
for i,j in _name_dict.items():
_[j] = i
for i in selected_name_list:
select_col.append(_[i])
print(select_col)
feature_name = []
profile_name = []
selected_profile = [30, 25, 34, 21, 22, 5, 20, 18, 6, 7, 8, 1, 9, 28, 3, 4]
for i in selected_profile:
profile_name.append(_name_dict[str(i)])
for j in glycoprofile_list[i-1].glycan_id_list:
feature_name.append(j)
feature_name = list(set(feature_name))
## select motif and get motif abundance -> glyco-motif vector
core=select_motifs.nglycan_core
motif_abd_table, motif_lab, merged_weights_dict=pipeline_functions.select_motifs_pip(keywords_dict,
linkage_specific=linkage_specific,
core=core,
only_substructures_start_from_root=True,
select_col= select_col)
## cluster and determine representative motifs
glycoprofile_cluster_dict, glyco_motif_cluster_dict = pipeline_functions.clustering_analysis_pip(keywords_dict=keywords_dict,
motif_abd_table=motif_abd_table,
select_profile_name = selected_name_list)
pipeline_functions.draw_substructure_representative_pip(glyco_motif_cluster=glyco_motif_cluster_dict,
substructure_vec=motif_lab.motif_vec,
motif_weights_dict=merged_weights_dict,
plot_all_substructure=False,
address_dir=keywords_dict['plot_output_dir'],
threshold=0.51,
plot_rep=True)
The GlyCompare framework provides several tools that account for the influence of the glycan substructure network in the analysis of glycomic data. However, for its effective use, there are two primary requirements for the data it processes. First, the tools require that the glycan is stored in a tree-like structure. Thus, neither cyclic nor glycan with undefined topology in glycoCT format can be processed. Second, during the substructure matching, in terms of the linkage specificity, the code can only handle two types of substructure analysis. One has the exact linkage specification, and one ignores all linkage specification and only accounts for topology. Currently, our tool cannot handle partial ambiguity in linkages of a glycan. The code and the manual are freely available and will be continually developed to enable its accessibility to all types of scientists.
Glycompare provides several complete pipelines that several major pre-prescribed functions of Glycompare.
1.1. abundance_table.xls. An abundance_table has column: sample (glycoprofile), row: glycan (the glycan identifier can be glytoucan_id or custimized: m/z, hplc). An abundance table (1.1) is necessary to run the full GlyCompare pipeline, if the glytoucan_id for each glycan is specified. For example, iscience_data. Complex data need more, epo_data.
1.2. (optional) external_profile_naming.json. Glycompare defaults to index names for each sample (e.g. 1,2,3). If you want to specify a name for each sample use. i.e. paper_epo/source_data/external_profile_naming.json
1.3. (optional) glycan_identifier_to_glytoucan_id.json. If the glycan identifiers in abundance_table.xls are glytoucan_id or m/z which map uniquely (1:1) to glytoucan IDs, this file can be ignored. If m/z can map to multiple isoforms, this file must be completed to specify which m/z correspond to which structures paper_epo/source_data/glycan_identifier_to_glytoucan_id.json
1.4. (optinal) source_data/glycoct/. If part of glycan structures are manually curated, we need a source_data/glycoct/ directory to store all of them (i.e. paper_epo/source_data/glycoct)
2.1. working_addr : root working dir
2.2. project_name: usually same as the folder of root
2.3. init.num_processors: number of processes needed
2.4. glytoucan_db_addr: the addr for glytoucan database if needed
- meta_name: a gff type table which includes glycan's naming information
Pipeline functions include: load_glycan_pip
, extract_and_merge_substructures_pip
, glycoprofile_pip
, select_motifs_pip
, clustering_analysis_pip
Most require a set of core inputs we centralize in the keyword_dict
which can be initialized using the load parameter function.
Description: Create a dictionary to maintain consistent working directories, filenames and project names throughout the analysis.
Input:
- project_name: string, a name for the analysis to be performed
- working_addr: string, the working directory where you would like the analysis to be saved
- kwargs: any additional analysis variables you would like to set
Output:
- keywords_dict: dictionary, containing environment variables to be used throughout the analysis
Example: From the isience example
# environment parameter setting
glytoucan_db_addr = os.path.join( 'glycompare','database', 'glytoucan_database.json')
# project parameter
working_addr = os.path.join( 'example_data', 'test_iscience')
project_name = "test_iscience"
keywords_dict = pipeline_functions.load_para_keywords(project_name, working_addr, glytoucan_db_addr=glytoucan_db_addr)
Description: Load glycans from glycoprofile using glytoucan database (data_type='glytoucanid') or local glycoct structure (data_type='local_glycoct') or both (data_type='mix'). Glycoct structures will be read from source_data/glycoct/
folder in the source_data directory (i.e. paper_epo/source_data/glycoct). Returns a glycan_dict object, a dictionary of glypy.Glycan objects with glytoucan and m/z as keys.
Input:
- keywords_dict: dict, keyword pipeline variable dictionary from
load_para_keywords
function - data_type: string, either ['used', 'glycan_dict', 'glytoucanid', 'local_glycoct', 'mix'] indicating which method should be used to load and initilize the glypy.glycan objects. If
data_type=='local_glycoct'
ordata_type=='mix'
, then the user must add glycoct files to thesource_data/glycoct/
folder in the source_data directory (i.e. paper_epo/source_data/glycoct). - structure_loader: a list of glycan name/customized id, or a glycan_dict
Output:
- glycan_dict: dict, glypy.Glycan objects with glytoucan and/or m/z as keys.
Approach:
if data_type=='used': # load from a local glycan_dict json
-> json file already exists in the working directyr
if data_type=='glycan_dict': # load from an existing glycan_dict in the local environment (check the glycan_dict obeject)
-> (already in the enviroment) glycan_dict already exists
if data_type=='glytoucanid': # load from the glytoucan database using glytoucan ids to retrieve the glycoct structure
:param load from glytoucan database (glytoucan_database_addr)
if data_type=='local_glycoct': # load from local glycoct files
:param load from local glycoct files (glycoct_dir)
if data_type=='mix': # load from both local glycoct and glytoucanid
:param glycoct_dir & glytoucan_database_addr (glytoucanid & local_glycoct)
The motif-abundance matrix is the dot product of the glycan abundance matrix (abundance of each glycan in each sample, sample x glycan) and the glycan-motif occurance matrix (number of occurances of each motif in each glycan, glycan x motif)
## comparison across enzyme isoform
# extract motifs
glycanMotifMatrix1 = np.matrix( [mapMotifs( extractMotifs( g1 ) , motifsVector = motifVector ) for g1 in glycanList1 ] )
glycanMotifMatrix2 = np.matrix( [mapMotifs( extractMotifs( g2 ) , motifsVector = motifVector ) for g2 in glycanList2 ] )
# abunance weighted motif vector
glycanMotifVectorAbundance1 = glycanMotifMatrix1%*%abundance1
glycanMotifVectorAbundance2 = glycanMotifMatrix1%*%abundance2
# all motif vector
glycanMotifVector1 = numpy.sum( glycanMotifMatrix1 , axis=1)
glycanMotifVector2 = numpy.sum( glycanMotifMatrix2 , axis=1)
=======
Example:
From the isience example
keywords_dict = pipeline_functions.load_para_keywords("test_iscience", "example_data/test_iscience", glytoucan_db_addr='glycompare/database/glytoucan_database.json')
_table = glycan_io.load_table(os.path.join('example_data/test_iscience/source_data', 'abundace_table.csv'), rep='-')
structure_loader = _table.columns.tolist() # get glytoucan ids
glycan_dict = pipeline_functions.load_structure_pip(keywords_dict=keywords_dict,
data_type='glytoucanid',
structure_loader=structure_loader)
From the cho epo example
keywords_dict = pipeline_functions.load_para_keywords("paper_epo", "example_data/paper_epo", glytoucan_db_addr='glycompare/database/glytoucan_database.json')
meta_name = pd.read_csv(os.path.join('example_data/paper_epo/source_data','glycan_id_list.txt'), sep='\t')
structure_loader = meta_name['glycan_id'].tolist()
glycan_dict = pipeline_functions.load_structure_pip(keywords_dict=keywords_dict,
data_type='mix',
structure_loader=structure_loader)
Description: Loads observed glycans (extract_substructures.extract_substructures_pip
), identifies substructures in each glycan then merges all substructures into a reference substructure vector (merge_substructure_vec.merge_substructure_dict_pip
), and maps all input glycans to the merged substructure vector to determine the number of times each substructure appears in each glycan (merge_substructure_vec.substructure_matching_wrapper
).
Input:
- keywords_dict: dict, keyword pipeline variable dictionary from
load_para_keywords
function.extract_and_merge_substructures_pip
will locate the glycans to process in thesource_data
directory using thekeywords_dict
. - linkage_specific: boolean, if True,
extract_and_merge_substructures_pip
will leverage linkage information. If False,extract_and_merge_substructures_pip
will run without linkage information. Do not setlinkage_specification=True
if this information is not supported by your measurement, this will result in unpredictable behavior. - num_processors: integer that specifies the number of CPU to use.
- forced: boolean, defaults to False. If False,
extract_and_merge_substructures_pip
will not run if the output file already exists. If True,extract_and_merge_substructures_pip
will run and overwrite the existing output file.
Output:
- matched_dict: dict, mapping from observed glycans (keys) to a
substructure_present_absent_vector
(e.g. matched_dict.json). Thesubstructure_present_absent_vector
indicates the number of times each substructure occurs in a glycan. The i-th substructure referenced in thesubstructure_present_absent_vector
matches the i-th substructure in thesubstructure_vec
which contains the glycoct for each substructure. #Thesubstructure_vec
is returned byglycan_io.motif_dict_to_motif_vec(substructure_vector_dict)
.
Example:
From the cho epo example
keywords_dict = pipeline_functions.load_para_keywords("paper_epo", "example_data/paper_epo", glytoucan_db_addr='glycompare/database/glytoucan_database.json')
substructure_vector_dict=pipeline_functions.extract_and_merge_substructures_pip(keywords_dict, linkage_specific=False, forced=False)
# get all the substructures present in the 1st observed glycan
gt0=substructure_vector_dict[substructure_vector_dict.keys()[0]]>0
present_substructures=glycan_io.motif_dict_to_motif_vec(substructure_vector_dict)[gt0]
Description: Loads glycoprofiles (glycan abundance) and calculates the glycoprofile_vector_table (substructure abundance)
Input:
- keywords_dict: dict, keyword pipeline variable dictionary from
load_para_keywords
function - abd_table: string, filename pointing to the glycan abundance table. The loader will handel csv and xls tables.
Row names are m/z, glytoucan ids, or custom glycan ids specified in
glycan_identifier_to_glytoucan_id.json
. in the source_data directory (i.e. paper_epo/source_data/glycan_identifier_to_glytoucan_id.json). Column names will be ignored, to name sample/glycoprofiles the user must specifyexternal_profile_naming.json
in the source_data directory (i.e. paper_epo/source_data/external_profile_naming.json). - simple_profile: boolean, defaults to False. If False,
glycoprofile_pip
will search forsource_data/glycan_identifier_to_glytoucan_id.json
to determine the names of glycans in the glycoprofile. If True, glycan structure will be read from the table. If glycans are identified by m/z in the abundance table and there is not a 1:1 mapping from m/z to glycan structure,simple_profile
must be False. - unique_glycan_identifier_to_structure_id: boolean, defaults to False. If this variable and
already_glytoucan_id
are set to True,glycoprofile_pip
will run using the using the GlyTouCanIDs in the abundance table. Otherwise,glycan_identifier_to_glytoucan_id.json.
will be read to name the glycoprofiles/samples. - already_glytoucan_id: boolean, defaults to False. If False, glycan structure will be read from the glytoucan database using glytoucan IDs from the abundance table. If True, the glycan identifiers specified in the abundance table must be read in separately
- forced: boolean, defaults to False. If False,
glycoprofile_pip
will not run if the output file already exists. If True,glycoprofile_pip
will run and overwrite the existing output file.
Output:
- glycoprofile_vector_table: double matrix, containing substructure abundance. Row names are substructures and column names are glycoprofiles/samples.
Example:
From the cho epo example
keywords_dict = pipeline_functions.load_para_keywords("paper_epo", "example_data/paper_epo", glytoucan_db_addr='glycompare/database/glytoucan_database.json')
abd_table = glycan_io.load_table(os.path.join(keywords_dict['source_dir'], 'abundance_table.xls'))
glycoprofile_vector_table=pipeline_functions.glycoprofile_pip(keywords_dict, abd_table, external_profile_naming=True, forced=False)
Description: Examines the glyan and substructure abundance tables and the substructure network to determine the minimal set of meaningful substructures. We term the minimal set of sufficiently meaningful substructure, "motifs."
Input:
- keywords_dict: dict, keyword pipeline variable dictionary from
load_para_keywords
function - linkage_specific: boolean, if True,
extract_and_merge_substructures_pip
will leverage linkage information. If False,extract_and_merge_substructures_pip
will run without linkage information. Do not setlinkage_specification=True
if this information is not supported by your measurement, this will result in unpredictable behavior. If True,extract_and_merge_pip
will run with all sunstructures including the root/core substructure. If False, all substructures will be considered. - core: glycot string describing the root sunbstructure
- drop_parellel: boolean, defaults to False. If True,
extract_merge_substructures
will run dropping the smaller-degree fewer monosaccharides substructure even if they do not have a parent/child relationship. - drop_diff_abundance: boolean, defaults to True. If True,
extract_merge_substructures
will run dropping the parent if the parent's abundance is close to the children's abundance - select_col: list, defaults to an empty list. It should be the columns passed into the substructure_abd_table for getting substructure weight
- remove_core: boolean, defaults to True. If True,
extract_merge_substructures
will run with the motif set excluding the provided core. If False, the core is included.
Output:
- motif_abd_table: real matrix, a matrix of motif abundances with motifs in the rows and samples in the columns.
- a_node_state: select_motifs.NodesState object, containing the abundance information and substructure connectivity
Example: From the cho epo example
keywords_dict = pipeline_functions.load_para_keywords("paper_epo", "example_data/paper_epo", glytoucan_db_addr='glycompare/database/glytoucan_database.json')
abd_table = glycan_io.load_table(os.path.join(keywords_dict['source_dir'], 'abundance_table.xls'))
glycoprofile_vector_table=pipeline_functions.glycoprofile_pip(keywords_dict, abd_table, external_profile_naming=True, forced=False)
motif_abd_table,a_node_state = pipeline_functions.select_motifs_pip(keywords_dict, linkage_specific=linkage_specific, epitope=True)
Description: Clusters the motif abundance table and extracts representative motifs for each cluster
Input:
- keywords_dict: dict, keyword pipeline variable dictionary from
load_para_keywords
function - motif_abd_table: real matrix, a matrix of motif abundances with motifs in the rows and samples in the columns.
Plot:
- substructure_representative:
- glycoprofile_cluster:
- rep_str: dict, dictionary of representative structures
- raw_abd_zscore_plot: figure, a biscluster of substructure abundance saved at 'raw_abundance_zscore.eps'
Example: From the cho epo example
keywords_dict = pipeline_functions.load_para_keywords("paper_epo", "example_data/paper_epo", glytoucan_db_addr='glycompare/database/glytoucan_database.json')
abd_table = glycan_io.load_table(os.path.join(keywords_dict['source_dir'], 'abundance_table.xls'))
glycoprofile_vector_table=pipeline_functions.glycoprofile_pip(keywords_dict, abd_table, external_profile_naming=True, forced=False)
motif_abd_table = pipeline_functions.select_motifs_pip(keywords_dict, linkage_specific=linkage_specific, epitope=True)
pipeline_functions.profile_cluster_pip(keyworks_dict,motif_abd_table)
Statistical analyses are not yet incorperate in the package. The analysese performed can be found at example_notebook/Fig4_Fig5_hmo_stats.r