Skip to content

fpozoc/trifid

Repository files navigation



TRIFID is the name of the method described in the manuscript: Assessing the functional relevance of splice isoforms published in NAR Genomics and Bioinformatics the 22 May 2021.

Citation
@article{10.1093/nargab/lqab044,
    author = {Pozo, Fernando and Martinez-Gomez, Laura and Walsh, Thomas A and Rodriguez, José Manuel and Di Domenico, Tomas and Abascal, Federico and Vazquez, Jesús and Tress, Michael L},
    title = "{Assessing the functional relevance of splice isoforms}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {3},
    number = {2},
    year = {2021},
    month = {05},
    abstract = "{Alternative splicing of messenger RNA can generate an array of mature transcripts, but it is not clear how many go on to produce functionally relevant protein isoforms. There is only limited evidence for alternative proteins in proteomics analyses and data from population genetic variation studies indicate that most alternative exons are evolving neutrally. Determining which transcripts produce biologically important isoforms is key to understanding isoform function and to interpreting the real impact of somatic mutations and germline variations. Here we have developed a method, TRIFID, to classify the functional importance of splice isoforms. TRIFID was trained on isoforms detected in large-scale proteomics analyses and distinguishes these biologically important splice isoforms with high confidence. Isoforms predicted as functionally important by the algorithm had measurable cross species conservation and significantly fewer broken functional domains. Additionally, exons that code for these functionally important protein isoforms are under purifying selection, while exons from low scoring transcripts largely appear to be evolving neutrally. TRIFID has been developed for the human genome, but it could in principle be applied to other well-annotated species. We believe that this method will generate valuable insights into the cellular importance of alternative splicing.}",
    issn = {2631-9268},
    doi = {10.1093/nargab/lqab044},
    url = {https://doi.org/10.1093/nargab/lqab044},
    note = {lqab044},
    eprint = {https://academic.oup.com/nargab/article-pdf/3/2/lqab044/38108084/lqab044.pdf},
}

Introduction

TRIFID is a Machine Learning based-model that aims to predict the functionality of every single isoform in the genome. This model has been designed to be accurate, interpretable and reproducible.

This repository has been created to give the bioinformatician a whole recipe for how this method was created. However, if the user is not interested in the complete installation and execution of TRIFID, jumps directly to section 4, where the TRIFID predictions are described. Furthermore, the user can be interested in the TRIFID side modules that generate only some predictive features. If it is the case, go to section 6.

Go back to the table of Contents presented above, open an issue in this repository or contact directly with the main TRIFID developer if the user wants to know more about any other part of the project.

Installation instructions

Package installation

pip install git+https://github.com/fpozoc/trifid.git

Package development

Run the silent installation of Miniconda/Anaconda in case you don't have this software in your environment.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
Remember to install mamba!
conda install -c conda-forge mamba

Once you have installed Miniconda/Anaconda with mamba, create a Python environment from environment.yml.

git clone git@github.com:fpozoc/trifid.git
cd trifid
mamba env create -f environment.yml
conda activate trifid
pre-commit install

Run the pre-commit/tests:

pre-commit run --all-files
pytest -v

Update the dependencies

Re-install the project in edit mode:

pip install -e .[dev]
# optional
pip install .[extra] # to install the visualization dependencies
pip install .[interactive] # to install the interactive dependencies

Model reproducibility

Data sources

The TRIFID model was initially trained with 45 predictive features of a subset of the protein isoforms annotated in GENCODE Release 27 (GRCh38.p10). Features have been described here. Two extra features has been added in the second release.

To set and create these features, we parsed some existing databases and created some specific modules for the task:

  • GENCODE genome annotation statistics for protein-coding transcripts. Data sets are available in the GENCODE ftp server.
  • APPRIS methods to quantify protein structural information, functionally important residues, conservation of functional domains and evidence of cross-species conservation. Data sets are available in the APPRIS http server.
  • PhyloCSF scores as a complementary measure of evolutionary conservation. Pre-computed scores for some genome annotation versions are available in this repository.
  • ALT-Corsair (APPRIS module) to quantify the age of the last common ancestor of the most distant orthologue that fulfills the search criteria, reporting a score representing the age of the oldest species maps to the whole protein sequence. It is a method based on the Corsair module in APPRIS. Pre-computed scores for some genome versions are available in the APPRIS webserver.
  • QSplice (TRIFID module) to quantify splice junctions coverage and our RNA-seq Snakemake pipeline to perform a comprehensive RNA-seq analysis. Pre-computed scores for GENCODE 27 available here. More details about this module in section below.
  • Pfam effects (TRIFID module) to quantify the effect of Alternative Splicing over Pfam domains of every protein-coding gene for the entire genome. Pre-computed scores for some genome annotation versions are available in here. More details about this module in the section below.
  • Fragment labelling (TRIFID module) to label genome isoforms in duplications or fragments for a further score correction step. More details about this module in below.

The data sources to reproduce our analysis are available for some genome versions through this shared point. In the source folder, the files that a user would need to run TRIFID on GENCODE 27 are freely available. Moreover, the config file contains the source file paths to create a data set to be trained with TRIFID. The user can modify these paths but it is recommendable to run everything inside the TRIFID previously downloaded folder.

Both predictions and features will be available with the second release of TRIFID here.

Preprocessing

Below is an example of how to reproduce the method from scratch for GENCODE 27. The user has to follow the next steps:

  1. To download the annotation files from GENCODE and APPRIS websites:
cd trifid

# GENCODE data
mkdir -p data/external/genome_annotation
curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz -o data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gtf.gz
curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gff3.gz -o data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz

# APPRIS data
mkdir -p data/external/appris
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.principal.txt -o data/external/appris/GRCh38/g27/appris_data.principal.txt
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.appris.txt -o data/external/appris/GRCh38/g27/appris_data.appris.txt
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.transl.fa.gz -o data/external/appris/GRCh38/g27/appris_data.transl.fa.gz
  1. To compute the splice-junction coverage scores, from a complete set of RNA-seq samples for a wide variety of tissues. Notice that these samples have been processed through an extensive computational pipeline. We provide these pre-computed scores for some genome annotation versions here.
python -m trifid.preprocessing.qsplice \
    --gff   data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz \
    --outdir data/external/qsplice/GRCh38/g27 \
    --samples out/E-MTAB-2836/GRCh38/STAR/g27 \
    --version g
  1. To compute the Pfam effects of the Alternative Splicing over reference isoform of every protein-coding gene for the entire genome. We provide these pre-computed scores for some genome annotation versions here.
python -m trifid.preprocessing.pfam_effects \
    --appris data/external/appris/GRCh38/g27/appris_data.appris.txt \
    --jobs 10 \
    --seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
    --spade data/external/appris/GRCh38/g27/appris_method.spade.gtf.gz \
    --outdir data/external/pfam_effects/GRCh38/g27
  1. To generate a non-redundant set of isoforms labelling fragments and duplications. We provide these pre-computed scores for some genome annotation versions here.
python -m trifid.preprocessing.label_fragments  \
    --gtf data/external/genome_annotation/GRCm38/g25/gencode.vM25.annotation.gtf.gz \
    --seqs data/external/appris/GRCm38/g25/appris_data.transl.fa.gz \
    --principals data/external/appris/GRCm38/g25/appris_data.principal.txt \
    --outdir data/external/label_fragments/GRCm38/g25 \
  1. To download the ALT-Corsair and PhyloCSF data sets available in this repository.

  2. To create both the complete set of isoforms with the correspondent predictive features run:

python -m trifid.data.make_dataset

Model training

Once we have created the data set with predictive features for GENCODE 27, we need to use the training set from proteomics experimental evidence (Kim et al., 2014), to train the Machine Learning model run:

python -m trifid.model.train

Finally, to apply the Machine Learning model previously trained to predict the functional probability of each isoform, the user has to run:

python -m trifid.model.predict

Availability of data

For now, predictions and predictive features are available for the genome annotation versions presented in the table. However, if somebody wants to achieve this data for some specific genome versions or some specific specie, please open an issue in this repository.

TRIFID predictions and predictive features

Genome assembly Specie Name Model Version Database Release - Date Features - predictions
GRCh38 Homo sapiens Human Human v1 GENCODE 27 - 08.2017 sharepoint
GRCh38 Homo sapiens Human Human v2 GENCODE 42 - 04.2022 sharepoint
GRCh38 Homo sapiens Human Human v2 GENCODE 37 - 02.2021 sharepoint
GRCh38 Homo sapiens Human RefSeq v2 RefSeq - NCBI 110 - 02.2020 sharepoint
GRCh37 Homo sapiens Human RefSeq v2 RefSeq - NCBI 105 - 02.2020 sharepoint
GRCh37 Homo sapiens Human Human v2 GENCODE 19 - 12.2013 sharepoint
GRCm39 Mus musculus Mouse Mouse v2 GENCODE 31 - 04.2022 sharepoint
GRCm38 Mus musculus Mouse Mouse v2 GENCODE 25 - 11.2019 sharepoint
mRatBN7.2 Rattus norvegicus Rat Vertebrates v2 Ensembl 105 - 12.2021 sharepoint
GRCz11 Danio rerio Zebrafish Vertebrates v2 Ensembl 104 - 05.2021 sharepoint
GRCg7b Gallus gallus Chicken Vertebrates v2 Ensembl 108 - 10.2022 sharepoint
Pan_tro_3.0 Pan troglodytes Chimpanzee Vertebrates v2 Ensembl 104 - 05.2021 sharepoint
Sscrofa11.1 Sus scrofa Pig Vertebrates v2 Ensembl 108 - 10.2022 sharepoint
ARS-UCD1.2 Bos taurus Cow Vertebrates v2 Ensembl 104 - 05.2021 sharepoint
Mmul_10 Macaca mulatta Macaque Vertebrates v2 Ensembl 105 - 12.2021 sharepoint
BDGP6 Drosophila melanogaster Fruitfly Invertebrates v2 Ensembl - Flybase 107 - 07.2022 sharepoint
WBcel235 Caenorhabditis elegans Worm Invertebrates v2 Ensembl - Wormbase 108 - 10.2022 sharepoint

Other useful links

Example: Fibroblast growth factor receptor 1 (FGFR1)

ENSG00000077782 (Ensembl) - P11362 (FGFR1_HUMAN) (UniProt)

Loading the model

import pandas as pd
predictions = pd.read_csv('data/genomes/GRCh38/g27/trifid_predictions.tsv.gz', compression='gzip', sep='\t')
gene_name = 'FGFR1' # select gene name to explore
predictions.loc[predictions['gene_name'] == gene_name][['transcript_id', 'gene_name', 'trifid_score', 'appris', 'sequence']]
Gene name Transcript id APPRIS label Length TRIFID Score TRIFID Score (n)
FGFR1 ENST00000447712 PRINCIPAL:3 822 0.87 0.99
FGFR1 ENST00000356207 MINOR 733 0.60 0.69
FGFR1 ENST00000397103 MINOR 733 0.01 0.08
FGFR1 ENST00000619564 MINOR 228 0.00 0.01

Loading the SHAP predictions for a single isoform

A more detailed explanation of how to load the SHAP local predictions for an isoform of FGFR1 is explained in our tutorial jupyter notebook:

explain_prediction(df_shap, model, features, 'ENST00000356207')

TRIFID modules

To generate a complete set of predictive features aiming to provide precise predictions, we created some extra predictive scores that intend to represent every single isoform.

QSplice

This TRIFID module quantifies the splice junctions coverage from STAR SJ.out.tab. It maps the unique reads to genome positions using the collapsed coding splice junctions to calculate a score per transcript.

To generate the initial splice-junctions coverage file, we mapped the RNA-seq expression samples of 32 tissues from 122 human individuals stored here, using our RNA-seq Snakemake pipeline.

As we have mentioned above, this module uses the gencode annotation gff3 and a set of SJ.out.tab samples generated by a STAR RNA-seq alignment. In our case, these samples will be stored in different folders inside the outdir directory, but it is also available an option to use a customized SJ.out.tab. The user only has to change the --samples tag by --custom SJ.out.customized.tab to use this mode. To generate the TRIFID RNA-seq predictive features with the E-MTAB-2836 samples, we used this command-line order:

python -m trifid.preprocessing.qsplice \
    --gff   data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz \
    --outdir data/external/qsplice/GRCh38/g27 \
    --samples out/E-MTAB-2836/GRCh38/STAR/g27 \
    --version g

The program releases 2 different files:

  • sj_maxp.emtab2836.mapped.tsv.gz representing one row and one score per splice-junction.
    • RNA2sj is the number of unique reads divided by the gene average unique reads of all splice-junctions.
    • RNA2sj_cds is the number of unique reads divided by the gene average unique reads of splice-junctions that are spanning CDS exons.
  • qsplice.emtab2836.g27.tsv.gz (TRIFID input) representing one row and one score per protein-coding transcript.

Let's see an example that represents this more clearly.

Example: Chromosome 1 open reading frame 112 (C1orf112)

ENSG00000000460 (Ensembl) - Q9NSG2 (CA112_HUMAN) (UniProt)

seqname type start end strand gene_id gene_name gene_type transcript_id cds_coverage intron_number nexons ncds unique_reads tissue gene_mean gene_mean_cds RNA2sj RNA2sj_cds
chr1 intron 169794906 169798856 + ENSG00000000460 C1orf112 protein_coding ENST00000472795 none 1 6 4 2 tonsil 67.3732 73.7826 0.0297 0.0271
chr1 intron 169798959 169800882 + ENSG00000000460 C1orf112 protein_coding ENST00000472795 none 2 6 4 69 testis 67.3732 73.7826 1.0241 0.9352
chr1 intron 169800972 169802620 + ENSG00000000460 C1orf112 protein_coding ENST00000472795 full 3 6 4 74 testis 67.3732 73.7826 1.0984 1.0029
chr1 intron 169802726 169803168 + ENSG00000000460 C1orf112 protein_coding ENST00000472795 full 4 6 4 77 testis 67.3732 73.7826 1.1429 1.0436
chr1 intron 169803310 169804074 + ENSG00000000460 C1orf112 protein_coding ENST00000472795 full 5 6 4 57 testis 67.3732 73.7826 0.846 0.7725

In the case of C1orf112 in GENCODE 27, QSplice selects the splice junction number 5, located between 169803310 and 169804074. This splice junction has the maximum coverage value in testis with 57 unique reads spanning the junction. Moreover, this coverage represents the lowest coverage per isoform as you can see in the table (notice that we only take into account introns that have been spanned by coding exons). The final score RNA2sj and RNA2sj_cds are obtained dividing this score by its respective gene means.

  • qsplice.emtab2836.tsv.gz sample output for some isoforms of C1orf112. The isoform ENST00000472795 represented above gets the same 0.8 score RNAsj and RNA2sj_cds as before.
seqname gene_id gene_name gene_type transcript_id intron_number nexons ncds unique_reads tissue gene_mean gene_mean_cds RNA2sj RNA2sj_cds
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000286031 6 24 22 53 testis 67.3732 73.7826 0.7867 0.7183
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000359326 7 25 22 53 testis 67.3732 73.7826 0.7867 0.7183
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000413811 20 23 14 62 testis 67.3732 73.7826 0.9202 0.8403
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000459772 2 23 3 7 fallopiantube 67.3732 73.7826 0.1039 0.0949
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000466580 2 8 3 7 fallopiantube 67.3732 73.7826 0.1039 0.0949
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000472795 5 6 4 57 testis 67.3732 73.7826 0.846 0.7725
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000481744 2 7 3 7 fallopiantube 67.3732 73.7826 0.1039 0.0949
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000496973 5 6 6 8 tonsil 67.3732 73.7826 0.1187 0.1084
chr1 ENSG00000000460 C1orf112 protein_coding ENST00000498289 3 29 0 0 - 67.3732 73.7826 0 0

Figure: ENST00000472795 (C1orf112-206) exon distribution scheme to represent how QSplice scores are generated.

Pfam effects

This TRIFID module quantifies Pfam effects over reference isoform of every protein-coding gene for the entire genome. The scores calculated the quantitative impact on Pfam domains of an Alternative Splicing event, and whether a domain would be damaged, lost or intact. To generate the TRIFID Pfam effects predictive features we need the APPRIS scores file, the protein sequences file and the SPADE scores file. To generate the set of predictive features in GENCODE 27, we used this command-line order:

python -m trifid.preprocessing.pfam_effects \
    --appris data/external/appris/GRCh38/g27/appris_data.appris.txt \
    --jobs 10 \
    --seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
    --spade data/external/appris/GRCh38/g27/appris_method.spade.gtf.gz \
    --outdir data/external/pfam_effects/GRCh38/g27

The program generates:

  • qpfam.tsv.gz representing one row and several scores per transcript. The final scores are:
    • pfam_score shows the direct effect of Alternative Splicing over Pfam domains getting the number of residues conserved after an event.
    • pfam_domains_impact_score represents the percentage of Pfam domains that are intact after an event.
    • perc_Damaged_State represents the percentage of Pfam domains that are damaged after an event.
    • perc_Lost_State represents the percentage of Pfam domains that are lost after an event.
    • Lost_residues_pfam counts the number of residues from Pfam domains lost.
    • Gain_residues_pfam counts the number of residues from Pfam domains added.

Again, let's see an example to understand these scores.

Example: NIPA like domain containing 3 (NIPAL3)

ENSG00000001461 (Ensembl) - Q6P499 (NPAL3_HUMAN) (UniProt).

The following table represents the qpfam.tsv.gz sample output for isoforms of ENSG00000001461.

transcript_id pfam_score pfam_domains_impact_score perc_Damaged_State perc_Lost_State Lost_residues_pfam Gain_residues_pfam pfam_effects_msa
ENST00000374399 1 1 0 0 0 0 Reference
ENST00000339255 1 1 0 0 0 0 Transcript
ENST00000003912 0.83 0 1 0 50 0 Transcript
ENST00000358028 0.62 0 1 0 112 0 Transcript
ENST00000432012 0.35 0 1 0 255 0 Transcript

This gene has one Pfam domain (Mg_trans_NIPA - PF05653), which represented in green below in the figure.


Figure: Muscle alignment including a fraction of the sequence isoforms of NIPAL3.

Fragment labelling

The module fragment labelling intends to define which fraction of the set of genome isoforms is redundant. In the GENCODE genome annotation, there are some incomplete sequences cds_end_NF or cds_start_NF that must be identified to correct their scores. Moreover, this program also identifies the duplicated protein sequences across the genome. Here, we need the GENCODE gtf annotation, the protein sequences and the APPRIS labels. Therefore, with the command line order presented below, we tagged as Principal, Alternative, Redundant [Principal|Alternative] or [Principal|Alternative] Duplication the whole set the isoforms:

python -m trifid.preprocessing.label_fragments  \
    --gtf data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gtf.gz \
    --seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
    --principals data/external/appris/GRCh38/g27/appris_data.principal.txt \
    --outdir data/external/label_fragments/GRCh38/g27

Directory structure

Project structure from Cookiecutter Data Science.

+-- .gitignore
+-- LICENSE
+-- README.md                       <- The top-level README for developers using this project
+-- config                          <- YAML files to customize the pipelines
¦   +-- features.yaml               <- Features name, category, description and species support
¦   +-- config.yaml                 <- Customized to create the database
¦
+-- img                             <- Repository image logos
¦
+-- models                          <- Trained model, model selection log and results
¦
+-- notebooks                       <- Jupyter notebooks to reproduce interactively the methods
¦   +-- 01.tutorial.ipynb           <- Tutorial to run an end-to-end TRIFID simulation
¦   +-- 02.figures                  <- Useful figures generated
¦
+-- .editorconfig                   <- Editor configuration file
+-- setup.py                        <- Make this project pip installable 
+-- setup.cfg                       <- Setup configuration file
+-- environment.yml                 <- The requirements file for reproducing the analysis environment
+-- pyproject.toml                  <- Project configuration file
+-- trifid                          <- Source code for use in this project.
¦   +-- __init__.py                 <- Makes trifid a Python module
¦   ¦
¦   +-- preprocessing               <- Scripts to run the TRIFID modules
¦   ¦   +-- __init__.py
¦   ¦   +-- fragment_labeling.py
¦   ¦   +-- pfam_effects.py
¦   ¦   +-- qsplice.py
¦   ¦
¦   +-- data                        <- Scripts to download or generate data and turn raw data into features for modeling
¦   ¦   +-- __init__.py
¦   ¦   +-- loaders.py
¦   ¦   +-- feature_engineering.py
¦   ¦   +-- make_dataset.py
¦   ¦
¦   +-- models                      <- Scripts to train models and then use trained models to make predictions
¦   ¦   +-- __init__.py
¦   ¦   +-- interpret.py
¦   ¦   +-- predict.py
¦   ¦   +-- select.py
¦   ¦   +-- train.py
¦   ¦
¦   +-- utils                      <- Useful functions used in several modules of the package
¦   ¦   +-- __init__.py
¦   ¦   +-- utils.py
¦   ¦   +-- analyse_appris_spade_transcripts_nf.pl
¦   ¦   +-- get_NR_list.pl
¦   ¦   +-- get_seqlen.pl
¦   ¦
¦   +-- visualization               <- Scripts to create exploratory and results in oriented visualizations
¦   ¦   +-- __init__.py
¦       +-- figures.py

Author information

Fernando Pozo (@fpozocafpozoc@gmx.com)

Contributors: Daniel Cerdán, Laura Martinez-Gomez, Thomas A. Walsh, Tomas Di Domenico, Jose Manuel Rodriguez, Jesus Vazquez, Federico Abascal, Michael L Tress

Release History

  • TRIFID initial release (March 10, 2021).
  • TRIFID v2.0.0 release (Sep, 2022).

Contributing

Instructions to contribute to this project:

Branching (internal collaboration)

Read CONTRIBUTING.md

Quickstart tips

  • Follow a development workflow structure:
    1. Open an Issue describing your implementation.
    2. Create a Branch called issue-number_developer-name_problem (e.g. 12_fernando_modify-docs) (please don't commit directly to master branch). Commit from here.
    3. Create a Pull Request to main or develop branch when code were ready.
  • Don't upload big files here (only tests/examples if needed). Instead, use Azure.
  • The Makefile contains some useful command line orders (e.g. make check). Check it.
  • Read the CONTRIBUTING.md if you are interest to contribute with this repository.

Forking (external collaboration)

  1. Fork it (https://github.com/fpozoc/trifid)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

NOTE: Several functions or classes inside this repository can be useful for Bioinformatics or Machine Learning developers. However, at the moment, the main objective of TRIFID is not to be a Python package explicitly. It only has been designed in this way to facilitate reproducibility.

License

Distributed under the GNU General Public License.

See LICENSE file.