Skip to content

Algorithm to infer clusters of isoorthologous transcripts using gene-level homology relationships and a Reciprocal Best Hits approach

License

Notifications You must be signed in to change notification settings

UdeS-CoBIUS/TranscriptOrthology

Repository files navigation

🧬 Orthology and Paralogy at Transcript Level 🧬

theme

👥 Authors

  • Wend Yam Donald Davy Ouedraogo & Aida Ouangraoua, CoBIUS LAB, Department of Computer Science, Faculty of Science, Université de Sherbrooke, Sherbrooke, Canada*

💡 If you are using our algorithm in your research, please cite our recent paper: Ouedraogo, W. Y. D. D., & Ouangraoua, A. (2023, April). Inferring Clusters of Orthologous and Paralogous Transcripts. In RECOMB International Workshop on Comparative Genomics (pp. 19-34).

📧 Contact: wend.yam.donald.davy.ouedraogo@usherbrooke.ca

📖 Table of Contents

  1. ➤ About the project
    1. ➤ Overview
    2. ➤ Operating System
    3. ➤ Requirements
  2. ➤ Inferring clusters of orthologous and paralogous transcripts
    1. ➤ Package Pypi
    2. ➤ Getting Started
    3. ➤ Project files descriptions
      1. ➤ Inputs description
      2. ➤ Inputs description
      3. ➤ Inputs description

-----------------------------------------------------

📝 About The Project

☁️ Overview

We present an algorithm for inferring clusters of orthologous and paralogous transcripts.

👨‍💻 Operating System

The program was both developed and tested on a system operating Ubuntu version 18.04.6 LTS. ---

⚒️ Requirements

  • python3 (at leat python 3.6)
  • NetworkX
  • Pandas
  • Numpy
  • ETE toolkit

-----------------------------------------------------

Inferring clusters of orthologous and paralogous transcripts

📦 About the package

install the package

pip3 install transcriptorthology

import package and use the main function

from transcriptorthology.transcriptOrthology import inferring_transcripts_isoorthology

if __name__ == '__main__':
  gtot_path = './execution/mapping_gene_to_transcripts/ENSGT00390000000080.fasta'
  gt_path = './execution/NHX_trees/ENSGT00390000000080.nwk'
  lower_bound = 0.7
  transcripts_msa_path = './execution/transcripts_alignments/ENSGT00390000000080.alg'
  tsm_conditions = 2
  constraint = 1
  output_folder = './execution/output_folder'
  
  inferring_transcripts_isoorthology(transcripts_msa_path, gtot_path, gt_path, tsm_conditions, lower_bound, constraint, output_folder)

-----------------------------------------------------

🚀 Getting Started

Command

usage: transcriptOrthology.py [-h] -talg TRALIGNMENT
                              -gtot GENETOTRANSCRIPTS -nhxt NHXGENETREE
                              [-lowb LOWERBOUND] [-tsm TSMVALUE]
                              [-outf OUTPUTFOLDER]

program parameters

options:
  -h, --help            show this help message and exit
  -talg TRALIGNMENT, --tralignment TRALIGNMENT
                        Multiple Sequences Alignment of transcripts in FASTA
                        format
  -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS
                        mappings transcripts to corresponding genes
  -nhxt NHXGENETREE, --nhxgenetree NHXGENETREE
                        NHX gene tree
  -lowb LOWERBOUND, --lowerbound LOWERBOUND
                        a threshold for the selection of transcripts RBHs
  -tsm TSMVALUE, --tsmvalue TSMVALUE
                        an integer(1|2|3|4|5|6) that refers to the transcript
                        similarity measure
  -const CONSTRAINT, --constraint CONSTRAINT
                        an integer(0|1), constraint for the selection of recent paralogs
                        similarity measure
  -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER
                        the output folder to store the results

Details

parameter definition value format
-talg
--tralignment
MSA of transcripts FASTA format
>{id_transcript}\n{sequence}
-gtot
--genetotranscripts
mappings g(t) FASTA format
>{id_transcript}:{id_gene}\n
-nhxt
--nhxtgenetree
gene tree NHX format
-lowb
--lowerbound
a lower bound to select RBHs transcripts. By default, equals to 0.5 float between 0 and 1
-tsm
--tsmvalue
The similarity mesure(mean, length, unitary) integer 1(tsm+unitary) | 2(tsm+length) | 3(tsm+mean) | 4(tsm++unitary) | 5(tsm++length) | 6(tsm++mean)
-const
--constraint
constraint for the selection of recent paralogs 0(not reciprocal) | 1(reciprocal)
-outf
--outputfolder
folder to save results. The current program folder is set by default. String

Usage example

python3 ./scripts/transcriptOrthology.py -talg ./execution/inputs/transcripts_alignments/ENSGT00390000003967.alg -gtot ./execution/inputs/mapping_gene_to_transcripts/ENSGT00390000003967.fasta -nhxt ./execution/inputs/NHX_trees/ENSGT00390000003967.nhx -lowb 0.7 -outf ./execution/outputs/ -tsm 1 -const 1

OR

sh ./execution_inferring_clusters.sh

Output expected

++++++++++++++++Starting ....
+++++++ All data were retrieved & the representation of subtranscribed sequences of genes into blocks are available.
+++++ Computing matrix ...       in progress
+++++ Computing matrix ...       status: Finished without errors in 0.42296433448791504 seconds
+++++ Searching for recent-paralogs ...         status: processing
+++++ Searching for recent-paralogs ...         status: finished in 0.11350250244140625 seconds
+++++ Searching for RBHs ...    status: processing
+++++ Searching for RBHs ...    status: finished in 0.09129834175109863 seconds
+++++ Construction of the orthology graph (Adding nodes ...) ...        status: processing
+++++ Construction of the orthology graph (Adding nodes ...) ...        status: finished in 0.524106502532959 seconds
+++++ Searching for connected components ...    status: processing
+++++ Searching for connected components ...    status: finished in 0.06076645851135254 seconds
++++++++++++++++Finished 

-----------------------------------------------------

📁 Project Files Description

⌨️ Inputs description

Inputs files

  • 1️⃣ tsmcomputing() ➡️ returns the similarity matrix (tsm+ | tsm) scores depending on the `tsmvalue` for all pairs of homologous transcripts.
    usage: tsmComputing.py [-h] [-talg TRALIGNMENT]
                           [-gtot GENETOTRANSCRIPTS] [-tsm TSMVALUE]
                           [-outf OUTPUTFOLDER]
    

    parsor program parameter

    optional arguments: -h, --help show this help message and exit -talg TRALIGNMENT, --tralignment TRALIGNMENT -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS -tsm TSMVALUE, --tsmvalue TSMVALUE -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER

  • 2️⃣ Tclustering() ➡️ returns the orthology graph of transcripts.
    usage: Tclustering.py [-h] [-m MATRIX] [-gtot GENETOTRANSCRIPTS]
                          [-nhxt NHXGENETREE] [-lowb LOWERBOUND]
                          [-outf OUTPUTFOLDER]
    

    parsor program parameter

    optional arguments: -h, --help show this help message and exit -m MATRIX, --matrix MATRIX -gtot GENETOTRANSCRIPTS, --genetotranscripts GENETOTRANSCRIPTS -nhxt NHXGENETREE, --nhxgenetree NHXGENETREE -lowb LOWERBOUND, --lowerbound LOWERBOUND -const CONSTRAINT, --constraint CONSTRAINT -outf OUTPUTFOLDER, --outputfolder OUTPUTFOLDER

  • 3️⃣ transcriptOthology() ➡️ returns for each pair of homologous transcripts, their homology relationship type (recent-paralogs, ortho-paralogs or ortho-orthologs).

💽 Outputs description

Outputs files

  • 1️⃣ matrix.csv : similarity matrix score that present the tsm+ score between each pair of homologous transcripts.
  • 2️⃣ blocks_transcripts.csv|blocks_genes : csv file describing the representation of blocks for each transcript(resp. gene).
  • 3️⃣ start_orthology_graph.pdf|end_orthology_graph.pdf : orthology graph at the start of the algorithm(resp. at the end of the algorithm) showing only the pair relationships between recent-paralogs(resp. all the orthologous clusters). (:warning:only retrieved if the number of transcripts is not greater than 20)
  • 4️⃣ orthologs.csv : csv files resuming the information of the isoorthology-clustering.

✔️ Dataset

The folder data contains dataset used for the studies and also the results obtained.







Copyright © 2023 CoBIUS LAB

About

Algorithm to infer clusters of isoorthologous transcripts using gene-level homology relationships and a Reciprocal Best Hits approach

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published