Skip to content

Latest commit

 

History

History
163 lines (113 loc) · 13.4 KB

README.md

File metadata and controls

163 lines (113 loc) · 13.4 KB

The emergence of eukaryotes as an evolutionary algorithmic phase transition

This repository contains the data and code needed to reproduce the results reported in our paper. We also describe how to obtain the annotations from public repositories used in this work.

README.md guides you all over this repository. The structure of this repository is the next:

  • main_tables needed to reproduce the main figures.
    - suppl_tables for the supplementary material.
    - suppl_tables__extra contains some extra data that can be helpful (i.e. taxonomical ids).

  • main_work contains the code needed to reproduce the main results.
    - suppl_work, where the code for the supplementary material is.
    - suppl_work__extra, where some extra code that complement the supplementary material is.

  • gl_lib contains libraries of code used in this repository


Data: the annotations were downloaded from public repositories

Proteins

The reference proteomes were downloaded from the Universal Protein Resource (Uniprot). Each proteome has a unique Uniprot-identifier (UPID). A description of the proteomes is also provided. It contains a table with information on every proteome: UPIDs, taxonomy_ids, species names, etc. All the reference proteomes were downloaded from Uniprot FTP repository on 28.5.2021. Note that viruses were not downloaded and that Uniprot is updated regularly, every eight weeks.

Then, for each species a fasta file containing its reference proteome was downloaded, preserving the directory structure of the repository. For instance, for Homo sapiens (UPID: UP000005640 and taxonomy id:9606):

UP000005640_9606.fasta.gz @
our_mnt_dir + /data/compressed/ + "ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/"

our_mnt_dir is the local directory where the data were downloaded.

Protein coding genes

The protein coding gene annotations were obtained from different webservers provided by Ensembl: prokaryotes (Archaea, Bacteria), protists, plants, Fungi, invertebrates, vertebrates. The categorization in groups of organisms is well established by Ensembl.

Ensembl ftp site by Kingdom/division Release
Archaea, Bacteria ensemblgenomes 49
protists ensemblgenomes 49
plants ensemblgenomes 49
Fungi ensemblgenomes 49
invertebrates ensemblgenomes 49
vertebrates (Vertebrata) ensembl 98

The gzip compressed *.gtf.gz (General Transfer Format) gene annotation files were downloaded for the different species preserving the structure of the directories (FTP Ensembl repositories). For instance, for Homo sapiens:

Homo_sapiens.GRCh38.98.gtf.gz @
our_mnt_dir + data/compressed/ + "ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/"

our_mnt_dir is, as above, the local directory where all the data were downloaded.

Taxonomy ids of the different species annotated in Ensembl

The taxonomy id of each species has been downloaded from the corresponding release from Ensembl for each division: Archaea, Bacteria, protists, plants, Fungi, invertebrates, vertebrates.

Genome quality

The data from NCBI genome was downloaded (20.6.2022) directly from NCBI genome reports.


The lengths of protein coding genes and proteins

The length of any protein coding gene or protein for all the used species can be accessed from our server:
https://genford.uv.es:5001/sharing/P79EcUfhE


main_tables

The files for protein coding genes, proteins, and the intersection set between them (merged) are provided in standard tab-separated values (*.tsv):

  • stat_protCodGenes.tsv (header line + 33627 entries).
  • stat_proteins.tsv (header line + 9913 entries).
  • stat_merged (header line + 6519 entries).

Also, a file for the merged set with the mean gene length vs. rho (fraction of nCDS within the protein coding genes). The entries are ordered by ascending mean gene length:

  • rho_vs_gene.dat (three header lines + 6519 entries).

Number of entries per taxonomical division:

stat_protCodGenes.tsv (header line + 33627 entries):

counts regnum
31943 prokaryotes*
237 protists
96 plants
1014 Fungi
115 invertebrates
222 vertebrates
33627 entries in total

*30714 Bacteria and 1229 Archaea.

stat_proteins.tsv (header line + 9,913 entries):

counts domain
330 Archaea
7997 Bacteria
1586 Eukaryota*
9913 entries in total

*In the annotations from Uniprot, Eukaryota includes: protists (156), plants (184), Fungi (772), invertebrates (226), and vertebrates (248). The 1586 Eukaryotes were classified using the taxonomic hierarchical classification (downloaded on 19.11.2021) provided by Uniprot and based in the NCBI taxonomy database (see Lineage).

stat_merged.tsv (header line + 6519 entries):

counts regnum
5468 Bacteria
227 Archaea
91 protists
59 plants
533 Fungi
49 invertebrates
92 vertebrates
6519 entries in total

suppl_tables

  • stat_protCodGenes_ncbiGenomeAssemblyStatus.tsv. Assembly status for the genomes associated to the Ensembl protein coding genes entries. The file is composed by one header and 33627 entries (rows) with 3 columns: species, ensembl_assembly_accession, assembly_status.

  • gene_length_vs_divergence_time.tsv. Average of $< L >$ and $< log L >$ of each group of organisms and their divergence time (Mya) obtained from Timetree.

  • protCodGenes_averageL_perGoOrg.txt. Groups of organisms with at least 20 species (to compare with proteomes) and the average $< L >$ of each group in base pairs.

  • proteins_averageLp_perGoOrg.txt. Groups of organisms with at least 20 species (to compare with proteomes) and the average $< L_{p} >$ of each group in amino acids.

suppl_tables__extra

  • species_Ensembl.tsv. The file contains the taxonomy ids of the different species annotated in Ensembl, see above. The files for the different divisions have been concatenated into species_Ensembl.tsv, maintaining only the first header. Finally, the file has been slimmed-down reducing its columns to species, species name and taxonomy_id.

  • 480lognormal.dat. Initial seed for the gene growth model: 5000 genes, lognormally distributed (mean=480).

  • Homo_sapiens_CDS_nCDS.xlsx. Data needed to compare the length frequency distribution for coding (CDS) and non-coding (nCDS) genetic sequences, see Extended Data Fig. 8.


main_work

suppl_work

  • mean_vs_time.ipynb: average mean gene lengths are represented against their divergence time from LUCA. Similarly, it is displayed for those groups of organisms, the average of the mean of the gene lengths' logarithm against the evolutionary divergence time from LUCA. That is, $\overline{\langle L \rangle}$ (nt) and $\overline{\langle log L \rangle}$ (nt) vs. divergence time from LUCA (My). See Extended Data Fig. 3.

  • protCodGenes__2nd_order_momentum.ipynb: the observed generalized Taylor law for the protein coding gene length's distributions for the different genomes: ($\sigma_{g}^{2} + \langle L \rangle^{2}$) vs $\langle L \rangle$ in $log_{10}$ representation, where $\langle L^{2} \rangle$ is the second order momentum. See Extended Data Fig. 4 that complements the main Fig 2.

  • proteins__2nd_order_momentum.ipynb: the same for proteins. That is, the observed generalized Taylor law for the protein length's distributions for the different species: $(\sigma_{p}^{2} + \langle L_{p}\rangle^{2})$ vs $\langle L_{p} \rangle$ in $log_{10}$ representation; where $\langle L_{p}^{2} \rangle$ is the second order momentum. See Extended Data Fig. 4 that complements the main Fig 2.

  • meanOfLn_lnOfMean.ipynb: for the gene length's distributions of the different genomes. Here we compare the mean of the log of the lengths, $\langle log L \rangle$ , and the log of the mean of lengths, $log \langle L \rangle$. It corresponds to the Extended Data Fig. 5.

  • average_mean_lengths__order.ipynb. Extended Data Fig. 6a: order of the average mean gene lengths for the different groups of organisms. Extended Data Fig 6b: same kind of representation for the average protein lengths.

  • meanL_distribution__perGofOrg.ipynb: distribution of the mean gene lengths (Fungi; 1014 genomes). It corresponds to the Extended Data Fig. 9a. Note that the Extended Data Fig. 9b was calculated using code from the main_work section, see relation_proteins_protCodGenes_lengths.ipynb.

  • reliability_fit.ipynb: calculates the log-likelihood that fits the different distributions. See Extended Data Fig. 2.

  • entropy.f: calculates the entropy of the allowed states. See Extended Data Fig. 10.

suppl_work__extra

  • gene_growth_simulator.f: example of simulator of gene growth using a multiplicative stochastic factor.