This repository contains the data and code needed to reproduce the results reported in our paper. We also describe how to obtain the annotations from public repositories used in this work.
README.md guides you all over this repository. The structure of this repository is the next:
-
main_tables needed to reproduce the main figures.
- suppl_tables for the supplementary material.
- suppl_tables__extra contains some extra data that can be helpful (i.e. taxonomical ids). -
main_work contains the code needed to reproduce the main results.
- suppl_work, where the code for the supplementary material is.
- suppl_work__extra, where some extra code that complement the supplementary material is. -
gl_lib contains libraries of code used in this repository
The reference proteomes were downloaded from the Universal Protein Resource (Uniprot). Each proteome has a unique Uniprot-identifier (UPID). A description of the proteomes is also provided. It contains a table with information on every proteome: UPIDs, taxonomy_ids, species names, etc. All the reference proteomes were downloaded from Uniprot FTP repository on 28.5.2021. Note that viruses were not downloaded and that Uniprot is updated regularly, every eight weeks.
Then, for each species a fasta file containing its reference proteome was downloaded, preserving the directory structure of the repository. For instance, for Homo sapiens (UPID: UP000005640 and taxonomy id:9606):
UP000005640_9606.fasta.gz @
our_mnt_dir + /data/compressed/ + "ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/"
our_mnt_dir is the local directory where the data were downloaded.
The protein coding gene annotations were obtained from different webservers provided by Ensembl: prokaryotes (Archaea, Bacteria), protists, plants, Fungi, invertebrates, vertebrates. The categorization in groups of organisms is well established by Ensembl.
Ensembl ftp site by Kingdom/division | Release |
---|---|
Archaea, Bacteria | ensemblgenomes 49 |
protists | ensemblgenomes 49 |
plants | ensemblgenomes 49 |
Fungi | ensemblgenomes 49 |
invertebrates | ensemblgenomes 49 |
vertebrates (Vertebrata) | ensembl 98 |
The gzip compressed *.gtf.gz (General Transfer Format) gene annotation files were downloaded for the different species preserving the structure of the directories (FTP Ensembl repositories). For instance, for Homo sapiens:
Homo_sapiens.GRCh38.98.gtf.gz @
our_mnt_dir + data/compressed/ + "ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/"
our_mnt_dir is, as above, the local directory where all the data were downloaded.
The taxonomy id of each species has been downloaded from the corresponding release from Ensembl for each division: Archaea, Bacteria, protists, plants, Fungi, invertebrates, vertebrates.
The data from NCBI genome was downloaded (20.6.2022) directly from NCBI genome reports.
The length of any protein coding gene or protein for all the used species can be accessed from our server:
https://genford.uv.es:5001/sharing/P79EcUfhE
The files for protein coding genes, proteins, and the intersection set between them (merged) are provided in standard tab-separated values (*.tsv):
- stat_protCodGenes.tsv (header line + 33627 entries).
- stat_proteins.tsv (header line + 9913 entries).
- stat_merged (header line + 6519 entries).
Also, a file for the merged set with the mean gene length vs. rho (fraction of nCDS within the protein coding genes). The entries are ordered by ascending mean gene length:
- rho_vs_gene.dat (three header lines + 6519 entries).
stat_protCodGenes.tsv (header line + 33627 entries):
counts | regnum |
---|---|
31943 | prokaryotes* |
237 | protists |
96 | plants |
1014 | Fungi |
115 | invertebrates |
222 | vertebrates |
33627 entries in total |
*30714 Bacteria and 1229 Archaea.
stat_proteins.tsv (header line + 9,913 entries):
counts | domain |
---|---|
330 | Archaea |
7997 | Bacteria |
1586 | Eukaryota* |
9913 entries in total |
*In the annotations from Uniprot, Eukaryota includes: protists (156), plants (184), Fungi (772), invertebrates (226), and vertebrates (248). The 1586 Eukaryotes were classified using the taxonomic hierarchical classification (downloaded on 19.11.2021) provided by Uniprot and based in the NCBI taxonomy database (see Lineage).
stat_merged.tsv (header line + 6519 entries):
counts | regnum |
---|---|
5468 | Bacteria |
227 | Archaea |
91 | protists |
59 | plants |
533 | Fungi |
49 | invertebrates |
92 | vertebrates |
6519 entries in total |
-
stat_protCodGenes_ncbiGenomeAssemblyStatus.tsv. Assembly status for the genomes associated to the Ensembl protein coding genes entries. The file is composed by one header and 33627 entries (rows) with 3 columns: species, ensembl_assembly_accession, assembly_status.
-
gene_length_vs_divergence_time.tsv. Average of
$< L >$ and$< log L >$ of each group of organisms and their divergence time (Mya) obtained from Timetree. -
protCodGenes_averageL_perGoOrg.txt. Groups of organisms with at least 20 species (to compare with proteomes) and the average
$< L >$ of each group in base pairs. -
proteins_averageLp_perGoOrg.txt. Groups of organisms with at least 20 species (to compare with proteomes) and the average
$< L_{p} >$ of each group in amino acids.
-
species_Ensembl.tsv. The file contains the taxonomy ids of the different species annotated in Ensembl, see above. The files for the different divisions have been concatenated into species_Ensembl.tsv, maintaining only the first header. Finally, the file has been slimmed-down reducing its columns to species, species name and taxonomy_id.
-
480lognormal.dat. Initial seed for the gene growth model: 5000 genes, lognormally distributed (mean=480).
-
Homo_sapiens_CDS_nCDS.xlsx. Data needed to compare the length frequency distribution for coding (CDS) and non-coding (nCDS) genetic sequences, see Extended Data Fig. 8.
-
protCodGenes_lognormDist.ipynb and proteins_lognormDist.ipynb: the distributions of the lengths of the protein coding genes (genes hereafter) and proteins respectively. See Fig.1, also Extended Data Figs. 1 and 7.
-
protCodGenes_taylorLaw.ipynb and proteins_taylorLaw.ipynb: the observed Taylor law in the distributions of the lengths of genes and proteins (variance vs mean in
$log_{10}$ representation) for the different species. See Fig. 2 and Extended Data Fig. 4. -
relation_proteins_protCodGenes_lengths.ipynb: threshold in the relationship between the mean gene length and the mean protein length for the different species. See Fig. 3 and Extended Data Fig. 9.
-
rho_nCDS_within_protCodGenes_lengths.ipynb: second-order phase transition in the fraction (
$\rho$ ) of non-coding sequences within protein coding genes with the mean gene length as control parameter. See Fig. 4. -
allowed_states.f. It calculates the allowed states of Fig. 4.
-
mean_vs_time.ipynb: average mean gene lengths are represented against their divergence time from LUCA. Similarly, it is displayed for those groups of organisms, the average of the mean of the gene lengths' logarithm against the evolutionary divergence time from LUCA. That is,
$\overline{\langle L \rangle}$ (nt) and$\overline{\langle log L \rangle}$ (nt) vs. divergence time from LUCA (My). See Extended Data Fig. 3. -
protCodGenes__2nd_order_momentum.ipynb: the observed generalized Taylor law for the protein coding gene length's distributions for the different genomes: (
$\sigma_{g}^{2} + \langle L \rangle^{2}$ ) vs$\langle L \rangle$ in$log_{10}$ representation, where$\langle L^{2} \rangle$ is the second order momentum. See Extended Data Fig. 4 that complements the main Fig 2. -
proteins__2nd_order_momentum.ipynb: the same for proteins. That is, the observed generalized Taylor law for the protein length's distributions for the different species:
$(\sigma_{p}^{2} + \langle L_{p}\rangle^{2})$ vs$\langle L_{p} \rangle$ in$log_{10}$ representation; where$\langle L_{p}^{2} \rangle$ is the second order momentum. See Extended Data Fig. 4 that complements the main Fig 2. -
meanOfLn_lnOfMean.ipynb: for the gene length's distributions of the different genomes. Here we compare the mean of the log of the lengths,
$\langle log L \rangle$ , and the log of the mean of lengths,$log \langle L \rangle$ . It corresponds to the Extended Data Fig. 5. -
average_mean_lengths__order.ipynb. Extended Data Fig. 6a: order of the average mean gene lengths for the different groups of organisms. Extended Data Fig 6b: same kind of representation for the average protein lengths.
-
meanL_distribution__perGofOrg.ipynb: distribution of the mean gene lengths (Fungi; 1014 genomes). It corresponds to the Extended Data Fig. 9a. Note that the Extended Data Fig. 9b was calculated using code from the main_work section, see relation_proteins_protCodGenes_lengths.ipynb.
-
reliability_fit.ipynb: calculates the log-likelihood that fits the different distributions. See Extended Data Fig. 2.
-
entropy.f: calculates the entropy of the allowed states. See Extended Data Fig. 10.
- gene_growth_simulator.f: example of simulator of gene growth using a multiplicative stochastic factor.