This notebook contains the details of the tables used for the ontologies in the paper. The ontolgies take as a base the ontologies in __[URGI](https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v1.0/)__. 

The gene models that were used for the study correspond to the genes in the following files:

 * __[iwgsc_refseqv1.0_HighConf_UTR_2017May05.gff3.zip](https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v1.0/iwgsc_refseqv1.0_HighConf_UTR_2017May05.gff3.zip)__
 * __[iwgsc_refseqv1.0_LowConf_UTR_2017May05.gff3.zip](https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v1.0/iwgsc_refseqv1.0_LowConf_UTR_2017May05.gff3.zip)__

After conncatenating and using __[gffread](http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread)__ to convert to a fasta file we obtained the transcriptome reference __[IWGSCv1.0_UTR_ALL.cdnas.fasta.gz](https://opendata.earlham.ac.uk/wheat/under_license/toronto/Ramirez-Gonzalez_etal_2018-06025-Transcriptome-Landscape/expvip/RefSeq_1.0/IWGSCv1.0_UTR_ALL.cdnas.fasta.gz)__


The functional annotation is based on the followign files in __[URGI](https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v1.0/)__: 

 * ```iwgsc_refseqv1.0_FunctionalAnnotation_v1__LCgenes_v1.0.TAB``` 
 * ```iwgsc_refseqv1.0_FunctionalAnnotation_v1_\_HCgenes_v1.0.TAB```


On the top of those, we added some measing annotations with blast2go as stated inthe methods: 

> The pipeline also contained a step annotating the domain architectures of the gene family members. The inferred domain architectures were utilized to identify gene families belonging to super- families of transcription factors, transcriptional and post-transcriptional regulators using a HMM- domain rule set established previously (79). The orthologous relationships were utilized to establish Gene Ontology (GO), Plant Ontology (PO) and Plant Trait Ontology (TO) term annotations for bread wheat by homology annotation transfer (10). This pipeline explicitly discarded ontologies related to biotic or abiotic stress. Therefore, to complement the functional annotation, the gene models where aligned to the Arabidopsis proteome (tair10) with blastx. Matches were called with a cut-off e-value `e-10` and GO terms were transferred from the GO assignment of the matching tair10 Arabidopsis annotation. We identified the Arabidopsis proteins with GO terms relating to biotic and abiotic stress, by using the following Plant GO slim (http://geneontology.org/page/go-slim-and-subset-guide) terms: GO:0006950: response to stress; GO:0009607: response to biotic stimulus and; GO:0009628: response to abiotic stimulus. Wheat genes homologous to Arabidopsis proteins with these GO slim terms were extracted from the blastx output and these functional annotations were added to the original IWGSC annotation (10). The GO release was the monthly freeze of 01/01/2017.



All the actual tables used on our analysis is here: https://opendata.earlham.ac.uk/wheat/under_license/toronto/Ramirez-Gonzalez_etal_2018-06025-Transcriptome-Landscape/data/TablesForExploration/


In [1]:
options(gsubfn.engine = "R")
library(ggplot2)
library(reshape2)
library(sqldf)
library(fields)
library(gridExtra)
library(ggtern)
library(clue)
library(geometry)
library(gtable)
library(goseq)
library(plyr)

options(keep.source = TRUE, error = 
  quote({ 
    cat("Environment:\n", file=stderr()); 

    # TODO: setup option for dumping to a file (?)
    # Set `to.file` argument to write this to a file for post-mortem debugging    
    dump.frames();  # writes to last.dump

    #
    # Debugging in R
    #   http://www.stats.uwo.ca/faculty/murdoch/software/debuggingR/index.shtml
    #
    # Post-mortem debugging
    #   http://www.stats.uwo.ca/faculty/murdoch/software/debuggingR/pmd.shtml
    #
    # Relation functions:
    #   dump.frames
    #   recover
    # >>limitedLabels  (formatting of the dump with source/line numbers)
    #   sys.frame (and associated)
    #   traceback
    #   geterrmessage
    #
    # Output based on the debugger function definition.

    n <- length(last.dump)
    calls <- names(last.dump)
    cat(paste("  ", 1L:n, ": ", calls, sep = ""), sep = "\n", file=stderr())
    cat("\n", file=stderr())

    if (!interactive()) {
      q()
    }
  }))

is.error <- function(x) inherits(x, "try-error")

loadGeneInformation<-function(dir="../TablesForExploration"){
    path<-paste0(dir,"/CanonicalTranscript.rds")
    canonicalTranscripts<-readRDS(path)
    canonicalTranscripts$intron_length<- canonicalTranscripts$mrna_length -  canonicalTranscripts$exon_length
    canonicalTranscripts$chr_group <- substr(canonicalTranscripts$Chr,4,4)
    canonicalTranscripts$genome    <- substr(canonicalTranscripts$Chr,5,5)
    
    path<-paste0(dir, "/MeanTpms.rds")
    meanTpms <- readRDS(path)
    expressed_genes<-unique(meanTpms$gene)
    canonicalTranscripts<-canonicalTranscripts[canonicalTranscripts$Gene %in% expressed_genes, ]
    canonicalTranscripts$scaled_5per_position <-   5 * ceiling(canonicalTranscripts$scaled_1per_position / 5)
    canonicalTranscripts$scaled_5per_position <- ifelse(canonicalTranscripts$scaled_5per_position == 0, 
        5, 
        canonicalTranscripts$scaled_5per_position)

    path<-paste0(dir, "/region_partition.csv")
    partition<-read.csv(path, row.names=1)
    
    partition_percentages<-round(100*partition/partition$Length)
    partition_percentages$Chr <- rownames(partition_percentages)
    partition$Chr <- rownames(partition)
    ct<-canonicalTranscripts
    ct_with_partition<-sqldf('SELECT ct.*, CASE 
WHEN scaled_1per_position < R1_R2a THEN "R1"
WHEN scaled_1per_position < R2a_C  THEN "R2A"
WHEN scaled_1per_position < C_R2b  THEN "C"
WHEN scaled_1per_position < R2b_R3  THEN "R2B"
ELSE "R3" END as partition
    
FROM ct LEFT JOIN partition_percentages ON ct.chr = partition_percentages.chr   ')

    x<-  as.factor(ct_with_partition$partition)
    x <- factor(x,levels(x)[c(2,3,1,4,5)])
    ct_with_partition$partition <- x 

    
    canonicalTranscripts<-ct_with_partition

    path<-paste0(dir,"/TriadMovement.rds")
    triadMovement<-readRDS(path)
    
    path<-paste0(dir,"/Triads.rds")
    triads<-readRDS(path)
    
    path<-paste0(dir,"/universe_table.csv")
    gene_universe<-read.csv(path)
    
    path<-paste0(dir, "/OntologiesForGenes.rds")
    ontologies<-readRDS(path)
    
    path<-paste0(dir, "/id_names_merged.txt")
    id_names <- read.csv(path, header=F, sep = "\t")
    
    path<-paste0(dir, "/WGCNA_table.csv")
    WGCNA <-  read.csv(path)
    
    path<-paste0(dir, "/ObservedGOTermsWithSlim.csv")
    go_slim<-read.csv(path, row.names=1)

    path<-paste0(dir, "/motifs.rds")
    motifs <- readRDS(path)
    motifs<-unique(motifs)

    path<-paste0(dir, "/SegmentalTriads.csv")
    allTriads<-read.csv(path, stringsAsFactors=F)
    only_genes<-allTriads[,c("group_id","A", "B", "D")]
    allTriads<-melt(only_genes, id.vars<-c("group_id"),
        variable.name = "chr_group",
        value.name ="gene")
    
    list(canonicalTranscripts=canonicalTranscripts, 
       meanTpms=meanTpms,
       triads=triads, 
       triadMovement=triadMovement,
       gene_universe=gene_universe, 
       ontologies=ontologies,
       id_names=id_names,
       WGCNA=WGCNA,
       GOSlim=go_slim,
       partition=partition,
       motifs=motifs,
       allTriads=allTriads
       )
}



Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: spam
Loading required package: grid
Spam version 1.4-0 (2016-08-29) is loaded.
Type 'help( Spam)' or 'demo( spam)' for a short introduction 
and overview of this package.
Help for individual functions is also obtained by adding the
suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

Attaching package: ‘spam’

The following objects are masked from ‘package:base’:

    backsolve, forwardsolve

Loading required package: maps
--
Consider donating at: http://ggtern.com
Even small amounts (say $10-50) are very much appreciated!
Remember to cite, run citation(package = 'ggtern') for further info.
--

Attaching package: ‘ggtern’

The following objects are masked from ‘package:gridExtra’:

    arrangeGrob, grid.arrange

The following objects are masked from ‘package:ggplot2’:

    %+%, aes, annotate, calc_element, ggplot, ggplot_build,
    ggplot_gtable, ggplot

In [None]:
geneInfo<-loadGeneInformation(dir="./TablesForExploration")

The ontologies are in the table ```OntologiesForGenes.rds```, the table contains several sets of ontologies, sorted by the column “ID ontology”: 

 * ```IWGSC+Stress``` The merge GO annotation as described in the methods above. 
 * ```GO``` Gene Ontology. From the functional annotation in URGI.
 * ```PO``` Plant Ontology.  From the functional annotation in URGI.
 * ```TO``` Plant trait ontology.  From the functional annotation in URGI. 
 * ```andrea_go``` Annotation from Blast2Go to arabidopsis. 
 * ```BUSCO``` GO annotation selecting only the terms present in the plants dataset from [BUSCO](https://busco.ezlab.org)
 * ```slim_IWGSC+Stress``` Selection of only the GO Slim terms for the merge GO annotation as described in the methods above. 
 * ```slim_GO```Selection of only the GO Slim terms for the Gene Ontology. From the functional annotation in URGI.
 * ```slim_andrea_go```. Selection of only the GO Slim terms for the annotation from Blast2Go to arabidopsis. 
 * ```slim_BUSCO```. Selection of only the GO Slim terms for the GO annotation selecting only the terms present in the plants dataset from [BUSCO](https://busco.ezlab.org)



The columns in ```OntologiesForGenes.rds``` are:

 * Gene  The gene ID. 
 * ID The ID of the onthology
 * Ontology The subset of the different ontologies that we tried during the study. 
 
This table is named ```ontologies``` in the list obtained by the ```loadGeneInformation``` function

In [None]:
ont<-geneInfo$ontologies
head(ont)

The table  with the description of the ontologies and their corresponding closest *[GO Slim](http://www.geneontology.org/page/go-subset-guide)* term is in the file ```ObservedGOTermsWithSlim.csv``` or in the ```GOSlim``` in the list form ```loadGeneInformation```. The columns are: 
* ```acc``` The ID of the GO term
* ```term_type``` The type according to GO (```biological_process```, ```cellular_component```, ```molecular_function```)
* ```name``` Description of the GO term
* ```slim_acc``` GO ID of the closest *GO Slim* 
* ```slim_type``` Type of the *GO Slim* term
* ```slim_name``` Description of the slim term. 

The *GO Slim* terms are useful to obtain an overview of the functions of the gene. 

In [None]:
ont_desc<-geneInfo$GOSlim
head(ont_desc)

With those two tables it is possible to obtain the ontologies related for all the genes. For example, the following query is to display the ontologies used for gene ```TraesCS6A01G207600```. 

In [None]:
sqldf("SELECT DISTINCT ont.Gene, acc, term_type, name FROM ont JOIN ont_desc on acc=ID WHERE `ontology`='IWGSC+Stress' AND Gene='TraesCS6A01G207600'")