Skip to content

Cufflinks rdf

helios edited this page Sep 28, 2012 · 37 revisions

aka RNA Seq

#Team

#Description As part of BioNGS project and performing many RNA-Seq analyses, the need to collect, organize, query and integrate these data became every day more important. We want to convert quantification and differential expression data into RDF. For re sequencing projects and RNA-Seq analyses, Cufflinks is de facto standard for quantifying transcripts expression. Cufflinks can also be used to perform differential expression.

Each sample is quantified independently and data are saved in a file called transcripts.gtf . This file is generally around 305 MB. An initial conversion, keeping all information generated a turtle file of about 1.2 GB for each sample. Compacting and removing useless information turtle file size falls down to around 240 MB per sample, which is reasonable.

##Code

The real code is https://github.com/helios/bioruby-ngs/blob/master/lib/bio/appl/ngs/cufflinks/gtf/rdf.rb, please contribute.

Using biongs is possible to run the converter from command line

Usage:
biongs convert:cuff:quant_to_ttl GTF

Options:
   [--output=OUTPUT]       # output file name
   [--renamettl]           # rename the origina file name to ttl and use it as output
   [--sample=SAMPLE]       # tag these transcripts with a specific name, sample ?
   [--project=PROJECT]     # attach to these data their are coming from a specific project
   [--run=RUN]             # attach to these data the run date/illumina name
   [--get-info-from-path]  # try to extract information (run, project, sample) from the current   directory or filename
   [--remove-zero]         # remove transcripts with FPKM == 0.0
                           # Default: true

convert a Cufflinks GTF quantification file in RDF Turtle format. Data are sent in stdout.

Command line example:

biongs convert:cuff:quant_to_ttl MAPQUANT/110908_H125_0119_AB01W2ABXX_DATA/Project_Naive_T0/Sample_SQ_0080/quantification_denovo/transcripts.ttl
--renamettl --get-info-from-path 

##UseCase Considering a real use case, having one project with 3 biological replicates quantified independently, the total number of triples is 8.682.592. Having more than 30 projects sequenced this increases the number of triples to 260.477.760. At the time of writing any differential expression data were converted in RDF, we expect the number of triples can increase more.

###Notes/ToDos

  • Converting a GTF file without providing any kind of meta data related to the sample_name, project and sequencing information does not make any sense. RDF converter must be aware of adding this meta data to the output RDF. The converter from BioNGS takes care of that, following the best practice of convention over configuration.
  • Proposal: remove genomic location for already annotated data. Conversion can be extremely simplified if we assume that annotation information for ANNOTATED data are already in the database, in this case we can only store a reference to a transcript.

#Input A GTF file produced by Cufflinks as a results of RNA-Seq quantification on human

1       Cufflinks       transcript      62948   63887   1       +       .       gene_id ENSG00000240361"; transcript_id "ENST00000492842"; FPKM "1.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no";
1       Cufflinks       exon    62948   63887   1       +       .       gene_id "ENSG00000240361"; transcript_id "ENST00000492842"; exon_number "1"; FPKM "1.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1       Cufflinks       transcript      69091   70008   1       +       .       gene_id ENSG00000186092"; transcript_id "ENST00000335137"; FPKM "2.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no";
1       Cufflinks       exon    69091   70008   1       +       .       gene_id "ENSG00000186092"; transcript_id "ENST00000335137"; exon_number "1"; FPKM "2.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1       Cufflinks       transcript      34554   36081   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; FPKM "1.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no";
1       Cufflinks       exon    34554   35174   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; FPKM "1.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1       Cufflinks       exon    35277   35481   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; FPKM "1.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1       Cufflinks       exon    35721   36081   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; FPKM "1.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1       Cufflinks       transcript      35245   36073   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000461467"; FPKM "4.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no";
1       Cufflinks       exon    35245   35481   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000461467"; exon_number "1"; FPKM "4.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1       Cufflinks       exon    35721   36073   1       -       .       gene_id "ENSG00000237613"; transcript_id "ENST00000461467"; exon_number "2"; FPKM "4.5000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";  

#Results

Unique URI for identifying locations on genome, these are mostly human annotated data

http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r

A GTF quantification must be connected to the source sample, project and NGS run which generated it, so a transcript must be connected to them:

 <http://genome.db/sample/SQ_0081>   ngs:hasSample  gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
 <http://genome.db/project/Naive_T0> ngs:hasProject gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
 <http://genome.db/run/110908_H125_0119_AB01W2ABXX> ngs:hasRun gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .

Example of RDF file for a quantified transcript.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ns0: <http://purl.obolibrary.org/obo/> .
@prefix gtf: <http://genome.db/gtf/> .
@prefix ngs: <http://genome.db/ngs/> .
<http://genome.db/ensembl/ENST00000417324>  rdf:type    ns0:SO_0000833 .
<http://genome.db/ensembl/ENST00000417324>  rdfs:label  "ENST00000417324" .
<http://genome.db/ensembl/ENST00000417324>  gtf:parent_gene <http://genome.db/ensembl/ENSG00000237613> .
<http://genome.db/ensembl/ENST00000417324>  gtf:uuid    gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
<http://genome.db/sample/SQ_0081>   ngs:hasSample  gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
<http://genome.db/project/Naive_T0> ngs:hasProject gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
<http://genome.db/run/110908_H125_0119_AB01W2ABXX> ngs:hasRun gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   ngs:sample  <http://genome.db/sample/SQ_0081> .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   ngs:project <http://genome.db/project/Naive_T0> .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   ngs:run <http://genome.db/run/110908_H125_0119_AB01W2ABXX> .
<http://genome.db/ensembl/ENST00000417324>  gtf:location    <http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> a   gtf:cufflinks_transcript .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:seqname "1" .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:start   34554 .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:stop    36081 .
<http://genome.db/coords/1:34554_35174-35277_35481-35721_36081:r> gtf:strand  "-" .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:FPKM    1.5 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:frac    0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:conf_lo 0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:conf_hi 0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:cov 0.0 .
gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e   gtf:full_read_support   "no" .
gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9   rdf:type    ns0:SO_0000852 .
gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9   gtf:parent_transcript   gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-6c679944-1cb0-4891-905a-fbbf507c2ec9   gtf:location    <http://genome.db/coords/1:34554_35174> .
<http://genome.db/coords/1:34554_35174> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:34554_35174> gtf:seqname "1" .
<http://genome.db/coords/1:34554_35174> gtf:start   34554 .
<http://genome.db/coords/1:34554_35174> gtf:stop    35174 .
<http://genome.db/coords/1:34554_35174> gtf:strand  "-" .
<http://genome.db/coords/1:34554_35174> gtf:exon_number 1 .
gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf   rdf:type    ns0:SO_0000852 .
gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf   gtf:parent_transcript   gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-74fdb480-0b2a-4439-8050-f8ea508b30bf   gtf:location    <http://genome.db/coords/1:35277_35481> .
<http://genome.db/coords/1:35277_35481> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:35277_35481> gtf:seqname "1" .
<http://genome.db/coords/1:35277_35481> gtf:start   35277 .
<http://genome.db/coords/1:35277_35481> gtf:stop    35481 .
<http://genome.db/coords/1:35277_35481> gtf:strand  "-" .
<http://genome.db/coords/1:35277_35481> gtf:exon_number 2 .
gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad   rdf:type    ns0:SO_0000852 .
gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad   gtf:parent_transcript   gtf:uuid-06555be2-3bca-40ff-84ac-ca641f88320e .
gtf:uuid-62592261-26e6-4ac5-81b0-b5c4e73d87ad   gtf:location    <http://genome.db/coords/1:35721_36081> .
<http://genome.db/coords/1:35721_36081> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:35721_36081> gtf:seqname "1" .
<http://genome.db/coords/1:35721_36081> gtf:start   35721 .
<http://genome.db/coords/1:35721_36081> gtf:stop    36081 .
<http://genome.db/coords/1:35721_36081> gtf:strand  "-" .
<http://genome.db/coords/1:35721_36081> gtf:exon_number 3 .
<http://genome.db/ensembl/ENST00000461467>  rdf:type    ns0:SO_0000833 .
<http://genome.db/ensembl/ENST00000461467>  rdfs:label  "ENST00000461467" .
<http://genome.db/ensembl/ENST00000461467>  gtf:parent_gene <http://genome.db/ensembl/ENSG00000237613> .
<http://genome.db/ensembl/ENST00000461467>  gtf:uuid    gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   ngs:sample  <http://genome.db/sample/SQ_0081> .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   ngs:project <http://genome.db/project/Naive_T0> .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   ngs:run <http://genome.db/run/110908_H125_0119_AB01W2ABXX> .
<http://genome.db/ensembl/ENST00000461467>  gtf:location    <http://genome.db/coords/1:35245_35481-35721_36073:r> .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   a   gtf:cufflinks_transcript .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:seqname "1" .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:start   35245 .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:stop    36073 .
<http://genome.db/coords/1:35245_35481-35721_36073:r>   gtf:strand  "-" .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:FPKM    4.5 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:frac    0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:conf_lo 0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:conf_hi 0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:cov 0.0 .
gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57   gtf:full_read_support   "no" .
gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b   rdf:type    ns0:SO_0000852 .
gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b   gtf:parent_transcript   gtf:uuid-2cb16d8c-eac3-4524-8aaf-f78e13602c57 .
gtf:uuid-01197dff-a1d8-4e61-8202-c27d8047986b   gtf:location    <http://genome.db/coords/1:35245_35481> .
<http://genome.db/coords/1:35245_35481> a   gtf:caffulinks_exon .
<http://genome.db/coords/1:35245_35481> gtf:seqname "1" .
<http://genome.db/coords/1:35245_35481> gtf:start   35245 .
<http://genome.db/coords/1:35245_35481> gtf:stop    35481 .
<http://genome.db/coords/1:35245_35481> gtf:strand  "-" .
<http://genome.db/coords/1:35245_35481> gtf:exon_number 1 .

#Numbers

Loading

Name      : Triples : Time 
Sample_80 : 2666254 :  27s
Sample_81 : 3002918 :  27s
Sample_82 : 3013420 :  33s

Comparing transcripts.ttl files, 1.107.741 of lines are duplicated.

For a total number of effective triples 5.796.335 unless something went wrong ^_^

   Name    :  Raw Triples :  Triples   :  Time     : MaxMemUsed
15 samples :   44.030.870 : 22.606.259 : 7m16.814s :  3.14 GB

Raw and loaded triples are so different, I do not know why o_O

#ToDo

#OpenQuestions

  • do we need to save everything ?
  • which is the best triple store and how much does it cost ?
  • performance to query SPARQL endpoint with potentially a huge amount of data ?

#Technologies

  • Triple-store: owlim-lite
    • brew install tomcat
    • cp openrdf-sesame.war /usr/local/Cellar/tomcat/7.0.29/libexec/webapps/
    • cp openrdf-workbench.war /usr/local/Cellar/tomcat/7.0.29/libexec/webapps/
    • export CATALINA_OPTS="-Xms4096m -Xmx4096m"
    • catalina start
    • open http://localhost:8080/openrdf-workbench/
  • Libraries: owlim-ruby by Toshiaki Katayama
  • OS: OS X 10.8
  • Machine: 2.9 Ghz Intel Core i7, 8GB Ram
  • JAVA: java version "1.6.0_33", Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-11M3720), Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)
You can’t perform that action at this time.