Skip to content
Rayan Chikhi edited this page Jun 26, 2020 · 5 revisions

Serratus Working Bucket(~): s3://serratus-public/

All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data.

~/notebook : Experiment associated data

For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD) corresponding to the date of the notebook file. For example

The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb is found in s3://serratus-public/notebook/200411/.

~/out : Serratus alignment output

  • ~/out/200525_viro/bam : Aligned output file, SRA accession named
  • ~/out/200525_viro/summary : .summary files for this experiment

~/seq :

Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.

Examples:

  • ~/seq/cov0 : All CoV sequences from NCBI

    • NCBI search: "(Coronaviridae) AND "viruses"[porgn:txid10239]"
    • Date Accessed: 2020/03/30
    • Results: 33296
  • ~/seq/hgr1 : Human rDNA testing sequence

~/sra : SraRunInfo Tables (.csv.gz)

SRA Accession and Run Information master tables. Accessed via SRA website and the following basic filter:

"type_rnaseq"[Filter] AND cluster_public[prop] AND "platform illumina"[Properties] AND "cloud s3"[Properties] NOT "scRNA"[All Fields] AND <SUBFILTER>
  • Test Data Set

    • Mammals and CoV+ swabs for testing pipeline
    • SARS-CoV-2: PRJNA616446
    • Felis catus: PRJNA432069
    • Homo sapiens (HCT116): PRJEB29794
    • Macaca fascicularis: PRJNA553361
    • Mus musculus: PRJNA553361
    • Date Accessed: 2020/04/07
    • Results: 49 libraries
  • Non-Human, Non-Mouse Mammals

    • BASE AND "Mammalia"[Organism] NOT "Homo sapiens"[Organism]) NOT "Mus musculus"[orgn]
    • Date Accessed: 2020/03/28
    • Results: 66926, 0.15 PB
  • Human

    • BASE AND "Homo sapiens"[Organism]
    • Date Accessed: 2020/03/05
    • Results: 520257, 4.75 PB
  • Mouse

    • BASE AND "Mus musculus"[orgn]
    • Results: 539233
    • Not accessed
  • Vertebrates, Non-mammal

    • BASE NOT "Mammalia"[Organism] NOT "Homo sapiens"[Organism] NOT "Mus musculus"[orgn]
    • Date Accessed: 2020/03/29
    • Results: 74532, 0.115 PB
  • Invertebrates

    • BASE NOT "Vertebrata"[Organism]
    • Date Accessed: 2020/03/30
    • Results: 403639, 0.7 PB
  • HCT116 RNAseq

    • For testing; ca. 1000 entries of human HCT116 cell line
  • CoV Positive Control (known CoV)

    • "platform illumina"[Properties] OR "platform bgiseq"[Properties] AND txid694002[Organism:exp]
    • Date Accessed: 2020/04/27
    • Results: 862 samples

~/test-data : example data for development

Sequence Files

  • ../bam/ : aligned bam files for breaking into blocks
  • ../bam-block : bam file output of fq-blocks requiring merging
  • ../fq/ : sequencing reads of various length
  • ../fq-block : fq files broken into 'blocks'
  • ../out : Example output data of re-aligned reads

~/var/ : Assorted nuts and bolts

~/assemblies/ : Assemblies of the CoV+ identified datasets

in assemblies/analysis/:

  • catA-v[XXX].txt list of assemblies of category A: single contig, longer than 25 Kbp
  • catB-v[XXX].txt list of assemblies of category B: > 1 contigs, total length longer than 25 Kbp
  • cat[A/B]-v[XXX].fa multifasta files of the lists above

in assemblies/contigs/:

  • SRRxxx.minia.checkv_filtered.fa Minia k31 contigs filtered by CheckV, keeping only coronavirus hits
  • SRRxxx.coronaspades.checkv_filtered.fa coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits
  • SRRxxx.coronaspades.gene_clusters.fa coronaSPAdes' gene_clusters.fasta (you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!)
  • SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa coronaSPAdes gene_clusters.fasta further filtered by CheckV

in assemblies/other/SRRxxx.[assembler]/:

  • SRRxxx.[assembler].contigs.fa.mfc unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses.
  • SRRxxx.inputdata.txt some statistics about the reads (number of reads, FASTQ file size)
  • SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the whole assembly file (i.e. contigs.fa.mfc)
  • SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the gene_clusters.fasta file (for coronaSPAdes)
  • SRRxxx.[assembler].txt output log of the assembler

the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly

Clone this wiki locally