Skip to content
Artem Babaian edited this page Aug 7, 2020 · 5 revisions

Serratus Working Bucket(~): s3://serratus-public/

All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data. If you're interested in data from here your best bet will be to join the slack and ask and the right person can point you to it.

Accessing Data

The S3 bucket has public read-only permissions. All files can be downloaded via aws cli or wget/curl.

  • aws-cli : aws s3 cp s3://serratus-public/<file_path>.

  • wget/curl : wget https://serratus-public.s3.amazonaws.com/<file_path>

~/notebook : Experiment associated data

For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD) corresponding to the date of the notebook file. For example

The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb is found in s3://serratus-public/notebook/200411/.

~/out : Serratus alignment output

  • ~/out/200525_viro/bam : Aligned output file, SRA accession named
  • ~/out/200525_viro/summary : .summary files for this experiment

~/seq :

Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.

Examples:

  • ~/seq/cov0 : All CoV sequences from NCBI

    • NCBI search: "(Coronaviridae) AND "viruses"[porgn:txid10239]"
    • Date Accessed: 2020/03/30
    • Results: 33296
  • ~/seq/hgr1 : Human rDNA testing sequence

~/sra : SraRunInfo Tables (.csv.gz)

SRA Accession and Run Information master tables. Accessed via SRA website. See also SRA-queries.

~/test-data : example data for development

Sequence Files

  • ../bam/ : aligned bam files for breaking into blocks
  • ../bam-block : bam file output of fq-blocks requiring merging
  • ../fq/ : sequencing reads of various length
  • ../fq-block : fq files broken into 'blocks'
  • ../out : Example output data of re-aligned reads

~/var/ : Assorted nuts and bolts

~/assemblies/ : Assemblies of the CoV+ identified datasets

in assemblies/analysis/:

  • catA-v[XXX].txt list of assemblies of category A: single contig, longer than 25 Kbp
  • catB-v[XXX].txt list of assemblies of category B: > 1 contigs, total length longer than 25 Kbp
  • cat[A/B]-v[XXX].fa multifasta files of the lists above

in assemblies/contigs/:

  • SRRxxx.minia.checkv_filtered.fa Minia k31 contigs filtered by CheckV, keeping only coronavirus hits
  • SRRxxx.coronaspades.checkv_filtered.fa coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits
  • SRRxxx.coronaspades.gene_clusters.fa coronaSPAdes' gene_clusters.fasta (you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!)
  • SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa coronaSPAdes gene_clusters.fasta further filtered by CheckV

in assemblies/other/SRRxxx.[assembler]/:

  • SRRxxx.[assembler].contigs.fa.mfc unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses.
  • SRRxxx.inputdata.txt some statistics about the reads (number of reads, FASTQ file size)
  • SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the whole assembly file (i.e. contigs.fa.mfc)
  • SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the gene_clusters.fasta file (for coronaSPAdes)
  • SRRxxx.[assembler].txt output log of the assembler

the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly

Clone this wiki locally