Skip to content
Artem Babaian edited this page Jun 12, 2020 · 5 revisions

Serratus Working Bucket(~): s3://serratus-public/

All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data.

~/notebook : Experiment associated data

For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD) corresponding to the date of the notebook file. For example

The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb is found in s3://serratus-public/notebook/200411/.

~/out : Serratus alignment output

  • ~/out/200525_viro/bam : Aligned output file, SRA accession named
  • ~/out/200525_viro/summary : .summary files for this experiment

~/seq :

Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.

Examples:

  • ~/seq/cov0 : All CoV sequences from NCBI

    • NCBI search: "(Coronaviridae) AND "viruses"[porgn:txid10239]"
    • Date Accessed: 2020/03/30
    • Results: 33296
  • ~/seq/hgr1 : Human rDNA testing sequence

~/sra : SraRunInfo Tables (.csv.gz)

SRA Accession and Run Information master tables. Accessed via SRA website and the following basic filter:

"type_rnaseq"[Filter] AND cluster_public[prop] AND "platform illumina"[Properties] AND "cloud s3"[Properties] NOT "scRNA"[All Fields] AND <SUBFILTER>
  • Test Data Set

    • Mammals and CoV+ swabs for testing pipeline
    • SARS-CoV-2: PRJNA616446
    • Felis catus: PRJNA432069
    • Homo sapiens (HCT116): PRJEB29794
    • Macaca fascicularis: PRJNA553361
    • Mus musculus: PRJNA553361
    • Date Accessed: 2020/04/07
    • Results: 49 libraries
  • Non-Human, Non-Mouse Mammals

    • BASE AND "Mammalia"[Organism] NOT "Homo sapiens"[Organism]) NOT "Mus musculus"[orgn]
    • Date Accessed: 2020/03/28
    • Results: 66926, 0.15 PB
  • Human

    • BASE AND "Homo sapiens"[Organism]
    • Date Accessed: 2020/03/05
    • Results: 520257, 4.75 PB
  • Mouse

    • BASE AND "Mus musculus"[orgn]
    • Results: 539233
    • Not accessed
  • Vertebrates, Non-mammal

    • BASE NOT "Mammalia"[Organism] NOT "Homo sapiens"[Organism] NOT "Mus musculus"[orgn]
    • Date Accessed: 2020/03/29
    • Results: 74532, 0.115 PB
  • Invertebrates

    • BASE NOT "Vertebrata"[Organism]
    • Date Accessed: 2020/03/30
    • Results: 403639, 0.7 PB
  • HCT116 RNAseq

    • For testing; ca. 1000 entries of human HCT116 cell line
  • CoV Positive Control (known CoV)

    • "platform illumina"[Properties] OR "platform bgiseq"[Properties] AND txid694002[Organism:exp]
    • Date Accessed: 2020/04/27
    • Results: 862 samples

~/test-data : example data for development

Sequence Files

  • ../bam/ : aligned bam files for breaking into blocks
  • ../bam-block : bam file output of fq-blocks requiring merging
  • ../fq/ : sequencing reads of various length
  • ../fq-block : fq files broken into 'blocks'
  • ../out : Example output data of re-aligned reads

~/var/ : Assorted nuts and bolts

Clone this wiki locally