This repository is designed to hold crowdsourced genomics data from your new project. It is a generic data structure designed to hold multiple instances of genomics data and links to external sources of data too large to store on GitHub. You can fork it and start your own repository for your own projects.
If you want to create a new crowdsourced project just follow these instructions or get in touch through firstname.lastname@example.org
How to use this repository
Once you've forked your new repository for your new project, you'll need to know how to populate it.
The directory (folder) structure
The folders in this structure are designed to be semantically organised, so that you should be able to follow them and find what you want easily and intuitively. They start off like this:
. |____project_name |____org_1 |____org_2 |____project.info |____README |____org.README |____strain.README |____reads.README |____assembly.README |____alignments.README |____annotations.README |____project.README
The base folder is called
project_name, and it contains the project level information. Immediately inside this folder is a file called
README (which is the file you are reading),
project.info (which contains the project level metadata), a load of files ending in
.README that define the metadata required for datatypes lower in the structure and one folder each for however many organisms you will have in this project (here we have two,
org_2). The names of the folders aren't important, only what they contain. So if you want to work on E.coli then you could rename one of the folders to
The organism folder contains all information about just one organism,
e_coli or whatever. It looks like this:
org_x |____org.info |____strain_1 |____strain_2
Details about the organism are in the org.info metadata file, one of which is required for every organism in the project.
org.README in the project folder contains details on the required metadata for the org.info file.
The organism folder also holds the strain folders, where
strain_x is a particular strain or isolate and could be renamed to 0104_H4, K12 or whatever.
strain.README is a file that specifies the metadata require for each strain.
These folders are where the actual data are. The enclosing folders exist really to help stop all the data sets from getting mixed up with themselves. Once in the strain folder we find specific datasets. Each strain folder contains two metadata files and 4 sub-folders.
strain_x |____alignments |____annotations |____assemblies |____reads |____strain.info
The metadata file
strain.info contains the metadata for the strain/isolate/ecotype analysed in this folder. The metadata are specified in the
strain.README file in the project folder. One
strain.info is required for every strain folder. Each sub-folder holds all the different datasets for the different types of genomic data in the repository. Just four types of dataset are identified, (alignments, annotations, assemblies and reads) but this could be extended by adding a new folder for a new datatype.
###data folders The four different types of data folder are able to hold the majority of datasets we expect a genomics project will generate. There is nothing stopping you from creating new ones of your own.
This folder is designed to contain links to the raw read files, e.g fastq files at other locations around the web. These files are too large in general, to be hosted via GitHub, but text links to them aren't. If you want the actual reads you should follow instructions in the
read_set.info file which should explain how to get them as well as containing the metadata specified in
reads.README in the project folder. Each read set should have its own folder and its own
This folder is designed to contain the results of assemblies carried out with reads in the
reads folder e.g fasta files of contigs or AGP files of assemblies. In most circumstances the fasta files will not be too big for GitHub so can be directly in the repository and not just linked. The assembly could be of genomic or transcriptomic assembly or anything else exotic along those lines. The details of how each of these assemblies was made should go in the
assembly.info according to the specification in
This folder designed to contain the results of alignments of one data set against another. Usually this will be reads in the reads folder to assemblies in the assemblies folder, but can be some other combination as required. It may contain links to large and unwieldy files, such as BAM, or directly contain text based alignments of reasonable size, e.g SAM or even BLAST output files. The details of how each of these alignments was made should go in the
alignment.info according to the specification in
This folder is designed to contain the results of annotations of assemblies in the assemblies folder. Annotations is used in a broad sense and could include gene calls in GFF format, or just SNP calls in a proprietary text format. It may contain links to very large files stored elsewhere on the web as well as those directly in the repository. As in the other folders,
annotation.README in the project folder contains details on the required metadata for the annotation.info file. Every annotation folder should have one
####other data types - extending the directory structure Extending the structure to hold data types we haven't thought of should be dead easy, just add a new folder in the strain folder. Make sure there is a .info file for the new datatype and a metadata definition in a .README file in the project_name folder
With this data structure, each data file has a unique, semantic and (hopefully) logical place and address. For example:
The .info files
The files ending in
.info contain metadata describing the data according to the specification in
.README files. Each
.info should describe one discrete unit of data, so a
read_set.info might describe one sequencing run or set of reads from a few lanes on a sequencer for which all the metadata are the same. Different read sets should be in different folders and therefore need different
.info. For example,
crowd_sourcing/e_coli/k12/reads/bgi_sequenced_100_paired_150_insert/ would be different from
crowd_sourcing/e_coli/k12/reads/tgac_sequenced_150_single/ so would need a different
.info though all the reads of that type would be in the same folder so would be covered by the same
The .README files and metadata standards
.README files are a place to specify what should be captured by the individual
.info files and are very project specific, so are kept blank in this empty repository. You'll need to decide what is appropriate at a project level.
The idea for and layout of this repository borrows very heavily from the repository created for the E.coli 0104:H4 outbreak of 2011.