Skip to content

Latest commit

 

History

History
157 lines (115 loc) · 6.54 KB

solid.rst

File metadata and controls

157 lines (115 loc) · 6.54 KB

SOLiD sequencing data

This section outlines the general structure of the data from SOLiD based sequencers.

Structure of SOLiD run names

For multiplex fragment sequencing the run names will have the form:

<instrument-name>_<date-stamp>_FRAG_BC[_2]

(For example: solid0123_20110315_FRAG_BC).

The components are:

  • <instrument_name>: name of the SOLiD instrument e.g. solid0123
  • <date-stamp>: a date stamp in year-month-day format e.g. 20110315 is 15th March 2011
  • FRAG: indicates a fragment library was used
  • BC: indicates bar-coding was used (note that not all samples in the run might be bar-coded, even if this appears in the name)
  • 2: if this is present then it indicates the data came from flow cell 2; otherwise it's from flow cell 1.

For multiplex paired-end sequencing the run names have the form:

<instrument-name>_<date-stamp>_PE_BC

Here the PE part of the name indicates a paired-end run.

Note

If the run name contains WFA then it's a work-flow analysis and not final sequence data.

See also SOLiD 4 System Instrument Operation Quick Reference (PDF) for more information.

Navigating SOLiD run data directories

Run definition file

Typically the top-level of a SOLiD run data directory should contain the run definition file which has information about the samples and libraries used in the run, including the names that were assigned when the run was set up. For example for a bar-coded sample this might look like:

version    userId  runType isMultiplexing  runName runDesc mask    protocol
v1.3   lab_user    FRAGMENT    TRUE    solid0127_20111013_FRAG_BC      1_spot_mask_sf  SOLiD4 Multiplex
primerSet  baseLength
BC 10
F3 50
sampleName sampleDesc  spotAssignments primarySetting  library application secondaryAnalysis   multiplexingSeries  barcodes
DB_SB_JL_pool      1   default primary DB01    SingleTag   sacCer2 BC Kit Module 1-96  "1"
DB_SB_JL_pool      1   default primary DB02    SingleTag   sacCer2 BC Kit Module 1-96  "2"
...
DB_SB_JL_pool      1   default primary SB_DIMB_2   SingleTag   none    BC Kit Module 1-96  "14"
DB_SB_JL_pool      1   default primary SB_DMTA SingleTag   none    BC Kit Module 1-96  "15"

Essentially the run definition file consists of a three sections, each delimited by a header line. The last section (with the header line beginning sampleName...) has the information on each of the libraries, and can be used to locate the primary data files.

Primary data files (csfasta/qual) for multiplex fragment sequencing

Locating the primary data files within the SOLiD data directories can be quite tedious and confusing. For bar-coded samples the following heuristic can be used:

  1. From the top-level of the SOLiD run directory (e.g. solid0123_20111013_FRAG_BC) move into the subdirectory with the sample name of interest (e.g. DB_SB_JL_pool, from the run definition file in the previous section).
  2. Within the sample subdirectory, look for a directory called results (which will be a link to one of the other results... directories here). Move into the results directory.
  3. Within the results subdirectory, look for a directory called libraries and move into this.
  4. Within libraries you should see subdirectories named for each of the libraries associated with this sample, as they appear in the run definition file (e.g. DB01, DB02, ..., SB_DIMB_2, SB_DMTA). Move into the subdirectory for the library of interest.
  5. Within the directory for a specific library, there should be one or more subdirectories with names of the form primary.20111015000420127 (and possibly also secondary.20111015000420127). Check each of these subdirectories looking for the one which itself contains three subdirectories reads, rejects and reports (the others will only contain reads and reports). Move into this directory, and then into the reads subdirectory. This is the location of the primary data files (csfasta and qual files).

Typically this results in a path of the form:

solid0123_20111013_FRAG_BC/SAMPLE_NAME/results/libraries/LIBRARY_NAME/primary.TIMESTAMP/reads/

As a further check, the primary data file names should include F3 in the name.

Primary data files (csfasta/qual) for multiplex paired-end sequencing

In the case of paired-end sequencing the final data consists of primary data file pairs for both the F3 and F5 reads for each library.

Locating the F3 and F5 reads uses a similar heuristic to that described above for multiplex fragment sequencing:

  1. From the top-level of the SOLiD run directory, move into the subdirectory for the sample name of interest (e.g. DB_SB_JL_pool).
  2. Look for the results directory and move into it.
  3. Look for the libraries directory and move into it.
  4. Within libraries there are subdirectories for each of the libraries associated with this sample (e.g. DB01, DB02, ..., SB_DIMB_2, SB_DMTA) - move into the one for the library of interest.
  5. Here there are one or more subdirectories with names of the form primary.20111015000420127 etc. Check each of these subdirectories looking for those which contain three subdirectories reads, rejects and reports (not just reads and reports). There should be two primary... directories which match this criterion: in the reads directory of one there will be primary data files with F5-BC in the name, and in the other files with F3.

Automatic location of primary data using analyse_solid_run.py

The heuristics described above are also encoded in the reference_analyse_solid_run program, which will identify and report the location of the primary data files when without any other arguments i.e.:

analyse_solid_run.py solid0123_20111101_FRAG_BC

This works for both multiplex fragment and multiplex paired-end sequencing.