Skip to content

YakhiniGroup/SOLQC

Repository files navigation

SOLQC - Synthetic Oligo Library Quality Control (Tool)

drawing

This tool allows for an easy to use analysis of synthetic oligo libraries.

Contents

Overview

The solqc is a tool for statistical analysis of synthetic oligo libraries.
Given a list of designed sequences and list of sequenced reads that were generated from the designed sequences the SOLQC tool will output a statistical analysis report of the synthesized sequences.

The tool's pipline is as follows:

  1. Preprocessing : Iterate over the sequnced reads of the library and filter out reads that do not match certain parameters (prefix, length, etc...)
  2. Matching : Matching between the reads and the variants.
  3. Alignment : Aligning each read to his matched variant.
  4. Analysis : Analyzing the alignment and matching results.

Setup

In it's current state we assume the user as some familiarity with python.
You'll need to run the tool with python 3.6.5.
Start by cloning the repository to a local directory.
Next, open a command line tool, go to the root folder and run:
pip install -r requirements.txt This will install all the necessary modules to run the tool.

Preparation

In order to use the tool you'll need the following:

  • Design, could one of 2 options:
    • A design file, in a csv format containing 2 columns : [barcode, variant]
      • barcode - a sequence identifier for the variant. [Needed for matching between a read and a variant].
      • variant - the complete variant sequence. [Needed for the alignment to analyse missmatches and indel's.
    • IUPAC string
  • A reads text file containing all the fasta/q files names of the sequenced read (one row for each file).
  • A config.json file containing different possible configuration, see - configuration

Here is an example for each of those files:

data/my_data/reads_1.fastq
data/my_data/reads_2.fastq
{
    "prefix" : "ACAACGCTTTCTGTGTCGTG",
    "suffix" : "",
    "length" : 0,
    "barcode_start" : 20,
    "barcode_end" : 32,
}

Usage

Open a command line and to go the root folder and run:
python main.py -d <path_to_design>/design.csv -r <path_to_read>/reads.txt -c <path_to_config>/config.json Or if you are using IUPAC string instead of a design:
python main.py -d "IUPAC_string" -r <path_to_read>/reads.txt -c <path_to_config>/config.json

Additional Parameters

  • --no-edit(flag) : If you don't want to prefrom alignemnt between the reads and variants (highly recommended if you don't want to perform any related analysis as it saves a lot of running time.
  • --edit(flag) : If you want to prefrom alignemnt [Default]
  • -a (str array) : Allows the specification of different matching startegies. Currently only one matching is implemented.
  • -id (str): Will prefix outputed files (relevant if you want to run multiples run and not erase old output).
  • More parameters will come soon!

Analysis Options

We will soon allow the setting of different analysis on the library from the command line but currently you'll need to go main.py and choose them yourself.
Go to line 139 and choose the the desired analysis. (you can see all of them in the analyzer.py file).

analyzers = AnalyzerFactory.create_analyzers([AnalyzersNames.MATCHING_ANALYZER,
                                                  AnalyzersNames.FREQUENCY_ANALYZER
                                                  ])

Output

Once the tool is done you can find the analysis results under a deliverable folder.

Configuration

In order to run the tool you must supply a config file to the program. This should be a json file containing the following parameters:

{
    "prefix" : "ACAACGCTTTCTGTGTCGTG",
    "suffix" : "",
    "length" : 0,
    "barcode_start" : 20,
    "barcode_end" : 32,
}
  • prefix : If supplied will remove all reads not starting with the supplied sequence.
  • suffix : If supplied will remove all reads not ending with the supplied sequence.
  • length : If supplied and set to a value above 0 will only leave reads with
    length - 5 <= len(read) <= length + 5
  • barcode_start : Start position of the barcode.
  • barcode_end : End position of the barcode.

Example

We recommend running the tool with the toy data supplied with this repository.
This will give you a sense of how to use the tool with a relative small sized data, so it will run the entire analysis in less than 30 seconds.
After you setup the tool simply run :
-d data/toy_data/design.csv -r data/toy_data/reads.txt -c data/toy_data/config.json