Skip to content

Implementation

David Soldevila edited this page Jul 10, 2019 · 16 revisions

First of all I want to say that the program has been coded a little on the go and in its current version it is a bit messy, but I will try my best to explain how it works.

Secondly, if you have not read the Matching and Simulation wikis, you should, as the most important parameters are explained there.

Program structure schematic

Program Structure

Script Summary

Currently the program is compound of the following scripts:

  • common.py: As the name says, contains code and some global variables used by multiple files.
  • load_data.py: Contains all functions related to load data.
  • matching.py: Does all the things related to generate the template and saving it.
  • matching_frontend.py: Handles the GUI and cl of the matching.
  • interface.py: The existence of this file is questionable. Its purpose was to make the front end a little cleaner and to separate the frontend from the backend, to not have to modify both. But it only masks the matching.py and load_data.py, as simulation.py bypasses this file, it's a little messy. It handles exceptions, but lately I've been moving the exception handlers to matching.py and load_data.py.
  • simulation.py: Does the simulation computations and saves the results.
  • simulation_frontend.py: Handles the GUI and cl of the simulation.
  • QMPrimers.py: Contains the main function. Handles the main GUI than wraps the GUIs from matching_frontend.py and simulation_frontend.py.

Explanation

Let's explain the general flow of the program:

When QMPrimers.py is executed it checks the parameters passed to it.

  • If there are none, creates an instance of the main GUI. The main GUI then embeds an instance of the matching and simulate GUIs (located in matching_frontend.py and simulation_frontend.py) in form of tabs and, below them, a terminal. Finally it redirects the stdout to this embedded terminal and initializes the logging module. Both GUIs are partially automated: they use the info of the parameters table to generate the entries. For example: In the matching GUI, the gen parameter is categorized as "entry", that's why a text entry is generated for this parameter. If you add another parameter with category "entry", another entry will be generated, although it will do nothing. The buttons, when pressed, call a GUI function that creates a thread to run the particular function (match, load data, etc..).

  • If there are, initialized the logging module and then calls matching_cl() or sim_cl() (located in matching_frontend.py and simulation_frontend.py) depending on the parameters passed. This functions take charge of reading the input parameters and call the needed functions (load data, match, etc..).

Now you have a general idea of the frontend, let's talk about the backend. When the genome sequence and primer files are loaded and the "Compute" button is pressed, the GUI creates a new thread and calls the compute function from interface.py. At the same time, this function calls the functions to load both files respectively from load_data.py and then call the main matching function from matching.py

To load and/or restore a template and to save the files the same pattern is repeated: the GUI calls a function from interface.py that calls the respective functions from other scripts. But that's with matching block, with the simulation block, it is way simpler. The simulation is implemented as a class (located in simulation.py), on the contrary, the matching is implemented as a set of functions. To handle the simulation the GUI creates an instance of the Simulation class and then runs the main function on a thread, bypassing the interface.py script.

Getting a bit deeper on the backend, let's see the general workflow of the:

Matching:

The matching main function compute_gen_matching() gets the primer and genome files from the parameters and transforms the data into numpy arrays to make it compatible with the pandas library (that's a patch, as at the beginning it was not needed). It creates two pandas tables aka DataFrames to store the positive and negative results. Then for every genome sequence for every primer pair calls the compute_primer_pair_best_alignment() function, which updates the positive table with the best alignment found, or updates the negative table if there hasn't been any.

This compute_primer_pair_best_alignment() function first checks that the primer pair can fit the genome sequence, then it gets all the position where the forward primer matches with the genome by using the _compute_primer_matching() function. After this, it calls again the _compute_primer_matching() function to get where the reverse primer matches, but this time the search is restricted by the matches found for the forward primer. Finally, it founds the best (or the bests) alignment, that is to say, the pair of forward and reverse matches than has the better score and then updates the template. If there are none, it updates the negative table (discarded).

The _compute_primer_matching() function generates a matrix similar to the SCORE_TABLE found at common.py, but with the genome sequence and a primer as column and row indexes respectively, see the Smith-Waterman algorithm. Then it reduces the matrix to get the score of each position and generates a set of tuples containing for each valid score, the starting and ending points of the primer on the genome.

Simulation:

This section is not done yet, but the simulation block is more understandable.