CoronaTracker

Analysis of the spread of Covid-19 by comparing mutations between samples via Fast Optimized Global Sequence Alignment Algorithm (modified).

Purpose

Each of these modules has a specific function in the overall goal of plotting the spread of coronavirus (or other infectious diseases by comparing their DNA. The theory being that, assuming a large enough sample size, the sample which most closely matches another sample out of all the ones which were observed prior is most likely an ancestor. That means the latter case can be said to have been infected by the older case, or at least are near each other in the infection chain. The four scripts for the program are:

CoronaTracker.py - Utilities used by the other modules, mainly GenerateSQL.py and Run.py. Do not execute it on its own.
GenerateSQL.py - Creates a SQLite database along with the necessary tables and populates two of them with the information obtained from the CSV and FASTA files. Must be run first if not already ran.
Run.py - Performs the comparisons on all genomes in the database and populates the third table, spreads, with the resulting chains. Should be run prior to ShowSpread.py
ShowSpread.py - Generates a tkinter based GUI to observe the spreads found through genome analysis.

Method

The method I used in this is the Fast Optimized Global Sequence Alignment Algorithm (FOGSAA), with some of my own modifications. The standard FOGSAA algorithm sets a percentage match cutoff for the comparison to know when to end; my modified version instead runs until the next best fitness score to be checked is lower than the already obtained best match, ensuring there are no possible better matches. On top of this, my modified version throws all children into the priority queue and then takes whichever node has the best fitness score to expand, not necessarily limited to the children just obtained. I'm unsure on if this is a gain/loss in efficiency but I wanted to test the method and it seems to have comparable speed to standard FOGSAA at least when I observed it briefly. When the best match is obtained, it is then stored as a spread to eventually be stored in the spread table, containing the spread's start location, end location and each case's ID. The spread table also has a strength column which contains solely the value 1 in this version, in the event that I decide to add the ability to assign percent chance of origin to multiple cases instead of assuming the best fit is the guaranteed source. NOTE: The csv file downloaded only contains the Accession, Geo_Location, Collection_Date columns. Adding more to the file and not restricting it to these will cause the program to crash. No other data is needed for it to run. Furthermore, the csv and fasta files should be saved in the same directory as 'sequences.csv' and 'sequences.fasta' and the database will be created as 'Sequences.db' in the same directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoronaTracker

Purpose

Method

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CoronaTracker.py		CoronaTracker.py
GenerateSQL.py		GenerateSQL.py
README.md		README.md
Run.py		Run.py
ShowSpread.py		ShowSpread.py

Folders and files

Latest commit

History

Repository files navigation

CoronaTracker

Purpose

Method

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages