diff --git a/Columbus_manual.pdf b/Columbus_manual.pdf new file mode 100644 index 0000000..79342ab Binary files /dev/null and b/Columbus_manual.pdf differ diff --git a/Manual.pdf b/Manual.pdf new file mode 100644 index 0000000..64fa260 Binary files /dev/null and b/Manual.pdf differ diff --git a/contrib/MetaVelvet-v0.3.1/LICENSE.txt b/contrib/MetaVelvet-v0.3.1/LICENSE.txt new file mode 100644 index 0000000..08ddefd --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/LICENSE.txt @@ -0,0 +1,340 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. + diff --git a/contrib/MetaVelvet-v0.3.1/Makefile b/contrib/MetaVelvet-v0.3.1/Makefile new file mode 100644 index 0000000..7c4ba3b --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/Makefile @@ -0,0 +1,76 @@ +CC = gcc +CFLAGS = -Wall +DEBUG = -g +LDFLAGS = -lm +OPT = -O3 +MAXKMERLENGTH=31 +CATEGORIES=2 +DEF = -D MAXKMERLENGTH=$(MAXKMERLENGTH) -D CATEGORIES=$(CATEGORIES) + +Z_LIB_DIR=../../third-party/zlib-1.2.3 +Z_LIB_FILES=$(Z_LIB_DIR)/*.o + +# Mac OS users: uncomment the following lines +# Z_LIB_FILES= +# LDFLAGS = -lm -lz +# CFLAGS = -Wall -m64 + +# Sparc/Solaris users: uncomment the following line +# CFLAGS = -Wall -m64 + +OBJ = obj/tightString.o obj/run.o obj/splay.o obj/splayTable.o obj/graph.o obj/run2.o obj/fibHeap.o obj/fib.o obj/concatenatedGraph.o obj/passageMarker.o obj/graphStats.o obj/correctedGraph.o obj/dfib.o obj/dfibHeap.o obj/recycleBin.o obj/readSet.o obj/shortReadPairs.o obj/locallyCorrectedGraph.o obj/graphReConstruction.o obj/roadMap.o obj/preGraph.o obj/preGraphConstruction.o obj/concatenatedPreGraph.o obj/readCoherentGraph.o obj/crc.o obj/utility.o obj/kmer.o obj/scaffold.o +OBJDBG = $(subst obj,obj/dbg,$(OBJ)) + +default : cleanobj zlib obj meta-velveth meta-velvetg + +clean : + -rm obj/*.o obj/dbg/*.o ./meta-velvet* + cd $(Z_LIB_DIR) && make clean + +cleanobj: + -rm obj/*.o obj/dbg/*.o + +zlib : + cd $(Z_LIB_DIR); ./configure; make; rm minigzip.o; rm example.o + +meta-velveth : obj + $(CC) $(CFLAGS) $(OPT) $(LDFLAGS) -o meta-velveth obj/tightString.o obj/run.o obj/recycleBin.o obj/splay.o obj/splayTable.o obj/readSet.o obj/crc.o obj/utility.o obj/kmer.o $(Z_LIB_FILES) + + +meta-velvetg : obj + $(CC) $(CFLAGS) $(OPT) $(LDFLAGS) -o meta-velvetg obj/tightString.o obj/graph.o obj/run2.o obj/fibHeap.o obj/fib.o obj/concatenatedGraph.o obj/passageMarker.o obj/graphStats.o obj/correctedGraph.o obj/dfib.o obj/dfibHeap.o obj/recycleBin.o obj/readSet.o obj/shortReadPairs.o obj/scaffold.o obj/locallyCorrectedGraph.o obj/graphReConstruction.o obj/roadMap.o obj/preGraph.o obj/preGraphConstruction.o obj/concatenatedPreGraph.o obj/readCoherentGraph.o obj/utility.o obj/kmer.o $(Z_LIB_FILES) + +debug : cleanobj obj/dbg + $(CC) $(CFLAGS) $(LDFLAGS) $(DEBUG) -o meta-velveth obj/dbg/tightString.o obj/dbg/run.o obj/dbg/recycleBin.o obj/dbg/splay.o obj/dbg/splayTable.o obj/dbg/readSet.o obj/dbg/crc.o obj/dbg/utility.o obj/dbg/kmer.o $(Z_LIB_FILES) + $(CC) $(CFLAGS) $(LDFLAGS) $(DEBUG) -o meta-velvetg obj/dbg/tightString.o obj/dbg/graph.o obj/dbg/run2.o obj/dbg/fibHeap.o obj/dbg/fib.o obj/dbg/concatenatedGraph.o obj/dbg/passageMarker.o obj/dbg/graphStats.o obj/dbg/correctedGraph.o obj/dbg/dfib.o obj/dbg/dfibHeap.o obj/dbg/recycleBin.o obj/dbg/readSet.o obj/dbg/shortReadPairs.o obj/dbg/scaffold.o obj/dbg/locallyCorrectedGraph.o obj/dbg/graphReConstruction.o obj/dbg/roadMap.o obj/dbg/preGraph.o obj/dbg/preGraphConstruction.o obj/dbg/concatenatedPreGraph.o obj/dbg/readCoherentGraph.o obj/dbg/utility.o obj/dbg/kmer.o $(Z_LIB_FILES) + +color : cleanobj obj_de + $(CC) $(CFLAGS) $(OPT) $(LDFLAGS) -o meta-velveth_de obj/tightString.o obj/run.o obj/recycleBin.o obj/splay.o obj/splayTable.o obj/readSet.o obj/crc.o obj/utility.o obj/kmer.o $(Z_LIB_FILES) + $(CC) $(CFLAGS) $(OPT) $(LDFLAGS) -o velvetg_de obj/tightString.o obj/graph.o obj/run2.o obj/fibHeap.o obj/fib.o obj/concatenatedGraph.o obj/passageMarker.o obj/graphStats.o obj/correctedGraph.o obj/dfib.o obj/dfibHeap.o obj/recycleBin.o obj/readSet.o obj/shortReadPairs.o obj/scaffold.o obj/locallyCorrectedGraph.o obj/graphReConstruction.o obj/roadMap.o obj/preGraph.o obj/preGraphConstruction.o obj/concatenatedPreGraph.o obj/readCoherentGraph.o obj/utility.o obj/kmer.o $(Z_LIB_FILES) + +colordebug : cleanobj obj/dbg_de + $(CC) $(CFLAGS) $(LDFLAGS) $(DEBUG) -o meta-velveth_de obj/dbg/tightString.o obj/dbg/run.o obj/dbg/recycleBin.o obj/dbg/splay.o obj/dbg/splayTable.o obj/dbg/readSet.o obj/dbg/crc.o obj/dbg/utility.o obj/dbg/kmer.o $(Z_LIB_FILES) + $(CC) $(CFLAGS) $(LDFLAGS) $(DEBUG) -o velvetg_de obj/dbg/tightString.o obj/dbg/graph.o obj/dbg/run2.o obj/dbg/fibHeap.o obj/dbg/fib.o obj/dbg/concatenatedGraph.o obj/dbg/passageMarker.o obj/dbg/graphStats.o obj/dbg/correctedGraph.o obj/dbg/dfib.o obj/dbg/dfibHeap.o obj/dbg/recycleBin.o obj/dbg/readSet.o obj/dbg/shortReadPairs.o obj/dbg/scaffold.o obj/dbg/locallyCorrectedGraph.o obj/dbg/graphReConstruction.o obj/dbg/roadMap.o obj/dbg/preGraph.o obj/dbg/preGraphConstruction.o obj/dbg/concatenatedPreGraph.o obj/dbg/readCoherentGraph.o obj/dbg/utility.o obj/dbg/kmer.o $(Z_LIB_FILES) + +objdir: + mkdir -p obj + +obj: zlib objdir $(OBJ) + +obj_de: override DEF := $(DEF) -D COLOR +obj_de: zlib cleanobj objdir $(OBJ) + +obj/dbgdir: + mkdir -p obj/dbg + +obj/dbg: zlib obj/dbgdir $(OBJDBG) + +obj/dbg_de: override DEF := $(DEF) -D COLOR +obj/dbg_de: zlib cleanobj obj/dbgdir $(OBJDBG) + +obj/%.o: src/%.c + $(CC) $(CFLAGS) $(OPT) $(DEF) -c $? -o $@ + +obj/dbg/%.o: src/%.c + $(CC) $(CFLAGS) $(DEBUG) $(DEF) -c $? -o $@ + diff --git a/contrib/MetaVelvet-v0.3.1/README.txt b/contrib/MetaVelvet-v0.3.1/README.txt new file mode 100644 index 0000000..37b978d --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/README.txt @@ -0,0 +1,42 @@ +README.TXT + +VELVET SOURCE +March 28 2008 +Daniel Zerbino + +NOTE: The PDF manual in this directory contains all the information contained +in this text file, plus much more! + +> SUMMARY + * A/ REQUIREMENTS + * B/ COMPILING INSTRUCTIONS + * C/ WHERE IS THE MANUAL? + +---------------------------------------------------------------------------------- +A/ REQUIREMENTS + + Velvet should function on any standard 64bit Linx environment with +gcc. A good amount of physical memory (12GB to start with, more is no luxury) +is recommended. + +---------------------------------------------------------------------------------- +B/ COMPILING INSTRUCTIONS + +Normally, with a GNU environment, just type: + +> make + +For colorspace Velvet replace that command with + +> make color + +Otherwise compile each *.c file separately, then execute the default +instructions at the top of Makefile. + +---------------------------------------------------------------------------------- +C/ WHERE IS THE MANUAL? + +If you cannot find the PDF manual in the source directory (probably because +you downloaded Velvet through git), you can simply compile it: + +> make doc diff --git a/contrib/MetaVelvet-v0.3.1/scriptEstimatedCovMulti.py b/contrib/MetaVelvet-v0.3.1/scriptEstimatedCovMulti.py new file mode 100755 index 0000000..ab52889 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/scriptEstimatedCovMulti.py @@ -0,0 +1,243 @@ +#!/usr/bin/env python2.5 +# -*- coding: utf-8 -*- + +import sys +import math +import random + + +# Define functions +def importStats(fin_stats): + dicStats = {} + listHeader = [] + + while True: + line = fin_stats.readline() + if not line: + break + + if len(dicStats) == 0: + listHeader = line.rstrip("\n").split("\t") + for header in listHeader: + dicStats[header] = [] + else: + listStats = line.rstrip("\n").split("\t") + for i in range(len(listStats)): + if i in [0, 1, 2, 3, 9, 10, 11]: + stats = int(listStats[i]) + else: + stats = float(listStats[i]) + dicStats[listHeader[i]].append(stats) + + return dicStats + + +def weightedHisto(dicStats, xMin, xMax, binWidth): + dicHisto = {} + listShort1Cov = dicStats["short1_cov"] + listLgth = dicStats["lgth"] + + for x in range(xMin, xMax, binWidth): + dicHisto[x] = 0 + + for i in range(len(listShort1Cov)): + cov = listShort1Cov[i] + if cov < xMin or cov >= xMax: + continue + for x in range(xMin, xMax+binWidth, binWidth): + if (cov >= x and cov < x + binWidth): + dicHisto[x] += listLgth[i] + + return dicHisto + + +def smoothingHisto(dicHisto, xMin, xMax, binWidth, widthMovAve): + dicSmoothHisto = {} + listMovAve = [] + + for x in range(xMin, xMax, binWidth): + listMovAve.append(dicHisto[x]) + if len(listMovAve) < widthMovAve: + continue + dicSmoothHisto[x - binWidth * ((widthMovAve - 1) / 2)] \ + = sum(listMovAve) / float(widthMovAve) + listMovAve.pop(0) + + return dicSmoothHisto + + +def printHisto(dicHisto, xMin, xMax, binWidth): + for x in range(xMin, xMax, binWidth): + #print str(x) + " : " + str(int(round(dicHisto[x], 0))) + lenBar = int(round((dicHisto[x] / 20000), 0)) - 1 + print str(x) + "\t", + for i in range(lenBar): + print "=", + print "\n", + print "\n", + + +def setXMax(xMax, binWidth): + return int((math.floor(xMax / binWidth)) * binWidth) + + +def getFirstXMax(dicStats, binWidth, thresConLen): + listLgth = dicStats["lgth"] + listShort1Cov = dicStats["short1_cov"] + maxCov = 0 + subMaxCov = 0 + + for i in range(len(listLgth)): + if listLgth[i] >= thresConLen: + if listShort1Cov[i] > maxCov: + subMaxCov = maxCov + maxCov = listShort1Cov[i] + + xMax = setXMax(subMaxCov, binWidth) + binWidth * 5 + return xMax + + +def getN50(tupleConLen): + listSortedConLen = list(tupleConLen) + listSortedConLen.sort() + listSortedConLen.reverse() + totalLen = sum(listSortedConLen) + sumLen = 0 + + for i in range(len(listSortedConLen)): + sumLen += listSortedConLen[i] + if sumLen >= totalLen / 2: + return listSortedConLen[i] + + return -1 + + +def setWidthByXMax(xMax): + listWidth = [0, 0] # [binWidth, widthMovAve] + + if xMax > 300: + listWidth = [6, 5] + if xMax <= 300: + listWidth = [4, 3] + if xMax <= 120: + listWidth = [2, 3] + if xMax <= 100: + listWidth = [1, 1] + + return listWidth + + +def detectPeakPandS(dicHisto, xMin, xMax, binWidth, + thresHeight, listPeakPandS): + countIncrease = 0; thresIncrease = 3 + countDecrease = 0; thresDecrease = 3 + beforeHeight = -1 + flagPeakStart = False + peakHeight = 0; peakCov = 0 + + for x in range(xMax - binWidth, xMin - binWidth, -1 * binWidth): + if beforeHeight == -1: + beforeHeight = dicHisto[x] + continue + + if not flagPeakStart: + if dicHisto[x] >= thresHeight: + if dicHisto[x] >= beforeHeight: + countIncrease += 1 + if countIncrease >= thresIncrease: + countIncrease = 0 + flagPeakStart = True + beforeHeight = dicHisto[x] + + if flagPeakStart: + if dicHisto[x] >= peakHeight: + peakHeight = dicHisto[x] + peakCov = x + else: + countDecrease += 1 + if countDecrease >= thresDecrease: + for i in range(2): + if listPeakPandS[i] == -1: + tmpBias = float(binWidth) / 2 + listPeakPandS[i] = peakCov + tmpBias + peakHeight = 0; peakCov = 0 + break + if listPeakPandS[1] != -1: + return listPeakPandS + countDecrease = 0 + flagPeakStart = False + + return listPeakPandS + + + +# ---------- Main function ---------- + +# Import stats file +fin_stats = open(sys.argv[1], "r") +dicStats = importStats(fin_stats) + +# Make weighted histogram +listPeak = [] +xMin = 0 +xMax = 1000 +binWidth = 4 +widthMovAve = 5 +listPeakPandS = [-1, -1] +N50 = 0 +thresHeight = 0 +thresConLen = 0 + +while True: + # Get N50 + if len(listPeak) == 0: + N50 = getN50(tuple(dicStats["lgth"])) + print "N50 : " + str(N50) + thresConLen = N50 * 5 + + # Get first xMax + if len(listPeak) == 0: + xMax = getFirstXMax(dicStats, binWidth, thresConLen) + print "First xMax : " + str(xMax) + + # Set width and xMax + listWidth = setWidthByXMax(xMax) + binWidth = listWidth[0]; widthMovAve = listWidth[1] + xMax = setXMax(xMax, binWidth) + + # Make weighted and smoothed histogram + xMin = 0 + dicHisto = weightedHisto(dicStats, xMin, xMax, binWidth) + dicSmoothHisto = smoothingHisto(dicHisto, xMin, xMax, + binWidth, widthMovAve) + xMin += binWidth * ((widthMovAve - 1) / 2) + xMax -= binWidth * ((widthMovAve - 1) / 2) + + # Get thresHeight + if len(listPeak) == 0: + thresHeight = dicSmoothHisto[xMax - binWidth] + print "Thres Height : " + str(thresHeight) + + # Print histogram + if len(listPeak) == 0: + printHisto(dicSmoothHisto, xMin, xMax, binWidth) + + # Detect (primary and) secondary peak + listPeakPandS = detectPeakPandS(dicSmoothHisto, xMin, xMax, binWidth, + thresHeight, listPeakPandS) + + # Record peak + if len(listPeak) == 0: + listPeak.append(listPeakPandS[0]) + listPeak.append(listPeakPandS[1]) + + # When couldn't detect secondary peak, break + if listPeakPandS[1] == -1: + listPeak.pop(-1) + print listPeak + break + + # Prepare for next peak + listPeakPandS[0] = listPeakPandS[1] + listPeakPandS[1] = -1 + xMax = listPeakPandS[0] diff --git a/contrib/MetaVelvet-v0.3.1/shuffleAndConvertSequences_fastq2fasta.py b/contrib/MetaVelvet-v0.3.1/shuffleAndConvertSequences_fastq2fasta.py new file mode 100755 index 0000000..318fdbc --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/shuffleAndConvertSequences_fastq2fasta.py @@ -0,0 +1,33 @@ +#!/usr/bin/env python2.5 +# -*- coding: utf-8 -*- + +import sys +import re + +listLine = ["", ""] +fin_fw = open(sys.argv[1], "r") +fin_bw = open(sys.argv[2], "r") +fout = open(sys.argv[3], "w") + +while True: + for i in range(2): + for j in range(4): + if i == 0: + line = fin_fw.readline() + else: + line = fin_bw.readline() + + if not line: + sys.exit() + + if j >= 2: + continue + + if i == 0 and j == 0: + line = ">" + re.sub("^@|\n", "", line) + "/1\n" + elif i == 1 and j == 0: + line = ">" + re.sub("^@|\n", "", line) + "/2\n" + + fout.write(line) + + diff --git a/contrib/MetaVelvet-v0.3.1/shuffleSequences_fasta.pl b/contrib/MetaVelvet-v0.3.1/shuffleSequences_fasta.pl new file mode 100755 index 0000000..d82a527 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/shuffleSequences_fasta.pl @@ -0,0 +1,41 @@ +#!/usr/bin/perl + +if (!@ARGV) { + print "Usage: $0 forward_reads.fa reverse_reaads.fa outfile.fa\n"; + print "\tforward_reads.fa / reverse_reads.fa : paired reads to be merged\n"; + print "\toutfile.fa : outfile to be created\n"; + system.exit(0); +} + +$filenameA = $ARGV[0]; +$filenameB = $ARGV[1]; +$filenameOut = $ARGV[2]; + +die "Could not open $filenameA" unless (-e $filenameA); +die "Could not open $filenameB" unless (-e $filenameB); + +open FILEA, "< $filenameA"; +open FILEB, "< $filenameB"; + +open OUTFILE, "> $filenameOut"; + +my ($lineA, $lineB); + +$lineA = ; +$lineB = ; + +while(defined $lineA) { + print OUTFILE $lineA; + $lineA = ; + while (defined $lineA && $lineA !~ m/>/) { + print OUTFILE $lineA; + $lineA = ; + } + + print OUTFILE $lineB; + $lineB = ; + while (defined $lineB && $lineB !~ m/>/) { + print OUTFILE $lineB; + $lineB = ; + } +} diff --git a/contrib/MetaVelvet-v0.3.1/shuffleSequences_fastq.pl b/contrib/MetaVelvet-v0.3.1/shuffleSequences_fastq.pl new file mode 100755 index 0000000..bc9c2b4 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/shuffleSequences_fastq.pl @@ -0,0 +1,29 @@ +#!/usr/bin/perl + +$filenameA = $ARGV[0]; +$filenameB = $ARGV[1]; +$filenameOut = $ARGV[2]; + +open $FILEA, "< $filenameA"; +open $FILEB, "< $filenameB"; + +open $OUTFILE, "> $filenameOut"; + +while(<$FILEA>) { + print $OUTFILE $_; + $_ = <$FILEA>; + print $OUTFILE $_; + $_ = <$FILEA>; + print $OUTFILE $_; + $_ = <$FILEA>; + print $OUTFILE $_; + + $_ = <$FILEB>; + print $OUTFILE $_; + $_ = <$FILEB>; + print $OUTFILE $_; + $_ = <$FILEB>; + print $OUTFILE $_; + $_ = <$FILEB>; + print $OUTFILE $_; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/concatenatedGraph.c b/contrib/MetaVelvet-v0.3.1/src/concatenatedGraph.c new file mode 100644 index 0000000..2cb08d5 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/concatenatedGraph.c @@ -0,0 +1,288 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "graph.h" +#include "passageMarker.h" + +void concatenateReadStarts(Node * target, Node * source, Graph * graph) +{ + IDnum sourceLength, targetLength; + ShortReadMarker *sourceArray, *targetArray, *marker; + IDnum index; + Coordinate position, nodeLength; + + if (!readStartsAreActivated(graph)) + return; + + if (target == NULL || source == NULL) + return; + + // Update Coordinates + sourceArray = getNodeReads(source, graph); + sourceLength = getNodeReadCount(source, graph); + + nodeLength = getNodeLength(target); + for (index = 0; index < sourceLength; index++) { + marker = getShortReadMarkerAtIndex(sourceArray, index); + position = getShortReadMarkerPosition(marker); + if (position != -1) { + position += nodeLength; + setShortReadMarkerPosition(marker, position); + } + } + + // Same but for symmetrical reads + targetArray = getNodeReads(getTwinNode(target), graph); + targetLength = getNodeReadCount(getTwinNode(target), graph); + + nodeLength = getNodeLength(source); + for (index = 0; index < targetLength; index++) { + marker = getShortReadMarkerAtIndex(targetArray, index); + position = getShortReadMarkerPosition(marker); + if (position != -1) { + position += nodeLength; + setShortReadMarkerPosition(marker, position); + } + } + + // Merging lists + mergeNodeReads(target, source, graph); + mergeNodeReads(getTwinNode(target), getTwinNode(source), graph); +} + +// Replaces two consecutive nodes into a single equivalent node +// The extra memory is freed +void concatenateNodes(Node * nodeA, Node * nodeB, Graph * graph) +{ + PassageMarker *marker, *tmpMarker; + Node *twinA = getTwinNode(nodeA); + Node *twinB = getTwinNode(nodeB); + Arc *arc; + Category cat; + + // Arc management: + // Freeing useless arcs + while (getArc(nodeA) != NULL) + destroyArc(getArc(nodeA), graph); + + // Correct arcs + for (arc = getArc(nodeB); arc != NULL; arc = getNextArc(arc)) { + if (getDestination(arc) != twinB) + createAnalogousArc(nodeA, getDestination(arc), + arc, graph); + else + createAnalogousArc(nodeA, twinA, arc, graph); + } + + // Passage marker management in node A: + for (marker = getMarker(nodeA); marker != NULL; + marker = getNextInNode(marker)) + if (isTerminal(marker)) + incrementFinishOffset(marker, + getNodeLength(nodeB)); + + // Swapping new born passageMarkers from B to A + for (marker = getMarker(nodeB); marker != NULL; marker = tmpMarker) { + tmpMarker = getNextInNode(marker); + + if (isInitial(marker) + || getNode(getPreviousInSequence(marker)) != nodeA) { + extractPassageMarker(marker); + transposePassageMarker(marker, nodeA); + incrementFinishOffset(getTwinMarker(marker), + getNodeLength(nodeA)); + } else + disconnectNextPassageMarker(getPreviousInSequence + (marker), graph); + } + + // Read starts + concatenateReadStarts(nodeA, nodeB, graph); + + // Gaps + appendNodeGaps(nodeA, nodeB, graph); + + // Descriptor management (node) + appendDescriptors(nodeA, nodeB); + + // Update uniqueness: + setUniqueness(nodeA, getUniqueness(nodeA) || getUniqueness(nodeB)); + + // Update virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementVirtualCoverage(nodeA, cat, + getVirtualCoverage(nodeB, cat)); + + // Update original virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementOriginalVirtualCoverage(nodeA, cat, + getOriginalVirtualCoverage + (nodeB, cat)); + + // Freeing gobbled node + destroyNode(nodeB, graph); +} + +// Replaces two consecutive nodes into a single equivalent node +// The extra memory is freed +void concatenateStringOfNodes(Node * nodeA, Graph * graph) +{ + Node *twinA = getTwinNode(nodeA); + Node * nodeB = nodeA; + Node * twinB; + Node *currentNode, *nextNode; + Coordinate totalLength = 0; + PassageMarker *marker, *tmpMarker; + Arc *arc; + Category cat; + + while (simpleArcCount(nodeB) == 1 + && + simpleArcCount(getTwinNode + (getDestination(getArc(nodeB)))) == + 1 + && getDestination(getArc(nodeB)) != getTwinNode(nodeB) + && getDestination(getArc(nodeB)) != nodeA) { + totalLength += getNodeLength(nodeB); + nodeB = getDestination(getArc(nodeB)); + } + twinB = getTwinNode(nodeB); + totalLength += getNodeLength(nodeB); + reallocateNodeDescriptor(nodeA, totalLength); + + currentNode = nodeA; + while (currentNode != nodeB) { + currentNode = getDestination(getArc(currentNode)); + + // Passage marker management in node A: + for (marker = getMarker(nodeA); marker != NULL; + marker = getNextInNode(marker)) + if (isTerminal(marker)) + incrementFinishOffset(marker, + getNodeLength(currentNode)); + + // Swapping new born passageMarkers from B to A + for (marker = getMarker(currentNode); marker != NULL; marker = tmpMarker) { + tmpMarker = getNextInNode(marker); + + if (isInitial(marker) + || getNode(getPreviousInSequence(marker)) != nodeA) { + extractPassageMarker(marker); + transposePassageMarker(marker, nodeA); + incrementFinishOffset(getTwinMarker(marker), + getNodeLength(nodeA)); + } else + disconnectNextPassageMarker(getPreviousInSequence + (marker), graph); + } + + // Read starts + concatenateReadStarts(nodeA, currentNode, graph); + + // Gaps + appendNodeGaps(nodeA, currentNode, graph); + + // Update uniqueness: + setUniqueness(nodeA, getUniqueness(nodeA) || getUniqueness(currentNode)); + + // Update virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementVirtualCoverage(nodeA, cat, + getVirtualCoverage(currentNode, cat)); + + // Update original virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementOriginalVirtualCoverage(nodeA, cat, + getOriginalVirtualCoverage + (currentNode, cat)); + // Descriptor management (node) + directlyAppendDescriptors(nodeA, currentNode, totalLength); + } + + // Correct arcs + for (arc = getArc(nodeB); arc != NULL; arc = getNextArc(arc)) { + if (getDestination(arc) != twinB) + createAnalogousArc(nodeA, getDestination(arc), + arc, graph); + else + createAnalogousArc(nodeA, twinA, arc, graph); + } + + // Freeing gobbled nodes + currentNode = getTwinNode(nodeB); + while (currentNode != getTwinNode(nodeA)) { + arc = getArc(currentNode); + nextNode = getDestination(arc); + destroyNode(currentNode, graph); + currentNode = nextNode; + } +} + +// Detects sequences that could be simplified through concatentation +// Iterates till graph cannot be more simplified +// Useless nodes are freed from memory and remaining ones are renumbered +void concatenateGraph(Graph * graph) +{ + IDnum nodeIndex; + Node *node, *twin; + + puts("Concatenation..."); + + for (nodeIndex = 1; nodeIndex < nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + + if (node == NULL) + continue; + + twin = getTwinNode(node); + while (simpleArcCount(node) == 1 + && + simpleArcCount(getTwinNode + (getDestination(getArc(node)))) == + 1) { + if (getDestination(getArc(node)) == twin + || getDestination(getArc(node)) == node) + break; + concatenateStringOfNodes(node, + graph); + } + + while (simpleArcCount(twin) == 1 + && + simpleArcCount(getTwinNode + (getDestination(getArc(twin)))) == + 1) { + if (getDestination(getArc(twin)) == node + || getDestination(getArc(twin)) == twin) + break; + concatenateStringOfNodes(twin, + graph); + } + } + + renumberNodes(graph); + sortGapMarkers(graph); + puts("Concatenation over!"); +} + diff --git a/contrib/MetaVelvet-v0.3.1/src/concatenatedGraph.h b/contrib/MetaVelvet-v0.3.1/src/concatenatedGraph.h new file mode 100644 index 0000000..ce610e3 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/concatenatedGraph.h @@ -0,0 +1,32 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _CONCATENATEDGRAPH_H_ +#define _CONCATENATEDGRAPH_H_ + +// Concatenates short read markers from node B to the end of nodeA +void concatenateReadStarts(Node * nodeA, Node * nodeB, Graph * graph); + +// Concatenates all pairs of nodes which form a linear structure +void concatenateGraph(Graph * graph); + +// Concatenates two nodes together (supposing that they form a linear structure) +void concatenateNodes(Node * nodeA, Node * nodeB, Graph * graph); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/concatenatedPreGraph.c b/contrib/MetaVelvet-v0.3.1/src/concatenatedPreGraph.c new file mode 100644 index 0000000..133e315 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/concatenatedPreGraph.c @@ -0,0 +1,248 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "preGraph.h" +#include "utility.h" + +// Replaces two consecutive preNodes into a single equivalent preNode +// The extra memory is freed +static void concatenatePreNodes(IDnum preNodeAID, PreArc * oldPreArc, + PreGraph * preGraph) +{ + IDnum preNodeBID = preNodeAID; + IDnum currentPreNodeID, nextPreNodeID; + PreArc *preArc = oldPreArc; + Coordinate totalLength = 0; + Coordinate arrayLength; + Descriptor * descr, * ptr; + int writeOffset = 0; + int wordLength = getWordLength_pg(preGraph); + + //printf("Concatenating nodes %li and %li\n", preNodeAID, preNodeBID); + + while(hasSinglePreArc_pg(preNodeBID, preGraph) + && + hasSinglePreArc_pg(getOtherEnd_pg + (preArc, preNodeBID), + preGraph) + && !isLoop_pg(preArc) + && getDestination_pg(preArc, preNodeBID) != preNodeAID) { + + totalLength += getPreNodeLength_pg(preNodeBID, preGraph); + preNodeBID = getDestination_pg(preArc, preNodeBID); + preArc = getPreArc_pg(preNodeBID, preGraph); + } + totalLength += getPreNodeLength_pg(preNodeBID, preGraph); + totalLength += wordLength - 1; + + // Descriptor management (preNode) + arrayLength = totalLength / 4; + if (totalLength % 4) + arrayLength++; + descr = callocOrExit(arrayLength, Descriptor); + ptr = descr; + if (preNodeAID > 0) { + currentPreNodeID = preNodeAID; + appendDescriptors_pg(&ptr, &writeOffset, currentPreNodeID, preGraph, true); + preArc = getPreArc_pg(currentPreNodeID, preGraph); + currentPreNodeID = getDestination_pg(preArc, currentPreNodeID); + while (currentPreNodeID != preNodeBID) { + appendDescriptors_pg(&ptr, &writeOffset, currentPreNodeID, preGraph, false); + preArc = getPreArc_pg(currentPreNodeID, preGraph); + currentPreNodeID = getDestination_pg(preArc, currentPreNodeID); + } + appendDescriptors_pg(&ptr, &writeOffset, currentPreNodeID, preGraph, false); + } else { + currentPreNodeID = -preNodeBID; + appendDescriptors_pg(&ptr, &writeOffset ,currentPreNodeID, preGraph, true); + preArc = getPreArc_pg(currentPreNodeID, preGraph); + currentPreNodeID = getDestination_pg(preArc, currentPreNodeID); + while (currentPreNodeID != -preNodeAID) { + appendDescriptors_pg(&ptr, &writeOffset ,currentPreNodeID, preGraph, false); + preArc = getPreArc_pg(currentPreNodeID, preGraph); + currentPreNodeID = getDestination_pg(preArc, currentPreNodeID); + } + appendDescriptors_pg(&ptr, &writeOffset ,currentPreNodeID, preGraph, false); + } + + if (writeOffset != 0) + while (writeOffset++ != 4) + (*ptr) >>= 2; + + setPreNodeDescriptor_pg(descr, totalLength - wordLength + 1, preNodeAID, preGraph); + + // Correct preArcs + for (preArc = getPreArc_pg(preNodeBID, preGraph); preArc != NULL; + preArc = getNextPreArc_pg(preArc, preNodeBID)) { + if (getDestination_pg(preArc, preNodeBID) != -preNodeBID) + createAnalogousPreArc_pg(preNodeAID, + getDestination_pg(preArc, + preNodeBID), + preArc, preGraph); + else + createAnalogousPreArc_pg(preNodeAID, -preNodeAID, + preArc, preGraph); + } + + // Freeing gobbled preNode + currentPreNodeID = -preNodeBID; + while (currentPreNodeID != -preNodeAID) { + preArc = getPreArc_pg(currentPreNodeID, preGraph); + nextPreNodeID = getDestination_pg(preArc, currentPreNodeID); + destroyPreNode_pg(currentPreNodeID, preGraph); + currentPreNodeID = nextPreNodeID; + } +} + +// Detects sequences that could be simplified through concatentation +// Iterates till preGraph cannot be more simplified +// Useless preNodes are freed from memory and remaining ones are renumbered +void concatenatePreGraph_pg(PreGraph * preGraph) +{ + IDnum preNodeIndex; + PreArc *preArc; + PreNode *preNode; + + puts("Concatenation..."); + + for (preNodeIndex = 1; preNodeIndex < preNodeCount_pg(preGraph); + preNodeIndex++) { + preNode = getPreNodeInPreGraph_pg(preGraph, preNodeIndex); + + if (preNode == NULL) + continue; + + preArc = getPreArc_pg(preNodeIndex, preGraph); + + while (hasSinglePreArc_pg(preNodeIndex, preGraph) + && + hasSinglePreArc_pg(getOtherEnd_pg + (preArc, preNodeIndex), + preGraph)) { + if (isLoop_pg(preArc)) + break; + concatenatePreNodes(preNodeIndex, preArc, + preGraph); + preArc = getPreArc_pg(preNodeIndex, preGraph); + } + + preArc = getPreArc_pg(-preNodeIndex, preGraph); + + while (hasSinglePreArc_pg(-preNodeIndex, preGraph) + && + hasSinglePreArc_pg(getOtherEnd_pg + (preArc, -preNodeIndex), + preGraph)) { + if (isLoop_pg(preArc)) + break; + concatenatePreNodes(-preNodeIndex, preArc, + preGraph); + preArc = getPreArc_pg(-preNodeIndex, preGraph); + } + } + + renumberPreNodes_pg(preGraph); + puts("Concatenation over!"); +} + +static boolean isEligibleTip(IDnum index, PreGraph * preGraph, Coordinate + cutoffLength) +{ + IDnum currentIndex = -index; + Coordinate totalLength = 0; + PreArc *activeArc = NULL; + PreArc *arc; + IDnum mult = 0; + + if (getPreArc_pg(index, preGraph) != NULL) + return false; + + // Finding first tangle + while (currentIndex != 0 + && simplePreArcCount_pg(-currentIndex, preGraph) < 2 + && simplePreArcCount_pg(currentIndex, preGraph) < 2) { + totalLength += getPreNodeLength_pg(currentIndex, preGraph); + activeArc = getPreArc_pg(currentIndex, preGraph); + currentIndex = getDestination_pg(activeArc, currentIndex); + } + + // If too long + if (totalLength >= cutoffLength) + return false; + + // If isolated snippet: + if (currentIndex == 0) + return true; + + // Joined tips + if (simplePreArcCount_pg(-currentIndex, preGraph) < 2) + return false; + + // If unique event + if (getMultiplicity_pg(activeArc) == 1) + return true; + + // Computing max arc + for (arc = getPreArc_pg(-currentIndex, preGraph); arc != NULL; + arc = getNextPreArc_pg(arc, -currentIndex)) + if (getMultiplicity_pg(arc) > mult) + mult = getMultiplicity_pg(arc); + + // Testing for minority + return mult > getMultiplicity_pg(activeArc); +} + +void clipTips_pg(PreGraph * preGraph) +{ + IDnum index; + PreNode *current; + boolean modified = true; + Coordinate cutoffLength = getWordLength_pg(preGraph) * 2; + IDnum counter = 0; + + puts("Clipping short tips off preGraph"); + + while (modified) { + modified = false; + for (index = 1; index <= preNodeCount_pg(preGraph); + index++) { + current = getPreNodeInPreGraph_pg(preGraph, index); + + if (current == NULL) + continue; + + if (isEligibleTip(index, preGraph, cutoffLength) + || isEligibleTip(-index, preGraph, + cutoffLength)) { + counter++; + destroyPreNode_pg(index, preGraph); + modified = true; + } + } + } + + concatenatePreGraph_pg(preGraph); + printf("%d tips cut off\n", counter); + printf("%d nodes left\n", preNodeCount_pg(preGraph)); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/concatenatedPreGraph.h b/contrib/MetaVelvet-v0.3.1/src/concatenatedPreGraph.h new file mode 100644 index 0000000..a41153d --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/concatenatedPreGraph.h @@ -0,0 +1,29 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _CONCATENATEDPREGRAPH_H_ +#define _CONCATENATEDPREGRAPH_H_ + +// Concatenates all pairs of nodes which form a linear structure +void concatenatePreGraph_pg(PreGraph * preGraph); + +void clipTips_pg(PreGraph * preGraph); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/correctedGraph.c b/contrib/MetaVelvet-v0.3.1/src/correctedGraph.c new file mode 100644 index 0000000..b291d62 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/correctedGraph.c @@ -0,0 +1,2727 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "tightString.h" +#include "dfibHeap.h" +#include "fibHeap.h" +#include "recycleBin.h" +#include "passageMarker.h" +#include "concatenatedGraph.h" +#include "graphStats.h" +#include "utility.h" + +#define TICKET_BLOCK_SIZE 10000 + +static const Time INDEL = 0; +static const Time SIM[4][4] = { + {1, 0, 0, 0}, + {0, 1, 0, 0}, + {0, 0, 1, 0}, + {0, 0, 0, 1} +}; + +typedef struct tkt_st Ticket; + +struct tkt_st { + Ticket *next; + IDnum id_a; + IDnum id_b; +}; + +//Global variables used throughout this procedure(internal use only !) +static int MAXREADLENGTH = 100; +static int MAXNODELENGTH = 200; +static double MAXDIVERGENCE = 0.2; +static int MAXGAPS = 3; + +static Time *times; +static Node **previous; + +static DFibHeapNode **dheapNodes; +static DFibHeap *dheap; + +static TightString *fastSequence; +static TightString *slowSequence; + +static Node *activeNode; +static Node *startingNode; +static int WORDLENGTH; +static Graph *graph; +static IDnum dbgCounter; + +static PassageMarker *fastPath; +static PassageMarker *slowPath; + +static IDnum *eligibleStartingPoints; + +static double **Fmatrix; +static Coordinate *slowToFastMapping; +static Coordinate *fastToSlowMapping; + +static RecycleBin *ticketMemory; +static Ticket *ticketQueue; + +static Ticket **todoLists; +static Ticket **todo; +static Ticket *done; +static boolean *progressStatus; + +static Coordinate *sequenceLengths; + +//End of global variables; + +static void setNodeTime(Node * node, Time time) +{ + times[getNodeID(node) + nodeCount(graph)] = time; +} + +static Time getNodeTime(Node * node) +{ + return times[getNodeID(node) + nodeCount(graph)]; +} + +static void setNodePrevious(Node * previousNode, Node * node) +{ + previous[getNodeID(node) + nodeCount(graph)] = previousNode; +} + +static Node *getNodePrevious(Node * node) +{ + return previous[getNodeID(node) + nodeCount(graph)]; +} + +static void setNodeDHeapNode(Node * node, DFibHeapNode * dheapNode) +{ + dheapNodes[getNodeID(node) + nodeCount(graph)] = dheapNode; +} + +static DFibHeapNode *getNodeDHeapNode(Node * node) +{ + return dheapNodes[getNodeID(node) + nodeCount(graph)]; +} + +static Ticket *newTicket() +{ + if (ticketMemory == NULL) + ticketMemory = + newRecycleBin(sizeof(Ticket), TICKET_BLOCK_SIZE); + + return allocatePointer(ticketMemory); +} + +static void newQueueTicket(IDnum id_a, IDnum id_b) +{ + Ticket *tkt = newTicket(); + tkt->id_a = id_a; + tkt->id_b = id_b; + tkt->next = ticketQueue; + ticketQueue = tkt; +} + +static boolean isPreviousToNode(Node * previous, Node * target) +{ + Node *currentNode = target; + Node *previousNode = NULL; + Time targetTime = getNodeTime(target); + + //printf("Testing if %ld is previous to %ld\n", getNodeID(previous), getNodeID(target)); + + while (true) { + if (currentNode == previous) + return true; + + if (currentNode == previousNode) + return false; + + if (getNodeTime(currentNode) != targetTime) + return false; + + previousNode = currentNode; + currentNode = getNodePrevious(currentNode); + } +} + +static void concatenateCommonTodoLists(Node * nodeA, Node * nodeB) +{ + Ticket **listA = &todoLists[getNodeID(nodeA) + nodeCount(graph)]; + Ticket **listB = &todoLists[getNodeID(nodeB) + nodeCount(graph)]; + Ticket *head = NULL; + Ticket *tail = NULL; + Ticket *tmp; + IDnum idA, idB; + IDnum targetID = getNodeID(nodeA); + IDnum indexA, indexB; + IDnum nodes = nodeCount(graph); + + //printf("Merging todo list %ld into %ld\n", getNodeID(nodeB), + // getNodeID(nodeA)); + + if (*listB == NULL) + return; + + if (*listA == NULL) { + *listA = *listB; + *listB = NULL; + return; + } + + while (*listA != NULL && *listB != NULL) { + idA = (*listA)->id_a; + idB = (*listB)->id_a; + indexA = idA + nodes; + indexB = idB + nodes; + + if (previous[indexA] == nodeA) { + tmp = *listA; + *listA = (*listA)->next; + deallocatePointer(ticketMemory, tmp); + continue; + } + + if (idB == targetID || previous[indexB] == nodeA) { + tmp = *listB; + *listB = (*listB)->next; + deallocatePointer(ticketMemory, tmp); + continue; + } + + if (idA > idB) { + tmp = *listB; + *listB = (*listB)->next; + } else if (idA < idB) { + tmp = *listA; + *listA = (*listA)->next; + } else { + tmp = *listB; + *listB = (*listB)->next; + deallocatePointer(ticketMemory, tmp); + + tmp = *listA; + *listA = (*listA)->next; + } + + if (tail == NULL) { + tail = tmp; + head = tail; + } else { + tail->next = tmp; + tail = tail->next; + } + } + + while (*listA != NULL) { + idA = (*listA)->id_a; + indexA = idA + nodes; + + if (previous[indexA] == nodeA) { + tmp = *listA; + *listA = (*listA)->next; + deallocatePointer(ticketMemory, tmp); + } else if (tail != NULL) { + tail->next = *listA; + *listA = (*listA)->next; + tail = tail->next; + } else { + head = *listA; + *listA = (*listA)->next; + tail = head; + } + } + + while (*listB != NULL) { + idB = (*listB)->id_a; + indexB = idB + nodes; + + if (idB == targetID || previous[indexB] == nodeA) { + tmp = *listB; + *listB = (*listB)->next; + deallocatePointer(ticketMemory, tmp); + } else if (tail != NULL) { + tail->next = *listB; + *listB = (*listB)->next; + tail = tail->next; + } else { + head = *listB; + *listB = (*listB)->next; + tail = head; + + } + } + + if (tail != NULL) + tail->next = NULL; + + *listA = head; + *listB = NULL; +} + +static void concatenateTodoListIntoActive(Node * nodeB) +{ + Ticket **listB = &todoLists[getNodeID(nodeB) + nodeCount(graph)]; + Ticket *head = NULL; + Ticket *tail = NULL; + Ticket *tmp; + IDnum nodes = nodeCount(graph); + IDnum idA, idB; + IDnum activeID = getNodeID(activeNode); + IDnum indexB, indexA; + + //printf("Merging todo list %ld into active node %ld\n", + // getNodeID(nodeB), getNodeID(activeNode)); + + if (*listB == NULL) + return; + + if (*todo == NULL) { + *todo = *listB; + *listB = NULL; + return; + } + + while (*todo != NULL && *listB != NULL) { + idA = (*todo)->id_a; + idB = (*listB)->id_a; + indexA = idA + nodes; + indexB = idB + nodes; + + if (previous[indexA] == activeNode + || progressStatus[indexA]) { + tmp = *todo; + *todo = (*todo)->next; + deallocatePointer(ticketMemory, tmp); + continue; + } + + if (idB == activeID || previous[indexB] == activeNode + || progressStatus[indexB]) { + tmp = *listB; + *listB = (*listB)->next; + deallocatePointer(ticketMemory, tmp); + continue; + } + + if (idA > idB) { + tmp = *listB; + *listB = (*listB)->next; + } else if (idA < idB) { + tmp = *todo; + *todo = (*todo)->next; + } else { + tmp = *listB; + *listB = (*listB)->next; + deallocatePointer(ticketMemory, tmp); + + tmp = *todo; + *todo = (*todo)->next; + } + + if (tail == NULL) { + tail = tmp; + head = tail; + } else { + tail->next = tmp; + tail = tmp; + } + } + + while (*todo != NULL) { + idA = (*todo)->id_a; + indexA = idA + nodes; + + if (previous[indexA] == activeNode + || progressStatus[indexA]) { + tmp = *todo; + *todo = (*todo)->next; + deallocatePointer(ticketMemory, tmp); + } else if (tail != NULL) { + tail->next = *todo; + *todo = (*todo)->next; + tail = tail->next; + } else { + head = *todo; + *todo = (*todo)->next; + tail = head; + } + } + + while (*listB != NULL) { + idB = (*listB)->id_a; + indexB = idB + nodes; + + if (idB == activeID || previous[indexB] == activeNode + || progressStatus[indexB]) { + tmp = *listB; + *listB = (*listB)->next; + deallocatePointer(ticketMemory, tmp); + } else if (tail != NULL) { + tail->next = *listB; + *listB = (*listB)->next; + tail = tail->next; + } else { + head = *listB; + *listB = (*listB)->next; + tail = head; + + } + } + + if (tail != NULL) + tail->next = NULL; + *todo = head; + *listB = NULL; +} + +static void concatenateTodoLists(Node * nodeA, Node * nodeB) +{ + if (nodeA == activeNode) + concatenateTodoListIntoActive(nodeB); + else + concatenateCommonTodoLists(nodeA, nodeB); +} + +static IDnum nextTodoTicket() +{ + Ticket *tkt; + IDnum index; + + while (*todo != NULL) { + tkt = *todo; + *todo = tkt->next; + + index = tkt->id_a + nodeCount(graph); + + if (previous[index] == activeNode) { + deallocatePointer(ticketMemory, tkt); + continue; + } + + progressStatus[index] = true; + + tkt->next = done; + done = tkt; + + return tkt->id_a; + } + + return 0; +} + +static void freeDoneTickets() +{ + Ticket *tkt; + IDnum nodes = nodeCount(graph); + + while (done != NULL) { + tkt = done; + done = tkt->next; + progressStatus[tkt->id_a + nodes] = false; + deallocatePointer(ticketMemory, tkt); + } +} + +static void updateNodeStatus(Node * node) +{ + FibHeap *heap = newFibHeap(); + Arc *arc; + Node *currentNode = node; + Node *destination; + + setNodeStatus(currentNode, true); + + while (currentNode != NULL) { + for (arc = getArc(currentNode); arc != NULL; + arc = getNextArc(arc)) { + destination = getDestination(arc); + if (getNodeStatus(destination) > 1) { + setNodeStatus(destination, true); + insertNodeIntoHeap(heap, + getNodeID(destination), + destination); + } + } + + currentNode = removeNextNodeFromHeap(heap); + } + + destroyHeap(heap); +} + +static void determineEligibleStartingPoints() +{ + IDnum nodeIndex; + IDnum maxmult; + Node *node; + Arc *arc; + IDnum counter = 0; + FibHeap *heap = newFibHeap(); + + puts("Determining eligible starting points"); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (node == NULL) + continue; + + maxmult = 0; + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) + if (getMultiplicity(arc) > maxmult) + maxmult = getMultiplicity(arc); + + insertNodeIntoHeap(heap, -maxmult, node); + + // Same for twin node + node = getNodeInGraph(graph, -nodeIndex); + maxmult = 0; + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) + if (getMultiplicity(arc) > maxmult) + maxmult = getMultiplicity(arc); + + insertNodeIntoHeap(heap, -maxmult, node); + } + + while ((node = removeNextNodeFromHeap(heap)) != NULL) + eligibleStartingPoints[counter++] = getNodeID(node); + + destroyHeap(heap); + puts("Done listing starting nodes"); +} + +static Node *nextStartingPoint() +{ + static IDnum counter = 0; + Node *result = NULL; + + while (result == NULL || getNodeStatus(result) > 0) { + if (counter >= nodeCount(graph) * 2) + return NULL; + + result = + getNodeInGraph(graph, + eligibleStartingPoints[counter++]); + } + + return result; +} + +static boolean +extractSequence(PassageMarker * path, TightString * sequence) +{ + PassageMarker *marker; + Coordinate seqLength = 0; + Coordinate writeIndex = 0; + + //printf("Extracting sequence %ld ... ", pathLength); + + //Measure length + for (marker = getNextInSequence(path); !isTerminal(marker); + marker = getNextInSequence(marker)) + seqLength += getNodeLength(getNode(marker)); + + if (seqLength > MAXREADLENGTH) + return false; + else + setTightStringLength(sequence, seqLength); + + //Copy sequences + for (marker = getNextInSequence(path); !isTerminal(marker); + marker = getNextInSequence(marker)) { + appendNodeSequence(getNode(marker), sequence, writeIndex); + writeIndex += getNodeLength(getNode(marker)); + } + + return true; +} + +static Time max(Time A, Time B, Time C) +{ + if (A >= B && A >= C) + return A; + else if (B >= C) + return B; + else + return C; +} + +static boolean +compareSequences(TightString * sequence1, TightString * sequence2) +{ + Coordinate i, j; + Coordinate length1 = getLength(sequence1); + Coordinate length2 = getLength(sequence2); + Coordinate maxLength; + Time Choice1, Choice2, Choice3; + Time maxScore; + + if (length1 == 0 || length2 == 0) + return false; + + maxLength = (length1 > length2 ? length1 : length2); + + if (length1 < WORDLENGTH || length2 < WORDLENGTH) { + if (maxLength - length1 > MAXGAPS + || maxLength - length2 > MAXGAPS) + return false; + if (WORDLENGTH - length1 > MAXGAPS + || WORDLENGTH - length2 > MAXGAPS) + return false; + } + + for (i = 0; i <= length1; i++) + Fmatrix[i][0] = 0; + for (j = 0; j <= length2; j++) + Fmatrix[0][j] = 0; + + for (i = 1; i <= length1; i++) { + for (j = 1; j <= length2; j++) { + Choice1 = + Fmatrix[i - 1][j - 1] + + SIM[(int) getNucleotide(i - 1, sequence1)] + [(int) getNucleotide(j - 1, sequence2)]; + Choice2 = Fmatrix[i - 1][j] + INDEL; + Choice3 = Fmatrix[i][j - 1] + INDEL; + Fmatrix[i][j] = max(Choice1, Choice2, Choice3); + } + } + + maxScore = Fmatrix[length1][length2]; + + if (maxScore < maxLength - MAXGAPS) + return false; + + if ((1 - maxScore / maxLength) > MAXDIVERGENCE) + return false; + + return true; +} + +static void mapSlowOntoFast() +{ + Coordinate slowIndex = getLength(slowSequence); + Coordinate fastIndex = getLength(fastSequence); + int fastn, slown; + + if (slowIndex == 0) { + slowToFastMapping[0] = fastIndex; + + while (fastIndex >= 0) + fastToSlowMapping[fastIndex--] = 0; + + return; + } + + if (fastIndex == 0) { + while (slowIndex >= 0) + slowToFastMapping[slowIndex--] = 0; + + fastToSlowMapping[0] = slowIndex; + + return; + } + + while (slowIndex > 0 && fastIndex > 0) { + fastn = (int) getNucleotide(fastIndex - 1, fastSequence); + slown = (int) getNucleotide(slowIndex - 1, slowSequence); + + if (Fmatrix[fastIndex][slowIndex] == + Fmatrix[fastIndex - 1][slowIndex - 1] + + SIM[fastn][slown]) { + fastToSlowMapping[--fastIndex] = --slowIndex; + slowToFastMapping[slowIndex] = fastIndex; + } else if (Fmatrix[fastIndex][slowIndex] == + Fmatrix[fastIndex - 1][slowIndex] + INDEL) + fastToSlowMapping[--fastIndex] = slowIndex - 1; + + else if (Fmatrix[fastIndex][slowIndex] == + Fmatrix[fastIndex][slowIndex - 1] + INDEL) + slowToFastMapping[--slowIndex] = fastIndex - 1; + + else { + puts("Error"); + fflush(stdout); + abort(); + } + } + + while (slowIndex > 0) + slowToFastMapping[--slowIndex] = -1; + while (fastIndex > 0) + fastToSlowMapping[--fastIndex] = -1; + + slowToFastMapping[getLength(slowSequence)] = + getLength(fastSequence); + fastToSlowMapping[getLength(fastSequence)] = + getLength(slowSequence); +} + +static void createAnalogousArcAndVaccinate(Node * nodeA, Node * nodeB, + Arc * arc) +{ + boolean aNull = (getNodeLength(nodeA) == 0); + boolean bNull = (getNodeLength(nodeB) == 0); + + createAnalogousArc(nodeA, nodeB, arc, graph); + + if (aNull && bNull) + newQueueTicket(getNodeID(nodeA), getNodeID(nodeB)); +} + +static void remapNodeArcsOntoTarget(Node * source, Node * target) +{ + Arc *arc; + + if (source == activeNode) { + activeNode = target; + todo = + &todoLists[getNodeID(activeNode) + nodeCount(graph)]; + } + concatenateTodoLists(target, source); + + arc = getArc(source); + while (arc != NULL) { + createAnalogousArcAndVaccinate(target, getDestination(arc), + arc); + destroyArc(arc, graph); + arc = getArc(source); + } +} + +static void remapNodeArcsOntoNeighbour(Node * source, Node * target) +{ + remapNodeArcsOntoTarget(source, target); + remapNodeArcsOntoTarget(getTwinNode(source), getTwinNode(target)); +} + +static void remapNodeMarkersOntoNeighbour(Node * source, + PassageMarker * sourceMarker, + Node * target, + PassageMarker * targetMarker) +{ + PassageMarker *marker; + Coordinate offset; + IDnum sourceLength, index; + ShortReadMarker *sourceArray, *shortMarker; + Coordinate position; + Category cat; + + Coordinate targetStart = getPassageMarkerStart(targetMarker); + Coordinate targetFinish = getPassageMarkerFinish(targetMarker); + Coordinate sourceStart = getPassageMarkerStart(sourceMarker); + Coordinate sourceFinish = getPassageMarkerFinish(sourceMarker); + + Coordinate alignedTargetLength = targetFinish - targetStart; + Coordinate alignedSourceLength = sourceFinish - sourceStart; + + Coordinate realTargetLength = getNodeLength(target); + Coordinate realSourceLength = getNodeLength(source); + + while (getMarker(source) != NULL) { + marker = getMarker(source); + extractPassageMarker(marker); + transposePassageMarker(marker, target); + + if (realSourceLength != 0 && alignedTargetLength != 0) { + if (isInitial(marker)) { + offset = getStartOffset(marker); + offset *= alignedSourceLength; + offset /= realSourceLength; + offset += sourceStart; + offset = slowToFastMapping[offset]; + offset -= targetStart; + offset *= realTargetLength; + offset /= alignedTargetLength; + + if (offset < 0) + offset = 0; + if (offset > realTargetLength) + offset = realTargetLength; + } else + offset = 0; + + setStartOffset(marker, offset); + + if (isTerminal(marker)) { + offset = getFinishOffset(marker); + offset *= alignedSourceLength; + offset /= realSourceLength; + offset = sourceFinish - offset; + offset = slowToFastMapping[offset]; + offset = targetFinish - offset; + offset *= realTargetLength; + offset /= alignedTargetLength; + + if (offset < 0) + offset = 0; + if (offset > realTargetLength) + offset = realTargetLength; + } else + offset = 0; + + setFinishOffset(marker, offset); + } else { + setStartOffset(marker, 0); + setFinishOffset(marker, 0); + } + } + + // Short read markers + if (readStartsAreActivated(graph)) { + // Update Coordinates + sourceArray = getNodeReads(source, graph); + sourceLength = getNodeReadCount(source, graph); + + for (index = 0; index < sourceLength; index++) { + shortMarker = + getShortReadMarkerAtIndex(sourceArray, index); + position = getShortReadMarkerPosition(shortMarker); + + if (position > -1) { + if (realSourceLength > 0 + && alignedTargetLength > 0) { + position *= alignedSourceLength; + position /= realSourceLength; + position += sourceStart; + position = + slowToFastMapping[position]; + position -= targetStart; + position *= realTargetLength; + position /= alignedTargetLength; + + if (position < 0) + position = 0; + if (position > realTargetLength) + position = + realTargetLength; + } else + position = 0; + } + + setShortReadMarkerPosition(shortMarker, position); + } + mergeNodeReads(target, source, graph); + + // Same but for symmetrical reads + sourceArray = getNodeReads(getTwinNode(source), graph); + sourceLength = + getNodeReadCount(getTwinNode(source), graph); + + for (index = 0; index < sourceLength; index++) { + shortMarker = + getShortReadMarkerAtIndex(sourceArray, index); + position = getShortReadMarkerPosition(shortMarker); + + if (position > -1) { + if (realSourceLength > 0 + && alignedTargetLength > 0) { + position = + realSourceLength - position; + position *= alignedSourceLength; + position /= realSourceLength; + position += sourceStart; + position = + slowToFastMapping[position]; + position -= targetStart; + position *= realTargetLength; + position /= alignedTargetLength; + position = + realTargetLength - position; + + if (position < 0) + position = 0; + if (position > realTargetLength) + position = + realTargetLength; + } else + position = 0; + } + + setShortReadMarkerPosition(shortMarker, position); + } + mergeNodeReads(getTwinNode(target), getTwinNode(source), + graph); + } + // Virtual reads + for (cat = 0; cat < CATEGORIES; cat++) + incrementVirtualCoverage(target, cat, + getVirtualCoverage(source, cat)); +} + +static void remapBackOfNodeArcsOntoNeighbour(Node * source, Node * target) +{ + Arc *arc; + + remapNodeArcsOntoTarget(getTwinNode(source), getTwinNode(target)); + for (arc = getArc(source); arc != NULL; arc = getNextArc(arc)) + createAnalogousArcAndVaccinate(target, source, arc); + +} + +static Coordinate +remapBackOfNodeMarkersOntoNeighbour(Node * source, + PassageMarker * sourceMarker, + Node * target, + PassageMarker * targetMarker, + boolean slowToFast) +{ + PassageMarker *marker, *newMarker, *previousMarker, *nextMarker; + Coordinate halfwayPoint, halfwayPointOffset, breakpoint, + newStartOffset, newFinishOffset; + Category cat; + Coordinate coverage; + Coordinate *targetToSourceMapping, *sourceToTargetMapping; + ShortReadMarker *selectedShortReads, *shortRead; + IDnum selectedShortReadCount, shortReadIndex; + Coordinate position; + + Coordinate targetStart = getPassageMarkerStart(targetMarker); + Coordinate targetFinish = getPassageMarkerFinish(targetMarker); + Coordinate sourceStart = getPassageMarkerStart(sourceMarker); + Coordinate sourceFinish = getPassageMarkerFinish(sourceMarker); + + Coordinate alignedTargetLength = targetFinish - targetStart; + Coordinate alignedSourceLength = sourceFinish - sourceStart; + + Coordinate realTargetLength = getNodeLength(target); + Coordinate realSourceLength = getNodeLength(source); + + if (slowToFast) { + sourceToTargetMapping = slowToFastMapping; + targetToSourceMapping = fastToSlowMapping; + } else { + sourceToTargetMapping = fastToSlowMapping; + targetToSourceMapping = slowToFastMapping; + } + + // Calculating source node breakpoint: + if (alignedSourceLength > 0 && targetFinish > 0) { + halfwayPoint = + targetToSourceMapping[targetFinish - 1] - sourceStart + + 1; + halfwayPoint *= realSourceLength; + halfwayPoint /= alignedSourceLength; + } else + halfwayPoint = 0; + + if (halfwayPoint < 0) + halfwayPoint = 0; + if (halfwayPoint > realSourceLength) + halfwayPoint = realSourceLength; + halfwayPointOffset = realSourceLength - halfwayPoint; + + // Complete markers + for (marker = getMarker(source); marker != NULL; + marker = nextMarker) { + nextMarker = getNextInNode(marker); + + // To avoid making loops... + if (getNode(getPreviousInSequence(marker)) == target) + continue; + + // Markers which are downstream of the breakpoint + if (isInitial(marker) + && getStartOffset(marker) > halfwayPoint) { + newStartOffset = + getStartOffset(marker) - halfwayPoint; + setStartOffset(marker, newStartOffset); + continue; + } + // Markers which are upstream of the breakpoint + if (isTerminal(marker) + && getFinishOffset(marker) > halfwayPointOffset) { + if (slowToFast) { + if (realSourceLength > 0 + && alignedTargetLength > 0) { + newFinishOffset = + getFinishOffset(marker) - + halfwayPointOffset; + newFinishOffset *= + alignedSourceLength; + newFinishOffset /= + realSourceLength; + newFinishOffset *= + realTargetLength; + newFinishOffset /= + alignedTargetLength; + if (newFinishOffset < 0) + newFinishOffset = 0; + else if (newFinishOffset > + realTargetLength) + newFinishOffset = + realTargetLength; + } else + newFinishOffset = 0; + } else { + newFinishOffset = + getFinishOffset(marker) - + halfwayPointOffset; + } + setFinishOffset(marker, newFinishOffset); + extractPassageMarker(marker); + transposePassageMarker(marker, target); + continue; + } + // Markers on both sides of the divide + newMarker = + addPassageMarker(getPassageMarkerSequenceID(marker), + getPassageMarkerStart(marker), + target); + + setPassageMarkerStart(newMarker, + getPassageMarkerStart(marker)); + setPassageMarkerStatus(newMarker, + getPassageMarkerStatus(marker)); + + if (realSourceLength - getStartOffset(marker) - + getFinishOffset(marker) > 0) { + breakpoint = halfwayPoint - getStartOffset(marker); + breakpoint *= getPassageMarkerLength(marker); + breakpoint /= realSourceLength - + getStartOffset(marker) - + getFinishOffset(marker); + breakpoint *= passageMarkerDirection(marker); + breakpoint += getPassageMarkerStart(marker); + } else { + breakpoint = getPassageMarkerStart(marker); + } + + setPassageMarkerFinish(newMarker, breakpoint); + setPassageMarkerStart(marker, breakpoint); + + if (slowToFast) { + if (realSourceLength != 0 + && alignedTargetLength != 0) { + newStartOffset = getStartOffset(marker); + newStartOffset *= alignedSourceLength; + newStartOffset /= realSourceLength; + newStartOffset *= realTargetLength; + newStartOffset /= alignedTargetLength; + if (newStartOffset < 0) + newStartOffset = 0; + else if (newStartOffset > realTargetLength) + newStartOffset = realTargetLength; + } else { + newStartOffset = 0; + } + } else { + newStartOffset = getStartOffset(marker); + } + + setStartOffset(newMarker, newStartOffset); + setFinishOffset(newMarker, 0); + setStartOffset(marker, 0); + + previousMarker = getPreviousInSequence(marker); + setNextInSequence(previousMarker, newMarker); + setPreviousInSequence(previousMarker, newMarker); + + setPreviousInSequence(newMarker, marker); + setNextInSequence(newMarker, marker); + } + + // Read starts + if (readStartsAreActivated(graph)) { + selectedShortReads = + extractBackOfNodeReads(source, halfwayPoint, graph, + &selectedShortReadCount, + sourceMarker, sequenceLengths); + if (slowToFast) { + if (realSourceLength > 0 + && alignedTargetLength > 0) { + for (shortReadIndex = 0; + shortReadIndex < + selectedShortReadCount; + shortReadIndex++) { + shortRead = + getShortReadMarkerAtIndex + (selectedShortReads, + shortReadIndex); + position = + getShortReadMarkerPosition + (shortRead); + if (position > -1) { + position *= + alignedSourceLength; + position /= + realSourceLength; + position += sourceStart; + position = + sourceToTargetMapping + [position]; + position -= targetStart; + position *= + realTargetLength; + position /= + alignedTargetLength; + if (position < 0) + position = 0; + if (position > + realTargetLength) + position = + realTargetLength; + } + setShortReadMarkerPosition + (shortRead, position); + } + } else { + for (shortReadIndex = 0; + shortReadIndex < + selectedShortReadCount; + shortReadIndex++) { + shortRead = + getShortReadMarkerAtIndex + (selectedShortReads, + shortReadIndex); + position = + getShortReadMarkerPosition + (shortRead); + if (position > -1) + setShortReadMarkerPosition + (shortRead, 0); + } + + } + } + injectShortReads(selectedShortReads, + selectedShortReadCount, target, graph); + + selectedShortReads = + extractFrontOfNodeReads(getTwinNode(source), + halfwayPoint, graph, + &selectedShortReadCount, + sourceMarker, sequenceLengths); + if (slowToFast) { + if (realSourceLength > 0 + && alignedTargetLength > 0) { + for (shortReadIndex = 0; + shortReadIndex < + selectedShortReadCount; + shortReadIndex++) { + shortRead = + getShortReadMarkerAtIndex + (selectedShortReads, + shortReadIndex); + position = + getShortReadMarkerPosition + (shortRead); + if (position > -1) { + position = + getShortReadMarkerPosition + (shortRead); + position = + realSourceLength - + position; + position *= + alignedSourceLength; + position /= + realSourceLength; + position += sourceStart; + position = + sourceToTargetMapping + [position]; + position -= targetStart; + position *= + realTargetLength; + position /= + alignedTargetLength; + position = + realTargetLength - + position; + if (position < 0) + position = 0; + if (position > + realTargetLength) + position = + realTargetLength; + } + setShortReadMarkerPosition + (shortRead, position); + } + } else { + for (shortReadIndex = 0; + shortReadIndex < + selectedShortReadCount; + shortReadIndex++) { + shortRead = + getShortReadMarkerAtIndex + (selectedShortReads, + shortReadIndex); + position = + getShortReadMarkerPosition + (shortRead); + if (position > -1) + setShortReadMarkerPosition + (shortRead, 0); + } + + } + } + injectShortReads(selectedShortReads, + selectedShortReadCount, + getTwinNode(target), graph); + } + // Virtual coverage + if (alignedSourceLength != 0) { + for (cat = 0; cat < CATEGORIES; cat++) { + coverage = getVirtualCoverage(source, cat); + coverage *= halfwayPoint; + coverage /= alignedSourceLength; + incrementVirtualCoverage(target, cat, coverage); + incrementVirtualCoverage(source, cat, -coverage); + + coverage = getOriginalVirtualCoverage(source, cat); + coverage *= halfwayPoint; + coverage /= alignedSourceLength; + incrementOriginalVirtualCoverage(source, cat, + -coverage); + } + } + + return halfwayPointOffset; +} + +static void remapNodeInwardReferencesOntoNode(Node * source, Node * target) +{ + Arc *arc; + Node *destination; + + for (arc = getArc(source); arc != NULL; arc = getNextArc(arc)) { + destination = getDestination(arc); + if (destination == target || destination == source) + continue; + if (getNodePrevious(destination) == source) + setNodePrevious(target, destination); + } +} + +static void remapNodeTimesOntoTargetNode(Node * source, Node * target) +{ + Time nodeTime = getNodeTime(source); + Node *previous = getNodePrevious(source); + Time targetTime = getNodeTime(target); + + if (nodeTime == -1) + return; + + if (previous == source) { + setNodeTime(target, nodeTime); + setNodePrevious(target, target); + } else if (targetTime == -1 + || targetTime > nodeTime + || (targetTime == nodeTime + && !isPreviousToNode(target, previous))) { + setNodeTime(target, nodeTime); + if (previous != getTwinNode(source)) + setNodePrevious(previous, target); + else + setNodePrevious(getTwinNode(target), target); + } + + remapNodeInwardReferencesOntoNode(source, target); + + setNodePrevious(NULL, source); +} + +static void foldSymmetricalNode(Node * node) +{ + Node *twinNode = getTwinNode(node); + Node *tmp, *destination; + Arc *arc; + PassageMarker *oldMarker = getMarker(node); + PassageMarker *currentMarker, *newMarker, *previousMarker; + Coordinate halfwayPoint; + IDnum totalMult; + + // Reduce time complexity of damn thing + if (simpleArcCount(node) < simpleArcCount(twinNode)) { + tmp = twinNode; + twinNode = node; + node = tmp; + } + // Destroy link to old markers + setMarker(node, NULL); + + // Reinsert markers properly + while (oldMarker != NULL) { + currentMarker = oldMarker; + oldMarker = getNextInNode(currentMarker); + previousMarker = getPreviousInSequence(currentMarker); + + if (getNode(previousMarker) != twinNode) { + newMarker = + addUncertainPassageMarker + (getPassageMarkerSequenceID(currentMarker), + twinNode); + setPassageMarkerStatus(newMarker, + getPassageMarkerStatus + (currentMarker)); + + setPassageMarkerStart(newMarker, + getPassageMarkerStart + (currentMarker)); + + // For security issues: + if (currentMarker == slowPath) + slowPath = newMarker; + else if (currentMarker == fastPath) + fastPath = newMarker; + + halfwayPoint = + (getPassageMarkerStart(currentMarker) + + getPassageMarkerFinish(currentMarker)) + / 2; + setPassageMarkerFinish(newMarker, halfwayPoint); + setPassageMarkerStart(currentMarker, halfwayPoint); + + setStartOffset(newMarker, + getStartOffset(currentMarker)); + setFinishOffset(newMarker, 0); + setStartOffset(currentMarker, 0); + + setPreviousInSequence(previousMarker, newMarker); + setNextInSequence(previousMarker, newMarker); + + setPreviousInSequence(newMarker, currentMarker); + setNextInSequence(newMarker, currentMarker); + } + + transposePassageMarker(currentMarker, node); + } + + // Read start info + foldSymmetricalNodeReads(node, graph); + + // Coverage => already balanced! + + // References + if (getNodeTime(node) == -1 && getNodeTime(twinNode) == -1); + else if (getNodeTime(node) == -1) { + setNodeTime(node, getNodeTime(twinNode)); + } else if (getNodeTime(twinNode) == -1) { + setNodeTime(twinNode, getNodeTime(node)); + setNodePrevious(getNodePrevious(node), twinNode); + } else if (getNodePrevious(node) == node) { + setNodeTime(twinNode, getNodeTime(node)); + setNodePrevious(twinNode, twinNode); + } else if (getNodeTime(node) < getNodeTime(twinNode)) { + setNodeTime(twinNode, getNodeTime(node)); + setNodePrevious(getNodePrevious(node), twinNode); + } else if (getNodeTime(node) == getNodeTime(twinNode) + && isPreviousToNode(node, twinNode)) { + setNodePrevious(getNodePrevious(node), twinNode); + } else { + setNodeTime(node, getNodeTime(twinNode)); + } + + setNodePrevious(twinNode, node); + remapNodeInwardReferencesOntoNode(twinNode, node); + + // Active node + if (twinNode == activeNode) { + activeNode = node; + todo = + &todoLists[getNodeID(activeNode) + nodeCount(graph)]; + } + concatenateTodoLists(node, twinNode); + + // Remap arcs properly + arc = getArc(twinNode); + totalMult = 0; + while (arc != NULL) { + destination = getDestination(arc); + if (destination != node) + createAnalogousArc(node, destination, arc, graph); + totalMult += getMultiplicity(arc); + destroyArc(arc, graph); + arc = getArc(twinNode); + } + + arc = createArc(twinNode, node, graph); + setMultiplicity(arc, totalMult); + + // Uniqueness + setUniqueness(node, false); + + // Starting node + if (startingNode == node) + startingNode = twinNode; +} + +static void remapNodeTimesOntoNeighbour(Node * source, Node * target) +{ + remapNodeTimesOntoTargetNode(source, target); + remapNodeTimesOntoTargetNode(getTwinNode(source), + getTwinNode(target)); +} + +static void remapNodeTimesOntoForwardMiddlePath(Node * source, + PassageMarker * path) +{ + PassageMarker *marker; + Node *target; + Time nodeTime = getNodeTime(source); + Node *previousNode = getNodePrevious(source); + Time targetTime; + + //printf("Remapping times from %ld to %ld\n", getNodeID(previousNode), getNodeID(source)); + + for (marker = path; getNode(marker) != source; + marker = getNextInSequence(marker)) { + target = getNode(marker); + targetTime = getNodeTime(target); + + //printf("Through %ld\n", getNodeID(target)); + + if (targetTime == -1 + || targetTime > nodeTime + || (targetTime == nodeTime + && !isPreviousToNode(target, previousNode))) { + setNodeTime(target, nodeTime); + setNodePrevious(previousNode, target); + } + + previousNode = target; + } + + setNodePrevious(previousNode, source); + +} + +static void remapNodeTimesOntoTwinMiddlePath(Node * source, + PassageMarker * path) +{ + PassageMarker *marker; + Node *target; + Node *previousNode = getTwinNode(source); + Time targetTime; + PassageMarker *limit = getTwinMarker(getPreviousInSequence(path)); + Time nodeTime = getNodeTime(getNode(limit)); + + //printf("Remapping times from twins %ld to %ld\n", getNodeID(previousNode), getNodeID(getNode(limit))); + + // Revving up + marker = path; + while (getNode(marker) != source) + marker = getNextInSequence(marker); + marker = getTwinMarker(marker); + + // Going down the path + while (marker != limit) { + marker = getNextInSequence(marker); + target = getNode(marker); + targetTime = getNodeTime(target); + + //printf("Through %ld\n", getNodeID(target)); + + if (targetTime == -1 + || targetTime > nodeTime + || (targetTime == nodeTime + && !isPreviousToNode(target, previousNode))) { + setNodeTime(target, nodeTime); + getNodeTime(target); + setNodePrevious(previousNode, target); + } + + previousNode = target; + } +} + +static void +remapNodeFibHeapReferencesOntoNode(Node * source, Node * target) +{ + DFibHeapNode *sourceDHeapNode = getNodeDHeapNode(source); + DFibHeapNode *targetDHeapNode = getNodeDHeapNode(target); + + if (sourceDHeapNode == NULL) + return; + + if (targetDHeapNode == NULL) { + setNodeDHeapNode(target, sourceDHeapNode); + replaceValueInDHeap(sourceDHeapNode, target); + } else if (getKey(targetDHeapNode) > getKey(sourceDHeapNode)) { + setNodeDHeapNode(target, sourceDHeapNode); + replaceValueInDHeap(sourceDHeapNode, target); + destroyNodeInDHeap(targetDHeapNode, dheap); + } else + destroyNodeInDHeap(sourceDHeapNode, dheap); + + setNodeDHeapNode(source, NULL); +} + +static void remapNodeOntoNeighbour(Node * source, + PassageMarker * sourceMarker, + Node * target, + PassageMarker * targetMarker) +{ + //printf("Remapping node %ld onto middle path %ld\n", getNodeID(source), getNodeID(target)); + remapNodeMarkersOntoNeighbour(source, sourceMarker, target, + targetMarker); + + remapNodeTimesOntoNeighbour(source, target); + remapNodeArcsOntoNeighbour(source, target); + + remapNodeFibHeapReferencesOntoNode(getTwinNode(source), + getTwinNode(target)); + remapNodeFibHeapReferencesOntoNode(source, target); + + if (startingNode == source) + startingNode = target; + if (startingNode == getTwinNode(source)) + startingNode = getTwinNode(target); + + destroyNode(source, graph); +} + +static void remapBackOfNodeDescriptorOntoNeighbour(Node * source, + PassageMarker * + sourceMarker, + Node * target, + PassageMarker * + targetMarker, + boolean slowToFast, + Coordinate offset) +{ + //printf("Splitting node descriptor %ld // %ld\n", getNodeLength(source), offset); + + if (slowToFast) + splitNodeDescriptor(source, NULL, offset); + else + splitNodeDescriptor(source, target, offset); +} + +static void remapBackOfNodeTimesOntoNeighbour(Node * source, Node * target) +{ + Time targetTime = getNodeTime(target); + Time nodeTime = getNodeTime(source); + Node *twinTarget = getTwinNode(target); + Node *twinSource = getTwinNode(source); + Node *previous; + + if (nodeTime != -1) { + previous = getNodePrevious(source); + + if (previous == source) { + setNodeTime(target, nodeTime); + setNodePrevious(target, target); + } else if (targetTime == -1 + || targetTime > nodeTime + || (targetTime == nodeTime + && !isPreviousToNode(target, previous))) { + setNodeTime(target, nodeTime); + if (previous != getTwinNode(source)) + setNodePrevious(previous, target); + else + setNodePrevious(getTwinNode(target), + target); + } + + setNodePrevious(target, source); + } + + targetTime = getNodeTime(twinTarget); + nodeTime = getNodeTime(twinSource); + + if (nodeTime != -1) { + if (targetTime == -1 + || targetTime > nodeTime + || (targetTime == nodeTime + && !isPreviousToNode(twinTarget, twinSource))) { + setNodeTime(twinTarget, nodeTime); + setNodePrevious(twinSource, twinTarget); + } + } + + remapNodeInwardReferencesOntoNode(twinSource, twinTarget); +} + +static void +remapBackOfNodeOntoNeighbour(Node * source, PassageMarker * sourceMarker, + Node * target, PassageMarker * targetMarker, + boolean slowToFast) +{ + Coordinate offset; + //printf("Remapping node %ld onto middle path\n", getNodeID(node)); + + offset = + remapBackOfNodeMarkersOntoNeighbour(source, sourceMarker, + target, targetMarker, + slowToFast); + remapBackOfNodeDescriptorOntoNeighbour(source, sourceMarker, + target, targetMarker, + slowToFast, offset); + remapBackOfNodeTimesOntoNeighbour(source, target); + remapBackOfNodeArcsOntoNeighbour(source, target); + + remapNodeFibHeapReferencesOntoNode(getTwinNode(source), + getTwinNode(target)); + + if (getTwinNode(source) == startingNode) + startingNode = getTwinNode(target); +} + +static void remapEmptyPathArcsOntoMiddlePathSimple(PassageMarker * + emptyPath, + PassageMarker * + targetPath) +{ + PassageMarker *pathMarker; + Node *start = getNode(getPreviousInSequence(emptyPath)); + Node *finish = getNode(emptyPath); + Node *previousNode = start; + Node *currentNode; + Arc *originalArc = getArcBetweenNodes(start, finish, graph); + + for (pathMarker = targetPath; getNode(pathMarker) != finish; + pathMarker = getNextInSequence(pathMarker)) { + currentNode = getNode(pathMarker); + createAnalogousArcAndVaccinate(previousNode, currentNode, + originalArc); + previousNode = currentNode; + } + + createAnalogousArcAndVaccinate(previousNode, finish, originalArc); + + destroyArc(originalArc, graph); +} + +static void remapEmptyPathMarkersOntoMiddlePathSimple(PassageMarker * + emptyPath, + PassageMarker * + targetPath) +{ + PassageMarker *marker, *newMarker, *previousMarker, *pathMarker; + Node *start = getNode(getPreviousInSequence(emptyPath)); + Node *finish = getNode(emptyPath); + PassageMarker *oldMarker = getMarker(finish); + Coordinate markerStart; + IDnum intersectionLength, twinIntersectionLength; + ShortReadMarker *intersectionReads = + commonNodeReads(start, finish, graph, &intersectionLength); + ShortReadMarker *twinIntersectionReads = + commonNodeReads(getTwinNode(start), getTwinNode(finish), graph, + &twinIntersectionLength); + + //printf("SIMPLE %ld\t%ld\t%i\t%i\n", markerCount(finish), + // getNodeID(finish), arcCount(finish), + // arcCount(getTwinNode(finish))); + + // Destroy link to old nodes + setMarker(finish, NULL); + + while (oldMarker != NULL) { + marker = oldMarker; + oldMarker = getNextInNode(marker); + newMarker = getPreviousInSequence(marker); + + if (getNode(newMarker) != start) { + transposePassageMarker(marker, finish); + continue; + } + + markerStart = getPassageMarkerStart(marker); + for (pathMarker = targetPath; + getNode(pathMarker) != finish; + pathMarker = getNextInSequence(pathMarker)) { + previousMarker = newMarker; + + newMarker = + addUncertainPassageMarker + (getPassageMarkerSequenceID(marker), + getNode(pathMarker)); + setPassageMarkerStatus(newMarker, + getPassageMarkerStatus + (marker)); + setPassageMarkerStart(newMarker, markerStart); + setPassageMarkerFinish(newMarker, markerStart); + + setNextInSequence(previousMarker, newMarker); + setPreviousInSequence(previousMarker, newMarker); + + setStartOffset(newMarker, 0); + setFinishOffset(newMarker, 0); + + } + + setNextInSequence(newMarker, marker); + setPreviousInSequence(newMarker, marker); + transposePassageMarker(marker, finish); + } + + if (readStartsAreActivated(graph)) { + for (pathMarker = targetPath; + getNode(pathMarker) != finish; + pathMarker = getNextInSequence(pathMarker)) { + // Read starts + spreadReadIDs(intersectionReads, + intersectionLength, + getNode(pathMarker), graph); + spreadReadIDs(twinIntersectionReads, + twinIntersectionLength, + getTwinNode(getNode(pathMarker)), + graph); + } + } + + free(intersectionReads); + free(twinIntersectionReads); +} + +static boolean markerFollowsPath(PassageMarker * marker, + PassageMarker * start, + PassageMarker * finish, Node * stopNode) +{ + PassageMarker *current, *path; + + path = start; + current = marker; + + while (true) { + if (current == NULL || path == finish || path == NULL) + return true; + + if (getNode(current) != getNode(path)) + return false; + + current = getNextInSequence(current); + path = getNextInSequence(path); + } +} + +static PassageMarkerList *getAnchors(PassageMarker * marker, Node * nodeA, + Node * nodeB) +{ + PassageMarker *current, *next; + Node *twinA = getTwinNode(nodeA); + Node *twinB = getTwinNode(nodeB); + PassageMarkerList *result = NULL; + + current = marker; + while (current != NULL) { + next = getNextInSequence(current); + if (getNode(current) == nodeA && getNode(next) == nodeB) { + result = newPassageMarkerList(next, result); + } + if (getNode(current) == twinB && getNode(next) == twinA) { + result = + newPassageMarkerList(getTwinMarker(current), + result); + } + current = next; + } + + return result; +} + +static void destroyPassageMarkerList(PassageMarkerList ** list) +{ + PassageMarkerList *ptr; + + while (*list != NULL) { + ptr = *list; + *list = ptr->next; + deallocatePassageMarkerList(ptr); + } +} + +static void remapEmptyPathMarkersOntoMiddlePathDevious(PassageMarker * + emptyPath, + PassageMarker * + targetPath) +{ + PassageMarker *marker, *newMarker, *previousMarker, *pathMarker; + Node *start = getNode(getPreviousInSequence(emptyPath)); + Node *finish = getNode(emptyPath); + PassageMarkerList *anchors = getAnchors(targetPath, start, finish); + PassageMarkerList *currentAnchor; + boolean untouchable = false; + Coordinate markerStart; + + printf("DEVIOUS %d\t%d\t%i\t%i\n", markerCount(finish), + getNodeID(finish), arcCount(finish), + arcCount(getTwinNode(finish))); + + for (marker = getMarker(finish); marker != NULL; + marker = getNextInNode(marker)) { + newMarker = getPreviousInSequence(marker); + + if (getNode(newMarker) != start) + continue; + + + for (currentAnchor = anchors; currentAnchor != NULL; + currentAnchor = currentAnchor->next) + if (markerFollowsPath + (marker, currentAnchor->marker, targetPath, + finish)) { + untouchable = true; + break; + } + + if (untouchable) + continue; + + markerStart = getPassageMarkerStart(marker); + for (pathMarker = targetPath; + getNode(pathMarker) != finish; + pathMarker = getNextInSequence(pathMarker)) { + previousMarker = newMarker; + newMarker = + addUncertainPassageMarker + (getPassageMarkerSequenceID(marker), + getNode(pathMarker)); + setPassageMarkerStatus(newMarker, + getPassageMarkerStatus + (marker)); + setPassageMarkerStart(newMarker, markerStart); + setPassageMarkerFinish(newMarker, markerStart); + + setNextInSequence(previousMarker, newMarker); + setPreviousInSequence(previousMarker, newMarker); + + setStartOffset(newMarker, 0); + setFinishOffset(newMarker, 0); + } + + setNextInSequence(newMarker, marker); + setPreviousInSequence(newMarker, marker); + } + + destroyPassageMarkerList(&anchors); +} + +static boolean markerLeadsToArc(PassageMarker * marker, Node * nodeA, + Node * nodeB) +{ + PassageMarker *current, *next; + Node *twinA = getTwinNode(nodeA); + Node *twinB = getTwinNode(nodeB); + + current = marker; + while (current != NULL) { + next = getNextInSequence(current); + if (getNode(current) == nodeA && getNode(next) == nodeB) + return true; + if (getNode(current) == twinB && getNode(next) == twinA) + return true; + current = next; + } + + return false; +} + +static void +remapEmptyPathOntoMiddlePath(PassageMarker * emptyPath, + PassageMarker * targetPath) +{ + Node *start = getNode(getPreviousInSequence(emptyPath)); + Node *finish = getNode(emptyPath); + + // Remapping markers + if (!markerLeadsToArc(targetPath, start, finish)) { + remapEmptyPathArcsOntoMiddlePathSimple(emptyPath, + targetPath); + remapEmptyPathMarkersOntoMiddlePathSimple(emptyPath, + targetPath); + } else { + remapEmptyPathMarkersOntoMiddlePathDevious(emptyPath, + targetPath); + } + + //Remap times and previous(if necessary) + if (getNodePrevious(finish) == start) + remapNodeTimesOntoForwardMiddlePath(finish, targetPath); + + if (getNodePrevious(getTwinNode(start)) == getTwinNode(finish)) + remapNodeTimesOntoTwinMiddlePath(finish, targetPath); +} + +static void reduceSlowNodes(PassageMarker * slowMarker, Node * finish) +{ + PassageMarker *marker; + + for (marker = slowMarker; getNode(marker) != finish; + marker = getNextInSequence(marker)) { + reduceNode(getNode(marker)); + } +} + +static void destroyPaths() +{ + PassageMarker *marker; + + while (slowPath != NULL) { + marker = slowPath; + getNodeTime(getNode(marker)); + getNodeTime(getTwinNode(getNode(marker))); + + slowPath = getNextInSequence(marker); + destroyPassageMarker(marker); + } + + while (fastPath != NULL) { + marker = fastPath; + getNodeTime(getNode(marker)); + getNodeTime(getTwinNode(getNode(marker))); + fastPath = getNextInSequence(marker); + destroyPassageMarker(marker); + } +} + +static Coordinate mapDistancesOntoPaths() +{ + PassageMarker *marker; + Coordinate totalDistance = 0; + + marker = slowPath; + while (!isTerminal(marker)) { + marker = getNextInSequence(marker); + setPassageMarkerStart(marker, totalDistance); + totalDistance += getNodeLength(getNode(marker)); + setPassageMarkerFinish(marker, totalDistance); + } + + totalDistance = 0; + marker = fastPath; + while (!isTerminal(marker)) { + marker = getNextInSequence(marker); + setPassageMarkerStart(marker, totalDistance); + totalDistance += getNodeLength(getNode(marker)); + setPassageMarkerFinish(marker, totalDistance); + } + + return totalDistance; +} + +static boolean markerLeadsToNode(PassageMarker * marker, Node * node) +{ + PassageMarker *currentMarker; + + for (currentMarker = marker; currentMarker != NULL; + currentMarker = getNextInSequence(currentMarker)) + if (getNode(currentMarker) == node) + return true; + + return false; +} + +static void transferNodeData(Node * source, Node * target) +{ + Arc *arc; + Node *twinSource = getTwinNode(source); + Node *twinTarget = getTwinNode(target); + Node *destination; + + // Time & Outward references + if (getNodePrevious(source) == source) { + setNodeTime(target, getNodeTime(source)); + setNodePrevious(target, target); + } + + if (getNodeTime(twinSource) == -1); + else if (getNodePrevious(twinSource) == twinSource) { + setNodeTime(twinTarget, getNodeTime(twinSource)); + setNodePrevious(twinTarget, twinTarget); + } else if (getNodeTime(twinTarget) == -1 + || getNodeTime(twinSource) < getNodeTime(twinTarget) + || (getNodeTime(twinSource) == getNodeTime(twinTarget) + && !isPreviousToNode(twinTarget, twinSource))) { + setNodeTime(twinTarget, getNodeTime(twinSource)); + setNodePrevious(getNodePrevious(twinSource), twinTarget); + } + + if (getNodePrevious(twinTarget) == source) + setNodePrevious(target, twinTarget); + + // Inward references: + for (arc = getArc(source); arc != NULL; arc = getNextArc(arc)) { + destination = getDestination(arc); + if (getNodePrevious(destination) == source) + setNodePrevious(target, destination); + } + + // Fib Heap refs + remapNodeFibHeapReferencesOntoNode(source, target); + remapNodeFibHeapReferencesOntoNode(twinSource, twinTarget); + + // Starting point + if (startingNode == source) + startingNode = target; + else if (startingNode == twinSource) + startingNode = twinTarget; + + if (getNode(slowPath) == twinSource) + slowPath = getNextInSequence(slowPath); + if (getNode(fastPath) == twinSource) + fastPath = getNextInSequence(fastPath); + + // Next node + if (source == activeNode) { + activeNode = target; + todo = + &todoLists[getNodeID(activeNode) + nodeCount(graph)]; + } + concatenateTodoLists(target, source); + + if (twinSource == activeNode) { + activeNode = twinTarget; + todo = + &todoLists[getNodeID(activeNode) + nodeCount(graph)]; + } +} + +// Replaces two consecutive nodes into a single equivalent node +// The extra memory is freed +static void concatenateNodesAndVaccinate(Node * nodeA, Node * nodeB, + Graph * graph) +{ + PassageMarker *marker, *tmpMarker; + Node *twinA = getTwinNode(nodeA); + Node *twinB = getTwinNode(nodeB); + Arc *arc; + Category cat; + + //printf("Concatenating nodes %ld and %ld\n", getNodeID(nodeA), getNodeID(nodeB)); + // Arc management: + // Freeing useless arcs + while (getArc(nodeA) != NULL) + destroyArc(getArc(nodeA), graph); + + // Correct arcs + for (arc = getArc(nodeB); arc != NULL; arc = getNextArc(arc)) { + if (getDestination(arc) != twinB) + createAnalogousArcAndVaccinate(nodeA, + getDestination(arc), + arc); + else + createAnalogousArcAndVaccinate(nodeA, twinA, arc); + } + + // Passage marker management in node A: + for (marker = getMarker(nodeA); marker != NULL; + marker = getNextInNode(marker)) + if (isTerminal(marker)) + incrementFinishOffset(marker, + getNodeLength(nodeB)); + + // Swapping new born passageMarkers from B to A + for (marker = getMarker(nodeB); marker != NULL; marker = tmpMarker) { + tmpMarker = getNextInNode(marker); + + if (isInitial(marker)) { + extractPassageMarker(marker); + transposePassageMarker(marker, nodeA); + incrementStartOffset(marker, getNodeLength(nodeA)); + } else + disconnectNextPassageMarker(getPreviousInSequence + (marker), graph); + } + + // Read starts + concatenateReadStarts(nodeA, nodeB, graph); + + // Descriptor management + appendDescriptors(nodeA, nodeB); + + // Update uniqueness: + setUniqueness(nodeA, getUniqueness(nodeA) || getUniqueness(nodeB)); + + // Update virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementVirtualCoverage(nodeA, cat, + getVirtualCoverage(nodeB, cat)); + + // Update virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementOriginalVirtualCoverage(nodeA, cat, + getOriginalVirtualCoverage + (nodeB, cat)); + + // Freeing gobbled node + destroyNode(nodeB, graph); +} + +static void simplifyNode(Node * node) +{ + Node *twin = getTwinNode(node); + Node *destination, *twinDestination; + + if (!hasSingleArc(node)) + return; + + destination = getDestination(getArc(node)); + twinDestination = getTwinNode(destination); + + while (hasSingleArc(node) + && hasSingleArc(twinDestination) + && destination != twin && destination != node) { + transferNodeData(destination, node); + concatenateNodesAndVaccinate(node, destination, graph); + + if (!hasSingleArc(node)) + return; + destination = getDestination(getArc(node)); + twinDestination = getTwinNode(destination); + } + +} + +static void concatenatePathNodes(PassageMarker * pathStart) +{ + PassageMarker *pathMarker; + + //puts("Removing null loops"); + for (pathMarker = pathStart; pathMarker != NULL; + pathMarker = getNextInSequence(pathMarker)) { + simplifyNode(getNode(pathMarker)); + } +} + +#define SLOW_TO_FAST true +#define FAST_TO_SLOW false + +static void cleanUpRedundancy() +{ + PassageMarker *slowMarker = getNextInSequence(slowPath); + PassageMarker *fastMarker = getNextInSequence(fastPath); + Coordinate slowLength, fastLength; + Coordinate fastConstraint = 0; + Coordinate slowConstraint = 0; + Coordinate finalLength; + Node *slowNode, *fastNode; + + //puts("Correcting new redundancy"); + mapSlowOntoFast(); + finalLength = mapDistancesOntoPaths(); + + while (slowMarker != NULL && fastMarker != NULL) { + if (isTerminal(slowMarker)) + slowLength = finalLength; + else { + slowLength = + slowToFastMapping[getPassageMarkerFinish + (slowMarker) - 1]; + if (slowLength < slowConstraint) + slowLength = slowConstraint; + } + + fastLength = getPassageMarkerFinish(fastMarker) - 1; + if (fastLength < fastConstraint) + fastLength = fastConstraint; + + slowNode = getNode(slowMarker); + fastNode = getNode(fastMarker); + + if (slowNode == fastNode) { + if (fastLength > slowLength) + slowConstraint = fastLength; + else if (fastLength < slowLength); + fastConstraint = slowLength; + + slowMarker = getNextInSequence(slowMarker); + fastMarker = getNextInSequence(fastMarker); + } else if (slowNode == getTwinNode(fastNode)) { + if (fastLength > slowLength) + slowConstraint = fastLength; + else if (fastLength < slowLength); + fastConstraint = slowLength; + + slowMarker = getNextInSequence(slowMarker); + fastMarker = getNextInSequence(fastMarker); + foldSymmetricalNode(fastNode); + } else if (markerLeadsToNode(slowMarker, fastNode)) { + reduceSlowNodes(slowMarker, fastNode); + remapEmptyPathOntoMiddlePath(fastMarker, + slowMarker); + while (getNode(slowMarker) != fastNode) + slowMarker = getNextInSequence(slowMarker); + } else if (markerLeadsToNode(fastMarker, slowNode)) { + remapEmptyPathOntoMiddlePath(slowMarker, + fastMarker); + while (getNode(fastMarker) != slowNode) + fastMarker = getNextInSequence(fastMarker); + } else if (slowLength == fastLength) { + remapNodeOntoNeighbour(slowNode, slowMarker, + fastNode, fastMarker); + slowMarker = getNextInSequence(slowMarker); + fastMarker = getNextInSequence(fastMarker); + } else if (slowLength < fastLength) { + remapBackOfNodeOntoNeighbour(fastNode, fastMarker, + slowNode, slowMarker, + FAST_TO_SLOW); + slowMarker = getNextInSequence(slowMarker); + } else { + remapBackOfNodeOntoNeighbour(slowNode, slowMarker, + fastNode, fastMarker, + SLOW_TO_FAST); + fastMarker = getNextInSequence(fastMarker); + } + + fflush(stdout); + } + + //puts("Done with path"); + + while (!isInitial(slowPath)) + slowPath = getPreviousInSequence(slowPath); + while (!isInitial(fastPath)) + fastPath = getPreviousInSequence(fastPath); + + //puts("Concatenation"); + + // Freeing up memory + if (slowMarker != NULL) + concatenatePathNodes(slowPath); + else + concatenatePathNodes(fastPath); + + //puts("Vaccinatting"); + + destroyPaths(); + + // Cleaning up silly structures + //vaccinatePath(&returnValue); + + //puts("Clean up done"); + //fflush(stdout); +} + +static void comparePaths(Node * destination, Node * origin) +{ + IDnum slowLength, fastLength; + Node *fastNode, *slowNode; + IDnum i; + PassageMarker *marker; + + //Measure lengths + slowLength = fastLength = 0; + fastNode = destination; + slowNode = origin; + + while (fastNode != slowNode) { + if (getNodeTime(fastNode) > getNodeTime(slowNode)) { + fastLength++; + fastNode = getNodePrevious(fastNode); + } else if (getNodeTime(fastNode) < getNodeTime(slowNode)) { + slowLength++; + slowNode = getNodePrevious(slowNode); + } else if (isPreviousToNode(slowNode, fastNode)) { + while (fastNode != slowNode) { + fastLength++; + fastNode = getNodePrevious(fastNode); + } + } else if (isPreviousToNode(fastNode, slowNode)) { + while (slowNode != fastNode) { + slowLength++; + slowNode = getNodePrevious(slowNode); + } + } else { + fastLength++; + fastNode = getNodePrevious(fastNode); + slowLength++; + slowNode = getNodePrevious(slowNode); + } + + if (slowLength > MAXNODELENGTH + || fastLength > MAXNODELENGTH) + return; + } + + if (fastLength == 0) + return; + + //Backtracking to record actual paths + fastPath = addUncertainPassageMarker(1, destination); + setPassageMarkerStatus(fastPath, true); + + for (i = 0; i < fastLength; i++) { + marker = + addUncertainPassageMarker(1, + getNodePrevious(getNode + (fastPath))); + setPassageMarkerStatus(marker, true); + connectPassageMarkers(marker, fastPath, graph); + fastPath = marker; + } + + slowPath = addUncertainPassageMarker(2, destination); + setPassageMarkerStatus(slowPath, true); + + marker = addUncertainPassageMarker(2, origin); + setPassageMarkerStatus(marker, true); + connectPassageMarkers(marker, slowPath, graph); + slowPath = marker; + + for (i = 0; i < slowLength; i++) { + marker = + addUncertainPassageMarker(2, + getNodePrevious(getNode + (slowPath))); + setPassageMarkerStatus(marker, true); + connectPassageMarkers(marker, slowPath, graph); + slowPath = marker; + } + + //Extract sequences + if (!extractSequence(fastPath, fastSequence) + || !extractSequence(slowPath, slowSequence)) { + destroyPaths(); + return; + } + //Compare sequences + if (compareSequences(fastSequence, slowSequence)) { + cleanUpRedundancy(); + return; + } + //puts("\tFinished comparing paths, changes made"); + destroyPaths(); +} + +static void tourBusArc(Node * origin, Arc * arc, Time originTime) +{ + Node *destination = getDestination(arc); + Time arcTime, totalTime, destinationTime; + IDnum nodeIndex = getNodeID(destination) + nodeCount(graph); + Node *oldPrevious = previous[nodeIndex]; + + if (oldPrevious == origin || getNodeStatus(destination) == 1) + return; + + arcTime = + ((Time) getNodeLength(origin)) / ((Time) getMultiplicity(arc)); + totalTime = originTime + arcTime; + + destinationTime = times[nodeIndex]; + + if (destinationTime == -1) { + setNodeTime(destination, totalTime); + dheapNodes[nodeIndex] = + insertNodeIntoDHeap(dheap, totalTime, destination); + previous[nodeIndex] = origin; + return; + } else if (destinationTime > totalTime) { + if (dheapNodes[nodeIndex] == NULL) { + //puts("Already expanded though"); + return; + } + + setNodeTime(destination, totalTime); + replaceKeyInDHeap(dheap, dheapNodes[nodeIndex], totalTime); + previous[nodeIndex] = origin; + + comparePaths(destination, oldPrevious); + return; + } else { + if (destinationTime == getNodeTime(origin) + && isPreviousToNode(destination, origin)) { + return; + } + + comparePaths(destination, origin); + } +} + +static void initializeTodoLists() +{ + IDnum index; + Node *node; + Arc *arc; + Ticket *tkt; + IDnum nodes = nodeCount(graph); + Ticket **currentList; + Ticket *currentTicket, *tmp; + Node *destination; + + puts("Initializing todo lists"); + + for (index = -nodes; index <= nodes; index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + currentList = &todoLists[index + nodes]; + *currentList = NULL; + + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) { + destination = getDestination(arc); + + if (destination == node) + continue; + + tkt = newTicket(); + tkt->id_a = getNodeID(destination); + + currentTicket = *currentList; + if (currentTicket == NULL + || currentTicket->id_a > tkt->id_a) { + tkt->next = currentTicket; + *currentList = tkt; + continue; + } + + while (currentTicket->next != NULL + && currentTicket->next->id_a < tkt->id_a) + currentTicket = currentTicket->next; + + tmp = currentTicket->next; + currentTicket->next = tkt; + tkt->next = tmp; + } + } + + puts("Done with initilization"); +} + +static void tourBusNode(Node * node) +{ + Arc *arc; + Node *destination; + Time nodeTime = getNodeTime(node); + IDnum id; + + dbgCounter++; + if (dbgCounter % 1000 == 0) { + printf("%d nodes visited\n", dbgCounter); + fflush(stdout); + } + + setSingleNodeStatus(node, 2); + activeNode = node; + todo = &todoLists[getNodeID(activeNode) + nodeCount(graph)]; + done = NULL; + + while ((id = nextTodoTicket()) != 0) { + destination = getNodeInGraph(graph, id); + + // Node doesn't exist anymore + if (destination == NULL) + continue; + + arc = getArcBetweenNodes(activeNode, destination, graph); + + // Arc does not exist for some reason (?) + if (arc == NULL) + continue; + + tourBusArc(activeNode, arc, nodeTime); + } + + freeDoneTickets(); +} + +static Coordinate getTipLength(Node * node) +{ + Node *current = getTwinNode(node); + Coordinate length = 0; + + if (simpleArcCount(current) > 1) + return getNodeLength(node); + + while (current != NULL && simpleArcCount(getTwinNode(current)) < 2 + && simpleArcCount(current) < 2) { + length += getNodeLength(current); + current = getDestination(getArc(current)); + } + + return length; +} + +static boolean isMinorityChoice(Node * node) +{ + Node *current = getTwinNode(node); + Arc *arc; + Arc *activeArc = NULL; + IDnum mult = 0; + + // Finding first tangle + while (current != NULL && simpleArcCount(getTwinNode(current)) < 2 + && simpleArcCount(current) < 2) { + activeArc = getArc(current); + current = getDestination(activeArc); + } + + // If isolated snippet: + if (current == NULL) + return true; + + // Joined tips + if (simpleArcCount(getTwinNode(current)) < 2) + return false; + + // If unique event + if (getMultiplicity(activeArc) == 1) + return true; + + // Computing max arc + for (arc = getArc(getTwinNode(current)); arc != NULL; + arc = getNextArc(arc)) + if (getMultiplicity(arc) > mult) + mult = getMultiplicity(arc); + + // Testing for minority + return mult >= getMultiplicity(activeArc); +} + +void clipTips(Graph * graph) +{ + IDnum index; + Node *current, *twin; + boolean modified = true; + int Wordlength = getWordLength(graph); + PassageMarker *marker; + + puts("Clipping short tips off graph"); + + while (modified) { + modified = false; + for (index = 1; index <= nodeCount(graph); index++) { + current = getNodeInGraph(graph, index); + + if (current == NULL) + continue; + + twin = getTwinNode(current); + + if (getArc(current) == NULL + && getTipLength(current) < 2 * Wordlength + && isMinorityChoice(current)) { + while ((marker = getMarker(current))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence + (marker), graph); + destroyPassageMarker(marker); + } + destroyNode(current, graph); + modified = true; + } else if (getArc(twin) == NULL + && getTipLength(twin) < 2 * Wordlength + && isMinorityChoice(twin)) { + while ((marker = getMarker(current))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence + (marker), graph); + destroyPassageMarker(marker); + } + destroyNode(twin, graph); + modified = true; + } + } + } + + concatenateGraph(graph); + printf("%d nodes left\n", nodeCount(graph)); +} + +void clipTipsHard(Graph * graph) +{ + IDnum index; + Node *current, *twin; + boolean modified = true; + int Wordlength = getWordLength(graph); + PassageMarker *marker; + + puts("Clipping short tips off graph, drastic"); + + while (modified) { + modified = false; + for (index = 1; index <= nodeCount(graph); index++) { + current = getNodeInGraph(graph, index); + + if (current == NULL) + continue; + + twin = getTwinNode(current); + + if (getArc(current) == NULL + && getTipLength(current) < 2 * Wordlength) { + while ((marker = getMarker(current))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence + (marker), graph); + destroyPassageMarker(marker); + } + destroyNode(current, graph); + modified = true; + } else if (getArc(twin) == NULL + && getTipLength(twin) < + 2 * Wordlength) { + while ((marker = getMarker(current))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence + (marker), graph); + destroyPassageMarker(marker); + } + destroyNode(twin, graph); + modified = true; + } + } + } + + concatenateGraph(graph); + printf("%d nodes left\n", nodeCount(graph)); +} + +static void tourBus(Node * startingPoint) +{ + Node *currentNode = startingPoint; + IDnum nodeID = getNodeID(startingPoint) + nodeCount(graph); + + //printf("Tour bus from node %ld...\n", getNodeID(startingPoint)); + + times[nodeID] = 0; + previous[nodeID] = currentNode; + + while (currentNode != NULL) { + dheapNodes[getNodeID(currentNode) + nodeCount(graph)] = + NULL; + tourBusNode(currentNode); + currentNode = removeNextNodeFromDHeap(dheap); + } +} + +void correctGraph(Graph * argGraph, Coordinate * argSequenceLengths) +{ + IDnum nodes; + IDnum index; + + //Setting global params + graph = argGraph; + WORDLENGTH = getWordLength(graph); + sequenceLengths = argSequenceLengths; + dbgCounter = 0; + // Done with global params + + printf("Correcting graph with cutoff %f\n", MAXDIVERGENCE); + + //clipTips(graph); + nodes = nodeCount(graph); + + // Allocating memory + times = mallocOrExit(2 * nodes + 1, Time); + previous = mallocOrExit(2 * nodes + 1, Node *); + dheapNodes = mallocOrExit(2 * nodes + 1, DFibHeapNode *); + + for (index = 0; index < (2 * nodeCount(graph) + 1); index++) { + times[index] = -1; + previous[index] = NULL; + dheapNodes[index] = NULL; + } + + dheap = newDFibHeap(); + + fastSequence = newTightString(MAXREADLENGTH); + slowSequence = newTightString(MAXREADLENGTH); + fastToSlowMapping = callocOrExit(MAXREADLENGTH + 1, Coordinate); + slowToFastMapping = callocOrExit(MAXREADLENGTH + 1, Coordinate); + Fmatrix = callocOrExit(MAXREADLENGTH + 1, double *); + for (index = 0; index < MAXREADLENGTH + 1; index++) + Fmatrix[index] = callocOrExit(MAXREADLENGTH + 1, double); + + eligibleStartingPoints = mallocOrExit(2 * nodes + 1, IDnum); + progressStatus = callocOrExit(2 * nodes + 1, boolean); + todoLists = callocOrExit(2 * nodes + 1, Ticket *); + //Done with memory + + resetNodeStatus(graph); + determineEligibleStartingPoints(); + initializeTodoLists(); + activateArcLookupTable(graph); + + while ((startingNode = nextStartingPoint()) != NULL) { + //puts("Going through the cycle..."); + tourBus(startingNode); + updateNodeStatus(startingNode); + } + + deactivateArcLookupTable(graph); + concatenateGraph(graph); + + clipTipsHard(graph); + + //Deallocating globals + free(times); + free(previous); + free(dheapNodes); + destroyDHeap(dheap); + + destroyTightString(fastSequence); + destroyTightString(slowSequence); + free(fastToSlowMapping); + free(slowToFastMapping); + for (index = 0; index < MAXREADLENGTH + 1; index++) + free(Fmatrix[index]); + free(Fmatrix); + + free(eligibleStartingPoints); + free(progressStatus); + free(todoLists); + + if (ticketMemory != NULL) + destroyRecycleBin(ticketMemory); + + free(sequenceLengths); + //Done deallocating +} + +void setMaxReadLength(int value) +{ + if (value < 0) { + printf("Negative branch length %i!\n", value); + puts("Exiting..."); + exit(1); + } + MAXREADLENGTH = value; + MAXNODELENGTH = 2 * value; +} + +void setMaxGaps(int value) +{ + if (value < 0) { + printf("Negative max gap count %i!\n", value); + puts("Exiting..."); + exit(1); + } + MAXGAPS = value; +} + +void setMaxDivergence(double value) +{ + if (value < 0 || value > 1) { + printf("Divergence rate %lf out of bounds [0,1]!\n", + value); + puts("Exiting..."); + exit(1); + } + MAXDIVERGENCE = value; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/correctedGraph.h b/contrib/MetaVelvet-v0.3.1/src/correctedGraph.h new file mode 100644 index 0000000..7102175 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/correctedGraph.h @@ -0,0 +1,34 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _CORRECTEDGRAPH_H_ +#define _CORRECTEDGRAPH_H_ + +void clipTips(Graph * graph); + +void clipTipsHard(Graph * graph); + +void correctGraph(Graph * graph, Coordinate * sequenceLengths); + +// Black arts: +void setMaxReadLength(int value); +void setMaxGaps(int value); +void setMaxDivergence(double value); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/crc.c b/contrib/MetaVelvet-v0.3.1/src/crc.c new file mode 100644 index 0000000..5fb5912 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/crc.c @@ -0,0 +1,99 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include + +#define CRC_HASH_MASK 0x0000000000ffffffL + +#ifndef CRC_HASH_BUCKETS +#define CRC_HASH_BUCKETS 16777216 +#endif + +static int crc_table[256] = { + 0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, + 0x706af48f, 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, + 0xe0d5e91e, 0x97d2d988, 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, + 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, + 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, + 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, + 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, + 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, + 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, + 0x45df5c75, 0xdcd60dcf, 0xabd13d59, 0x26d930ac, 0x51de003a, + 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, + 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, + 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, + 0x01db7106, 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, + 0x9fbfe4a5, 0xe8b8d433, 0x7807c9a2, 0x0f00f934, 0x9609a88e, + 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01, + 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, + 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, + 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, + 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, + 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, + 0x346ed9fc, 0xad678846, 0xda60b8d0, 0x44042d73, 0x33031de5, + 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa, 0xbe0b1010, + 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, + 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, + 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, + 0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, + 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, + 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, 0xf00f9344, + 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb, + 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, + 0x67dd4acc, 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, + 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, + 0xa6bc5767, 0x3fb506dd, 0x48b2364b, 0xd80d2bda, 0xaf0a1b4c, + 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55, 0x316e8eef, + 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, + 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, + 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, + 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226, 0x756aa39c, + 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, + 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, + 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, + 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, + 0x18b74777, 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, + 0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, + 0xd70dd2ee, 0x4e048354, 0x3903b3c2, 0xa7672661, 0xd06016f7, + 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc, 0x40df0b66, + 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, + 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, + 0xcdd70693, 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, + 0x5d681b02, 0x2a6f2b94, 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, + 0x2d02ef8d +}; + +int crc32_v(const char *buf, int len) +{ + int crc; + + if (buf == NULL) + return 0; + + crc = 0xffffffff; + while (len--) { + crc = + crc_table[((int) crc ^ (*buf++)) & 0xff] ^ (crc >> 8); + } + + crc ^= 0xffffffff; + return crc & CRC_HASH_MASK; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/crc.h b/contrib/MetaVelvet-v0.3.1/src/crc.h new file mode 100644 index 0000000..2a37f76 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/crc.h @@ -0,0 +1,30 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _CRC_H_ +#define _CRC_H_ + +#ifndef CRC_HASH_BUCKETS +#define CRC_HASH_BUCKETS 16777216 +#endif + +int crc32_v(const char *buf, int len); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/dfib.c b/contrib/MetaVelvet-v0.3.1/src/dfib.c new file mode 100644 index 0000000..6c41de6 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/dfib.c @@ -0,0 +1,509 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/*- + * Copyright 1997-2003 John-Mark Gurney. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $Id: dfib.c,v 1.12 2007/10/19 13:09:26 zerbino Exp $ + * + */ +#include +#include + +#include "recycleBin.h" +#include "dfib.h" + +#include "dfibpriv.h" + +#define BLOCKSIZE 10000 +static DFibHeapNode *allocateDFibHeapNode(DFibHeap * heap) +{ + return allocatePointer(heap->nodeMemory); +}; + +static void deallocateDFibHeapNode(DFibHeapNode * a, DFibHeap * heap) +{ + deallocatePointer(heap->nodeMemory, a); +} + +IDnum dfibheap_getSize(DFibHeap * heap) +{ + return heap->dfh_n; +} + +#define swap(type, a, b) \ + do { \ + type c; \ + c = a; \ + a = b; \ + b = c; \ + } while (0) \ + +#define INT_BITS (sizeof(IDnum) * 8) + +static inline IDnum ceillog2(IDnum a) +{ + IDnum oa; + IDnum i; + IDnum b; + IDnum cons; + + oa = a; + b = INT_BITS / 2; + i = 0; + while (b) { + i = (i << 1); + cons = ((IDnum) 1) << b; + if (a >= cons) { + a /= cons; + i = i | 1; + } else + a &= cons - 1; + b /= 2; + } + if ((((IDnum) 1 << i)) == oa) + return i; + else + return i + 1; +} + +/* + * Public Heap Functions + */ +DFibHeap *dfh_makekeyheap() +{ + DFibHeap *n; + + if ((n = malloc(sizeof *n)) == NULL) + return NULL; + + n->nodeMemory = newRecycleBin(sizeof(DFibHeapNode), BLOCKSIZE); + n->dfh_n = 0; + n->dfh_Dl = -1; + n->dfh_cons = NULL; + n->dfh_min = NULL; + n->dfh_root = NULL; + + return n; +} + +void dfh_deleteheap(DFibHeap * h) +{ + destroyRecycleBin(h->nodeMemory); + if (h->dfh_cons != NULL) + free(h->dfh_cons); + free(h); +} + +/* + * Public Key Heap Functions + */ +DFibHeapNode *dfh_insertkey(DFibHeap * h, Time key, void *data) +{ + DFibHeapNode *x; + + if ((x = dfhe_newelem(h)) == NULL) + return NULL; + + /* just insert on root list, and make sure it's not the new min */ + x->dfhe_data = data; + x->dfhe_key = key; + + dfh_insertel(h, x); + + return x; +} + +Time dfh_replacekey(DFibHeap * h, DFibHeapNode * x, Time key) +{ + Time ret; + + ret = x->dfhe_key; + (void) dfh_replacekeydata(h, x, key, x->dfhe_data); + + return ret; +} + +void *dfh_replacekeydata(DFibHeap * h, DFibHeapNode * x, + Time key, void *data) +{ + void *odata; + Time okey; + DFibHeapNode *y; + int r; + + odata = x->dfhe_data; + okey = x->dfhe_key; + + /* + * we can increase a key by deleting and reinserting, that + * requires O(lgn) time. + */ + if ((r = dfh_comparedata(h, key, data, x)) > 0) { + /* XXX - bad code! */ + abort(); + } + + x->dfhe_data = data; + x->dfhe_key = key; + + /* because they are equal, we don't have to do anything */ + if (r == 0) + return odata; + + y = x->dfhe_p; + + if (okey == key) + return odata; + + if (y != NULL && dfh_compare(h, x, y) <= 0) { + dfh_cut(h, x, y); + dfh_cascading_cut(h, y); + } + + /* + * the = is so that the call from dfh_delete will delete the proper + * element. + */ + if (dfh_compare(h, x, h->dfh_min) <= 0) + h->dfh_min = x; + + return odata; +} + +/* + * Public void * Heap Functions + */ +/* + * this will return these values: + * NULL failed for some reason + * ptr token to use for manipulation of data + */ +void *dfh_extractmin(DFibHeap * h) +{ + DFibHeapNode *z; + void *ret; + + ret = NULL; + + if (h->dfh_min != NULL) { + z = dfh_extractminel(h); + ret = z->dfhe_data; + deallocateDFibHeapNode(z, h); + } + + return ret; +} + +void *dfh_replacedata(DFibHeapNode * x, void *data) +{ + void *odata = x->dfhe_data; + x->dfhe_data = data; + return odata; +} + +void *dfh_delete(DFibHeap * h, DFibHeapNode * x) +{ + void *k; + + k = x->dfhe_data; + dfh_replacekey(h, x, INT_MIN); + dfh_extractmin(h); + + return k; +} + +/* + * begin of private element fuctions + */ +static DFibHeapNode *dfh_extractminel(DFibHeap * h) +{ + DFibHeapNode *ret; + DFibHeapNode *x, *y, *orig; + + ret = h->dfh_min; + + orig = NULL; + /* put all the children on the root list */ + /* for true consistancy, we should use dfhe_remove */ + for (x = ret->dfhe_child; x != orig && x != NULL;) { + if (orig == NULL) + orig = x; + y = x->dfhe_right; + x->dfhe_p = NULL; + dfh_insertrootlist(h, x); + x = y; + } + /* remove minimum from root list */ + dfh_removerootlist(h, ret); + h->dfh_n--; + + /* if we aren't empty, consolidate the heap */ + if (h->dfh_n == 0) + h->dfh_min = NULL; + else { + h->dfh_min = ret->dfhe_right; + dfh_consolidate(h); + } + + return ret; +} + +static void dfh_insertrootlist(DFibHeap * h, DFibHeapNode * x) +{ + if (h->dfh_root == NULL) { + h->dfh_root = x; + x->dfhe_left = x; + x->dfhe_right = x; + return; + } + + dfhe_insertafter(h->dfh_root, x); +} + +static void dfh_removerootlist(DFibHeap * h, DFibHeapNode * x) +{ + if (x->dfhe_left == x) + h->dfh_root = NULL; + else + h->dfh_root = dfhe_remove(x); +} + +static void dfh_consolidate(DFibHeap * h) +{ + DFibHeapNode **a; + DFibHeapNode *w; + DFibHeapNode *y; + DFibHeapNode *x; + IDnum i; + IDnum d; + IDnum D; + + dfh_checkcons(h); + + /* assign a the value of h->dfh_cons so I don't have to rewrite code */ + D = h->dfh_Dl + 1; + a = h->dfh_cons; + + for (i = 0; i < D; i++) + a[i] = NULL; + + while ((w = h->dfh_root) != NULL) { + x = w; + dfh_removerootlist(h, w); + d = x->dfhe_degree; + /* XXX - assert that d < D */ + while (a[d] != NULL) { + y = a[d]; + if (dfh_compare(h, x, y) > 0) + swap(DFibHeapNode *, x, y); + dfh_heaplink(h, y, x); + a[d] = NULL; + d++; + } + a[d] = x; + } + h->dfh_min = NULL; + for (i = 0; i < D; i++) + if (a[i] != NULL) { + dfh_insertrootlist(h, a[i]); + if (h->dfh_min == NULL + || dfh_compare(h, a[i], h->dfh_min) < 0) + h->dfh_min = a[i]; + } +} + +static void dfh_heaplink(DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x) +{ + /* make y a child of x */ + if (x->dfhe_child == NULL) + x->dfhe_child = y; + else + dfhe_insertbefore(x->dfhe_child, y); + y->dfhe_p = x; + x->dfhe_degree++; + y->dfhe_mark = 0; +} + +static void dfh_cut(DFibHeap * h, DFibHeapNode * x, DFibHeapNode * y) +{ + dfhe_remove(x); + y->dfhe_degree--; + dfh_insertrootlist(h, x); + x->dfhe_p = NULL; + x->dfhe_mark = 0; +} + +static void dfh_cascading_cut(DFibHeap * h, DFibHeapNode * y) +{ + DFibHeapNode *z; + + while ((z = y->dfhe_p) != NULL) { + if (y->dfhe_mark == 0) { + y->dfhe_mark = 1; + return; + } else { + dfh_cut(h, y, z); + y = z; + } + } +} + +/* + * begining of handling elements of dfibheap + */ +static DFibHeapNode *dfhe_newelem(DFibHeap * h) +{ + DFibHeapNode *e; + + if ((e = allocateDFibHeapNode(h)) == NULL) + return NULL; + + e->dfhe_degree = 0; + e->dfhe_mark = 0; + e->dfhe_p = NULL; + e->dfhe_child = NULL; + e->dfhe_left = e; + e->dfhe_right = e; + e->dfhe_data = NULL; + + return e; +} + +static void dfhe_insertafter(DFibHeapNode * a, DFibHeapNode * b) +{ + if (a == a->dfhe_right) { + a->dfhe_right = b; + a->dfhe_left = b; + b->dfhe_right = a; + b->dfhe_left = a; + } else { + b->dfhe_right = a->dfhe_right; + a->dfhe_right->dfhe_left = b; + a->dfhe_right = b; + b->dfhe_left = a; + } +} + +static inline void dfhe_insertbefore(DFibHeapNode * a, DFibHeapNode * b) +{ + dfhe_insertafter(a->dfhe_left, b); +} + +static DFibHeapNode *dfhe_remove(DFibHeapNode * x) +{ + DFibHeapNode *ret; + + if (x == x->dfhe_left) + ret = NULL; + else + ret = x->dfhe_left; + + /* fix the parent pointer */ + if (x->dfhe_p != NULL && x->dfhe_p->dfhe_child == x) + x->dfhe_p->dfhe_child = ret; + + x->dfhe_right->dfhe_left = x->dfhe_left; + x->dfhe_left->dfhe_right = x->dfhe_right; + + /* clear out hanging pointers */ + x->dfhe_p = NULL; + x->dfhe_left = x; + x->dfhe_right = x; + + return ret; +} + +static void dfh_checkcons(DFibHeap * h) +{ + IDnum oDl; + + /* make sure we have enough memory allocated to "reorganize" */ + if (h->dfh_Dl == -1 || h->dfh_n > (1 << h->dfh_Dl)) { + oDl = h->dfh_Dl; + if ((h->dfh_Dl = ceillog2(h->dfh_n) + 1) < 8) + h->dfh_Dl = 8; + if (oDl != h->dfh_Dl) + h->dfh_cons = + (DFibHeapNode **) realloc(h->dfh_cons, + sizeof *h-> + dfh_cons * + (h->dfh_Dl + 1)); + if (h->dfh_cons == NULL) + abort(); + } +} + +static int dfh_compare(DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b) +{ + if (a->dfhe_key < b->dfhe_key) + return -1; + if (a->dfhe_key == b->dfhe_key) + return 0; + return 1; +} + +static int +dfh_comparedata(DFibHeap * h, Time key, void *data, DFibHeapNode * b) +{ + DFibHeapNode a; + + a.dfhe_key = key; + a.dfhe_data = data; + + return dfh_compare(h, &a, b); +} + +static void dfh_insertel(DFibHeap * h, DFibHeapNode * x) +{ + dfh_insertrootlist(h, x); + + if (h->dfh_min == NULL || x->dfhe_key < h->dfh_min->dfhe_key) + h->dfh_min = x; + + h->dfh_n++; +} + +Time dfibheap_el_getKey(DFibHeapNode * node) +{ + return node->dfhe_key; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/dfib.h b/contrib/MetaVelvet-v0.3.1/src/dfib.h new file mode 100644 index 0000000..ae88f7a --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/dfib.h @@ -0,0 +1,69 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/*- + * Copyright 1997, 1998-2003 John-Mark Gurney. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $Id: dfib.h,v 1.8 2007/04/24 12:16:41 zerbino Exp $ + * + */ + +#ifndef _DFIB_H_ +#define _DFIB_H_ + +#include "globals.h" + +/* functions for key heaps */ +DFibHeap *dfh_makekeyheap(void); +DFibHeapNode *dfh_insertkey(DFibHeap *, Time, void *); +Time dfh_replacekey(DFibHeap *, DFibHeapNode *, Time); +void *dfh_replacekeydata(DFibHeap *, DFibHeapNode *, Time, void *); + +void *dfh_extractmin(DFibHeap *); +void *dfh_replacedata(DFibHeapNode *, void *); +void *dfh_delete(DFibHeap *, DFibHeapNode *); +void dfh_deleteheap(DFibHeap *); + +IDnum dfibheap_getSize(DFibHeap *); +Time dfibheap_el_getKey(DFibHeapNode *); + +#endif /* _FIB_H_ */ diff --git a/contrib/MetaVelvet-v0.3.1/src/dfibHeap.c b/contrib/MetaVelvet-v0.3.1/src/dfibHeap.c new file mode 100644 index 0000000..7434892 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/dfibHeap.c @@ -0,0 +1,94 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "dfib.h" +#include "utility.h" + +// Return number of elements stored in heap +IDnum getDFibHeapSize(DFibHeap * heap) +{ + return dfibheap_getSize(heap); +} + +// Constructor +// Memory allocated +DFibHeap *newDFibHeap() +{ + DFibHeap* dheap = dfh_makekeyheap(); + if (dheap == NULL) + exitErrorf(EXIT_FAILURE, true, "Can't allocate DFibHeap"); + + return dheap; +} + +// Add new node into heap with a key, and a pointer to the specified node +DFibHeapNode *insertNodeIntoDHeap(DFibHeap * heap, Time key, + struct node_st * node) +{ + DFibHeapNode *res; + res = dfh_insertkey(heap, key, node); + + return res; +} + +// Replaces the key for a given node +Time replaceKeyInDHeap(DFibHeap * heap, DFibHeapNode * node, Time newKey) +{ + Time res; + res = dfh_replacekey(heap, node, newKey); + + return res; +} + +// Removes the node with the shortest key, then returns it. +Node *removeNextNodeFromDHeap(DFibHeap * heap) +{ + Node *node; + node = (Node *) dfh_extractmin(heap); + + return node; +} + +// Destructor +void destroyDHeap(DFibHeap * heap) +{ + dfh_deleteheap(heap); +} + +// Replace the node pointed to by a heap node +void replaceValueInDHeap(DFibHeapNode * node, Node * newValue) +{ + dfh_replacedata(node, newValue); +} + +// Remove unwanted node +void destroyNodeInDHeap(DFibHeapNode * node, DFibHeap * heap) +{ + dfh_delete(heap, node); +} + +Time getKey(DFibHeapNode * node) +{ + return dfibheap_el_getKey(node); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/dfibHeap.h b/contrib/MetaVelvet-v0.3.1/src/dfibHeap.h new file mode 100644 index 0000000..525c8f9 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/dfibHeap.h @@ -0,0 +1,41 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _DFIBHEAP_H_ +#define _DFIBHEAP_H_ + +DFibHeap *newDFibHeap(); + +DFibHeapNode *insertNodeIntoDHeap(DFibHeap * heap, Time key, Node * node); + +Time replaceKeyInDHeap(DFibHeap * heap, DFibHeapNode * node, Time newKey); + +Node *removeNextNodeFromDHeap(DFibHeap * heap); + +void destroyDHeap(DFibHeap * heap); + +void replaceValueInDHeap(DFibHeapNode * node, Node * newValue); + +void *destroyNodeInDHeap(DFibHeapNode * node, DFibHeap * heap); + +IDnum getDFibHeapSize(DFibHeap * heap); + +Time getKey(DFibHeapNode * node); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/dfibpriv.h b/contrib/MetaVelvet-v0.3.1/src/dfibpriv.h new file mode 100644 index 0000000..ff59758 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/dfibpriv.h @@ -0,0 +1,105 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/*- + * Copyright 1997, 1999-2003 John-Mark Gurney. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $Id: dfibpriv.h,v 1.8 2007/10/09 09:56:46 zerbino Exp $ + * + */ + +#ifndef _DFIBPRIV_H_ +#define _DFIBPRIV_H_ + +#include "globals.h" + +/* + * specific node operations + */ +struct dfibheap_el { + DFibHeapNode *dfhe_p; + DFibHeapNode *dfhe_child; + DFibHeapNode *dfhe_left; + DFibHeapNode *dfhe_right; + void *dfhe_data; + Time dfhe_key; + int dfhe_degree; + boolean dfhe_mark; +}; + +static DFibHeapNode *dfhe_newelem(DFibHeap *); +static void dfhe_insertafter(DFibHeapNode * a, DFibHeapNode * b); +static inline void dfhe_insertbefore(DFibHeapNode * a, DFibHeapNode * b); +static DFibHeapNode *dfhe_remove(DFibHeapNode * a); + +/* + * global heap operations + */ +struct dfibheap { + RecycleBin *nodeMemory; + IDnum dfh_n; + IDnum dfh_Dl; + DFibHeapNode **dfh_cons; + DFibHeapNode *dfh_min; + DFibHeapNode *dfh_root; +}; + +static void dfh_insertrootlist(DFibHeap *, DFibHeapNode *); +static void dfh_removerootlist(DFibHeap *, DFibHeapNode *); +static void dfh_consolidate(DFibHeap *); +static void dfh_heaplink(DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x); +static void dfh_cut(DFibHeap *, DFibHeapNode *, DFibHeapNode *); +static void dfh_cascading_cut(DFibHeap *, DFibHeapNode *); +static DFibHeapNode *dfh_extractminel(DFibHeap *); +static void dfh_checkcons(DFibHeap * h); +static int dfh_compare(DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b); +static int dfh_comparedata(DFibHeap * h, Time key, + void *data, DFibHeapNode * b); +static void dfh_insertel(DFibHeap * h, DFibHeapNode * x); + + +/* + * general functions + */ +static inline IDnum ceillog2(IDnum a); + +#endif /* _FIBPRIV_H_ */ diff --git a/contrib/MetaVelvet-v0.3.1/src/fib.c b/contrib/MetaVelvet-v0.3.1/src/fib.c new file mode 100644 index 0000000..2a23b5e --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/fib.c @@ -0,0 +1,628 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/*- + * Copyright 1997-2003 John-Mark Gurney. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $Id: fib.c,v 1.10 2007/10/19 13:09:26 zerbino Exp $ + * + */ +#include +#include + +#include "fib.h" +#include "recycleBin.h" + +#include "fibpriv.h" + +#define BLOCKSIZE 10000 + +static FibHeapNode *allocateFibHeapEl(FibHeap * heap) +{ + return allocatePointer(heap->nodeMemory); +}; + +static void deallocateFibHeapEl(FibHeapNode * a, FibHeap * heap) +{ + deallocatePointer(heap->nodeMemory, a); +} + +#define swap(type, a, b) \ + do { \ + type c; \ + c = a; \ + a = b; \ + b = c; \ + } while (0) \ + +#define INT_BITS (sizeof(IDnum) * 8) + +static inline IDnum ceillog2(IDnum a) +{ + IDnum oa; + IDnum i; + IDnum b; + IDnum cons; + + oa = a; + b = INT_BITS / 2; + i = 0; + while (b) { + i = (i << 1); + cons = ((IDnum) 1) << b; + if (a >= cons) { + a /= cons; + i = i | 1; + } else + a &= cons - 1; + b /= 2; + } + if ((((IDnum) 1 << i)) == oa) + return i; + else + return i + 1; +} + +/* + * Private Heap Functions + */ +static void fh_initheap(FibHeap * new) +{ + new->fh_cmp_fnct = NULL; + new->nodeMemory = newRecycleBin(sizeof(FibHeapNode), BLOCKSIZE); + new->fh_neginf = NULL; + new->fh_n = 0; + new->fh_Dl = -1; + new->fh_cons = NULL; + new->fh_min = NULL; + new->fh_root = NULL; + new->fh_keys = 0; +} + +static void fh_destroyheap(FibHeap * h) +{ + h->fh_cmp_fnct = NULL; + h->fh_neginf = NULL; + if (h->fh_cons != NULL) + free(h->fh_cons); + h->fh_cons = NULL; + free(h); +} + +/* + * Public Heap Functions + */ +FibHeap *fh_makekeyheap() +{ + FibHeap *n; + + if ((n = malloc(sizeof *n)) == NULL) + return NULL; + + fh_initheap(n); + n->fh_keys = 1; + + return n; +} + +FibHeap *fh_makeheap() +{ + FibHeap *n; + + if ((n = malloc(sizeof *n)) == NULL) + return NULL; + + fh_initheap(n); + + return n; +} + +voidcmp fh_setcmp(FibHeap * h, voidcmp fnct) +{ + voidcmp oldfnct; + + oldfnct = h->fh_cmp_fnct; + h->fh_cmp_fnct = fnct; + + return oldfnct; +} + +void *fh_setneginf(FibHeap * h, void *data) +{ + void *old; + + old = h->fh_neginf; + h->fh_neginf = data; + + return old; +} + +FibHeap *fh_union(FibHeap * ha, FibHeap * hb) +{ + FibHeapNode *x; + + if (ha->fh_root == NULL || hb->fh_root == NULL) { + /* either one or both are empty */ + if (ha->fh_root == NULL) { + fh_destroyheap(ha); + return hb; + } else { + fh_destroyheap(hb); + return ha; + } + } + ha->fh_root->fhe_left->fhe_right = hb->fh_root; + hb->fh_root->fhe_left->fhe_right = ha->fh_root; + x = ha->fh_root->fhe_left; + ha->fh_root->fhe_left = hb->fh_root->fhe_left; + hb->fh_root->fhe_left = x; + ha->fh_n += hb->fh_n; + /* + * we probably should also keep stats on number of unions + */ + + /* set fh_min if necessary */ + if (fh_compare(ha, hb->fh_min, ha->fh_min) < 0) + ha->fh_min = hb->fh_min; + + fh_destroyheap(hb); + return ha; +} + +void fh_deleteheap(FibHeap * h) +{ + destroyRecycleBin(h->nodeMemory); + fh_destroyheap(h); +} + +/* + * Public Key Heap Functions + */ +FibHeapNode *fh_insertkey(FibHeap * h, Coordinate key, void *data) +{ + FibHeapNode *x; + + if ((x = fhe_newelem(h)) == NULL) + return NULL; + + /* just insert on root list, and make sure it's not the new min */ + x->fhe_data = data; + x->fhe_key = key; + + fh_insertel(h, x); + + return x; +} + +Coordinate fh_minkey(FibHeap * h) +{ + if (h->fh_min == NULL) + return (Coordinate) INT_MIN; + return h->fh_min->fhe_key; +} + +Coordinate fh_replacekey(FibHeap * h, FibHeapNode * x, Coordinate key) +{ + Coordinate ret; + + ret = x->fhe_key; + (void) fh_replacekeydata(h, x, key, x->fhe_data); + + return ret; +} + +void *fh_replacekeydata(FibHeap * h, FibHeapNode * x, + Coordinate key, void *data) +{ + void *odata; + Coordinate okey; + FibHeapNode *y; + int r; + + odata = x->fhe_data; + okey = x->fhe_key; + + /* + * we can increase a key by deleting and reinserting, that + * requires O(lgn) time. + */ + if ((r = fh_comparedata(h, key, data, x)) > 0) { + /* XXX - bad code! */ + abort(); + } + + x->fhe_data = data; + x->fhe_key = key; + + /* because they are equal, we don't have to do anything */ + if (r == 0) + return odata; + + y = x->fhe_p; + + if (h->fh_keys && okey == key) + return odata; + + if (y != NULL && fh_compare(h, x, y) <= 0) { + fh_cut(h, x, y); + fh_cascading_cut(h, y); + } + + /* + * the = is so that the call from fh_delete will delete the proper + * element. + */ + if (fh_compare(h, x, h->fh_min) <= 0) + h->fh_min = x; + + return odata; +} + +/* + * Public void * Heap Functions + */ +/* + * this will return these values: + * NULL failed for some reason + * ptr token to use for manipulation of data + */ +FibHeapNode *fh_insert(FibHeap * h, void *data) +{ + FibHeapNode *x; + + if ((x = fhe_newelem(h)) == NULL) + return NULL; + + /* just insert on root list, and make sure it's not the new min */ + x->fhe_data = data; + + fh_insertel(h, x); + + return x; +} + +void *fh_min(FibHeap * h) +{ + if (h->fh_min == NULL) + return NULL; + return h->fh_min->fhe_data; +} + +void *fh_extractmin(FibHeap * h) +{ + FibHeapNode *z; + void *ret; + + ret = NULL; + + if (h->fh_min != NULL) { + z = fh_extractminel(h); + ret = z->fhe_data; +#ifndef NO_FREE + deallocateFibHeapEl(z, h); +#endif + + } + + return ret; +} + +void *fh_replacedata(FibHeapNode * x, void *data) +{ + void *odata = x->fhe_data; + x->fhe_data = data; + return odata; +} + +void *fh_delete(FibHeap * h, FibHeapNode * x) +{ + void *k; + + k = x->fhe_data; + if (!h->fh_keys) + fh_replacedata(x, h->fh_neginf); + else + fh_replacekey(h, x, (Coordinate) INT_MIN); + fh_extractmin(h); + + return k; +} + +/* + * begin of private element fuctions + */ +static FibHeapNode *fh_extractminel(FibHeap * h) +{ + FibHeapNode *ret; + FibHeapNode *x, *y, *orig; + + ret = h->fh_min; + + orig = NULL; + /* put all the children on the root list */ + /* for true consistancy, we should use fhe_remove */ + for (x = ret->fhe_child; x != orig && x != NULL;) { + if (orig == NULL) + orig = x; + y = x->fhe_right; + x->fhe_p = NULL; + fh_insertrootlist(h, x); + x = y; + } + /* remove minimum from root list */ + fh_removerootlist(h, ret); + h->fh_n--; + + /* if we aren't empty, consolidate the heap */ + if (h->fh_n == 0) + h->fh_min = NULL; + else { + h->fh_min = ret->fhe_right; + fh_consolidate(h); + } + + return ret; +} + +static void fh_insertrootlist(FibHeap * h, FibHeapNode * x) +{ + if (h->fh_root == NULL) { + h->fh_root = x; + x->fhe_left = x; + x->fhe_right = x; + return; + } + + fhe_insertafter(h->fh_root, x); +} + +static void fh_removerootlist(FibHeap * h, FibHeapNode * x) +{ + if (x->fhe_left == x) + h->fh_root = NULL; + else + h->fh_root = fhe_remove(x); +} + +static void fh_consolidate(FibHeap * h) +{ + FibHeapNode **a; + FibHeapNode *w; + FibHeapNode *y; + FibHeapNode *x; + IDnum i; + IDnum d; + IDnum D; + + fh_checkcons(h); + + /* assign a the value of h->fh_cons so I don't have to rewrite code */ + D = h->fh_Dl + 1; + a = h->fh_cons; + + for (i = 0; i < D; i++) + a[i] = NULL; + + while ((w = h->fh_root) != NULL) { + x = w; + fh_removerootlist(h, w); + d = x->fhe_degree; + /* XXX - assert that d < D */ + while (a[d] != NULL) { + y = a[d]; + if (fh_compare(h, x, y) > 0) + swap(FibHeapNode *, x, y); + fh_heaplink(h, y, x); + a[d] = NULL; + d++; + } + a[d] = x; + } + h->fh_min = NULL; + for (i = 0; i < D; i++) + if (a[i] != NULL) { + fh_insertrootlist(h, a[i]); + if (h->fh_min == NULL + || fh_compare(h, a[i], h->fh_min) < 0) + h->fh_min = a[i]; + } +} + +static void fh_heaplink(FibHeap * h, FibHeapNode * y, FibHeapNode * x) +{ + /* make y a child of x */ + if (x->fhe_child == NULL) + x->fhe_child = y; + else + fhe_insertbefore(x->fhe_child, y); + y->fhe_p = x; + x->fhe_degree++; + y->fhe_mark = 0; +} + +static void fh_cut(FibHeap * h, FibHeapNode * x, FibHeapNode * y) +{ + fhe_remove(x); + y->fhe_degree--; + fh_insertrootlist(h, x); + x->fhe_p = NULL; + x->fhe_mark = 0; +} + +static void fh_cascading_cut(FibHeap * h, FibHeapNode * y) +{ + FibHeapNode *z; + + while ((z = y->fhe_p) != NULL) { + if (y->fhe_mark == 0) { + y->fhe_mark = 1; + return; + } else { + fh_cut(h, y, z); + y = z; + } + } +} + +/* + * begining of handling elements of fibheap + */ +static FibHeapNode *fhe_newelem(FibHeap * h) +{ + FibHeapNode *e; + + if ((e = allocateFibHeapEl(h)) == NULL) + return NULL; + + fhe_initelem(e); + + return e; +} + +static void fhe_initelem(FibHeapNode * e) +{ + e->fhe_degree = 0; + e->fhe_mark = 0; + e->fhe_p = NULL; + e->fhe_child = NULL; + e->fhe_left = e; + e->fhe_right = e; + e->fhe_data = NULL; +} + +static void fhe_insertafter(FibHeapNode * a, FibHeapNode * b) +{ + if (a == a->fhe_right) { + a->fhe_right = b; + a->fhe_left = b; + b->fhe_right = a; + b->fhe_left = a; + } else { + b->fhe_right = a->fhe_right; + a->fhe_right->fhe_left = b; + a->fhe_right = b; + b->fhe_left = a; + } +} + +static inline void fhe_insertbefore(FibHeapNode * a, FibHeapNode * b) +{ + fhe_insertafter(a->fhe_left, b); +} + +static FibHeapNode *fhe_remove(FibHeapNode * x) +{ + FibHeapNode *ret; + + if (x == x->fhe_left) + ret = NULL; + else + ret = x->fhe_left; + + /* fix the parent pointer */ + if (x->fhe_p != NULL && x->fhe_p->fhe_child == x) + x->fhe_p->fhe_child = ret; + + x->fhe_right->fhe_left = x->fhe_left; + x->fhe_left->fhe_right = x->fhe_right; + + /* clear out hanging pointers */ + x->fhe_p = NULL; + x->fhe_left = x; + x->fhe_right = x; + + return ret; +} + +static void fh_checkcons(FibHeap * h) +{ + IDnum oDl; + + /* make sure we have enough memory allocated to "reorganize" */ + if (h->fh_Dl == -1 || h->fh_n > (1 << h->fh_Dl)) { + oDl = h->fh_Dl; + if ((h->fh_Dl = ceillog2(h->fh_n) + 1) < 8) + h->fh_Dl = 8; + if (oDl != h->fh_Dl) + h->fh_cons = + (FibHeapNode **) realloc(h->fh_cons, + sizeof *h-> + fh_cons * + (h->fh_Dl + 1)); + if (h->fh_cons == NULL) + abort(); + } +} + +static int fh_compare(FibHeap * h, FibHeapNode * a, FibHeapNode * b) +{ + if (a->fhe_key < b->fhe_key) + return -1; + if (a->fhe_key == b->fhe_key) + return 0; + return 1; +} + +static int +fh_comparedata(FibHeap * h, Coordinate key, void *data, FibHeapNode * b) +{ + FibHeapNode a; + + a.fhe_key = key; + a.fhe_data = data; + + return fh_compare(h, &a, b); +} + +static void fh_insertel(FibHeap * h, FibHeapNode * x) +{ + fh_insertrootlist(h, x); + + if (h->fh_min == NULL + || (h->fh_keys ? x->fhe_key < + h->fh_min->fhe_key : h->fh_cmp_fnct(x->fhe_data, + h->fh_min->fhe_data) < + 0)) + h->fh_min = x; + + h->fh_n++; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/fib.h b/contrib/MetaVelvet-v0.3.1/src/fib.h new file mode 100644 index 0000000..b67d3b9 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/fib.h @@ -0,0 +1,78 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/*- + * Copyright 1997, 1998-2003 John-Mark Gurney. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $Id: fib.h,v 1.9 2007/04/24 12:16:41 zerbino Exp $ + * + */ + +#ifndef _FIB_H_ +#define _FIB_H_ + +#include "globals.h" + +typedef Coordinate(*voidcmp) (void *, void *); + +/* functions for key heaps */ +FibHeap *fh_makekeyheap(void); +FibHeapNode *fh_insertkey(FibHeap *, Coordinate, void *); +Coordinate fh_minkey(FibHeap *); +Coordinate fh_replacekey(FibHeap *, FibHeapNode *, Coordinate); +void *fh_replacekeydata(FibHeap *, FibHeapNode *, Coordinate, void *); + +/* functions for void * heaps */ +FibHeap *fh_makeheap(void); +voidcmp fh_setcmp(FibHeap *, voidcmp); +void *fh_setneginf(FibHeap *, void *); +FibHeapNode *fh_insert(FibHeap *, void *); + +/* shared functions */ +void *fh_extractmin(FibHeap *); +void *fh_min(FibHeap *); +void *fh_replacedata(FibHeapNode *, void *); +void *fh_delete(FibHeap *, FibHeapNode *); +void fh_deleteheap(FibHeap *); +FibHeap *fh_union(FibHeap *, FibHeap *); + +#endif /* _FIB_H_ */ diff --git a/contrib/MetaVelvet-v0.3.1/src/fibHeap.c b/contrib/MetaVelvet-v0.3.1/src/fibHeap.c new file mode 100644 index 0000000..1d05ae7 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/fibHeap.c @@ -0,0 +1,79 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include + +#include "fib.h" +#include "utility.h" + +// Constructor +// Memory allocated +FibHeap *newFibHeap() +{ + FibHeap * heap = fh_makekeyheap(); + if (heap == NULL) + exitErrorf(EXIT_FAILURE, true, "Can't allocate FibHeap"); + + return heap; +} + +// Add new node into heap with a key, and a pointer to the specified node +FibHeapNode *insertNodeIntoHeap(FibHeap * heap, Coordinate key, + struct node_st * node) +{ + return fh_insertkey(heap, key, node); +} + +// Returns smallest key in heap +Coordinate minKeyOfHeap(FibHeap * heap) +{ + return fh_minkey(heap); +} + +// Replaces the key for a given node +Coordinate replaceKeyInHeap(FibHeap * heap, FibHeapNode * node, + Coordinate newKey) +{ + return fh_replacekey(heap, node, newKey); +} + +// Removes the node with the shortest key, then returns it. +struct node_st *removeNextNodeFromHeap(FibHeap * heap) +{ + return (struct node_st *) fh_extractmin(heap); +} + +// Destructor +void destroyHeap(FibHeap * heap) +{ + fh_deleteheap(heap); +} + +// Replace the node pointed to by a heap node +void replaceValueInHeap(FibHeapNode * node, Node * newValue) +{ + fh_replacedata(node, newValue); +} + +// Remove unwanted node +void destroyNodeInHeap(FibHeapNode * node, FibHeap * heap) +{ + fh_delete(heap, node); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/fibHeap.h b/contrib/MetaVelvet-v0.3.1/src/fibHeap.h new file mode 100644 index 0000000..4c8c0e6 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/fibHeap.h @@ -0,0 +1,41 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _FIBHEAP_H_ +#define _FIBHEAP_H_ + +FibHeap *newFibHeap(); + +FibHeapNode *insertNodeIntoHeap(FibHeap * heap, Coordinate key, + Node * node); + +Coordinate minKeyOfHeap(FibHeap * heap); + +Coordinate replaceKeyInHeap(FibHeap * heap, FibHeapNode * node, + Coordinate newKey); + +void replaceValueInHeap(FibHeapNode * node, Node * newValue); + +Node *removeNextNodeFromHeap(FibHeap * heap); + +void *destroyNodeInHeap(FibHeapNode * node, FibHeap * heap); + +void destroyHeap(FibHeap * heap); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/fibpriv.h b/contrib/MetaVelvet-v0.3.1/src/fibpriv.h new file mode 100644 index 0000000..de8e8ce --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/fibpriv.h @@ -0,0 +1,110 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/*- + * Copyright 1997, 1999-2003 John-Mark Gurney. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $Id: fibpriv.h,v 1.10 2007/10/09 09:56:46 zerbino Exp $ + * + */ + +#ifndef _FIBPRIV_H_ +#define _FIBPRIV_H_ + +#include "globals.h" + +/* + * specific node operations + */ +struct fibheap_el { + FibHeapNode *fhe_p; + FibHeapNode *fhe_child; + FibHeapNode *fhe_left; + FibHeapNode *fhe_right; + void *fhe_data; + Coordinate fhe_key; + int fhe_degree; + boolean fhe_mark; +}; + +static FibHeapNode *fhe_newelem(struct fibheap *); +static void fhe_initelem(FibHeapNode *); +static void fhe_insertafter(FibHeapNode * a, FibHeapNode * b); +static inline void fhe_insertbefore(FibHeapNode * a, FibHeapNode * b); +static FibHeapNode *fhe_remove(FibHeapNode * a); + +/* + * global heap operations + */ +struct fibheap { + Coordinate(*fh_cmp_fnct) (void *, void *); + RecycleBin *nodeMemory; + IDnum fh_n; + IDnum fh_Dl; + FibHeapNode **fh_cons; + FibHeapNode *fh_min; + FibHeapNode *fh_root; + void *fh_neginf; + boolean fh_keys:1; +}; + +static void fh_initheap(FibHeap *); +static void fh_insertrootlist(FibHeap *, FibHeapNode *); +static void fh_removerootlist(FibHeap *, FibHeapNode *); +static void fh_consolidate(FibHeap *); +static void fh_heaplink(FibHeap * h, FibHeapNode * y, FibHeapNode * x); +static void fh_cut(FibHeap *, FibHeapNode *, FibHeapNode *); +static void fh_cascading_cut(FibHeap *, FibHeapNode *); +static FibHeapNode *fh_extractminel(FibHeap *); +static void fh_checkcons(FibHeap * h); +static void fh_destroyheap(FibHeap * h); +static int fh_compare(FibHeap * h, FibHeapNode * a, FibHeapNode * b); +static int fh_comparedata(FibHeap * h, Coordinate key, + void *data, FibHeapNode * b); +static void fh_insertel(FibHeap * h, FibHeapNode * x); + +/* + * general functions + */ +static inline IDnum ceillog2(IDnum a); + +#endif /* _FIBPRIV_H_ */ diff --git a/contrib/MetaVelvet-v0.3.1/src/globals.h b/contrib/MetaVelvet-v0.3.1/src/globals.h new file mode 100644 index 0000000..0d295f9 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/globals.h @@ -0,0 +1,119 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _GLOBALS_H_ +#define _GLOBALS_H_ + +#ifndef NULL +#define NULL 0 +#endif + +#ifndef true +#define true 1 +#define false 0 +#endif + +#define VERSION_NUMBER 0 +#define RELEASE_NUMBER 7 +#define UPDATE_NUMBER 62 + +#define MAXLINE 5000 + +#if defined(_WIN32) || defined(__WIN32__) || defined(WIN32) +#define inline __inline +extern struct tString_st; +extern struct readSet_st; +extern struct splayTable_st; +extern struct annotation_st; +extern struct roadmap_st; +extern struct insertionMarker_st; +extern struct arc_st; +extern struct node_st; +extern struct graph_st; +extern struct passage_st; +extern struct passageList_st; +extern struct readStart_st; +extern struct preArc_st; +extern struct preNode_st; +extern struct preGraph_st; +extern struct fibheap; +extern struct fibheap_el; +extern struct dfibheap; +extern struct dfibheap_el; +#endif + +// Namespace sizes +#include +typedef int8_t boolean; +typedef int8_t Nucleotide; +typedef uint8_t Descriptor; +typedef int32_t IDnum; +typedef int64_t Coordinate; +typedef int16_t ShortLength; +typedef double Time; +typedef uint8_t Quality; +typedef double Probability; +typedef int8_t Category; + +// Atomic word +typedef struct kmer_st Kmer; +typedef int64_t KmerKey; + +// Just a sequence string, but with just two bits per character +typedef struct tString_st TightString; + +// A simple container when reading files +typedef struct readSet_st ReadSet; +typedef struct sequenceReader_st SequenceReader; + +// Hash table structures +typedef struct splayTable_st SplayTable; + +// Graph construction structures +typedef struct annotationList_st AnnotationList; +typedef struct annotation_st Annotation; +typedef struct roadmap_st RoadMap; +typedef struct roadMapArray_st RoadMapArray; +typedef struct insertionMarker_st InsertionMarker; + +// Pre-Graph elements +typedef struct preArc_st PreArc; +typedef struct preNode_st PreNode; +typedef struct preGraph_st PreGraph; + +// Graph elements +typedef struct arc_st Arc; +typedef struct node_st Node; +typedef struct graph_st Graph; +typedef struct shortReadMarker_st ShortReadMarker; +typedef struct passage_st PassageMarker; +typedef struct passageList_st PassageMarkerList; +typedef struct readStart_st ReadStart; +typedef struct gapMarker_st GapMarker; + +// Fibonacci heaps used mainly in Tour Bus +typedef struct fibheap FibHeap; +typedef struct fibheap_el FibHeapNode; +typedef struct dfibheap DFibHeap; +typedef struct dfibheap_el DFibHeapNode; + +typedef struct nodeList_st NodeList; + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/graph.c b/contrib/MetaVelvet-v0.3.1/src/graph.c new file mode 100644 index 0000000..01f9945 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/graph.c @@ -0,0 +1,4223 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "recycleBin.h" +#include "tightString.h" +#include "passageMarker.h" +#include "utility.h" +#include "kmer.h" + +#define ADENINE 0 +#define CYTOSINE 1 +#define GUANINE 2 +#define THYMINE 3 + +struct arc_st { + Arc *twinArc; // 64 + Arc *next; // 64 + Arc *previous; // 64 + Arc *nextInLookupTable; // 64 + Node *destination; // 64 + IDnum multiplicity; // 32 +}; // 352 Total + +struct node_st { + Node *twinNode; // 64 + Arc *arc; // 64 + Descriptor *descriptor; // 64 + PassageMarker *marker; // 64 + Coordinate length; // 32 + Coordinate virtualCoverage[CATEGORIES]; // 32 * 2 + Coordinate originalVirtualCoverage[CATEGORIES]; // 32 * 2 + IDnum ID; // 32 + IDnum arcCount; // 32 + boolean status; // 1 + boolean uniqueness; // 1 +}; // 418 Total + +struct shortReadMarker_st { + Coordinate position; + IDnum readID; + ShortLength offset; +}; + +struct gapMarker_st { + GapMarker *next; + Coordinate position; + ShortLength length; +}; + +struct graph_st { + IDnum sequenceCount; + IDnum nodeCount; + Node **nodes; + Arc **arcLookupTable; + ShortReadMarker **nodeReads; + IDnum *nodeReadCounts; + Coordinate insertLengths[CATEGORIES + 1]; + double insertLengths_var[CATEGORIES + 1]; + int wordLength; + GapMarker **gapMarkers; +}; + +static RecycleBin *arcMemory = NULL; +static RecycleBin *nodeMemory = NULL; +static RecycleBin *gapMarkerMemory = NULL; + +#define BLOCKSIZE 50 +#define GAPBLOCKSIZE 10000 + +Arc *allocateArc() +{ + if (arcMemory == NULL) + arcMemory = newRecycleBin(sizeof(Arc), BLOCKSIZE); + + return allocatePointer(arcMemory); +} + +void deallocateArc(Arc * arc) +{ + deallocatePointer(arcMemory, arc); +} + +Node *allocateNode() +{ + if (nodeMemory == NULL) + nodeMemory = newRecycleBin(sizeof(Node), BLOCKSIZE); + + return (Node *) allocatePointer(nodeMemory); +} + +void deallocateNode(Node * node) +{ + deallocatePointer(nodeMemory, node); +} + +// Returns the twin node of a given node +Node *getTwinNode(Node * node) +{ + return node->twinNode; +} + +// Inserts new passage marker in the marker list of destination node +void insertPassageMarker(PassageMarker * marker, Node * destination) +{ + setTopOfTheNode(marker); + setNextInNode(marker, destination->marker); + destination->marker = marker; +} + +// Returns the length of the node's descriptor list +Coordinate getNodeLength(Node * node) +{ + return node->length; +} + +// Returns the number of nodes in the graph +IDnum nodeCount(Graph * graph) +{ + return graph->nodeCount; +} + +// returns the number of sequences used to buid the graph +IDnum sequenceCount(Graph * graph) +{ + return graph->sequenceCount; +} + +// Creates an arc from node origin to node destination. +// If this arc already exists, increments its multiplicity by 1. +Arc *createArc(Node * originNode, Node * destinationNode, Graph * graph) +{ + Arc *arc, *twinArc; + Node *destinationTwin; + IDnum lookupIndex; + + if (originNode == NULL || destinationNode == NULL) + return NULL; + +// printf("Connecting nodes %i -> %i\n", originNode->ID, destinationNode->ID); + + arc = getArcBetweenNodes(originNode, destinationNode, graph); + + if (arc != NULL) { + arc->multiplicity++; + arc->twinArc->multiplicity++; + return arc; + } + // If not found + arc = allocateArc(); + arc->destination = destinationNode; + arc->multiplicity = 1; + arc->previous = NULL; + arc->next = originNode->arc; + if (originNode->arc != NULL) + originNode->arc->previous = arc; + originNode->arc = arc; + originNode->arcCount++; + + destinationTwin = destinationNode->twinNode; + + // Hairpin case + if (destinationTwin == originNode) { + arc->multiplicity++; + arc->twinArc = arc; + if (graph->arcLookupTable != NULL) { + lookupIndex = + 2 * originNode->ID + destinationNode->ID + + 3 * graph->nodeCount; + arc->nextInLookupTable = + graph->arcLookupTable[lookupIndex]; + graph->arcLookupTable[lookupIndex] = arc; + } + return arc; + } + + twinArc = allocateArc(); + twinArc->destination = originNode->twinNode; + twinArc->multiplicity = 1; + twinArc->previous = NULL; + twinArc->next = destinationTwin->arc; + if (destinationTwin->arc != NULL) + destinationTwin->arc->previous = twinArc; + destinationTwin->arc = twinArc; + destinationTwin->arcCount++; + + arc->twinArc = twinArc; + twinArc->twinArc = arc; + + if (graph->arcLookupTable != NULL) { + lookupIndex = + 2 * originNode->ID + destinationNode->ID + + 3 * graph->nodeCount; + arc->nextInLookupTable = + graph->arcLookupTable[lookupIndex]; + graph->arcLookupTable[lookupIndex] = arc; + + lookupIndex = + -2 * destinationNode->ID - originNode->ID + + 3 * graph->nodeCount; + twinArc->nextInLookupTable = + graph->arcLookupTable[lookupIndex]; + graph->arcLookupTable[lookupIndex] = twinArc; + } + return arc; +} + +void createAnalogousArc(Node * originNode, Node * destinationNode, + Arc * refArc, Graph * graph) +{ + Arc *arc, *twinArc; + Node *destinationTwin; + IDnum lookupIndex; + + if (originNode == NULL || destinationNode == NULL) + return; + +// printf("Connecting nodes %i -> %i\n", originNode->ID, destinationNode->ID); + + arc = getArcBetweenNodes(originNode, destinationNode, graph); + + if (arc != NULL) { + if (refArc->twinArc != refArc) { + arc->multiplicity += getMultiplicity(refArc); + arc->twinArc->multiplicity += + getMultiplicity(refArc); + } else { + arc->multiplicity += getMultiplicity(refArc) / 2; + arc->twinArc->multiplicity += + getMultiplicity(refArc) / 2; + } + return; + } + // If not found + arc = allocateArc(); + arc->destination = destinationNode; + arc->multiplicity = getMultiplicity(refArc); + arc->previous = NULL; + arc->next = originNode->arc; + if (originNode->arc != NULL) + originNode->arc->previous = arc; + originNode->arc = arc; + originNode->arcCount++; + + destinationTwin = destinationNode->twinNode; + + // Hairpin case + if (destinationTwin == originNode) { + arc->twinArc = arc; + if (refArc->twinArc != refArc) + arc->multiplicity *= 2; + + if (graph->arcLookupTable != NULL) { + lookupIndex = + 2 * originNode->ID + destinationNode->ID + + 3 * graph->nodeCount; + arc->nextInLookupTable = + graph->arcLookupTable[lookupIndex]; + graph->arcLookupTable[lookupIndex] = arc; + } + return; + } + + twinArc = allocateArc(); + twinArc->destination = originNode->twinNode; + twinArc->multiplicity = getMultiplicity(refArc); + twinArc->previous = NULL; + twinArc->next = destinationTwin->arc; + if (destinationTwin->arc != NULL) + destinationTwin->arc->previous = twinArc; + destinationTwin->arc = twinArc; + destinationTwin->arcCount++; + + arc->twinArc = twinArc; + twinArc->twinArc = arc; + + if (graph->arcLookupTable != NULL) { + lookupIndex = + 2 * originNode->ID + destinationNode->ID + + 3 * graph->nodeCount; + arc->nextInLookupTable = + graph->arcLookupTable[lookupIndex]; + graph->arcLookupTable[lookupIndex] = arc; + + lookupIndex = + -2 * destinationNode->ID - originNode->ID + + 3 * graph->nodeCount; + twinArc->nextInLookupTable = + graph->arcLookupTable[lookupIndex]; + graph->arcLookupTable[lookupIndex] = twinArc; + } +} + +void changeMultiplicity(Arc * arc, IDnum variation) +{ + if (arc == NULL) + return; + arc->multiplicity += variation; + arc->twinArc->multiplicity += variation; +} + +Arc *getArcBetweenNodes(Node * originNode, Node * destinationNode, + Graph * graph) +{ + Arc *arc; + Node *twinDestination, *twinOrigin; + + if (originNode == NULL || destinationNode == NULL) + return NULL; + + if (graph->arcLookupTable != NULL) { + for (arc = + graph->arcLookupTable[2 * originNode->ID + + destinationNode->ID + + 3 * graph->nodeCount]; + arc != NULL; arc = arc->nextInLookupTable) { + if (arc->destination == destinationNode) { + return arc; + } + } + return NULL; + } + + twinDestination = destinationNode->twinNode; + if (originNode->arcCount <= twinDestination->arcCount) { + for (arc = originNode->arc; arc != NULL; arc = arc->next) + if (arc->destination == destinationNode) + return arc; + return NULL; + } + + twinOrigin = originNode->twinNode; + for (arc = twinDestination->arc; arc != NULL; arc = arc->next) + if (arc->destination == twinOrigin) + return arc->twinArc; + return NULL; +} + +void destroyArc(Arc * arc, Graph * graph) +{ + Node *origin, *destination; + Arc *twinArc; + Arc *currentArc; + IDnum lookupIndex; + + if (arc == NULL) + return; + + twinArc = arc->twinArc; + origin = twinArc->destination->twinNode; + destination = arc->destination->twinNode; + + //printf("Destroying arc %p\n", arc); + + // Removing arc from list + if (origin->arc == arc) { + origin->arc = arc->next; + if (origin->arc != NULL) + origin->arc->previous = NULL; + } else { + arc->previous->next = arc->next; + if (arc->next != NULL) + arc->next->previous = arc->previous; + } + + origin->arcCount--; + + if (destination == origin) { + if (graph->arcLookupTable != NULL) { + lookupIndex = + 2 * origin->ID - destination->ID + + 3 * graph->nodeCount; + currentArc = graph->arcLookupTable[lookupIndex]; + if (currentArc == arc) + graph->arcLookupTable[lookupIndex] = + arc->nextInLookupTable; + else { + while (currentArc->nextInLookupTable != + arc) + currentArc = + currentArc->nextInLookupTable; + + currentArc->nextInLookupTable = + twinArc->nextInLookupTable; + } + } + + deallocateArc(arc); + return; + } + // Removing arc's twin from list + if (destination->arc == twinArc) { + destination->arc = twinArc->next; + if (destination->arc != NULL) + destination->arc->previous = NULL; + } else { + twinArc->previous->next = twinArc->next; + if (twinArc->next != NULL) + twinArc->next->previous = twinArc->previous; + } + + destination->arcCount--; + + if (graph->arcLookupTable != NULL) { + lookupIndex = + 2 * origin->ID - destination->ID + + 3 * graph->nodeCount; + currentArc = graph->arcLookupTable[lookupIndex]; + if (currentArc == arc) + graph->arcLookupTable[lookupIndex] = + arc->nextInLookupTable; + else { + while (currentArc->nextInLookupTable != arc) + currentArc = currentArc->nextInLookupTable; + + currentArc->nextInLookupTable = + arc->nextInLookupTable; + } + + lookupIndex = + 2 * destination->ID - origin->ID + + 3 * graph->nodeCount; + currentArc = graph->arcLookupTable[lookupIndex]; + if (currentArc == twinArc) + graph->arcLookupTable[lookupIndex] = + twinArc->nextInLookupTable; + else { + while (currentArc->nextInLookupTable != twinArc) + currentArc = currentArc->nextInLookupTable; + + currentArc->nextInLookupTable = + twinArc->nextInLookupTable; + } + } + // Freeing memory + deallocateArc(arc); + deallocateArc(twinArc); +} + +void destroyNode(Node * node, Graph * graph) +{ + Node *twin = node->twinNode; + IDnum ID = node->ID; + IDnum index; + + //printf("Destroying %d\n and twin %d\n", getNodeID(node), getNodeID(twin)); + + if (ID < 0) + ID = -ID; + + // Node arcs: + while (node->arc != NULL) + destroyArc(node->arc, graph); + while (twin->arc != NULL) + destroyArc(twin->arc, graph); + + // Descriptors + free(node->descriptor); + free(twin->descriptor); + + // Passage markers + while (node->marker != NULL) + destroyPassageMarker(node->marker); + + // Reads starts + if (graph->nodeReads != NULL) { + index = ID + graph->nodeCount; + free(graph->nodeReads[index]); + graph->nodeReads[index] = NULL; + graph->nodeReadCounts[index] = 0; + + index = -ID + graph->nodeCount; + free(graph->nodeReads[index]); + graph->nodeReads[index] = NULL; + graph->nodeReadCounts[index] = 0; + } + + graph->nodes[ID] = NULL; + deallocateNode(node); + deallocateNode(twin); +} + +int outDegree(Node * node) +{ + int result = 0; + Arc *arc = node->arc; + while (arc != NULL) { + result += arc->multiplicity; + arc = arc->next; + } + + return result; +} + +int simpleArcCount(Node * node) +{ + return node->arcCount; +} + +int arcCount(Node * node) +{ + int result = 0; + Arc *arc; + + if (node == NULL) + return result; + + arc = node->arc; + while (arc != NULL) { + result++; + if (arc->destination == node->twinNode) + result++; + arc = arc->next; + } + + return result; + +} + +static Nucleotide getNucleotideInDescriptor(Descriptor * descriptor, + Coordinate i) +{ + Descriptor *fourMer = descriptor + i / 4; + + switch (i % 4) { + case 0: + return (*fourMer & 3); + case 1: + return (*fourMer & 12) >> 2; + case 2: + return (*fourMer & 48) >> 4; + case 3: + return (*fourMer & 192) >> 6; + } + return 0; +} + +Nucleotide getNucleotideInNode(Node * node, Coordinate index) { + return getNucleotideInDescriptor(node->descriptor, index); +} + +char *readNode(Node * node) +{ + char *s = callocOrExit(1000000, char); + char tmpString[100000]; + Descriptor *descriptor = node->descriptor; + Nucleotide nucleotide; + Coordinate i; + + sprintf(s, "NODE %d :", node->ID); + + for (i = 0; i < node->length; i++) { + nucleotide = getNucleotideInDescriptor(descriptor, i); + switch (nucleotide) { + case ADENINE: + tmpString[i] = 'A'; + break; + case CYTOSINE: + tmpString[i] = 'C'; + break; + case GUANINE: + tmpString[i] = 'G'; + break; + case THYMINE: + tmpString[i] = 'T'; + break; + } + } + + tmpString[i] = '\0'; + strcat(s, tmpString); + + /* + while (arc != NULL) { + sprintf(tmpString, " %d(%dx);", arc->destination->ID, + arc->multiplicity); + strcat(s, tmpString); + arc = arc->next; + } + + sprintf(tmpString, " lgth: %d", node->length); + strcat(s, tmpString); + */ + + return s; +} + +void displayGraph(Graph * graph) +{ + Node *currentNode; + IDnum nodeIndex; + + printf("%d sequences\n", graph->sequenceCount); + printf("%d*2 nodes\n", graph->nodeCount); + + for (nodeIndex = 1; nodeIndex <= graph->nodeCount; nodeIndex++) { + currentNode = graph->nodes[nodeIndex]; + printf("%s\n", readNode(currentNode)); + printf("%s\n", readNode(currentNode->twinNode)); + } +} + +PassageMarker *getMarker(Node * node) +{ + return node->marker; +} + +void setMarker(Node * node, PassageMarker * marker) +{ + if (node == NULL) + return; + + if (marker == NULL) { + node->marker = NULL; + node->twinNode->marker = NULL; + return; + } + + node->marker = marker; + setTopOfTheNode(marker); + node->twinNode->marker = getTwinMarker(marker); + setTopOfTheNode(getTwinMarker(marker)); +} + +void setNodeStatus(Node * node, boolean status) +{ + node->status = status; + node->twinNode->status = status; +} + +void setSingleNodeStatus(Node * node, boolean status) +{ + node->status = status; +} + +boolean getNodeStatus(Node * node) +{ + if (node == NULL) + return false; + return node->status; +} + +IDnum getNodeID(Node * node) +{ + if (node == NULL) + return 0; + + return node->ID; +} + +void resetNodeStatus(Graph * graph) +{ + IDnum nodeIndex; + Node *node; + + for (nodeIndex = 1; nodeIndex <= graph->nodeCount; nodeIndex++) { + node = graph->nodes[nodeIndex]; + if (node == NULL) + continue; + + node->status = false; + node->twinNode->status = false; + } +} + +void resetPassageMarkersStatus(Graph * graph) +{ + IDnum nodeIndex; + Node *node; + PassageMarker *marker; + + for (nodeIndex = 1; nodeIndex <= graph->nodeCount; nodeIndex++) { + node = graph->nodes[nodeIndex]; + if (node == NULL) + continue; + + for (marker = node->marker; marker != NULL; + marker = getNextInNode(marker)) + setPassageMarkerStatus(marker, false); + } +} + +Node *getNodeInGraph(Graph * graph, IDnum nodeID) +{ + if (nodeID == 0) + return NULL; + else if (nodeID > 0) + return graph->nodes[nodeID]; + else if (graph->nodes[-nodeID] == NULL) + return NULL; + else + return graph->nodes[-nodeID]->twinNode; +} + +Arc *getArc(Node * node) +{ + return node->arc; +} + +Arc *getNextArc(Arc * arc) +{ + return arc->next; +} + +IDnum getMultiplicity(Arc * arc) +{ + if (arc == NULL) + return 0; + + return arc->multiplicity; +} + +Node *getOrigin(Arc * arc) +{ + if (arc == NULL) + return NULL; + + return arc->twinArc->destination->twinNode; +} + +Node *getDestination(Arc * arc) +{ + if (arc == NULL) + return NULL; + + return arc->destination; +} + +IDnum markerCount(Node * node) +{ + IDnum count = 0; + PassageMarker *marker; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + count++; + + return count; +} + +void appendNodeSequence(Node * node, TightString * sequence, + Coordinate writeIndex) +{ + Coordinate i; + Nucleotide nucleotide; + + //printf("Getting sequence from node %d of length %d (%d)\n", getNodeID(node), getNodeLength(node), getLength(nodeLabel)); + + for (i = 0; i < getNodeLength(node); i++) { + nucleotide = + getNucleotideInDescriptor(node->descriptor, i); + writeNucleotideAtPosition(nucleotide, i + writeIndex, + sequence); + } +} + +static void writeNucleotideInDescriptor(Nucleotide nucleotide, + Descriptor * descriptor, + Coordinate i) +{ + Descriptor *fourMer = descriptor + i / 4; + switch (i % 4) { + case 3: + *fourMer &= 63; + *fourMer += nucleotide << 6; + return; + case 2: + *fourMer &= 207; + *fourMer += nucleotide << 4; + return; + case 1: + *fourMer &= 243; + *fourMer += nucleotide << 2; + return; + case 0: + *fourMer &= 252; + *fourMer += nucleotide; + } +} + +static inline Descriptor *mergeDescriptors(Descriptor * descr, + Coordinate destinationLength, + Descriptor * copy, + Coordinate sourceLength, + size_t arrayLength) +{ + Descriptor *readPtr, *writePtr; + Descriptor readCopy; + int readOffset, writeOffset; + Descriptor *new = callocOrExit(arrayLength, Descriptor); + Coordinate index; + + readPtr = descr; + readCopy = *readPtr; + writePtr = new; + writeOffset = 0; + for (index = 0; index < destinationLength; index++) { + (*writePtr) >>= 2; + (*writePtr) += (readCopy & 3) << 6; + readCopy >>= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + readPtr++; + if (index < destinationLength - 1) + readCopy = *readPtr; + writeOffset = 0; + } + } + + readPtr = copy; + readCopy = *readPtr; + readOffset = 0; + for (index = 0; index < sourceLength; index++) { + (*writePtr) >>= 2; + (*writePtr) += (readCopy & 3) << 6; + readCopy >>= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + writeOffset = 0; + } + + readOffset++; + if (readOffset == 4) { + readPtr++; + if (index < sourceLength - 1) + readCopy = *readPtr; + readOffset = 0; + } + } + + if (writeOffset != 0) { + while (writeOffset != 4) { + (*writePtr) >>= 2; + writeOffset++; + } + } + + return new; +} + +static void addBufferToDescriptor(Node * node, Coordinate length) +{ + Descriptor *descr; + Descriptor *twinDescr; + Coordinate newLength; + size_t arrayLength; + Node *twinNode; + Coordinate index; + Descriptor *old_descriptor; + + if (node == NULL) + return; + + twinNode = node->twinNode; + descr = node->descriptor; + twinDescr = twinNode->descriptor; + + // Amendments for empty descriptors + if (descr == NULL) { + arrayLength = length / 4; + if (length % 4 != 0) + arrayLength++; + + node->descriptor = callocOrExit(arrayLength, Descriptor); + node->length = length; + twinNode->descriptor = + callocOrExit(arrayLength, Descriptor); + twinNode->length = length; + return; + } + + newLength = node->length + length; + arrayLength = newLength / 4; + if (newLength % 4 != 0) + arrayLength++; + + // Merging forward descriptors + node->descriptor = + reallocOrExit(node->descriptor, arrayLength, Descriptor); + + for (index = node->length; index < newLength; index++) + writeNucleotideInDescriptor(ADENINE, node->descriptor, + index); + node->length = newLength; + + // Merging reverse descriptors + old_descriptor = twinNode->descriptor; + twinNode->descriptor = callocOrExit(arrayLength, Descriptor); + for (index = 0; index < twinNode->length; index++) + writeNucleotideInDescriptor(getNucleotideInDescriptor + (old_descriptor, index), + twinNode->descriptor, + index + length); + for (index = 0; index < length; index++) + writeNucleotideInDescriptor(THYMINE, twinNode->descriptor, + index); + free(old_descriptor); + twinNode->length = newLength; +} + +void appendDescriptors(Node * destination, Node * source) +{ + Descriptor *copy; + Descriptor *twinCopy; + Descriptor *descr; + Descriptor *twinDescr; + Coordinate newLength, destinationLength, sourceLength; + size_t arrayLength; + Descriptor *new; + Node *twinDestination; + + if (source == NULL || destination == NULL) + return; + + twinDestination = destination->twinNode; + descr = destination->descriptor; + twinDescr = twinDestination->descriptor; + copy = source->descriptor; + twinCopy = source->twinNode->descriptor; + + // Amendments for empty descriptors + if (getNodeLength(source) == 0) + return; + if (getNodeLength(destination) == 0) { + destination->descriptor = copy; + twinDestination->descriptor = twinCopy; + source->descriptor = NULL; + source->twinNode->descriptor = NULL; + destination->length = source->length; + destination->twinNode->length = source->length; + source->length = 0; + source->twinNode->length = 0; + return; + } + + destinationLength = destination->length; + sourceLength = source->length; + newLength = destinationLength + sourceLength; + arrayLength = newLength / 4; + if (newLength % 4 != 0) + arrayLength++; + + // Merging forward descriptors + new = + mergeDescriptors(descr, destinationLength, copy, sourceLength, + arrayLength); + free(descr); + destination->descriptor = new; + destination->length = newLength; + + // Merging reverse descriptors + new = + mergeDescriptors(twinCopy, sourceLength, twinDescr, + destinationLength, arrayLength); + free(twinDescr); + twinDestination->descriptor = new; + twinDestination->length = newLength; +} + +static void catDescriptors(Descriptor * descr, Coordinate destinationLength, Descriptor * copy, Coordinate sourceLength) +{ + Coordinate index; + Nucleotide nucleotide; + + for (index = 0; index < sourceLength; index++) { + nucleotide = getNucleotideInDescriptor(copy, index); + writeNucleotideInDescriptor(nucleotide, descr, index + destinationLength); + } +} + +static void reverseCatDescriptors(Descriptor * descr, Coordinate destinationLength, Descriptor * copy, Coordinate sourceLength, Coordinate totalLength) +{ + Coordinate shift = totalLength - destinationLength - sourceLength; + Coordinate index; + Nucleotide nucleotide; + + for (index = 0; index < sourceLength; index++) { + nucleotide = getNucleotideInDescriptor(copy, index); + writeNucleotideInDescriptor(nucleotide, descr, index + shift); + } +} + +void directlyAppendDescriptors(Node * destination, Node * source, Coordinate totalLength) +{ + Descriptor *copy; + Descriptor *twinCopy; + Descriptor *descr; + Descriptor *twinDescr; + Coordinate destinationLength, sourceLength; + + if (source == NULL || destination == NULL) + return; + + descr = destination->descriptor; + twinDescr = destination->twinNode->descriptor; + copy = source->descriptor; + twinCopy = source->twinNode->descriptor; + + // Amendments for empty descriptors + if (getNodeLength(source) == 0) + return; + + destinationLength = destination->length; + sourceLength = source->length; + + // Merging forward descriptors + catDescriptors(descr, destinationLength, copy, sourceLength); + + // Merging reverse descriptors + reverseCatDescriptors(twinDescr, destinationLength, twinCopy, sourceLength, totalLength); + + destination->length += source->length; + destination->twinNode->length += source->length; +} + +static void copyDownDescriptor(Descriptor ** writePtr, int *writeOffset, + Descriptor * source, Coordinate length) +{ + Descriptor *readPtr = source; + Descriptor readCopy = *readPtr; + int readOffset = 0; + Coordinate index; + + for (index = 0; index < length; index++) { + (**writePtr) >>= 2; + (**writePtr) += (readCopy & 3) << 6; + readCopy >>= 2; + + (*writeOffset)++; + if (*writeOffset == 4) { + (*writePtr)++; + *writeOffset = 0; + } + + readOffset++; + if (readOffset == 4) { + readPtr++; + if (index < length - 1) + readCopy = *readPtr; + readOffset = 0; + } + } +} + +static void copyDownSequence(Descriptor ** writePtr, int *writeOffset, + TightString * sequence, Coordinate start, + Coordinate finish, int WORDLENGTH) +{ + boolean forward = (start < finish); + Coordinate sourceLength = finish - start; + Coordinate index; + Nucleotide nucleotide; + + if (!forward) + sourceLength *= -1; + + for (index = 0; index < sourceLength; index++) { + if (forward) + nucleotide = + getNucleotide(start + WORDLENGTH - 1 + index, + sequence); + else + nucleotide = +#ifndef COLOR + 3 - getNucleotide(start - index - 1, sequence); +#else + getNucleotide(start - index - 1, sequence); +#endif + + (**writePtr) >>= 2; + (**writePtr) += nucleotide << 6; + + (*writeOffset)++; + if (*writeOffset == 4) { + (*writePtr)++; + *writeOffset = 0; + } + } +} + +static Descriptor *appendSequenceToDescriptor(Descriptor * descr, + Coordinate nodeLength, + PassageMarker * marker, + TightString ** sequences, + int WORDLENGTH, + size_t arrayLength, + boolean downStream) +{ + int writeOffset = 0; + Descriptor *new = callocOrExit(arrayLength, Descriptor); + Descriptor *writePtr = new; + TightString *sequence; + IDnum sequenceID = getPassageMarkerSequenceID(marker); + Coordinate start = getPassageMarkerStart(marker); + Coordinate finish = getPassageMarkerFinish(marker); + + if (sequenceID > 0) + sequence = sequences[sequenceID - 1]; + else + sequence = sequences[-sequenceID - 1]; + + if (downStream) + copyDownDescriptor(&writePtr, &writeOffset, descr, + nodeLength); + + copyDownSequence(&writePtr, &writeOffset, sequence, start, finish, + WORDLENGTH); + + if (!downStream) + copyDownDescriptor(&writePtr, &writeOffset, descr, + nodeLength); + + if (writeOffset != 0) { + while (writeOffset != 4) { + (*writePtr) >>= 2; + writeOffset++; + } + } + + return new; +} + +void appendSequence(Node * node, TightString ** reads, + PassageMarker * guide, Graph * graph) +{ + Descriptor *descr; + Descriptor *twinDescr; + Coordinate newLength, nodeLength, sourceLength; + size_t arrayLength; + Descriptor *new; + Node *twinNode; + + if (node == NULL) + return; + + twinNode = node->twinNode; + descr = node->descriptor; + twinDescr = twinNode->descriptor; + nodeLength = node->length; + sourceLength = getPassageMarkerLength(guide); + + // Amendments for empty descriptors + if (sourceLength == 0) + return; + + newLength = nodeLength + sourceLength; + arrayLength = newLength / 4; + if (newLength % 4 != 0) + arrayLength++; + + // Merging forward descriptors + new = + appendSequenceToDescriptor(descr, nodeLength, guide, reads, + getWordLength(graph), arrayLength, + true); + free(descr); + node->descriptor = new; + node->length = newLength; + + // Merging reverse descriptors + new = + appendSequenceToDescriptor(twinDescr, nodeLength, + getTwinMarker(guide), reads, + getWordLength(graph), arrayLength, + false); + free(twinDescr); + twinNode->descriptor = new; + twinNode->length = newLength; +} + +void setMultiplicity(Arc * arc, IDnum mult) +{ + arc->multiplicity = mult; + arc->twinArc->multiplicity = mult; +} + +// Reshuffles the graph->nodes array to remove NULL pointers +// Beware that node IDs are accordingly reshuffled (all pointers remain valid though) +void renumberNodes(Graph * graph) +{ + IDnum nodeIndex; + Node *currentNode; + IDnum counter = 0; + IDnum nodes = graph->nodeCount; + IDnum newIndex; + + puts("Renumbering nodes"); + printf("Initial node count %d\n", graph->nodeCount); + + for (nodeIndex = 1; nodeIndex <= nodes; nodeIndex++) { + currentNode = getNodeInGraph(graph, nodeIndex); + + if (currentNode == NULL) + counter++; + else if (counter != 0) { + newIndex = nodeIndex - counter; + currentNode->ID = newIndex; + currentNode->twinNode->ID = -newIndex; + graph->nodes[newIndex] = currentNode; + + if (graph->nodeReads != NULL) { + graph->nodeReads[newIndex + nodes] = + graph->nodeReads[nodeIndex + nodes]; + graph->nodeReadCounts[newIndex + nodes] = + graph->nodeReadCounts[nodeIndex + + nodes]; + + graph->nodeReads[nodeIndex + nodes] = NULL; + graph->nodeReadCounts[nodeIndex + nodes] = + 0; + + graph->nodeReads[-newIndex + nodes] = + graph->nodeReads[-nodeIndex + nodes]; + graph->nodeReadCounts[-newIndex + nodes] = + graph->nodeReadCounts[-nodeIndex + + nodes]; + + graph->nodeReads[-nodeIndex + nodes] = + NULL; + graph->nodeReadCounts[-nodeIndex + nodes] = + 0; + } + + if (graph->gapMarkers != NULL) { + graph->gapMarkers[newIndex] = + graph->gapMarkers[nodeIndex]; + graph->gapMarkers[nodeIndex] = NULL; + } + } + } + + // Shitfting array to the left + if (graph->nodeReads != NULL && counter != 0) { + for (nodeIndex = counter; nodeIndex <= 2 * nodes - counter; + nodeIndex++) { + graph->nodeReads[nodeIndex - counter] = + graph->nodeReads[nodeIndex]; + graph->nodeReadCounts[nodeIndex - counter] = + graph->nodeReadCounts[nodeIndex]; + } + } + + // Rellocating node space + graph->nodeCount -= counter; + graph->nodes = + reallocOrExit(graph->nodes, graph->nodeCount + 1, Node *); + + // Reallocating short read marker arrays + if (graph->nodeReads != NULL) { + graph->nodeReads = + reallocOrExit(graph->nodeReads, + 2 * graph->nodeCount + + 1, ShortReadMarker *); + graph->nodeReadCounts = + reallocOrExit(graph->nodeReadCounts, + 2 * graph->nodeCount + 1, IDnum); + } + + // Reallocating gap marker table + if (graph->gapMarkers != NULL) + graph->gapMarkers = reallocOrExit(graph->gapMarkers, + graph->nodeCount + + 1, GapMarker *); + + printf("Removed %d null nodes\n", counter); +} + +void splitNodeDescriptor(Node * source, Node * target, Coordinate offset) +{ + Coordinate originalLength = source->length; + Coordinate backLength = originalLength - offset; + Coordinate index; + Descriptor *descriptor, *new; + size_t arrayLength; + Nucleotide nucleotide; + + source->length = offset; + source->twinNode->length = offset; + + if (target != NULL) { + target->length = backLength; + target->twinNode->length = backLength; + free(target->descriptor); + free(target->twinNode->descriptor); + target->descriptor = NULL; + target->twinNode->descriptor = NULL; + } + + if (backLength == 0) + return; + + descriptor = source->descriptor; + + arrayLength = backLength / 4; + if (backLength % 4 > 0) + arrayLength++; + + if (target != NULL) { + // Target node .. forwards + new = mallocOrExit(arrayLength, Descriptor); + target->descriptor = new; + for (index = 0; index < backLength; index++) { + nucleotide = + getNucleotideInDescriptor(descriptor, index); + writeNucleotideInDescriptor(nucleotide, new, + index); + } + } + // Source node + for (index = backLength; index < originalLength; index++) { + nucleotide = getNucleotideInDescriptor(descriptor, index); + writeNucleotideInDescriptor(nucleotide, descriptor, + index - backLength); + } + + if (target == NULL) + return; + + // target node other way + descriptor = source->twinNode->descriptor; + new = mallocOrExit(arrayLength, Descriptor); + target->twinNode->descriptor = new; + + for (index = offset; index < originalLength; index++) { + nucleotide = getNucleotideInDescriptor(descriptor, index); + writeNucleotideInDescriptor(nucleotide, new, + index - offset); + } +} + +void reduceNode(Node * node) +{ + free(node->descriptor); + node->descriptor = NULL; + node->length = 0; + + free(node->twinNode->descriptor); + node->twinNode->descriptor = NULL; + node->twinNode->length = 0; +} + +void checkPassageMarkersStatus(Graph * graph) +{ + IDnum nodeIndex; + Node *node; + PassageMarker *marker; + + for (nodeIndex = 1; nodeIndex <= graph->nodeCount; nodeIndex++) { + node = graph->nodes[nodeIndex]; + if (node == NULL) + continue; + + for (marker = node->marker; marker != NULL; + marker = getNextInNode(marker)) { + if (getPassageMarkerStatus(marker)) { + printf("TRUE marker %s\n", + readPassageMarker(marker)); + exit(-1); + } + + if (getNextInSequence(marker) != NULL + && getArcBetweenNodes(node, + getNode(getNextInSequence + (marker)), + graph) == NULL) { + printf + ("Missing arc %d -> %d (for %d)\n", + getNodeID(node), + getNodeID(getNode + (getNextInSequence + (marker))), + getPassageMarkerSequenceID(marker)); + abort(); + } + if (getPreviousInSequence(marker) != NULL + && + getArcBetweenNodes(getNode + (getPreviousInSequence + (marker)), node, + graph) == NULL) { + printf + ("Missing arc %d -> %d (for %d)\n", + getNodeID(getNode + (getNextInSequence + (marker))), + getNodeID(node), + getPassageMarkerSequenceID(marker)); + abort(); + } + } + } +} + +void reassessArcMultiplicities(Graph * graph) +{ + IDnum index; + Node *node, *twin; + Arc *arc; + PassageMarker *marker; + + for (index = 1; index <= graph->nodeCount; index++) { + node = getNodeInGraph(graph, index); + + if (node == NULL) + continue; + + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) + setMultiplicity(arc, 0); + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) + setMultiplicity(arc, 0); + } + + for (index = 1; index <= graph->nodeCount; index++) { + node = getNodeInGraph(graph, index); + + if (node == NULL) + continue; + + twin = getTwinNode(node); + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + if (getPassageMarkerSequenceID(marker) > 0 + && !isTerminal(marker)) { + arc = getArcBetweenNodes(node, + getNode + (getNextInSequence + (marker)), + graph); + if (arc != NULL) + changeMultiplicity(arc, 1); + } else if (getPassageMarkerSequenceID(marker) < 0 + && !isInitial(marker)) { + arc = getArcBetweenNodes(twin, + getTwinNode + (getNode + (getPreviousInSequence + (marker))), + graph); + if (arc != NULL) + changeMultiplicity(arc, 1); + } + } + + } +} + +// Allocate memory for an empty graph created with sequenceCount different sequences +Graph *emptyGraph(IDnum sequenceCount, int wordLength) +{ + Graph *newGraph = mallocOrExit(1, Graph); + newGraph->sequenceCount = sequenceCount; + newGraph->arcLookupTable = NULL; + newGraph->nodeReads = NULL; + newGraph->nodeReadCounts = NULL; + newGraph->wordLength = wordLength; + newGraph->gapMarkers = NULL; + return newGraph; +} + +static Descriptor *newPositiveDescriptor(IDnum sequenceID, + Coordinate start, + Coordinate finish, + TightString ** sequences, + int WORDLENGTH) +{ + Coordinate index; + Nucleotide nucleotide; + TightString *tString = sequences[sequenceID - 1]; + Coordinate length = finish - start; + Descriptor *res; + size_t arrayLength = length / 4; + + if (length % 4 > 0) + arrayLength++; + + res = mallocOrExit(arrayLength, Descriptor); + + for (index = 0; index < length; index++) { + nucleotide = + getNucleotide(start + index + WORDLENGTH - 1, tString); + writeNucleotideInDescriptor(nucleotide, res, index); + } + + return res; + +} + +static Descriptor *newNegativeDescriptor(IDnum sequenceID, + Coordinate start, + Coordinate finish, + TightString ** sequences, + int WORDLENGTH) +{ + Coordinate index; + Nucleotide nucleotide; + TightString *tString = sequences[-sequenceID - 1]; + Coordinate length = start - finish; + Descriptor *res; + size_t arrayLength = length / 4; + + if (length % 4 > 0) + arrayLength++; + + res = mallocOrExit(arrayLength, Descriptor); + + for (index = 0; index < length; index++) { + nucleotide = getNucleotide(start - index, tString); +#ifndef COLOR + writeNucleotideInDescriptor(3 - nucleotide, res, index); +#else + writeNucleotideInDescriptor(nucleotide, res, index); +#endif + } + + return res; + +} + +static Descriptor *newDescriptor(IDnum sequenceID, Coordinate start, + Coordinate finish, + TightString ** sequences, int WORDLENGTH) +{ + if (sequenceID > 0) + return newPositiveDescriptor(sequenceID, start, finish, + sequences, WORDLENGTH); + else + return newNegativeDescriptor(sequenceID, start, finish, + sequences, WORDLENGTH); +} + +// Constructor +// Memory allocated +Node *newNode(IDnum sequenceID, Coordinate start, Coordinate finish, + Coordinate offset, IDnum ID, TightString ** sequences, + int WORDLENGTH) +{ + Node *newnd = allocateNode(); + Node *antiNode = allocateNode(); + Category cat; + + newnd->ID = ID; + newnd->descriptor = + newDescriptor(sequenceID, start + offset, finish + offset, + sequences, WORDLENGTH); + newnd->arc = NULL; + newnd->arcCount = 0; + newnd->marker = NULL; + newnd->status = false; + for (cat = 0; cat < CATEGORIES; cat++) { + newnd->virtualCoverage[cat] = 0; + newnd->originalVirtualCoverage[cat] = 0; + } + + antiNode->ID = -ID; + antiNode->descriptor = + newDescriptor(-sequenceID, finish + offset - 1, + start + offset - 1, sequences, WORDLENGTH); + antiNode->arc = NULL; + antiNode->arcCount = 0; + antiNode->marker = NULL; + antiNode->status = false; + for (cat = 0; cat < CATEGORIES; cat++) { + antiNode->virtualCoverage[cat] = 0; + antiNode->originalVirtualCoverage[cat] = 0; + } + + newnd->twinNode = antiNode; + antiNode->twinNode = newnd; + + if (sequenceID > 0) { + newnd->length = finish - start; + antiNode->length = finish - start; + } else { + newnd->length = start - finish; + antiNode->length = start - finish; + } + + return newnd; +} + +void allocateNodeSpace(Graph * graph, IDnum nodeCount) +{ + graph->nodes = callocOrExit(nodeCount + 1, Node *); + graph->nodeCount = nodeCount; +} + +void addNodeToGraph(Graph * graph, Node * node) +{ + graph->nodes[node->ID] = node; +} + +boolean getUniqueness(Node * node) +{ + return node->uniqueness; +} + +void setUniqueness(Node * node, boolean value) +{ + node->uniqueness = value; + node->twinNode->uniqueness = value; +} + +Node *emptyNode() +{ + Node *newnd = allocateNode(); + Node *antiNode = allocateNode(); + Category cat; + + newnd->ID = 0; + newnd->descriptor = NULL; + newnd->arc = NULL; + newnd->arcCount = 0; + newnd->marker = NULL; + newnd->length = 0; + newnd->uniqueness = false; + for (cat = 0; cat < CATEGORIES; cat++) { + newnd->virtualCoverage[cat] = 0; + newnd->originalVirtualCoverage[cat] = 0; + } + + antiNode->ID = 0; + antiNode->descriptor = NULL; + antiNode->arc = NULL; + antiNode->arcCount = 0; + antiNode->marker = NULL; + antiNode->length = 0; + antiNode->uniqueness = false; + for (cat = 0; cat < CATEGORIES; cat++) { + antiNode->virtualCoverage[cat] = 0; + antiNode->originalVirtualCoverage[cat] = 0; + } + + newnd->twinNode = antiNode; + antiNode->twinNode = newnd; + + return newnd; + +} + +Node *addEmptyNodeToGraph(Graph * graph, IDnum ID) +{ + Node *newnd = emptyNode(); + + newnd->ID = ID; + newnd->twinNode->ID = -ID; + + graph->nodes[ID] = newnd; + + return newnd; + +} + +void setVirtualCoverage(Node * node, Category category, + Coordinate coverage) +{ + node->virtualCoverage[category] = coverage; + node->twinNode->virtualCoverage[category] = + node->virtualCoverage[category]; +} + +void incrementVirtualCoverage(Node * node, Category category, + Coordinate coverage) +{ + node->virtualCoverage[category] += coverage; + node->twinNode->virtualCoverage[category] = + node->virtualCoverage[category]; + +} + +Coordinate getVirtualCoverage(Node * node, Category category) +{ + return node->virtualCoverage[category]; +} + +void setOriginalVirtualCoverage(Node * node, Category category, + Coordinate coverage) +{ + node->originalVirtualCoverage[category] = coverage; + node->twinNode->originalVirtualCoverage[category] = + node->originalVirtualCoverage[category]; +} + +void incrementOriginalVirtualCoverage(Node * node, Category category, + Coordinate coverage) +{ + node->originalVirtualCoverage[category] += coverage; + node->twinNode->originalVirtualCoverage[category] = + node->originalVirtualCoverage[category]; +} + +Coordinate getOriginalVirtualCoverage(Node * node, Category category) +{ + return node->originalVirtualCoverage[category]; +} + +void clipNodeLength(Node * node, Coordinate startClip, + Coordinate finishClip) +{ + Descriptor *descriptor; + Coordinate finalLength = + getNodeLength(node) - startClip - finishClip; + Coordinate index; + Node *twin = getTwinNode(node); + Nucleotide nucleotide; + + if (finalLength < 0) { + puts("Can't clip node that much!!"); + exit(-1); + } + + if (getNodeLength(node) == 0) { + puts("Short enough as is"); + exit(-1); + } + // One way + descriptor = node->descriptor; + for (index = 0; index < finalLength; index++) { + nucleotide = + getNucleotideInDescriptor(descriptor, + index + startClip); + writeNucleotideInDescriptor(nucleotide, descriptor, index); + } + + // Same thing in the other direction + descriptor = twin->descriptor; + for (index = 0; index < finalLength; index++) { + nucleotide = + getNucleotideInDescriptor(descriptor, + index + finishClip); + writeNucleotideInDescriptor(nucleotide, descriptor, index); + } + + // Length + node->length = finalLength; + node->twinNode->length = node->length; +} + +boolean hasSingleArc(Node * node) +{ + return node->arcCount == 1; +} + +void activateArcLookupTable(Graph * graph) +{ + IDnum index; + Node *node; + Arc *arc; + IDnum nodes = graph->nodeCount; + IDnum twinOriginID, destinationID, hash; + Arc **table; + + puts("Activating arc lookup table"); + + graph->arcLookupTable = callocOrExit(6 * nodes + 1, Arc *); + + table = graph->arcLookupTable; + + for (index = -nodes; index <= nodes; index++) { + if (index == 0) + continue; + + node = getNodeInGraph(graph, index); + if (node == 0) + continue; + + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) { + twinOriginID = arc->twinArc->destination->ID; + destinationID = arc->destination->ID; + hash = + 3 * nodes - 2 * twinOriginID + destinationID; + arc->nextInLookupTable = table[hash]; + table[hash] = arc; + } + } + + puts("Done activating arc lookup table"); +} + +void deactivateArcLookupTable(Graph * graph) +{ + free(graph->arcLookupTable); + graph->arcLookupTable = NULL; +} + +static void exportNode(FILE * outfile, Node * node, void *withSequence) +{ + Coordinate index; + Nucleotide nucleotide; + Category cat; + + if (node == NULL) + return; + + fprintf(outfile, "NODE\t%ld\t%lld", (long) node->ID, (long long) node->length); + for (cat = 0; cat < CATEGORIES; cat++) + fprintf(outfile, "\t%lld\t%lld", (long long) node->virtualCoverage[cat], + (long long) node->originalVirtualCoverage[cat]); + fprintf(outfile, "\n"); + + if (withSequence == NULL) + return; + + for (index = 0; index < node->length; index++) { + nucleotide = + getNucleotideInDescriptor(node->descriptor, index); + switch (nucleotide) { + case ADENINE: + fprintf(outfile, "A"); + break; + case CYTOSINE: + fprintf(outfile, "C"); + break; + case GUANINE: + fprintf(outfile, "G"); + break; + case THYMINE: + fprintf(outfile, "T"); + break; + } + } + fprintf(outfile, "\n"); + + for (index = 0; index < node->length; index++) { + nucleotide = + getNucleotideInDescriptor(node->twinNode->descriptor, + index); + switch (nucleotide) { + case ADENINE: + fprintf(outfile, "A"); + break; + case CYTOSINE: + fprintf(outfile, "C"); + break; + case GUANINE: + fprintf(outfile, "G"); + break; + case THYMINE: + fprintf(outfile, "T"); + break; + } + } + fprintf(outfile, "\n"); +} + +static void exportArc(FILE * outfile, Arc * arc) +{ + IDnum originID, destinationID; + IDnum absOriginID, absDestinationID; + + if (arc == NULL) + return; + + absOriginID = originID = -arc->twinArc->destination->ID; + absDestinationID = destinationID = arc->destination->ID; + + if (absOriginID < 0) + absOriginID = -absOriginID; + if (absDestinationID < 0) + absDestinationID = -absDestinationID; + + if (absDestinationID < absOriginID) + return; + + if (originID == destinationID && originID < 0) + return; + + fprintf(outfile, "ARC\t%d\t%d\t%d\n", originID, destinationID, + arc->multiplicity); +} + +// Merges two lists of annotations in order of increasing position (used in mergeSort mainly) +static Arc *mergeArcLists(Arc * left, Arc * right) +{ + Arc *mergedList = NULL; + Arc *tail = NULL; + + // Choose first element: + if (left->destination->ID <= right->destination->ID) { + mergedList = left; + tail = left; + left = left->next; + } else { + mergedList = right; + tail = right; + right = right->next; + } + + // Iterate while both lists are still non empty + while (left != NULL && right != NULL) { + if (left->destination->ID <= right->destination->ID) { + tail->next = left; + left->previous = tail; + left = left->next; + } else { + tail->next = right; + right->previous = tail; + right = right->next; + } + + tail = tail->next; + } + + // Concatenate the remaining list at the end of the merged list + if (left != NULL) { + tail->next = left; + left->previous = tail; + } + + if (right != NULL) { + tail->next = right; + right->previous = tail; + } + + return mergedList; +} + +static void arcMergeSort(Arc ** arcPtr, IDnum count) +{ + + IDnum half = count / 2; + Arc *left = *arcPtr; + Arc *ptr = left; + Arc *right; + IDnum index; + + if (count == 0 || count == 1) + return; + + if (count == 2) { + if ((*arcPtr)->destination->ID > + (*arcPtr)->next->destination->ID) { + (*arcPtr)->next->next = *arcPtr; + (*arcPtr)->previous = (*arcPtr)->next; + *arcPtr = (*arcPtr)->next; + (*arcPtr)->next->next = NULL; + (*arcPtr)->previous = NULL; + } + return; + } + + for (index = 0; index < half - 1; index++) { + ptr = ptr->next; + if (ptr == NULL) + return; + } + + right = ptr->next; + ptr->next = NULL; + right->previous = NULL; + + arcMergeSort(&left, half); + arcMergeSort(&right, count - half); + *arcPtr = mergeArcLists(left, right); +} + +static void sortNodeArcs(Node * node) +{ + Arc *arc; + IDnum count = 0; + + for (arc = getArc(node); arc != NULL; arc = getNextArc(arc)) + count++; + + if (count == 0) + return; + + arc = getArc(node); + arcMergeSort(&arc, count); + + node->arc = arc; +} + +// Merges two lists of annotations in order of increasing position (used in mergeSort mainly) +static GapMarker *mergeGapMarkerLists(GapMarker * left, GapMarker * right) +{ + GapMarker *mergedList = NULL; + GapMarker *tail = NULL; + + // Choose first element: + if (left->position <= right->position) { + mergedList = left; + tail = left; + left = left->next; + } else { + mergedList = right; + tail = right; + right = right->next; + } + + // Iterate while both lists are still non empty + while (left != NULL && right != NULL) { + if (left->position <= right->position) { + tail->next = left; + left = left->next; + } else { + tail->next = right; + right = right->next; + } + + tail = tail->next; + } + + // Concatenate the remaining list at the end of the merged list + if (left != NULL) + tail->next = left; + + if (right != NULL) + tail->next = right; + + return mergedList; +} + +static void gapMergeSort(GapMarker ** gapPtr, IDnum count) +{ + + IDnum half = count / 2; + GapMarker *left = *gapPtr; + GapMarker *ptr = left; + GapMarker *right; + IDnum index; + + if (count == 0 || count == 1) + return; + + if (count == 2) { + if ((*gapPtr)->position > (*gapPtr)->next->position) { + (*gapPtr)->next->next = *gapPtr; + *gapPtr = (*gapPtr)->next; + (*gapPtr)->next->next = NULL; + } + return; + } + + for (index = 0; index < half - 1; index++) { + ptr = ptr->next; + if (ptr == NULL) + return; + } + + right = ptr->next; + ptr->next = NULL; + + gapMergeSort(&left, half); + gapMergeSort(&right, count - half); + *gapPtr = mergeGapMarkerLists(left, right); +} + +static void sortNodeGapMarkers(Node * node, Graph * graph) +{ + GapMarker *gap; + IDnum count = 0; + IDnum ID = getNodeID(node); + + if (ID < 0) + ID = -ID; + + for (gap = graph->gapMarkers[ID]; gap != NULL; gap = gap->next) + count++; + + if (count == 0) + return; + + gap = graph->gapMarkers[ID]; + gapMergeSort(&gap, count); + + graph->gapMarkers[ID] = gap; +} + +void sortGapMarkers(Graph * graph) +{ + IDnum index; + Node *node; + + if (graph->gapMarkers == NULL) + return; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node) + sortNodeGapMarkers(node, graph); + } +} + +void exportGraph(char *filename, Graph * graph, TightString ** sequences) +{ + IDnum index; + FILE *outfile; + Node *node; + Arc *arc; + PassageMarker *marker; + ShortReadMarker *reads; + IDnum readCount, readIndex; + + if (graph == NULL) { + return; + } + + outfile = fopen(filename, "w"); + if (outfile == NULL) { + puts("Couldn't open file, sorry"); + return; + } else + printf("Writing into graph file %s...\n", filename); + + // General data + fprintf(outfile, "%d\t%d\t%i\n", graph->nodeCount, + graph->sequenceCount, graph->wordLength); + + // Node info + for (index = 1; index <= graph->nodeCount; index++) { + node = getNodeInGraph(graph, index); + exportNode(outfile, node, (void *) sequences); + } + + // Arc info + for (index = 1; index <= graph->nodeCount; index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + sortNodeArcs(node); + sortNodeArcs(getTwinNode(node)); + + for (arc = node->arc; arc != NULL; arc = arc->next) + exportArc(outfile, arc); + for (arc = node->twinNode->arc; arc != NULL; + arc = arc->next) + exportArc(outfile, arc); + } + + // Sequence info + for (index = 1; index <= graph->nodeCount; index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + for (marker = node->marker; marker != NULL; + marker = getNextInNode(marker)) + exportMarker(outfile, marker, sequences, + graph->wordLength); + } + + // Node reads + if (readStartsAreActivated(graph)) { + for (index = 0; index <= graph->nodeCount * 2; index++) { + readCount = graph->nodeReadCounts[index]; + if (readCount == 0) + continue; + + fprintf(outfile, "NR\t%d\t%d\n", + index - graph->nodeCount, readCount); + + reads = graph->nodeReads[index]; + for (readIndex = 0; readIndex < readCount; + readIndex++) + fprintf(outfile, "%ld\t%lld\t%d\n", + (long) reads[readIndex].readID, + (long long) reads[readIndex].position, + (int) reads[readIndex].offset); + } + } + + fclose(outfile); +} + +Graph *importGraph(char *filename) +{ + FILE *file = fopen(filename, "r"); + const int maxline = MAXLINE; + char line[MAXLINE]; + Graph *graph; + Coordinate coverage, originalCoverage; + IDnum nodeCounter, sequenceCount; + Node *node, *twin; + Arc *arc; + IDnum originID, destinationID, multiplicity; + PassageMarker *newMarker, *marker; + IDnum nodeID, seqID; + Coordinate index; + Coordinate start, finish; + Coordinate startOffset, finishOffset; + boolean finished = false; + size_t arrayLength; + IDnum readCount; + ShortReadMarker *array; + int wordLength, sCount; + ShortLength length; + Category cat; + long long_var, long_var2, long_var3; + long long longlong_var, longlong_var2, longlong_var3, longlong_var4; + short short_var; + char c; + + if (file == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + printf("Reading graph file %s\n", filename); + + // First line + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + sscanf(line, "%ld\t%ld\t%i\n", &long_var, &long_var2, + &wordLength); + nodeCounter = (IDnum) long_var; + sequenceCount = (IDnum) long_var2; + graph = emptyGraph(sequenceCount, wordLength); + resetWordFilter(wordLength); + allocateNodeSpace(graph, nodeCounter); + printf("Graph has %ld nodes and %ld sequences\n", (long) nodeCounter, + (long) sequenceCount); + + // Read nodes + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + while (!finished && strncmp(line, "NODE", 4) == 0) { + strtok(line, "\t\n"); + sscanf(strtok(NULL, "\t\n"), "%ld", &long_var); + nodeID = (IDnum) long_var; + node = addEmptyNodeToGraph(graph, nodeID); + sscanf(strtok(NULL, "\t\n"), "%lld", &longlong_var); + node->length = (Coordinate) longlong_var; + for (cat = 0; cat < CATEGORIES; cat++) { + sscanf(strtok(NULL, "\t\n"), "%lld", &longlong_var); + coverage = (Coordinate) longlong_var; + setVirtualCoverage(node, cat, coverage); + sscanf(strtok(NULL, "\t\n"), "%lld", + &longlong_var); + originalCoverage = (Coordinate) longlong_var; + setOriginalVirtualCoverage(node, cat, + originalCoverage); + } + + arrayLength = node->length / 4; + if (node->length % 4 > 0) + arrayLength++; + node->descriptor = + callocOrExit(arrayLength, Descriptor); + + index = 0; + while ((c = fgetc(file)) != '\n' && c != EOF) { + if (c == 'A') + writeNucleotideInDescriptor(ADENINE, + node-> + descriptor, + index++); + else if (c == 'C') + writeNucleotideInDescriptor(CYTOSINE, + node-> + descriptor, + index++); + else if (c == 'G') + writeNucleotideInDescriptor(GUANINE, + node-> + descriptor, + index++); + else if (c == 'T') + writeNucleotideInDescriptor(THYMINE, + node-> + descriptor, + index++); + } + + twin = node->twinNode; + twin->length = node->length; + twin->descriptor = + callocOrExit(arrayLength, Descriptor); + index = 0; + while ((c = fgetc(file)) != '\n' && c != EOF) { + if (c == 'A') + writeNucleotideInDescriptor(ADENINE, + twin-> + descriptor, + index++); + else if (c == 'C') + writeNucleotideInDescriptor(CYTOSINE, + twin-> + descriptor, + index++); + else if (c == 'G') + writeNucleotideInDescriptor(GUANINE, + twin-> + descriptor, + index++); + else if (c == 'T') + writeNucleotideInDescriptor(THYMINE, + twin-> + descriptor, + index++); + } + + if (fgets(line, maxline, file) == NULL) + finished = true; + } + + // Read arcs + while (!finished && line[0] == 'A') { + sscanf(line, "ARC\t%ld\t%ld\t%ld\n", &long_var, + &long_var2, &long_var3); + originID = (IDnum) long_var; + destinationID = (IDnum) long_var2; + multiplicity = (IDnum) long_var3; + arc = + createArc(getNodeInGraph(graph, originID), + getNodeInGraph(graph, destinationID), graph); + setMultiplicity(arc, multiplicity); + if (fgets(line, maxline, file) == NULL) + finished = true; + } + + // Read sequences + while (!finished && line[0] != 'N') { + sscanf(line, "SEQ\t%ld\n", &long_var); + seqID = (IDnum) long_var; + marker = NULL; + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + + while (!finished && line[0] != 'N' && line[0] != 'S') { + sCount = + sscanf(line, "%ld\t%lld\t%lld\t%lld\t%lld\n", + &long_var, &longlong_var, &longlong_var2, &longlong_var3, + &longlong_var4); + nodeID = (IDnum) long_var; + startOffset = (Coordinate) longlong_var; + start = (Coordinate) longlong_var2; + finish = (Coordinate) longlong_var3; + finishOffset = (Coordinate) longlong_var4; + if (sCount != 5) { + printf + ("ERROR: reading in graph - only %d items read for line '%s'", + sCount, line); + exit(1); + } + newMarker = + newPassageMarker(seqID, start, finish, + startOffset, finishOffset); + transposePassageMarker(newMarker, + getNodeInGraph(graph, + nodeID)); + connectPassageMarkers(marker, newMarker, graph); + marker = newMarker; + if (fgets(line, maxline, file) == NULL) + finished = true; + } + } + + // Node reads + while (!finished) { + sscanf(line, "NR\t%ld\t%ld\n", &long_var, &long_var2); + nodeID = (IDnum) long_var; + readCount = (IDnum) long_var2; + if (!readStartsAreActivated(graph)) + activateReadStarts(graph); + + graph->nodeReadCounts[nodeID + graph->nodeCount] = + readCount; + array = mallocOrExit(readCount, ShortReadMarker); + graph->nodeReads[nodeID + graph->nodeCount] = array; + + readCount = 0; + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + while (!finished && line[0] != 'N') { + sscanf(line, "%ld\t%lld\t%hd\n", &long_var, + &longlong_var, &short_var); + seqID = (IDnum) long_var; + startOffset = (Coordinate) longlong_var; + length = (ShortLength) short_var; + array[readCount].readID = seqID; + array[readCount].position = startOffset; + array[readCount].offset = length; + readCount++; + if (fgets(line, maxline, file) == NULL) + finished = true; + } + } + + //printf("New graph has %d nodes\n", graph->nodeCount); + + fclose(file); + //puts("Done, exiting"); + return graph; +} + +Graph *importSimplifiedGraph(char *filename) +{ + FILE *file = fopen(filename, "r"); + const int maxline = MAXLINE; + char line[MAXLINE]; + Graph *graph; + Coordinate coverage, originalCoverage; + IDnum nodeCounter, sequenceCount; + Node *node, *twin; + PassageMarker *newMarker, *marker; + IDnum nodeID, seqID; + Coordinate index; + Coordinate start, finish; + Coordinate startOffset, finishOffset; + boolean finished = false; + size_t arrayLength; + IDnum readCount; + ShortReadMarker *array = NULL; + int wordLength, sCount; + ShortLength length; + Category cat; + long long_var, long_var2; + long long longlong_var, longlong_var2, longlong_var3, longlong_var4; + short short_var; + char c; + + if (file == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + printf("Reading graph file %s\n", filename); + + // First line + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + sscanf(line, "%ld\t%ld\t%i\n", &long_var, &long_var2, + &wordLength); + nodeCounter = (IDnum) long_var; + sequenceCount = (IDnum) long_var2; + graph = emptyGraph(sequenceCount, wordLength); + resetWordFilter(wordLength); + allocateNodeSpace(graph, nodeCounter); + printf("Graph has %ld nodes and %ld sequences\n", (long) nodeCounter, + (long) sequenceCount); + + // Read nodes + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + while (strncmp(line, "NODE", 4) == 0) { + strtok(line, "\t\n"); + sscanf(strtok(NULL, "\t\n"), "%ld", &long_var); + nodeID = (IDnum) long_var; + sscanf(strtok(NULL, "\t\n"), "%lld", &longlong_var); + + if (longlong_var < 50) { + if (fgets(line, maxline, file) == NULL) + finished = true; + if (fgets(line, maxline, file) == NULL) + finished = true; + if (fgets(line, maxline, file) == NULL) + finished = true; + continue; + } + + node = addEmptyNodeToGraph(graph, nodeID); + node->length = (Coordinate) longlong_var; + for (cat = 0; cat < CATEGORIES; cat++) { + sscanf(strtok(NULL, "\t\n"), "%lld", &longlong_var); + coverage = (Coordinate) longlong_var; + setVirtualCoverage(node, cat, coverage); + sscanf(strtok(NULL, "\t\n"), "%lld", + &longlong_var); + originalCoverage = (Coordinate) longlong_var; + setOriginalVirtualCoverage(node, cat, + originalCoverage); + } + + arrayLength = node->length / 4; + if (node->length % 4 > 0) + arrayLength++; + node->descriptor = + callocOrExit(arrayLength, Descriptor); + + index = 0; + while ((c = fgetc(file)) != '\n' && c != EOF) { + if (c == 'A') + writeNucleotideInDescriptor(ADENINE, + node-> + descriptor, + index++); + else if (c == 'C') + writeNucleotideInDescriptor(CYTOSINE, + node-> + descriptor, + index++); + else if (c == 'G') + writeNucleotideInDescriptor(GUANINE, + node-> + descriptor, + index++); + else if (c == 'T') + writeNucleotideInDescriptor(THYMINE, + node-> + descriptor, + index++); + } + + twin = node->twinNode; + twin->length = node->length; + twin->descriptor = + callocOrExit(arrayLength, Descriptor); + index = 0; + while ((c = fgetc(file)) != '\n' && c != EOF) { + if (c == 'A') + writeNucleotideInDescriptor(ADENINE, + twin-> + descriptor, + index++); + else if (c == 'C') + writeNucleotideInDescriptor(CYTOSINE, + twin-> + descriptor, + index++); + else if (c == 'G') + writeNucleotideInDescriptor(GUANINE, + twin-> + descriptor, + index++); + else if (c == 'T') + writeNucleotideInDescriptor(THYMINE, + twin-> + descriptor, + index++); + } + + if (fgets(line, maxline, file) == NULL) + finished = true; + } + + // Read arcs + while (!finished && line[0] == 'A') + if (fgets(line, maxline, file) == NULL) + finished = true; + + // Read sequences + while (!finished && line[0] != 'N') { + sscanf(line, "SEQ\t%ld\n", &long_var); + seqID = (IDnum) long_var; + marker = NULL; + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + + while (!finished && line[0] != 'N' && line[0] != 'S') { + sCount = + sscanf(line, "%ld\t%lld\t%lld\t%lld\t%lld\n", + &long_var, &longlong_var, &longlong_var2, &longlong_var3, + &longlong_var4); + nodeID = (IDnum) long_var; + startOffset = (Coordinate) longlong_var; + start = (Coordinate) longlong_var2; + finish = (Coordinate) longlong_var3; + finishOffset = (Coordinate) longlong_var4; + if (sCount != 5) { + printf + ("ERROR: reading in graph - only %d items read for line '%s'", + sCount, line); + abort(); + exit(1); + } + if (getNodeInGraph(graph, nodeID)) { + newMarker = + newPassageMarker(seqID, start, finish, + startOffset, finishOffset); + transposePassageMarker(newMarker, + getNodeInGraph(graph, + nodeID)); + connectPassageMarkers(marker, newMarker, graph); + marker = newMarker; + } + if (fgets(line, maxline, file) == NULL) + finished = true; + } + } + + // Node reads + while (!finished) { + sscanf(line, "NR\t%ld\t%ld\n", &long_var, &long_var2); + nodeID = (IDnum) long_var; + readCount = (IDnum) long_var2; + if (!readStartsAreActivated(graph)) + activateReadStarts(graph); + + if (getNodeInGraph(graph, nodeID)) { + graph->nodeReadCounts[nodeID + graph->nodeCount] = + readCount; + array = mallocOrExit(readCount, ShortReadMarker); + graph->nodeReads[nodeID + graph->nodeCount] = array; + } + + readCount = 0; + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "Graph file incomplete"); + while (!finished && line[0] != 'N') { + if (getNodeInGraph(graph, nodeID)) { + sscanf(line, "%ld\t%lld\t%hd\n", &long_var, + &longlong_var, &short_var); + seqID = (IDnum) long_var; + startOffset = (Coordinate) longlong_var; + length = (ShortLength) short_var; + array[readCount].readID = seqID; + array[readCount].position = startOffset; + array[readCount].offset = length; + readCount++; + } + if (fgets(line, maxline, file) == NULL) + finished = true; + } + } + + //printf("New graph has %d nodes\n", graph->nodeCount); + + fclose(file); + //puts("Done, exiting"); + renumberNodes(graph); + return graph; +} + +Graph *readPreGraphFile(char *preGraphFilename, boolean * double_strand) +{ + FILE *file = fopen(preGraphFilename, "r"); + const int maxline = MAXLINE; + char line[MAXLINE]; + + Graph *graph; + IDnum nodeCounter, sequenceCount; + + Node *node, *twin; + IDnum nodeID = 0; + Coordinate index, nodeLength; + char c; + int wordLength, wordShift; + size_t arrayLength; + short short_var; + long long_var, long_var2; + long long longlong_var; + + if (file == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not open %s", preGraphFilename); + + printf("Reading pre-graph file %s\n", preGraphFilename); + + // First line + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + sscanf(line, "%ld\t%ld\t%i\t%hi\n", &long_var, &long_var2, + &wordLength, &short_var); + nodeCounter = (IDnum) long_var; + sequenceCount = (IDnum) long_var2; + *double_strand = (boolean) short_var; + wordShift = wordLength - 1; + graph = emptyGraph(sequenceCount, wordLength); + resetWordFilter(wordLength); + allocateNodeSpace(graph, nodeCounter); + printf("Graph has %ld nodes and %ld sequences\n", (long) nodeCounter, + (long) sequenceCount); + + // Read nodes + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + while (line[0] == 'N') { + nodeID++; + node = addEmptyNodeToGraph(graph, nodeID); + + sscanf(line, "%*s\t%*i\t%lli\n", &longlong_var); + node->length = (Coordinate) longlong_var; + nodeLength = node->length; + arrayLength = node->length / 4; + if (node->length % 4 > 0) + arrayLength++; + node->descriptor = + callocOrExit(arrayLength, Descriptor); + + twin = node->twinNode; + twin->length = nodeLength; + twin->descriptor = + callocOrExit(arrayLength, Descriptor); + + + index = 0; + while ((c = getc(file)) != '\n') { + if (c == 'A') { + if (index - wordShift >= 0) + writeNucleotideInDescriptor(ADENINE, + node-> + descriptor, + index - wordShift); + if (nodeLength - index - 1 >= 0) { +#ifndef COLOR + writeNucleotideInDescriptor(THYMINE, + twin-> + descriptor, + nodeLength - index - 1); +#else + writeNucleotideInDescriptor(ADENINE, + twin-> + descriptor, + nodeLength - index - 1); +#endif + } + } else if (c == 'C') { + if (index - wordShift >= 0) + writeNucleotideInDescriptor(CYTOSINE, + node-> + descriptor, + index - wordShift); + if (nodeLength - index - 1 >= 0) { +#ifndef COLOR + writeNucleotideInDescriptor(GUANINE, + twin-> + descriptor, + nodeLength - index - 1); +#else + writeNucleotideInDescriptor(CYTOSINE, + twin-> + descriptor, + nodeLength - index - 1); +#endif + } + } else if (c == 'G') { + if (index - wordShift >= 0) + writeNucleotideInDescriptor(GUANINE, + node-> + descriptor, + index - wordShift); + if (nodeLength - index - 1 >= 0) { +#ifndef COLOR + writeNucleotideInDescriptor(CYTOSINE, + twin-> + descriptor, + nodeLength - index - 1); +#else + writeNucleotideInDescriptor(GUANINE, + twin-> + descriptor, + nodeLength - index - 1); +#endif + } + } else if (c == 'T') { + if (index - wordShift >= 0) + writeNucleotideInDescriptor(THYMINE, + node-> + descriptor, + index - wordShift); + if (nodeLength - index - 1 >= 0) { +#ifndef COLOR + writeNucleotideInDescriptor(ADENINE, + twin-> + descriptor, + nodeLength - index - 1); +#else + writeNucleotideInDescriptor(THYMINE, + twin-> + descriptor, + nodeLength - index - 1); +#endif + } + } + + index++; + } + + if (fgets(line, maxline, file) == NULL) { + fclose(file); + return graph; + } + } + + fclose(file); + return graph; +} + +// Prints out the information relative to the topology of a node into a new file +// Internal to exportDOTGraph() +void DOTNode(Node * node, FILE * outfile) +{ + IDnum ID; + Arc *arc; + Node *otherNode; + + ID = node->ID; + if (ID < 0) + return; + + fprintf(outfile, "\t%d [label=\"|%d|\"]\n", ID, ID); + + for (arc = node->arc; arc != NULL; arc = arc->next) { + otherNode = arc->destination; + if (!(otherNode->ID >= ID || otherNode->ID <= -ID)) { + continue; + } + + if (otherNode->ID > 0) + fprintf(outfile, "\t%d:right -> %d:left\n", ID, + otherNode->ID); + else + fprintf(outfile, "\t%d:right -> %d:right\n", ID, + -otherNode->ID); + } + + for (arc = node->twinNode->arc; arc != NULL; arc = arc->next) { + otherNode = arc->destination; + if (!(otherNode->ID >= ID || otherNode->ID <= -ID)) { + continue; + } + + if (otherNode->ID > 0) + fprintf(outfile, "\t%d:left -> %d:left\n", ID, + otherNode->ID); + else + fprintf(outfile, "\t%d:left -> %d:right\n", ID, + -otherNode->ID); + } +} + +// Exports the topology of a graph into a new file (designated by its filename) +void exportDOTGraph(char *filename, Graph * graph) +{ + IDnum nodeIndex; + Node *currentNode; + + FILE *outfile = fopen(filename, "w"); + if (outfile == NULL) { + puts("Couldn't open file, sorry"); + return; + } else + puts("Writing into file..."); + + fprintf(outfile, "digraph G {\n"); + fprintf(outfile, "\tRANKDIR=LR\n"); + fprintf(outfile, "\tnode [shape=record]\n"); + + for (nodeIndex = 1; nodeIndex <= graph->nodeCount; nodeIndex++) { + currentNode = getNodeInGraph(graph, nodeIndex); + DOTNode(currentNode, outfile); + } + + fprintf(outfile, "}\n"); + fclose(outfile); +} + +TightString *expandNode(Node * node, int WORDLENGTH) +{ + Nucleotide nucleotide; + Coordinate index; + TightString *tString = + newTightString(node->length + WORDLENGTH - 1); + Node *twin = node->twinNode; + Coordinate length = node->length; + + for (index = 0; index < WORDLENGTH; index++) { + nucleotide = + getNucleotideInDescriptor(twin->descriptor, + length - index - 1); +#ifndef COLOR + writeNucleotideAtPosition(3 - nucleotide, index, tString); +#else + writeNucleotideAtPosition(nucleotide, index, tString); +#endif + } + + for (index = 1; index < node->length; index++) { + nucleotide = + getNucleotideInDescriptor(node->descriptor, index); + writeNucleotideAtPosition(nucleotide, + index + WORDLENGTH - 1, tString); + } + + return tString; +} + +char *expandNodeFragment(Node * node, Coordinate contigStart, + Coordinate contigFinish, int wordLength) +{ + Nucleotide nucleotide; + Coordinate index; + Node *twin = node->twinNode; + Coordinate length = contigFinish - contigStart; + int wordShift = wordLength - 1; + char *string; + + if (length >= wordShift) { + string = callocOrExit(length + wordLength, char); + + for (index = 0; index < wordShift; index++) { + nucleotide = + getNucleotideInDescriptor(twin->descriptor, + twin->length - contigStart - + index - 1); + #ifndef COLOR + nucleotide = 3 - nucleotide; + #endif + + switch (nucleotide) { + case ADENINE: + string[index] = 'A'; + break; + case CYTOSINE: + string[index] = 'C'; + break; + case GUANINE: + string[index] = 'G'; + break; + case THYMINE: + string[index] = 'T'; + break; + } + + } + + for (index = 0; index < length; index++) { + nucleotide = + getNucleotideInDescriptor(node->descriptor, + contigStart + index); + switch (nucleotide) { + case ADENINE: + string[index + wordShift] = 'A'; + break; + case CYTOSINE: + string[index + wordShift] = 'C'; + break; + case GUANINE: + string[index + wordShift] = 'G'; + break; + case THYMINE: + string[index + wordShift] = 'T'; + break; + } + } + + string[length + wordShift] = '\0'; + } else { + string = callocOrExit(length + 1, char); + + for (index = 0; index < length; index++) { + nucleotide = + getNucleotideInDescriptor(node->descriptor, contigStart + index); + switch (nucleotide) { + case ADENINE: + string[index] = 'A'; + break; + case CYTOSINE: + string[index] = 'C'; + break; + case GUANINE: + string[index] = 'G'; + break; + case THYMINE: + string[index] = 'T'; + break; + } + } + + string[length] = '\0'; + } + + return string; +} + +boolean readStartsAreActivated(Graph * graph) +{ + return graph->nodeReads != NULL; +} + +void activateReadStarts(Graph * graph) +{ + graph->nodeReads = + callocOrExit(2 * graph->nodeCount + 1, ShortReadMarker *); + graph->nodeReadCounts = + callocOrExit(2 * graph->nodeCount + 1, IDnum); +} + +void deactivateReadStarts(Graph * graph) +{ + free(graph->nodeReads); + free(graph->nodeReadCounts); + + graph->nodeReads = NULL; + graph->nodeReadCounts = NULL; +} + +boolean findIDnumInArray(IDnum query, IDnum * array, IDnum arrayLength) +{ + IDnum leftIndex = 0; + IDnum rightIndex = arrayLength; + IDnum middleIndex; + + if (arrayLength == 0) + return false; + + while (true) { + middleIndex = leftIndex + (rightIndex - leftIndex) / 2; + + if (array[middleIndex] == query) + return true; + else if (leftIndex >= rightIndex) + return false; + else if (array[middleIndex] > query) + rightIndex = middleIndex; + else if (leftIndex == middleIndex) + leftIndex++; + else + leftIndex = middleIndex; + } +} + +static inline int compareShortReadMarkers(const void *A, const void *B) +{ + IDnum a = ((ShortReadMarker *) A)->readID; + IDnum b = ((ShortReadMarker *) B)->readID; + + if (a > b) + return 1; + if (a == b) + return 0; + return -1; +} + +static inline int compareIDnums(const void *A, const void *B) +{ + IDnum a = *((IDnum *) A); + IDnum b = *((IDnum *) B); + + if (a > b) + return 1; + if (a == b) + return 0; + return -1; +} + +void incrementReadStartCount(Node * node, Graph * graph) +{ + graph->nodeReadCounts[node->ID + graph->nodeCount]++; +} + +void createNodeReadStartArrays(Graph * graph) +{ + IDnum index; + + if (graph->nodeReads == NULL) + return; + + for (index = 0; index <= 2 * (graph->nodeCount); index++) { + if (graph->nodeReadCounts[index] != 0) { + graph->nodeReads[index] = + mallocOrExit(graph->nodeReadCounts[index], + ShortReadMarker); + graph->nodeReadCounts[index] = 0; + } else { + graph->nodeReads[index] = NULL; + } + } +} + +void orderNodeReadStartArrays(Graph * graph) +{ + IDnum index; + + if (graph->nodeReads == NULL) + return; + + for (index = 0; index <= 2 * (graph->nodeCount); index++) + if (graph->nodeReadCounts[index] != 0) + qsort(graph->nodeReads[index], + graph->nodeReadCounts[index], + sizeof(ShortReadMarker), + compareShortReadMarkers); +} + +void addReadStart(Node * node, IDnum seqID, Coordinate position, + Graph * graph, Coordinate offset) +{ + IDnum nodeIndex = getNodeID(node) + graph->nodeCount; + + ShortReadMarker *array = graph->nodeReads[nodeIndex]; + IDnum arrayLength = graph->nodeReadCounts[nodeIndex]; + + if (node->status) + return; + node->status = true; + + array[arrayLength].readID = seqID; + array[arrayLength].position = position; + array[arrayLength].offset = (ShortLength) offset; + graph->nodeReadCounts[nodeIndex]++; +} + +void blurLastShortReadMarker(Node * node, Graph * graph) +{ + IDnum nodeIndex = getNodeID(node) + nodeCount(graph); + IDnum index = graph->nodeReadCounts[nodeIndex] - 1; + ShortReadMarker *marker; + + if (index >= 0) + marker = &(graph->nodeReads[nodeIndex][index]); + else + abort(); + + setShortReadMarkerPosition(marker, -1); +} + +ShortReadMarker *commonNodeReads(Node * nodeA, Node * nodeB, Graph * graph, + IDnum * length) +{ + IDnum targetID, targetLength, targetIndex, targetVal; + IDnum sourceID, sourceLength, sourceIndex, sourceVal; + IDnum mergeLength; + ShortReadMarker *mergeArray, *targetArray, *sourceArray; + + if (graph->nodeReads == NULL) { + *length = 0; + return NULL; + } + + if (nodeA == NULL || nodeB == NULL) { + *length = 0; + return NULL; + } + + targetID = getNodeID(nodeA) + graph->nodeCount; + targetArray = graph->nodeReads[targetID]; + targetLength = graph->nodeReadCounts[targetID]; + + sourceID = getNodeID(nodeB) + graph->nodeCount; + sourceArray = graph->nodeReads[sourceID]; + sourceLength = graph->nodeReadCounts[sourceID]; + + if (sourceArray == NULL || targetArray == NULL) { + *length = 0; + return NULL; + } + + mergeArray = + mallocOrExit(sourceLength + + targetLength, ShortReadMarker); + + mergeLength = 0; + sourceIndex = 0; + targetIndex = 0; + sourceVal = sourceArray[0].readID; + targetVal = targetArray[0].readID; + + while (sourceIndex < sourceLength && targetIndex < targetLength) { + switch (compareIDnums(&sourceVal, &targetVal)) { + case -1: + mergeArray[mergeLength].readID = sourceVal; + mergeArray[mergeLength].position = -1; + mergeArray[mergeLength].offset = -1; + mergeLength++; + sourceIndex++; + if (sourceIndex < sourceLength) + sourceVal = + sourceArray[sourceIndex].readID; + break; + case 0: + mergeArray[mergeLength].readID = sourceVal; + mergeArray[mergeLength].position = -1; + mergeArray[mergeLength].offset = -1; + mergeLength++; + sourceIndex++; + if (sourceIndex < sourceLength) + sourceVal = + sourceArray[sourceIndex].readID; + targetIndex++; + if (targetIndex < targetLength) + targetVal = + targetArray[targetIndex].readID; + break; + case 1: + mergeArray[mergeLength].readID = targetVal; + mergeArray[mergeLength].position = -1; + mergeArray[mergeLength].offset = -1; + mergeLength++; + targetIndex++; + if (targetIndex < targetLength) + targetVal = + targetArray[targetIndex].readID; + } + } + + while (sourceIndex < sourceLength) { + mergeArray[mergeLength].readID = + sourceArray[sourceIndex].readID; + mergeArray[mergeLength].position = -1; + mergeArray[mergeLength].offset = -1; + mergeLength++; + sourceIndex++; + } + + while (targetIndex < targetLength) { + mergeArray[mergeLength].readID = + targetArray[targetIndex].readID; + mergeArray[mergeLength].position = -1; + mergeArray[mergeLength].offset = -1; + mergeLength++; + targetIndex++; + } + + *length = mergeLength; + return mergeArray; +} + +ShortReadMarker *extractFrontOfNodeReads(Node * node, + Coordinate breakpoint, + Graph * graph, IDnum * length, + PassageMarker * sourceMarker, + Coordinate * lengths) +{ + IDnum sourceID; + IDnum mergeLength, newLength, sourceLength; + IDnum sourceIndex; + ShortReadMarker *mergeArray, *sourceArray, *newArray; + ShortReadMarker *mergePtr, *sourcePtr, *newPtr; + Coordinate finish; + Coordinate revBreakpoint; + + if (graph->nodeReads == NULL) { + *length = 0; + return NULL; + } + + if (node == NULL) { + *length = 0; + return NULL; + } + + if (breakpoint == 0) { + return commonNodeReads(node, + getTwinNode(getNode + (getPreviousInSequence + (sourceMarker))), + graph, length); + } + + sourceID = getNodeID(node) + graph->nodeCount; + sourceArray = graph->nodeReads[sourceID]; + sourceLength = graph->nodeReadCounts[sourceID]; + + if (sourceArray == NULL) { + *length = 0; + return NULL; + } + + revBreakpoint = node->length - breakpoint; + + mergeLength = 0; + newLength = 0; + sourcePtr = sourceArray; + for (sourceIndex = 0; sourceIndex < sourceLength; sourceIndex++) { + if (sourcePtr->position == -1) { + newLength++; + mergeLength++; + } else { + finish = + sourcePtr->position - sourcePtr->offset + + lengths[sourcePtr->readID - 1]; + if (sourcePtr->position < revBreakpoint) + newLength++; + if (finish > revBreakpoint) + mergeLength++; + } + sourcePtr++; + } + + newArray = mallocOrExit(newLength, ShortReadMarker); + mergeArray = mallocOrExit(mergeLength, ShortReadMarker); + + mergePtr = mergeArray; + newPtr = newArray; + sourcePtr = sourceArray; + mergeLength = 0; + newLength = 0; + for (sourceIndex = 0; sourceIndex < sourceLength; sourceIndex++) { + if (sourcePtr->position == -1) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr++; + mergeLength++; + newPtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(newPtr, -1); + setShortReadMarkerOffset(newPtr, -1); + newPtr++; + newLength++; + } else { + finish = + sourcePtr->position - sourcePtr->offset + + lengths[sourcePtr->readID - 1]; + if (sourcePtr->position < revBreakpoint) { + newPtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(newPtr, + sourcePtr-> + position); + setShortReadMarkerOffset(newPtr, + sourcePtr-> + offset); + newPtr++; + newLength++; + + // Saddle back reads: + if (finish > revBreakpoint) { + mergePtr->readID = + sourcePtr->readID; + setShortReadMarkerPosition + (mergePtr, 0); + setShortReadMarkerOffset(mergePtr, + sourcePtr-> + offset + + revBreakpoint + - + sourcePtr-> + position); + mergePtr++; + } + } else if (finish > revBreakpoint) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, + sourcePtr-> + position - revBreakpoint); + setShortReadMarkerOffset(mergePtr, + sourcePtr-> + offset); + mergePtr++; + mergeLength++; + } + } + + sourcePtr++; + } + + free(sourceArray); + graph->nodeReads[sourceID] = newArray; + graph->nodeReadCounts[sourceID] = newLength; + + *length = mergeLength; + return mergeArray; +} + +ShortReadMarker *extractBackOfNodeReads(Node * node, Coordinate breakpoint, + Graph * graph, IDnum * length, + PassageMarker * sourceMarker, + Coordinate * lengths) +{ + IDnum sourceID; + IDnum mergeLength, newLength, sourceLength; + IDnum sourceIndex; + ShortReadMarker *mergeArray, *sourceArray, *newArray; + ShortReadMarker *mergePtr, *sourcePtr, *newPtr; + Coordinate finish; + + if (graph->nodeReads == NULL) { + *length = 0; + return NULL; + } + + if (node == NULL) { + *length = 0; + return NULL; + } + + if (breakpoint == 0) { + return + commonNodeReads(getNode + (getPreviousInSequence(sourceMarker)), + node, graph, length); + } + + sourceID = getNodeID(node) + graph->nodeCount; + sourceArray = graph->nodeReads[sourceID]; + sourceLength = graph->nodeReadCounts[sourceID]; + + if (sourceArray == NULL) { + *length = 0; + return NULL; + } + + mergeLength = 0; + newLength = 0; + sourcePtr = sourceArray; + for (sourceIndex = 0; sourceIndex < sourceLength; sourceIndex++) { + if (sourcePtr->position == -1) { + mergeLength++; + newLength++; + } else { + finish = + sourcePtr->position - sourcePtr->offset + + lengths[sourcePtr->readID - 1]; + if (sourcePtr->position < breakpoint) + mergeLength++; + if (finish > breakpoint) + newLength++; + } + sourcePtr++; + } + + newArray = mallocOrExit(newLength, ShortReadMarker); + mergeArray = mallocOrExit(mergeLength, ShortReadMarker); + + mergePtr = mergeArray; + newPtr = newArray; + sourcePtr = sourceArray; + for (sourceIndex = 0; sourceIndex < sourceLength; sourceIndex++) { + if (sourcePtr->position == -1) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr++; + + newPtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(newPtr, -1); + setShortReadMarkerOffset(newPtr, -1); + newPtr++; + + sourcePtr++; + continue; + } else { + finish = + sourcePtr->position - sourcePtr->offset + + lengths[sourcePtr->readID - 1]; + + if (sourcePtr->position < breakpoint) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, + sourcePtr-> + position); + setShortReadMarkerOffset(mergePtr, + sourcePtr-> + offset); + mergePtr++; + + // Saddle back reads: + if (finish > breakpoint) { + newPtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(newPtr, + 0); + setShortReadMarkerOffset(newPtr, + sourcePtr-> + offset + + breakpoint + - + sourcePtr-> + position); + newPtr++; + } + } else if (finish > breakpoint) { + newPtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(newPtr, + sourcePtr-> + position - + breakpoint); + setShortReadMarkerOffset(newPtr, + sourcePtr-> + offset); + newPtr++; + } + } + + sourcePtr++; + } + + free(sourceArray); + graph->nodeReads[sourceID] = newArray; + graph->nodeReadCounts[sourceID] = newLength; + + *length = mergeLength; + return mergeArray; +} + +void spreadReadIDs(ShortReadMarker * reads, IDnum readCount, Node * node, + Graph * graph) +{ + IDnum targetID, targetLength, targetIndex, targetVal; + IDnum sourceLength, sourceIndex, sourceVal; + IDnum mergeLength; + ShortReadMarker *sourceArray, *targetArray, *mergeArray; + ShortReadMarker *sourcePtr, *targetPtr, *mergePtr; + Coordinate targetPosition; + //ShortLength nodeLength = (ShortLength) getNodeLength(node); + ShortLength targetOffset; + + if (graph->nodeReads == NULL || reads == NULL || node == NULL) + return; + + targetID = getNodeID(node) + graph->nodeCount; + targetArray = graph->nodeReads[targetID]; + targetLength = graph->nodeReadCounts[targetID]; + targetPtr = targetArray; + + sourceArray = reads; + sourceLength = readCount; + sourcePtr = sourceArray; + + if (targetArray == NULL) { + mergeArray = + mallocOrExit(sourceLength, ShortReadMarker); + mergePtr = mergeArray; + + sourceIndex = 0; + while (sourceIndex < sourceLength) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr++; + sourcePtr++; + sourceIndex++; + } + + graph->nodeReads[targetID] = mergeArray; + graph->nodeReadCounts[targetID] = sourceLength; + return; + } + + mergeArray = + mallocOrExit(sourceLength + + targetLength, ShortReadMarker); + mergePtr = mergeArray; + + mergeLength = 0; + sourceIndex = 0; + targetIndex = 0; + sourceVal = sourcePtr->readID; + targetVal = targetPtr->readID; + targetPosition = targetPtr->position; + targetOffset = targetPtr->offset; + + while (sourceIndex < sourceLength && targetIndex < targetLength) { + if (sourceVal < targetVal) { + mergePtr->readID = sourceVal; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + sourceIndex++; + sourcePtr++; + if (sourceIndex < sourceLength) + sourceVal = sourcePtr->readID; + } else if (sourceVal == targetVal) { + mergePtr->readID = sourceVal; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + sourceIndex++; + sourcePtr++; + if (sourceIndex < sourceLength) + sourceVal = sourcePtr->readID; + targetIndex++; + targetPtr++; + if (targetIndex < targetLength) { + targetVal = targetPtr->readID; + targetPosition = targetPtr->position; + targetOffset = targetPtr->offset; + } + } else { + mergePtr->readID = targetVal; + setShortReadMarkerPosition(mergePtr, + targetPosition); + setShortReadMarkerOffset(mergePtr, targetOffset); + targetIndex++; + targetPtr++; + if (targetIndex < targetLength) { + targetVal = targetPtr->readID; + targetPosition = targetPtr->position; + targetOffset = targetPtr->offset; + } + } + + mergeLength++; + mergePtr++; + } + + while (sourceIndex < sourceLength) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergeLength++; + mergePtr++; + sourceIndex++; + sourcePtr++; + } + + while (targetIndex < targetLength) { + mergePtr->readID = targetPtr->readID; + setShortReadMarkerPosition(mergePtr, targetPtr->position); + setShortReadMarkerOffset(mergePtr, targetPtr->offset); + mergeLength++; + mergePtr++; + targetIndex++; + targetPtr++; + } + + free(targetArray); + graph->nodeReads[targetID] = mergeArray; + graph->nodeReadCounts[targetID] = mergeLength; +} + +static inline Coordinate min(Coordinate A, Coordinate B) +{ + return A < B ? A : B; +} + +static inline ShortLength min_short(ShortLength A, ShortLength B) +{ + return A < B ? A : B; +} + +void injectShortReads(ShortReadMarker * sourceArray, IDnum sourceLength, + Node * target, Graph * graph) +{ + IDnum targetID = getNodeID(target) + graph->nodeCount; + ShortReadMarker *targetArray = graph->nodeReads[targetID]; + IDnum targetLength = graph->nodeReadCounts[targetID]; + ShortReadMarker *targetPtr = targetArray; + ShortReadMarker *sourcePtr = sourceArray; + ShortReadMarker *mergeArray, *mergePtr; + IDnum mergeLength; + Coordinate targetPosition, sourcePosition; + ShortLength targetOffset, sourceOffset; + IDnum targetIndex, targetVal, sourceIndex, sourceVal; + + if (sourceLength == 0) { + free(sourceArray); + return; + } + + if (targetLength == 0) { + free(targetArray); + graph->nodeReads[targetID] = sourceArray; + graph->nodeReadCounts[targetID] = sourceLength; + return; + } + + mergeArray = + mallocOrExit(sourceLength + + targetLength, ShortReadMarker); + mergePtr = mergeArray; + + mergeLength = 0; + sourceIndex = 0; + targetIndex = 0; + targetVal = targetPtr->readID; + targetPosition = targetPtr->position; + targetOffset = targetPtr->offset; + sourceVal = sourcePtr->readID; + sourcePosition = sourcePtr->position; + sourceOffset = sourcePtr->offset; + + while (sourceIndex < sourceLength && targetIndex < targetLength) { + if (sourceVal < targetVal) { + mergePtr->readID = sourceVal; + setShortReadMarkerPosition(mergePtr, + sourcePosition); + setShortReadMarkerOffset(mergePtr, sourceOffset); + sourceIndex++; + if (sourceIndex < sourceLength) { + sourcePtr++; + sourceVal = sourcePtr->readID; + sourcePosition = sourcePtr->position; + sourceOffset = sourcePtr->offset; + } + } else if (sourceVal == targetVal) { + mergePtr->readID = sourceVal; + if (sourcePosition == -1 && targetPosition == -1) { + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + } else if (sourcePosition == -1) { + setShortReadMarkerPosition(mergePtr, + targetPosition); + setShortReadMarkerOffset(mergePtr, + targetOffset); + } else if (targetPosition == -1) { + setShortReadMarkerPosition(mergePtr, + sourcePosition); + setShortReadMarkerOffset(mergePtr, + sourceOffset); + } else { + setShortReadMarkerPosition(mergePtr, + min + (sourcePosition, + targetPosition)); + setShortReadMarkerOffset(mergePtr, + min_short + (sourceOffset, + targetOffset)); + } + sourceIndex++; + if (sourceIndex < sourceLength) { + sourcePtr++; + sourceVal = sourcePtr->readID; + sourcePosition = sourcePtr->position; + sourceOffset = sourcePtr->offset; + } + targetIndex++; + if (targetIndex < targetLength) { + targetPtr++; + targetVal = targetPtr->readID; + targetPosition = targetPtr->position; + targetOffset = targetPtr->offset; + } + } else { + mergePtr->readID = targetVal; + setShortReadMarkerPosition(mergePtr, + targetPosition); + setShortReadMarkerOffset(mergePtr, targetOffset); + targetIndex++; + if (targetIndex < targetLength) { + targetPtr++; + targetVal = targetPtr->readID; + targetPosition = targetPtr->position; + targetOffset = targetPtr->offset; + } + } + + mergeLength++; + mergePtr++; + } + + while (sourceIndex < sourceLength) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, sourcePtr->position); + setShortReadMarkerOffset(mergePtr, sourcePtr->offset); + mergeLength++; + mergePtr++; + sourceIndex++; + sourcePtr++; + } + + while (targetIndex < targetLength) { + mergePtr->readID = targetPtr->readID; + setShortReadMarkerPosition(mergePtr, targetPtr->position); + setShortReadMarkerOffset(mergePtr, targetPtr->offset); + mergeLength++; + mergePtr++; + targetIndex++; + targetPtr++; + } + + free(targetArray); + graph->nodeReads[targetID] = mergeArray; + graph->nodeReadCounts[targetID] = mergeLength; + + free(sourceArray); +} + +void mergeNodeReads(Node * target, Node * source, Graph * graph) +{ + IDnum sourceID, sourceLength; + ShortReadMarker *sourceArray; + + if (graph->nodeReads == NULL || source == NULL || target == NULL) + return; + + sourceID = getNodeID(source) + graph->nodeCount; + sourceArray = graph->nodeReads[sourceID]; + sourceLength = graph->nodeReadCounts[sourceID]; + + if (sourceArray == NULL) + return; + + graph->nodeReads[sourceID] = NULL; + graph->nodeReadCounts[sourceID] = 0; + + injectShortReads(sourceArray, sourceLength, target, graph); +} + +void foldSymmetricalNodeReads(Node * node, Graph * graph) +{ + IDnum targetID, targetLength, targetIndex; + IDnum sourceID, sourceLength, sourceIndex; + IDnum targetVal = 0; + IDnum sourceVal = 0; + IDnum mergeLength; + ShortReadMarker *sourceArray, *targetArray, *mergeArray, + *mergeArray2; + ShortReadMarker *sourcePtr, *targetPtr, *mergePtr, *mergePtr2; + + if (graph->nodeReads == NULL || node == NULL) + return; + + sourceID = getNodeID(node) + graph->nodeCount; + sourceArray = graph->nodeReads[sourceID]; + sourceLength = graph->nodeReadCounts[sourceID]; + sourcePtr = sourceArray; + + targetID = -getNodeID(node) + graph->nodeCount; + targetArray = graph->nodeReads[targetID]; + targetLength = graph->nodeReadCounts[targetID]; + targetPtr = targetArray; + + if (sourceArray == NULL && targetArray == NULL) + return; + + mergeArray = + mallocOrExit(sourceLength + + targetLength, ShortReadMarker); + mergeArray2 = + mallocOrExit(sourceLength + + targetLength, ShortReadMarker); + mergePtr = mergeArray; + mergePtr2 = mergeArray2; + + mergeLength = 0; + sourceIndex = 0; + targetIndex = 0; + if (targetIndex < targetLength) + targetVal = targetPtr->readID; + if (sourceIndex < sourceLength) + sourceVal = sourcePtr->readID; + + while (sourceIndex < sourceLength && targetIndex < targetLength) { + if (sourceVal < targetVal) { + mergePtr->readID = sourceVal; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr2->readID = sourceVal; + setShortReadMarkerPosition(mergePtr2, -1); + setShortReadMarkerOffset(mergePtr2, -1); + sourceIndex++; + sourcePtr++; + if (sourceIndex < sourceLength) + sourceVal = sourcePtr->readID; + } else if (sourceVal == targetVal) { + mergePtr->readID = sourceVal; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr2->readID = sourceVal; + setShortReadMarkerPosition(mergePtr2, -1); + setShortReadMarkerOffset(mergePtr2, -1); + sourceIndex++; + sourcePtr++; + if (sourceIndex < sourceLength) + sourceVal = sourcePtr->readID; + targetIndex++; + targetPtr++; + if (targetIndex < targetLength) + targetVal = targetPtr->readID; + } else { + mergePtr->readID = targetVal; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr2->readID = targetVal; + setShortReadMarkerPosition(mergePtr2, -1); + setShortReadMarkerOffset(mergePtr2, -1); + targetIndex++; + targetPtr++; + if (targetIndex < targetLength) + targetVal = targetPtr->readID; + } + + mergeLength++; + mergePtr++; + mergePtr2++; + } + + while (sourceIndex < sourceLength) { + mergePtr->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr2->readID = sourcePtr->readID; + setShortReadMarkerPosition(mergePtr2, -1); + setShortReadMarkerOffset(mergePtr2, -1); + mergeLength++; + mergePtr++; + mergePtr2++; + sourceIndex++; + sourcePtr++; + } + + while (targetIndex < targetLength) { + mergePtr->readID = targetPtr->readID; + setShortReadMarkerPosition(mergePtr, -1); + setShortReadMarkerOffset(mergePtr, -1); + mergePtr2->readID = targetPtr->readID; + setShortReadMarkerPosition(mergePtr2, -1); + setShortReadMarkerOffset(mergePtr2, -1); + mergeLength++; + mergePtr++; + mergePtr2++; + targetIndex++; + targetPtr++; + } + + free(targetArray); + graph->nodeReads[targetID] = mergeArray; + graph->nodeReadCounts[targetID] = mergeLength; + + free(sourceArray); + graph->nodeReads[sourceID] = mergeArray2; + graph->nodeReadCounts[sourceID] = mergeLength; +} + +void shareReadStarts(Node * target, Node * source, Graph * graph) +{ + ShortReadMarker *sourceArray; + IDnum sourceLength, sourceID; + + if (graph->nodeReads == NULL) + return; + + if (target == NULL || source == NULL) + return; + + sourceID = source->ID + graph->nodeCount; + sourceArray = graph->nodeReads[sourceID]; + sourceLength = graph->nodeReadCounts[sourceID]; + + if (sourceArray == NULL) + return; + + spreadReadIDs(sourceArray, sourceLength, target, graph); +} + +ShortReadMarker **getNodeToReadMappings(Graph * graph) +{ + return graph->nodeReads; +} + +IDnum getShortReadMarkerID(ShortReadMarker * marker) +{ + return marker->readID; +} + +inline ShortLength getShortReadMarkerOffset(ShortReadMarker * marker) +{ + return marker->offset; +} + +inline void setShortReadMarkerOffset(ShortReadMarker * marker, + ShortLength offset) +{ + marker->offset = offset; +} + +IDnum *getNodeReadCounts(Graph * graph) +{ + return graph->nodeReadCounts; +} + +int getWordLength(Graph * graph) +{ + return graph->wordLength; +} + +void displayArcMemory() +{ + printf("ARC MEMORY %lld allocated %lld free\n", + (long long) RecycleBin_memory_usage(arcMemory), + (long long) recycleBinFreeSpace(arcMemory)); +} + +void displayNodeMemory() +{ + printf("NODE MEMORY %lld allocated %lld free\n", + (long long) RecycleBin_memory_usage(nodeMemory), + (long long) recycleBinFreeSpace(nodeMemory)); +} + +ShortReadMarker *getNodeReads(Node * node, Graph * graph) +{ + IDnum id = node->ID + graph->nodeCount; + return graph->nodeReads[id]; +} + +IDnum getNodeReadCount(Node * node, Graph * graph) +{ + IDnum id = node->ID + graph->nodeCount; + return graph->nodeReadCounts[id]; +} + +inline Coordinate getShortReadMarkerPosition(ShortReadMarker * marker) +{ + return marker->position; +} + +inline void setShortReadMarkerPosition(ShortReadMarker * marker, + Coordinate position) +{ + if (position < -100) + return; + + marker->position = position; +} + +ShortReadMarker *getShortReadMarkerAtIndex(ShortReadMarker * array, + IDnum index) +{ + return &(array[index]); +} + +void destroyGraph(Graph * graph) +{ + IDnum index; + Node *node; + for (index = 1; index <= graph->nodeCount; index++) { + node = getNodeInGraph(graph, index); + if (node != NULL) + destroyNode(node, graph); + } + + if (graph->gapMarkers) + deactivateGapMarkers(graph); + + free(graph->nodes); + destroyRecycleBin(nodeMemory); + destroyRecycleBin(arcMemory); + destroyAllPassageMarkers(); + free(graph->arcLookupTable); + free(graph->nodeReads); + free(graph->nodeReadCounts); + free(graph); +} + +void checkNodeReads(IDnum index, Graph * graph) +{ + IDnum ref = index + graph->nodeCount; + IDnum arrayLength = graph->nodeReadCounts[ref]; + ShortReadMarker *array = graph->nodeReads[ref]; + IDnum i; + + //return; + + if (arrayLength > graph->sequenceCount) + abort(); + + //if (arrayLength > 10000) + // printf("Array length %d %d\n", arrayLength, index); + + for (i = 1; i < arrayLength; i++) { + if (array[i].readID <= array[i - 1].readID) + abort(); + if (array[i].position >= 0 && array[i].offset < 0) + abort(); + if (array[i - 1].position >= 0 && array[i - 1].offset < 0) + abort(); + } +} + +void setInsertLengths(Graph * graph, Category cat, Coordinate insertLength, + Coordinate insertLength_std_dev) +{ + graph->insertLengths[cat] = insertLength; + graph->insertLengths_var[cat] = + insertLength_std_dev * insertLength_std_dev; +} + +Coordinate getInsertLength(Graph * graph, Category cat) +{ + return graph->insertLengths[cat / 2]; +} + +double getInsertLength_var(Graph * graph, Category cat) +{ + return graph->insertLengths_var[cat / 2]; +} + +void activateGapMarkers(Graph * graph) +{ + graph->gapMarkers = + callocOrExit(graph->nodeCount + 1, GapMarker *); + gapMarkerMemory = newRecycleBin(sizeof(GapMarker), GAPBLOCKSIZE); +} + +void deactivateGapMarkers(Graph * graph) +{ + free(graph->gapMarkers); + graph->gapMarkers = NULL; + destroyRecycleBin(gapMarkerMemory); + gapMarkerMemory = NULL; +} + +static GapMarker *allocateGapMarker() +{ + return (GapMarker *) allocatePointer(gapMarkerMemory); +} + +void appendGap(Node * node, Coordinate length, Graph * graph) +{ + IDnum nodeID = getNodeID(node); + GapMarker *marker = allocateGapMarker(); + GapMarker *tmp; + + marker->length = length; + + if (nodeID > 0) { + marker->position = node->length; + marker->next = graph->gapMarkers[nodeID]; + graph->gapMarkers[nodeID] = marker; + } else { + for (tmp = graph->gapMarkers[-nodeID]; tmp != NULL; + tmp = tmp->next) + tmp->position += length; + + marker->position = 0; + marker->next = graph->gapMarkers[-nodeID]; + graph->gapMarkers[-nodeID] = marker; + } + + addBufferToDescriptor(node, length); +} + +void appendNodeGaps(Node * destination, Node * source, Graph * graph) +{ + IDnum destinationID = getNodeID(destination); + IDnum sourceID = getNodeID(source); + GapMarker *marker; + + if (graph->gapMarkers == NULL) + return; + + if (destinationID > 0 && sourceID > 0) { + for (marker = graph->gapMarkers[sourceID]; marker != NULL; + marker = marker->next) + marker->position += destination->length; + } else if (destinationID > 0 && sourceID < 0) { + sourceID = -sourceID; + for (marker = graph->gapMarkers[sourceID]; marker != NULL; + marker = marker->next) + marker->position = + source->length + destination->length - + marker->position - marker->length; + } else if (destinationID < 0 && sourceID > 0) { + destinationID = -destinationID; + for (marker = graph->gapMarkers[destinationID]; + marker != NULL; marker = marker->next) + marker->position += source->length; + + for (marker = graph->gapMarkers[sourceID]; marker != NULL; + marker = marker->next) + marker->position = + source->length - marker->position - + marker->length; + } else { + destinationID = -destinationID; + sourceID = -sourceID; + for (marker = graph->gapMarkers[destinationID]; + marker != NULL; marker = marker->next) + marker->position += source->length; + } + + if (graph->gapMarkers[destinationID] == NULL) + graph->gapMarkers[destinationID] = + graph->gapMarkers[sourceID]; + else { + marker = graph->gapMarkers[destinationID]; + while (marker->next != NULL) + marker = marker->next; + marker->next = graph->gapMarkers[sourceID]; + } + + graph->gapMarkers[sourceID] = NULL; +} + +GapMarker *getGap(Node * node, Graph * graph) +{ + IDnum nodeID = getNodeID(node); + + if (graph->gapMarkers == NULL) + return NULL; + + if (nodeID < 0) + nodeID = -nodeID; + + return graph->gapMarkers[nodeID]; +} + +GapMarker *getNextGap(GapMarker * marker) +{ + return marker->next; +} + +Coordinate getGapStart(GapMarker * marker) +{ + return marker->position; +} + +Coordinate getGapFinish(GapMarker * marker) +{ + return marker->position + marker->length; +} + +void reallocateNodeDescriptor(Node * node, Coordinate length) { + Coordinate arrayLength, index, shift; + Node * twin = node->twinNode; + Descriptor * array; + Nucleotide nucleotide; + + if (length < node->length) + exitErrorf(EXIT_FAILURE, true, "Sum of node lengths smaller than first!"); + + shift = length - node->length; + + arrayLength = length / 4; + if (length % 4) + arrayLength++; + + node->descriptor = reallocOrExit(node->descriptor, arrayLength, Descriptor); + + array = callocOrExit(arrayLength, Descriptor); + for (index = node->length - 1; index >= 0; index--) { + nucleotide = getNucleotideInDescriptor(twin->descriptor, index); + writeNucleotideInDescriptor(nucleotide, array, index + shift); + } + + free(twin->descriptor); + twin->descriptor = array; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/graph.h b/contrib/MetaVelvet-v0.3.1/src/graph.h new file mode 100644 index 0000000..8907f1e --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/graph.h @@ -0,0 +1,236 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _GRAPH_H_ +#define _GRAPH_H_ + +//////////////////////////////////////////////////////////// +// Node functions +//////////////////////////////////////////////////////////// + +//Creators/destructor +Node *newNode(IDnum sequenceID, Coordinate start, Coordinate finish, + Coordinate offset, IDnum ID, TightString ** sequences, + int WORDLENGTH); +Node *emptyNode(); +void destroyNode(Node * node, Graph * graph); + +// Locator +IDnum getNodeID(Node * node); +Node *getNodeInGraph(Graph * graph, IDnum nodeID); +Node *getTwinNode(Node * node); + +// Arc info +int arcCount(Node * node); +int simpleArcCount(Node * node); +Arc *getArc(Node * node); +boolean hasSingleArc(Node * node); + +// Descriptor +Coordinate getNodeLength(Node * node); +void appendDescriptors(Node * target, Node * source); +void directlyAppendDescriptors(Node * target, Node * sourcei, Coordinate totalLength); +void appendSequence(Node * node, TightString ** reads, + PassageMarker * guide, Graph * graph); +void clipNodeLength(Node * node, Coordinate startClip, + Coordinate finishClip); +void splitNodeDescriptor(Node * source, Node * target, Coordinate offset); +void reduceNode(Node * node); +void reallocateNodeDescriptor(Node * node, Coordinate length); +Nucleotide getNucleotideInNode(Node * node, Coordinate index); + +// Passage markers +void insertPassageMarker(PassageMarker * marker, Node * destination); +PassageMarker *getMarker(Node * node); +void setMarker(Node * node, PassageMarker * marker); +IDnum markerCount(Node * node); + +// Short read marker creation +void incrementReadStartCount(Node * node, Graph * graph); +void addReadStart(Node * node, IDnum seqID, Coordinate position, + Graph * graph, Coordinate offset); +void blurLastShortReadMarker(Node * node, Graph * graph); + +// Short read marker handling +ShortReadMarker *getNodeReads(Node * node, Graph * graph); +IDnum getNodeReadCount(Node * node, Graph * graph); +ShortReadMarker *commonNodeReads(Node * nodeA, Node * nodeB, Graph * graph, + IDnum * length); +ShortReadMarker *extractBackOfNodeReads(Node * node, Coordinate breakpoint, + Graph * graph, IDnum * length, + PassageMarker * sourceMarker, + Coordinate * sequenceLengths); +ShortReadMarker *extractFrontOfNodeReads(Node * node, + Coordinate breakpoint, + Graph * graph, IDnum * length, + PassageMarker * sourceMarker, + Coordinate * sequenceLengths); + +// Short read marker moving around +void foldSymmetricalNodeReads(Node * node, Graph * graph); +void spreadReadIDs(ShortReadMarker * reads, IDnum readCount, Node * node, + Graph * graph); +void injectShortReads(ShortReadMarker * sourceArray, IDnum sourceLength, + Node * target, Graph * graph); +void mergeNodeReads(Node * target, Node * source, Graph * graph); +void checkNodeReads(IDnum nodeIndex, Graph * graph); + +// Virtual coverage +void setVirtualCoverage(Node * node, Category category, + Coordinate coverage); +void incrementVirtualCoverage(Node * node, Category category, + Coordinate coverage); +Coordinate getVirtualCoverage(Node * node, Category category); + +// Original virtual coverage +void setOriginalVirtualCoverage(Node * node, Category category, + Coordinate coverage); +void incrementOriginalVirtualCoverage(Node * node, Category category, + Coordinate coverage); +Coordinate getOriginalVirtualCoverage(Node * node, Category category); + +// Status +void setNodeStatus(Node * node, boolean status); +void setSingleNodeStatus(Node * node, boolean status); +boolean getNodeStatus(Node * node); + +// Uniqueness +void setUniqueness(Node * node, boolean value); +boolean getUniqueness(Node * node); + +// Gap markers +void appendGap(Node * node, Coordinate length, Graph * graph); +void appendNodeGaps(Node * destination, Node * source, Graph * graph); + +// IO +char *readNode(Node * node); +TightString *expandNode(Node * node, int WORDLENGTH); +void appendNodeSequence(Node * node, TightString * sequence, + Coordinate writeIndex); +char *expandNodeFragment(Node * node, Coordinate contigStart, + Coordinate contigFinish, int WORDLENGTH); + +//////////////////////////////////////////////////////////// +// Arc functions +//////////////////////////////////////////////////////////// + +// Creators/destructor +Arc *createArc(Node * origin, Node * destination, Graph * graph); +void createAnalogousArc(Node * origin, Node * destination, Arc * refArc, + Graph * graph); +void destroyArc(Arc * arc, Graph * graph); + +// Multiplicity +void setMultiplicity(Arc * arc, IDnum mult); +IDnum getMultiplicity(Arc * arc); +void changeMultiplicity(Arc * arc, IDnum variation); + +// Extremities +Node *getOrigin(Arc * arc); +Node *getDestination(Arc * arc); + +// Finding arcs +Arc *getArcBetweenNodes(Node * origin, Node * destination, Graph * graph); +Arc *getNextArc(Arc * arc); + +// Lookup table option +void activateArcLookupTable(Graph * graph); +void deactivateArcLookupTable(Graph * graph); + +//////////////////////////////////////////////////////////// +// Short read marker functions +//////////////////////////////////////////////////////////// + +ShortReadMarker *getShortReadMarkerAtIndex(ShortReadMarker * array, + IDnum index); + +IDnum getShortReadMarkerID(ShortReadMarker * marker); + +extern inline Coordinate getShortReadMarkerPosition(ShortReadMarker * marker); +extern inline void setShortReadMarkerPosition(ShortReadMarker * marker, + Coordinate position); + +extern inline ShortLength getShortReadMarkerOffset(ShortReadMarker * marker); +extern inline void setShortReadMarkerOffset(ShortReadMarker * marker, + ShortLength offset); + +//////////////////////////////////////////////////////////// +// Gap marker functions +//////////////////////////////////////////////////////////// + +GapMarker *getGap(Node * node, Graph * graph); +GapMarker *getNextGap(GapMarker * marker); +Coordinate getGapStart(GapMarker * marker); +Coordinate getGapFinish(GapMarker * marker); + +//////////////////////////////////////////////////////////// +// Graph functions +//////////////////////////////////////////////////////////// + +// Memory allocation +Graph *emptyGraph(IDnum sequenceCount, int wordLength); +void allocateNodeSpace(Graph * graph, IDnum nodeCount); +void addNodeToGraph(Graph * graph, Node * node); +Node *addEmptyNodeToGraph(Graph * graph, IDnum nodeID); +void destroyGraph(Graph * graph); + +// Dimensions +IDnum nodeCount(Graph * graph); +IDnum sequenceCount(Graph * graph); +void renumberNodes(Graph * graph); +int getWordLength(Graph * graph); + +// Element status +void resetNodeStatus(Graph * graph); +void resetPassageMarkersStatus(Graph * graph); + +// Arc mults +void reassessArcMultiplicities(Graph * graph); + +// File IO +void displayGraph(Graph * graph); +Graph *importGraph(char *filename); +Graph *importSimplifiedGraph(char *filename); +void exportGraph(char *filename, Graph * graph, TightString ** sequences); +void exportDOTGraph(char *filename, Graph * graph); +Graph *readPreGraphFile(char *preGraphFilename, boolean * double_strand); + +// Read starts +void activateReadStarts(Graph * graph); +boolean readStartsAreActivated(Graph * graph); +void createNodeReadStartArrays(Graph * graph); +void orderNodeReadStartArrays(Graph * graph); + +// Insert lengths +void setInsertLengths(Graph * graph, Category cat, Coordinate insertLength, + Coordinate insertLength_std_dev); +Coordinate getInsertLength(Graph * graph, Category cat); +double getInsertLength_var(Graph * graph, Category cat); + +// Gaps markers +void activateGapMarkers(Graph * graph); +void deactivateGapMarkers(Graph * graph); +void sortGapMarkers(Graph * graph); + +void displayArcMemory(); +void displayNodeMemory(); + +void checkPassageMarkersStatus(Graph * graph); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/graphReConstruction.c b/contrib/MetaVelvet-v0.3.1/src/graphReConstruction.c new file mode 100644 index 0000000..2e17065 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/graphReConstruction.c @@ -0,0 +1,719 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "passageMarker.h" +#include "readSet.h" +#include "tightString.h" +#include "recycleBin.h" +#include "utility.h" +#include "kmer.h" + +#define ADENINE 0 +#define CYTOSINE 1 +#define GUANINE 2 +#define THYMINE 3 + +typedef struct kmerOccurence_st KmerOccurence; +typedef struct kmerOccurenceTable_st KmerOccurenceTable; +typedef struct smallNodeList_st SmallNodeList; + +// Internal structure used to mark the ends of an Annotation +struct kmerOccurence_st { + Kmer kmer; + Coordinate position; + IDnum nodeID; +}; + +struct kmerOccurenceTable_st { + KmerOccurence *kmerTable; + IDnum *accelerationTable; + IDnum kmerTableSize; + short int accelerationShift; + short int accelerationBits; +}; + +struct smallNodeList_st { + Node *node; + SmallNodeList *next; +}; + +static RecycleBin *smallNodeListMemory = NULL; +static SmallNodeList *nodePile = NULL; + +#define BLOCKSIZE 1000 + +static SmallNodeList *allocateSmallNodeList() +{ + if (smallNodeListMemory == NULL) + smallNodeListMemory = + newRecycleBin(sizeof(SmallNodeList), BLOCKSIZE); + + return allocatePointer(smallNodeListMemory); +} + +static void deallocateSmallNodeList(SmallNodeList * smallNodeList) +{ + deallocatePointer(smallNodeListMemory, smallNodeList); +} + +static void memorizeNode(Node * node) +{ + SmallNodeList *list = allocateSmallNodeList(); + list->node = node; + list->next = nodePile; + nodePile = list; +} + +static void unlockMemorizedNodes() +{ + SmallNodeList *list; + + while (nodePile) { + list = nodePile; + nodePile = list->next; + setSingleNodeStatus(list->node, false); + deallocateSmallNodeList(list); + } +} + +int compareKmerOccurences(void const *A, void const *B) +{ + KmerOccurence *a = (KmerOccurence *) A; + KmerOccurence *b = (KmerOccurence *) B; + + if (compareKmers(&(a->kmer), &(b->kmer)) < 0) + return -1; + else if (compareKmers(&(a->kmer), &(b->kmer)) > 0) + return 1; + else + return 0; +} + +static inline KmerKey keyInAccelerationTable(Kmer * kmer, + KmerOccurenceTable * table) +{ + KmerKey key = 0; + Kmer copy; + int i; + + copyKmers(©, kmer); + for (i = 0; i < table->accelerationShift; i+= 2) + popNucleotide(©); + + for (i = 0; i < table->accelerationBits; i+= 2) { + key += ((KmerKey) popNucleotide(©)) << table->accelerationBits; + key >>= 2; + } + + return key; +} + +static KmerOccurenceTable *referenceGraphKmers(char *preGraphFilename, + short int accelerationBits, Graph * graph, boolean double_strand) +{ + FILE *file = fopen(preGraphFilename, "r"); + const int maxline = MAXLINE; + char line[MAXLINE]; + char c; + int wordLength; + Coordinate lineLength, kmerCount; + Kmer word; + Kmer antiWord; + KmerOccurenceTable *kmerTable = NULL; + KmerOccurence *kmerOccurences, *kmerOccurencePtr; + Coordinate kmerOccurenceIndex; + IDnum index; + IDnum nodeID = 0; + IDnum *accelPtr = NULL; + KmerKey lastHeader = 0; + KmerKey header; + Nucleotide nucleotide; + + if (file == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not open %s", preGraphFilename); + + // Count kmers + printf("Scanning pre-graph file %s for k-mers\n", + preGraphFilename); + + // First line + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + sscanf(line, "%*i\t%*i\t%i\n", &wordLength); + + // Initialize kmer occurence table: + kmerTable = mallocOrExit(1, KmerOccurenceTable); + if (accelerationBits > 2 * wordLength) + accelerationBits = 2 * wordLength; + + if (accelerationBits > 32) + accelerationBits = 32; + + if (accelerationBits > 0) { + kmerTable->accelerationBits = accelerationBits; + kmerTable->accelerationTable = + callocOrExit((((size_t) 1) << accelerationBits) + 1, + IDnum); + accelPtr = kmerTable->accelerationTable; + kmerTable->accelerationShift = + (short int) 2 *wordLength - accelerationBits; + } else { + kmerTable->accelerationBits = 0; + kmerTable->accelerationTable = NULL; + kmerTable->accelerationShift = 0; + } + + // Read nodes + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + kmerCount = 0; + while (line[0] == 'N') { + lineLength = 0; + while ((c = getc(file)) != EOF && c != '\n') + lineLength++; + kmerCount += lineLength - wordLength + 1; + if (fgets(line, maxline, file) == NULL) + break; + } + fclose(file); + + // Create table + printf("%li kmers found\n", (long) kmerCount); + kmerOccurences = callocOrExit(kmerCount, KmerOccurence); + kmerOccurencePtr = kmerOccurences; + kmerOccurenceIndex = 0; + kmerTable->kmerTable = kmerOccurences; + kmerTable->kmerTableSize = kmerCount; + + // Fill table + file = fopen(preGraphFilename, "r"); + if (file == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not open %s", preGraphFilename); + + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + + // Read nodes + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + while (line[0] == 'N') { + nodeID++; + + // Fill in the initial word : + clearKmer(&word); + clearKmer(&antiWord); + + for (index = 0; index < wordLength - 1; index++) { + c = getc(file); + if (c == 'A') + nucleotide = ADENINE; + else if (c == 'C') + nucleotide = CYTOSINE; + else if (c == 'G') + nucleotide = GUANINE; + else if (c == 'T') + nucleotide = THYMINE; + else if (c == '\n') + exitErrorf(EXIT_FAILURE, true, "PreGraph file incomplete"); + else + nucleotide = ADENINE; + + + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + } + + // Scan through node + index = 0; + while((c = getc(file)) != '\n' && c != EOF) { + if (c == 'A') + nucleotide = ADENINE; + else if (c == 'C') + nucleotide = CYTOSINE; + else if (c == 'G') + nucleotide = GUANINE; + else if (c == 'T') + nucleotide = THYMINE; + else + nucleotide = ADENINE; + + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + + if (!double_strand || compareKmers(&word, &antiWord) <= 0) { + copyKmers(&kmerOccurencePtr->kmer, &word); + kmerOccurencePtr->nodeID = nodeID; + kmerOccurencePtr->position = + index; + } else { + copyKmers(&kmerOccurencePtr->kmer, &antiWord); + kmerOccurencePtr->nodeID = -nodeID; + kmerOccurencePtr->position = + getNodeLength(getNodeInGraph(graph, nodeID)) - 1 - index; + } + + kmerOccurencePtr++; + kmerOccurenceIndex++; + index++; + } + + if (fgets(line, maxline, file) == NULL) + break; + } + + fclose(file); + + // Sort table + qsort(kmerOccurences, kmerCount, sizeof(KmerOccurence), + compareKmerOccurences); + + // Fill up acceleration table + if (kmerTable->accelerationTable != NULL) { + *accelPtr = (IDnum) 0; + for (kmerOccurenceIndex = 0; + kmerOccurenceIndex < kmerCount; + kmerOccurenceIndex++) { + header = + keyInAccelerationTable(&kmerOccurences + [kmerOccurenceIndex]. + kmer, kmerTable); + while (lastHeader < header) { + lastHeader++; + accelPtr++; + *accelPtr = kmerOccurenceIndex; + } + } + + while (lastHeader < (KmerKey) 1 << accelerationBits) { + lastHeader++; + accelPtr++; + *accelPtr = kmerCount; + } + } + + return kmerTable; +} + +static KmerOccurence *findKmerOccurenceInSortedTable(Kmer * kmer, + KmerOccurenceTable * + table) +{ + KmerOccurence *array = table->kmerTable; + KmerKey key = keyInAccelerationTable(kmer, table); + Coordinate leftIndex, rightIndex, middleIndex; + + if (table->accelerationTable != NULL) { + leftIndex = table->accelerationTable[key]; + rightIndex = table->accelerationTable[key + 1]; + } else { + leftIndex = 0; + rightIndex = table->kmerTableSize; + } + + while (true) { + middleIndex = (rightIndex + leftIndex) / 2; + + if (leftIndex >= rightIndex) + return NULL; + else if (compareKmers(&(array[middleIndex]).kmer, kmer) == 0) + return &(array[middleIndex]); + else if (leftIndex == middleIndex) + return NULL; + else if (compareKmers(&(array[middleIndex]).kmer, kmer) > 0) + rightIndex = middleIndex; + else + leftIndex = middleIndex; + } +} + +static void ghostThreadSequenceThroughGraph(TightString * tString, + KmerOccurenceTable * + kmerOccurences, Graph * graph, + IDnum seqID, Category category, + boolean readTracking, + boolean double_strand, + boolean second_in_pair) +{ + Kmer word; + Kmer antiWord; + Coordinate readNucleotideIndex; + KmerOccurence *kmerOccurence; + int wordLength = getWordLength(graph); + Nucleotide nucleotide; + boolean reversed; + + Node *node; + Node *previousNode = NULL; + + clearKmer(&word); + clearKmer(&antiWord); + + // Neglect any read which will not be short paired + if ((!readTracking && category % 2 == 0) + || category / 2 >= CATEGORIES) + return; + + // Neglect any string shorter than WORDLENGTH : + if (getLength(tString) < wordLength) + return; + + // Verify that all short reads are reasonnably short + if (getLength(tString) > USHRT_MAX) { + printf("Short read of length %lli, longer than limit %i\n", + (long long) getLength(tString), SHRT_MAX); + puts("You should better declare this sequence as long, because it genuinely is!"); + exit(1); + } + // Allocate memory for the read pairs + if (!readStartsAreActivated(graph)) + activateReadStarts(graph); + + // Fill in the initial word : + for (readNucleotideIndex = 0; + readNucleotideIndex < wordLength - 1; readNucleotideIndex++) { + nucleotide = getNucleotide(readNucleotideIndex, tString); + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + } + + // Go through sequence + while (readNucleotideIndex < getLength(tString)) { + // Shift word: + nucleotide = getNucleotide(readNucleotideIndex++, tString); + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + + // Search in table + reversed = false; + if (double_strand) { + if (compareKmers(&word, &antiWord) <= 0) { + kmerOccurence = + findKmerOccurenceInSortedTable(&word, + kmerOccurences); + } else { + kmerOccurence = + findKmerOccurenceInSortedTable(&antiWord, + kmerOccurences); + reversed = true; + } + } else { + if (!second_in_pair) { + kmerOccurence = + findKmerOccurenceInSortedTable(&word, + kmerOccurences); + } else { + kmerOccurence = + findKmerOccurenceInSortedTable(&antiWord, + kmerOccurences); + reversed = true; + } + } + + if (kmerOccurence) { + if (!reversed) + node = getNodeInGraph(graph, kmerOccurence->nodeID); + else + node = getNodeInGraph(graph, -kmerOccurence->nodeID); + } else { + node = NULL; + if (previousNode) + break; + } + + previousNode = node; + + // Fill in graph + if (node && !getNodeStatus(node)) { + incrementReadStartCount(node, graph); + setSingleNodeStatus(node, true); + memorizeNode(node); + } + } + + unlockMemorizedNodes(); +} + +static void threadSequenceThroughGraph(TightString * tString, + KmerOccurenceTable * kmerOccurences, + Graph * graph, + IDnum seqID, Category category, + boolean readTracking, + boolean double_strand, + boolean second_in_pair) +{ + Kmer word; + Kmer antiWord; + Coordinate readNucleotideIndex; + Coordinate kmerIndex; + KmerOccurence *kmerOccurence; + int wordLength = getWordLength(graph); + + PassageMarker *marker = NULL; + PassageMarker *previousMarker = NULL; + Node *node; + Node *previousNode = NULL; + Coordinate coord; + Coordinate previousCoord = 0; + Nucleotide nucleotide; + boolean reversed; + + clearKmer(&word); + clearKmer(&antiWord); + + // Neglect any string shorter than WORDLENGTH : + if (getLength(tString) < wordLength) + return; + + // Fill in the initial word : + for (readNucleotideIndex = 0; + readNucleotideIndex < wordLength - 1; readNucleotideIndex++) { + nucleotide = getNucleotide(readNucleotideIndex, tString); + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + } + + // Go through sequence + while (readNucleotideIndex < getLength(tString)) { + nucleotide = getNucleotide(readNucleotideIndex++, tString); + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + + // Search in table + reversed = false; + if (double_strand) { + if (compareKmers(&word, &antiWord) <= 0) { + kmerOccurence = + findKmerOccurenceInSortedTable(&word, + kmerOccurences); + } else { + kmerOccurence = + findKmerOccurenceInSortedTable(&antiWord, + kmerOccurences); + reversed = true; + } + } else { + if (!second_in_pair) { + kmerOccurence = + findKmerOccurenceInSortedTable(&word, + kmerOccurences); + } else { + kmerOccurence = + findKmerOccurenceInSortedTable(&antiWord, + kmerOccurences); + reversed = true; + } + } + + if (kmerOccurence) { + if (!reversed) { + node = getNodeInGraph(graph, kmerOccurence->nodeID); + coord = kmerOccurence->position; + } else { + node = getNodeInGraph(graph, -kmerOccurence->nodeID); + coord = getNodeLength(node) - kmerOccurence->position - 1; + } + } else { + node = NULL; + if (previousNode) + break; + } + + // Fill in graph + if (node) { + kmerIndex = readNucleotideIndex - wordLength; + + if (previousNode == node + && previousCoord == coord - 1) { + if (category / 2 >= CATEGORIES) { + setPassageMarkerFinish(marker, + kmerIndex + + 1); + setFinishOffset(marker, + getNodeLength(node) + - coord - 1); + } else { + incrementVirtualCoverage(node, + category / + 2, 1); + incrementOriginalVirtualCoverage + (node, category / 2, 1); + } + + } else { + if (category / 2 >= CATEGORIES) { + marker = + newPassageMarker(seqID, + kmerIndex, + kmerIndex + 1, + coord, + getNodeLength + (node) - + coord - 1); + transposePassageMarker(marker, + node); + connectPassageMarkers + (previousMarker, marker, + graph); + previousMarker = marker; + } else { + if (readTracking) { + if (!getNodeStatus(node)) { + addReadStart(node, + seqID, + coord, + graph, + kmerIndex); + setSingleNodeStatus + (node, true); + memorizeNode(node); + } else { + blurLastShortReadMarker + (node, graph); + } + } + + incrementVirtualCoverage(node, + category / + 2, 1); + incrementOriginalVirtualCoverage + (node, category / 2, 1); + } + + createArc(previousNode, node, graph); + } + + previousNode = node; + previousCoord = coord; + } + } + + unlockMemorizedNodes(); +} + +static void fillUpGraph(ReadSet * reads, + KmerOccurenceTable * kmerOccurences, Graph * graph, + boolean readTracking, boolean double_strand) +{ + IDnum readIndex; + Category category; + boolean second_in_pair = false; + + resetNodeStatus(graph); + + for (readIndex = 0; readIndex < reads->readCount; readIndex++) { + category = reads->categories[readIndex]; + ghostThreadSequenceThroughGraph(reads-> + tSequences[readIndex], + kmerOccurences, + graph, readIndex + 1, + category, + readTracking, double_strand, second_in_pair); + + if (category % 2) + second_in_pair = (second_in_pair? false : true); + else + second_in_pair = false; + } + + createNodeReadStartArrays(graph); + + second_in_pair = false; + for (readIndex = 0; readIndex < reads->readCount; readIndex++) { + category = reads->categories[readIndex]; + + if (readIndex % 100000 == 0) + printf("Threading through reads %d / %d\n", + readIndex, reads->readCount); + + threadSequenceThroughGraph(reads->tSequences[readIndex], + kmerOccurences, + graph, readIndex + 1, category, + readTracking, double_strand, second_in_pair); + + if (category % 2) + second_in_pair = (second_in_pair? false : true); + else + second_in_pair = false; + } + + orderNodeReadStartArrays(graph); + + if (smallNodeListMemory != NULL) + destroyRecycleBin(smallNodeListMemory); + + free(kmerOccurences->kmerTable); + free(kmerOccurences->accelerationTable); + free(kmerOccurences); +} + +Graph *importPreGraph(char *preGraphFilename, ReadSet * reads, + boolean readTracking, short int accelerationBits) +{ + boolean double_strand = false; + Graph *graph = readPreGraphFile(preGraphFilename, &double_strand); + KmerOccurenceTable *kmerOccurences = + referenceGraphKmers(preGraphFilename, accelerationBits, graph, double_strand); + fillUpGraph(reads, kmerOccurences, graph, readTracking, double_strand); + + return graph; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/graphReConstruction.h b/contrib/MetaVelvet-v0.3.1/src/graphReConstruction.h new file mode 100644 index 0000000..bf746a1 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/graphReConstruction.h @@ -0,0 +1,27 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _GRAPHRECONSTRUCTION_H_ +#define _GRAPHRECONSTRUCTION_H_ + +Graph *importPreGraph(char *preGraphFilename, ReadSet * reads, + boolean readTracking, short int accelerationBits); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/graphStats.c b/contrib/MetaVelvet-v0.3.1/src/graphStats.c new file mode 100644 index 0000000..d8f61b2 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/graphStats.c @@ -0,0 +1,3481 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "readSet.h" +#include "tightString.h" +#include "passageMarker.h" +#include "concatenatedGraph.h" +#include "readCoherentGraph.h" +#include "fibHeap.h" +#include "utility.h" +// Original +#include "shortReadPairs.h" +// Original + +// Original +#define LEN_HISTO_X 10000 +// Original + +static void surveyPath(PassageMarker * marker) +{ + Coordinate length = 0; + Coordinate realLength = 0; + PassageMarker *current = marker; + + if (passageMarkerDirection(current) < 0) + current = getTwinMarker(current); + + for (; current != NULL; current = getNextInSequence(current)) { + length += + getNodeLength(getNode(current)) - + getStartOffset(current) - getFinishOffset(current); + if (getPassageMarkerFinish(current) > 0) + realLength = getPassageMarkerFinish(current); + } + + printf("SURVEY %ld %lld %lld\n", (long) getAbsolutePassMarkerSeqID(marker), + (long long) realLength, (long long) length); +} + +void surveyPaths(Graph * graph) +{ + IDnum ID; + PassageMarker *marker; + + for (ID = 1; ID <= nodeCount(graph); ID++) + for (marker = getMarker(getNodeInGraph(graph, ID)); + marker != NULL; marker = getNextInNode(marker)) + if ((passageMarkerDirection(marker) > 0 + && isInitial(marker)) + || (passageMarkerDirection(marker) < 0 + && isTerminal(marker))) + surveyPath(marker); +} + +static PassageMarkerList *copyMarkers(Node * node) +{ + PassageMarkerList *list = NULL; + PassageMarkerList *new; + PassageMarker *currentMarker; + + for (currentMarker = getMarker(node); currentMarker != NULL; + currentMarker = getNextInNode(currentMarker)) { + new = newPassageMarkerList(currentMarker, list); + list = new; + } + + return list; +} + +static boolean removeDead(PassageMarkerList ** list) +{ + PassageMarkerList *current, *next; + boolean removed = false; + + if (*list == NULL) + return false; + + current = *list; + + while (current->next != NULL) { + next = current->next; + + if (isTerminal(next->marker)) { + removed = true; + current->next = next->next; + deallocatePassageMarkerList(next); + } else + current = current->next; + } + + current = *list; + if (isTerminal(current->marker)) { + removed = true; + *list = current->next; + deallocatePassageMarkerList(current); + } + + return removed; +} + +static Node *chooseDestination(PassageMarkerList * list) +{ + PassageMarkerList *current = list; + Node *destination; + + destination = getNode(getNextInSequence(current->marker)); + while (current != NULL) { + if (getNode(getNextInSequence(current->marker)) != + destination) + return NULL; + current = current->next; + } + + return destination; +} + +static void destroyPassageMarkerList(PassageMarkerList ** list) +{ + PassageMarkerList *ptr; + + while (*list != NULL) { + ptr = *list; + *list = ptr->next; + deallocatePassageMarkerList(ptr); + } +} + +static void updateMarkers(PassageMarkerList * list) +{ + PassageMarkerList *current; + + for (current = list; current != NULL; current = current->next) + current->marker = getNextInSequence(current->marker); +} + +Coordinate computeSubsequentNodesLength(Node * node) +{ + PassageMarkerList *list; + Node *nextNode; + Coordinate totalLength = 0; + boolean uncertain = false; + + list = copyMarkers(node); + + while (true) { + if (removeDead(&list)) + uncertain = true; + + if (uncertain && simpleArcCount(node) > 1) { + destroyPassageMarkerList(&list); + return totalLength; + } + + if (list == NULL) + return totalLength; + + nextNode = chooseDestination(list); + if (nextNode == NULL) { + destroyPassageMarkerList(&list); + return totalLength; + } + + totalLength += getNodeLength(nextNode); + + updateMarkers(list); + } + + // Impossible instruction + return -1; +} + +Coordinate computeVirtualNodeLength(Node * node) +{ + Coordinate virtualLength; + + if (node == NULL) + return 0; + + virtualLength = getNodeLength(node); + + virtualLength += computeSubsequentNodesLength(node); + virtualLength += computeSubsequentNodesLength(getTwinNode(node)); + + return virtualLength; +} + +void testForBizarreMarkers(Graph * graph) +{ + IDnum index; + Node *node; + PassageMarker *marker; + + for (index = 1; index < nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + if (getTwinMarker(marker) == NULL) + exitErrorf(EXIT_FAILURE, false, "Bizarre marker %s", + readPassageMarker(marker)); + } + } +} + +// COunts how many nodes are dead-ends +IDnum countSinksAndSources(Graph * graph) +{ + IDnum nodeIndex; + IDnum result = 0; + Node *node; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (arcCount(node) == 0 + || arcCount(getTwinNode(node)) == 0) + result++; + } + + return result; +} + +// Counts how many nodes have several arcs either going in or coming out +IDnum countTangles(Graph * graph) +{ + IDnum nodeIndex; + IDnum result = 0; + Node *node; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (arcCount(node) > 1 || arcCount(getTwinNode(node)) > 1) + result++; + } + return result; +} + +// Counts nodes with exactly one incoming and one outgoing arc +IDnum countRepeats(Graph * graph) +{ + IDnum nodeIndex; + IDnum result = 0; + Node *node; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (arcCount(node) == 1 + && arcCount(getTwinNode(getDestination(getArc(node)))) + == 1) + if (getNodeID + (getTwinNode(getDestination(getArc(node)))) < 0 + || + getNodeID(getTwinNode + (getDestination(getArc(node)))) > + getNodeID(node)) + result++; + } + return result; + +} + +// Counts the number of markers for one node +int nodeGenomicMultiplicity(Node * node, IDnum firstStrain) +{ + int counter = 0; + PassageMarker *marker; + + if (node == NULL) + return 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) < firstStrain) + counter++; + + return counter; +} + +// Counts the number of markers for one node +IDnum nodeMultiplicity(Node * node) +{ + int counter = 0; + PassageMarker *marker; + + if (node == NULL) + return 0; + + marker = getMarker(node); + while (marker != NULL) { + counter++; + marker = getNextInNode(marker); + } + return counter; +} + +// Prints out a set of predefined statistics for one node +char *nodeStatistics(Node * node) +{ + char *s = mallocOrExit(100, char); + sprintf(s, "NODE %ld\t%lld\t%i\t%i\t%ld", (long) getNodeID(node), + (long long) getNodeLength(node), simpleArcCount(node), + simpleArcCount(getTwinNode(node)), (long) nodeMultiplicity(node)); + return s; +} + +// Prints out a table of statistics for all the nodes of the graph +void displayGraphStatistics(Graph * graph) +{ + IDnum nodeIndex; + printf("NODE ID\tlgth\tFwd\tBck\tMult\n"); + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) + printf("%s\n", + nodeStatistics(getNodeInGraph(graph, nodeIndex))); +} + +void displayNodeStatisticsSelective(Node * node, IDnum first) +{ + PassageMarker *marker; + boolean originalGenome; + boolean strain; + + if (node == NULL) + return; + + marker = getMarker(node); + + originalGenome = false; + strain = false; + while (marker != NULL) { + if (getAbsolutePassMarkerSeqID(marker) < first) + originalGenome = true; + + if (getAbsolutePassMarkerSeqID(marker) >= first) + strain = true; + + marker = getNextInNode(marker); + } + + printf("%s", nodeStatistics(node)); + if (originalGenome && !strain) + printf("\tTRUE"); + else + printf("\tFALSE"); + + if (originalGenome && strain) + printf("\tTRUE"); + else + printf("\tFALSE"); + + + if (strain && !originalGenome) + puts("\tTRUE"); + else + puts("\tFALSE"); + +} + +void displayGraphStatisticsSelective(Graph * graph, IDnum first) +{ + IDnum index; + + for (index = 1; index <= nodeCount(graph); index++) + displayNodeStatisticsSelective(getNodeInGraph + (graph, index), first); + +} + +boolean isOnlyGenome(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) >= firstStrain) + return false; + + return true; +} + +boolean isOnlyStrain(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) < firstStrain) + return false; + + return true; +} + +boolean isSNP(Node * node, IDnum firstStrain, int WORDLENGTH) +{ + IDnum sequence; + Coordinate position; + + if (getNodeLength(node) != WORDLENGTH) + return false; + + if (getMarker(node) == NULL) + return false; + + if (getAbsolutePassMarkerSeqID(getMarker(node)) >= firstStrain) + return false; + + if (getNextInNode(getMarker(node)) != NULL) + return false; + + if (arcCount(node) != 1) + return false; + + if (arcCount(getTwinNode(node)) != 1) + return false; + + if (isOnlyGenome(getDestination(getArc(node)), firstStrain)) + return false; + + if (isOnlyGenome + (getDestination(getArc(getTwinNode(node))), firstStrain)) + return false; + + sequence = getPassageMarkerSequenceID(getMarker(node)); + + if (sequence >= 0) + position = getPassageMarkerStart(getMarker(node)); + else { + sequence = -sequence; + position = getPassageMarkerFinish(getMarker(node)); + } + + printf("SNP\t%lld\t%ld\n", (long long) position, (long) sequence); + + return true; +} + +IDnum strainMarkerCount(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + IDnum counter = 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) >= firstStrain) + counter++; + + return counter; +} + +boolean isError(Node * node, IDnum firstStrain) +{ + return (strainMarkerCount(node, firstStrain) < 5); +} + +void removeStrainMarkers(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + PassageMarker *tmp = NULL; + + marker = getMarker(node); + while (marker != NULL) { + tmp = getNextInNode(marker); + + if (getAbsolutePassMarkerSeqID(marker) >= firstStrain) + destroyPassageMarker(marker); + marker = tmp; + } + +} + +void chainSawCorrection(Graph * graph, int minMult) +{ + IDnum nodeIndex; + IDnum removed = 0; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + if (markerCount(getNodeInGraph(graph, nodeIndex)) < + minMult) { + destroyNode(getNodeInGraph(graph, nodeIndex), + graph); + removed++; + } + } + + printf("%d dubious nodes removed\n", removed); + concatenateGraph(graph); + printf("%d node in the end\n", nodeCount(graph)); +} + +void grossErrorRemoval(Graph * graph, IDnum firstStrain) +{ + IDnum nodeIndex; + IDnum removed = 0; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + if (isError(getNodeInGraph(graph, nodeIndex), firstStrain)) { + if (isOnlyStrain + (getNodeInGraph(graph, nodeIndex), + firstStrain)) { + destroyNode(getNodeInGraph + (graph, nodeIndex), graph); + removed++; + } else + removeStrainMarkers(getNodeInGraph + (graph, nodeIndex), + firstStrain); + } + } + + printf("%d dubious nodes removed\n", removed); + concatenateGraph(graph); + printf("%d node in the end\n", nodeCount(graph)); +} + +IDnum countSNPs(Graph * graph, IDnum firstStrain, int WORDLENGTH) +{ + IDnum index; + IDnum counter = 0; + + for (index = 1; index < nodeCount(graph); index++) + if (isSNP + (getNodeInGraph(graph, index), firstStrain, + WORDLENGTH)) + counter++; + + return counter; +} + +Coordinate commonLength(Node * node, IDnum firstStrain) +{ + PassageMarker *marker = getMarker(node); + int orig = 0; + int strain = 0; + + while (marker != NULL) { + if (getAbsolutePassMarkerSeqID(marker) < firstStrain) + orig++; + else + strain++; + marker = getNextInNode(marker); + } + + if (orig == 0 || strain == 0) + return 0; + + return (Coordinate) orig *getNodeLength(node); +} + +Coordinate countCommonLength(Graph * graph, IDnum firstStrain) +{ + IDnum index; + Coordinate res = 0; + + for (index = 1; index <= nodeCount(graph); index++) + res += + commonLength(getNodeInGraph(graph, index), + firstStrain); + + return res; +} + +boolean isMixed(Node * node, IDnum firstStrain) +{ + return !isOnlyStrain(node, firstStrain) + && !isOnlyGenome(node, firstStrain); +} + +int countLocalBreakpoints(PassageMarker * marker, IDnum firstStrain) +{ + PassageMarker *localMarker; + IDnum sequenceID = getAbsolutePassMarkerSeqID(marker); + IDnum localSeqID; + Coordinate start = getPassageMarkerStart(marker); + Node *localNode = getNode(marker); + Node *destination; + Arc *arc; + int arcCount = 0; + int arcIndex; + boolean *arcStatus; + int counter = 0; + + if (!isMixed(localNode, firstStrain)) + return 0; + + // Count arcs + for (arc = getArc(localNode); arc != NULL; arc = getNextArc(arc)) + arcCount++; + arcStatus = callocOrExit(arcCount, boolean); + // Check for other genomic markers in node + for (localMarker = getMarker(localNode); localMarker != NULL; + localMarker = getNextInNode(localMarker)) { + localSeqID = getAbsolutePassMarkerSeqID(localMarker); + if (localSeqID >= firstStrain) + continue; + + if (localSeqID < sequenceID) + return 0; + + if (localSeqID == sequenceID + && getPassageMarkerStart(localMarker) < start) + return 0; + + destination = getNode(getNextInSequence(localMarker)); + + // Enter into table: + arcIndex = 0; + for (arc = getArc(localNode); + getDestination(arc) != destination; + arc = getNextArc(arc)) + arcIndex++; + arcStatus[arcIndex] = true; + } + + // Check other nodes + arcIndex = 0; + for (arc = getArc(localNode); arc != NULL; arc = getNextArc(arc)) { + if (!arcStatus[arcIndex] + && isMixed(getDestination(arc), firstStrain)) + counter++; + arcIndex++; + } + + free(arcStatus); + return counter; +} + +IDnum countBreakpoints(Graph * graph, IDnum firstStrain) +{ + PassageMarker *marker; + IDnum seqIndex; + IDnum total = 0; + + for (seqIndex = 1; seqIndex < firstStrain; seqIndex++) { + marker = getMarker(getNodeInGraph(graph, seqIndex)); + while (marker != NULL) { + total += + countLocalBreakpoints(marker, firstStrain); + marker = getNextInSequence(marker); + } + } + + return total; +} + +IDnum countStrainOnlyNodes(Graph * graph, IDnum firstStrain) +{ + IDnum index; + IDnum total = 0; + + for (index = 1; index <= nodeCount(graph); index++) + if (isOnlyStrain + (getNodeInGraph(graph, index), firstStrain)) + total++; + + return total; +} + +Coordinate countStrainOnlyBp(Graph * graph, IDnum firstStrain) +{ + IDnum index; + Coordinate total = 0; + Node *node; + Arc *arc; + Coordinate local; + + for (index = 1; index <= nodeCount(graph); index++) { + if (isOnlyStrain + (getNodeInGraph(graph, index), firstStrain)) { + node = getNodeInGraph(graph, index); + local = getNodeLength(node); + + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) { + if (!isOnlyStrain + (getDestination(arc), firstStrain)) { + local -= 24; + break; + } + } + + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) { + if (!isOnlyStrain + (getDestination(arc), firstStrain)) { + local -= 24; + break; + } + } + + if (local < 0) + local = 1; + + total += local; + } + } + + return total; +} + +IDnum genomeMarkerCount(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + IDnum counter = 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) < firstStrain) + counter++; + + return counter; +} + +Coordinate readCoverage(Node * node) +{ + PassageMarker *marker; + Coordinate sum = 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + if (getTwinMarker(marker) == NULL) { + printf("Node %d screwed up\n", getNodeID(node)); + printf("Sequence %d\n", + getPassageMarkerSequenceID(marker)); + abort(); + } + sum += getPassageMarkerLength(marker); + } + + return sum; +} + +Coordinate refReadCoverage(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + Coordinate sum = 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) < firstStrain) + sum += getPassageMarkerLength(marker); + + return sum; +} + +Coordinate newReadCoverage(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + Coordinate sum = 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) >= firstStrain) { + sum += getPassageMarkerLength(marker); + if (getPassageMarkerLength(marker) < 0) + printf("Bizarre marker %d at node %d\n", + getPassageMarkerSequenceID(marker), + getNodeID(node)); + } + + return sum; +} + +IDnum readStarts(Node * node) +{ + PassageMarker *marker; + IDnum sum = 0; + + if (node == NULL) + return 0; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + if (getPassageMarkerSequenceID(marker) > 0 + && isInitial(marker)) + sum++; + else if (getPassageMarkerSequenceID(marker) < 0 + && isTerminal(marker)) + sum++; + } + + return sum; +} + +static void printShortCounts(FILE * outfile, Node * node, Graph * graph, ReadSet * reads) { + IDnum counts[CATEGORIES]; + Category cat; + IDnum shortReadIndex; + IDnum readID; + IDnum shortReadCount; + ShortReadMarker *array; + ShortReadMarker *marker; + + if (!readStartsAreActivated(graph)) { + for (cat = 0; cat < CATEGORIES; cat++) + fprintf(outfile, "\tN/A"); + return; + } + + shortReadCount = getNodeReadCount(node, graph); + array = getNodeReads(node, graph); + + for (cat = 0; cat < CATEGORIES; cat++) + counts[cat] = 0; + + for (shortReadIndex = 0; shortReadIndex < shortReadCount; shortReadIndex++) { + marker = getShortReadMarkerAtIndex(array, shortReadIndex); + readID = getShortReadMarkerID(marker); + cat = reads->categories[readID - 1] / 2; + counts[cat]++; + } + + for (cat = 0; cat < CATEGORIES; cat++) + fprintf(outfile, "\t%li", (long) counts[cat]); +} + +void displayGeneralStatistics(Graph * graph, char *filename, ReadSet * reads) +{ + IDnum nodeIndex; + Node *node; + Category cat; + FILE *outfile; + + outfile = fopen(filename, "w"); + if (outfile == NULL) { + printf("Couldn't open file %s, sorry\n", filename); + return; + } else + printf("Writing into stats file %s...\n", filename); + + fprintf(outfile, "ID\tlgth\tout\tin\tlong_cov"); + + for (cat = 0; cat < CATEGORIES; cat++) { + fprintf(outfile, "\tshort%i_cov", (int) (cat + 1)); + fprintf(outfile, "\tshort%i_Ocov", (int) (cat + 1)); + } + + fprintf(outfile, "\tlong_nb"); + for (cat = 0; cat < CATEGORIES; cat++) { + fprintf(outfile, "\tshort%i_nb", (int) (cat + 1)); + } + + fprintf(outfile, "\n"); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (node == NULL) + continue; + fprintf + (outfile, "%ld\t%lld\t%i\t%i", + (long) getNodeID(node), (long long) getNodeLength(node), arcCount(node), + arcCount(getTwinNode(node))); + + if (getNodeLength(node) > 0) { + fprintf(outfile, "\t%f", + readCoverage(node) / + (double) getNodeLength(node)); + for (cat = 0; cat < CATEGORIES; cat++) { + fprintf(outfile, "\t%f", + getVirtualCoverage(node, + cat) / + (double) getNodeLength(node)); + fprintf(outfile, "\t%f", + getOriginalVirtualCoverage(node, + cat) / + (double) getNodeLength(node)); + } + } else { + fprintf(outfile, "\tInf"); + for (cat = 0; cat < CATEGORIES; cat++) + fprintf(outfile, "\tInf\tInf"); + } + + fprintf(outfile, "\t%li", (long) markerCount(node)); + printShortCounts(outfile, node, graph, reads); + + fprintf(outfile, "\n"); + } + + fclose(outfile); +} + +void destroyStrainSpecificIslands(Graph * graph, IDnum firstStrain) +{ + IDnum index; + Arc *arc; + boolean isModified = true; + Node *node; + IDnum counter = 0; + + resetNodeStatus(graph); + + puts("Destroying disconnected strain specific sub-graphs"); + + // Mark all genomic nodes + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (!isOnlyStrain(node, firstStrain)) + setNodeStatus(node, true); + } + + // Mark nodes connected to genomic nodes + while (isModified) { + isModified = false; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (getNodeStatus(node)) + continue; + + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) { + if (getNodeStatus(getDestination(arc))) { + isModified = true; + setNodeStatus(node, true); + } + } + + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) { + if (getNodeStatus(getDestination(arc))) { + isModified = true; + setNodeStatus(node, true); + } + } + } + } + + // Remove all unmarked nodes + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (!getNodeStatus(node)) { + destroyNode(node, graph); + counter++; + } + } + + // Renumber graph nodes + printf("Removed %d nodes \n", counter); + renumberNodes(graph); +} + +void displayStrainOnlySequences(Graph * graph, IDnum firstStrain, + char *inputFilename, char *filename, + int WORDLENGTH) +{ + IDnum nodeIndex; + Node *node; + FILE *outfile = fopen(filename, "w"); + Coordinate start, finish; + char str[100]; + TightString *tString; + IDnum readID; + Coordinate readCoord; + + if (outfile == NULL) { + printf("Could not write into %s, sorry\n", filename); + return; + } + + destroyStrainSpecificIslands(graph, firstStrain); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (isOnlyStrain(node, firstStrain) + && getNodeLength(node) > 500) { + tString = expandNode(node, WORDLENGTH); + readID = + getPassageMarkerSequenceID(getMarker(node)); + readCoord = getPassageMarkerStart(getMarker(node)); + fprintf(outfile, "> UNIQUE SEQUENCE %ld; %lld\n", + (long) readID, (long long) readCoord); + + start = 0; + while (start <= getLength(tString)) { + finish = start + 60; + readTightStringFragment(tString, start, + finish, str); + fprintf(outfile, "%s\n", str); + start = finish; + } + } + } + + fclose(outfile); +} + +void displayStrainOnlyDescriptors(Graph * graph, IDnum firstStrain) +{ + IDnum nodeIndex; + Node *node; + char *str; + + destroyStrainSpecificIslands(graph, firstStrain); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + printf("node %d from %d\n", nodeIndex, nodeCount(graph)); + node = getNodeInGraph(graph, nodeIndex); + + if (isOnlyStrain(node, firstStrain)) { + str = readNode(node); + printf("> UNIQUE SEQUENCE %s\n", str); + free(str); + } + } +} + +void displayLocalBreakpoint(PassageMarker * strainMarker, + IDnum firstStrain, + PassageMarker * genomeMarker, + Node ** genomeDestination, + Node ** strainDestination, IDnum * counter, + IDnum nodeCount) +{ + boolean isTranslocation; + PassageMarker *marker; + Node *destination, *destinationA; + Node *destination2, *destination2A; + Node *node1, *node2; + IDnum localID = getNodeID(getNode(strainMarker)); + + // Eliminate genomic markers + if (strainMarker == genomeMarker) + return; + + destinationA = getNode(getNextInSequence(strainMarker)); + + if (destinationA == NULL) + return; + + // Eliminate those that follow some local strain + if (isDestinationToMarker(genomeMarker, destinationA)) { +// puts("Parallel paths"); + return; + } + + destination2A = getNode(getNextInSequence(genomeMarker)); + + if (destination2A == NULL) + return; + + printf("Lengths %lld %lld\n", (long long) getNodeLength(destinationA), + (long long) getNodeLength(destination2A)); + + // Hop to another genomic node +// if (getNodeLength(destinationA) > 24) { + //printf("wrong length %d %d\n", getNodeLength(destination) , getNodeID(destination)); +// return; +// } + + destination = + getNode(getNextInSequence(getNextInSequence(strainMarker))); + + if (destination == NULL) + return; + + // Eliminate those that point to uniquely strain sequences + if (nodeGenomicMultiplicity(destination, firstStrain) != 1) { +// puts("Multiple genome reads"); + return; + } + // Hop to another genomic node +// if (getNodeLength(destination2A) != 24) { + //puts("wrong length 2"); +// return; +// } + + destination2 = + getNode(getNextInSequence(getNextInSequence(genomeMarker))); + + if (destination2 == NULL) + return; + + + if (destination == destination2) + return; + + // Eliminate those that point to uniquely strain sequences + if (isOnlyGenome(destination2, firstStrain)) + return; + + setSingleNodeStatus(getNode(strainMarker), true); + strainDestination[localID + nodeCount] = destination; + genomeDestination[localID + nodeCount] = destination2; + +// printf("Assigning %p and %p to %d\n", destination, destination2, localID); + printf("lengths %lld\t%lld\n", (long long) getNodeLength(destinationA), + (long long) getNodeLength(destination2A)); + + // Detect translocation + isTranslocation = true; + for (marker = getMarker(destination); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) == + getAbsolutePassMarkerSeqID(genomeMarker)) { + isTranslocation = false; + break; + } + + if (isTranslocation) { + printf("BREAK TRANS\t%ld\t%lld\t%lld\t%lld\n", + (long) getAbsolutePassMarkerSeqID(genomeMarker), + (long long) getPassageMarkerStart(genomeMarker), + (long long) getNodeLength(destinationA), + (long long) getNodeLength(destination2A)); + counter[2]++; + return; + } + // Detect breakpoint + printf("BREAK INTRA\t%ld\t%lld\t%lld\t%lld\n", + (long) getAbsolutePassMarkerSeqID(genomeMarker), + (long long) getPassageMarkerStart(genomeMarker), + (long long) getNodeLength(destinationA), (long long) getNodeLength(destination2A)); + counter[1]++; + + // Check for inversion + if (getPassageMarkerSequenceID(marker) != + -getPassageMarkerSequenceID(genomeMarker)) + return; + +// puts("potential!!"); + + node1 = getTwinNode(destination); + + if (getNodeStatus(node1)) { + node2 = + getTwinNode(genomeDestination + [getNodeID(node1) + nodeCount]); + if (getNodeStatus(node2)) + if (strainDestination[getNodeID(node2) + nodeCount] + == destination2) { +// puts("Safe"); + counter[1] -= 4; + counter[0]++; + } else; +// puts("stopped 3"); + else; +// puts("stopped 2"); + } else; +// puts("stopped 1"); +} + +void displayBreakpoints(Graph * graph, IDnum firstStrain) +{ + IDnum nodeIndex; + Node *node; + PassageMarker *strainMarker, *genomeMarker; + Node **genomeDestination = + callocOrExit(2 * nodeCount(graph) + 1, Node *); + Node **strainDestination = + callocOrExit(2 * nodeCount(graph) + 1, Node *); + IDnum counters[3]; + + counters[0] = 0; + counters[1] = 0; + counters[2] = 0; + + resetNodeStatus(graph); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + + if (arcCount(node) <= 1 + && arcCount(getTwinNode(node)) <= 1) { + continue; + } + + if (nodeGenomicMultiplicity(node, firstStrain) != 1) { + continue; + } + + if (isOnlyGenome(node, firstStrain)) { + continue; + } + + for (genomeMarker = getMarker(node); genomeMarker != NULL; + genomeMarker = getNextInNode(genomeMarker)) + if (getAbsolutePassMarkerSeqID(genomeMarker) < + firstStrain) + break; + + // Go through all strain passage marker + for (strainMarker = getMarker(node); strainMarker != NULL; + strainMarker = getNextInNode(strainMarker)) { + displayLocalBreakpoint(strainMarker, firstStrain, + genomeMarker, + genomeDestination, + strainDestination, counters, + nodeCount(graph)); + displayLocalBreakpoint(getTwinMarker(strainMarker), + firstStrain, + getTwinMarker(genomeMarker), + genomeDestination, + strainDestination, counters, + nodeCount(graph)); + } + } + + + printf("%d\t%d\t%d\n", counters[0], counters[1], counters[2]); + free(strainDestination); + free(genomeDestination); +} + +PassageMarker *genomeMarker(Node * node, IDnum firstStrain) +{ + PassageMarker *marker; + + if (genomeMarkerCount(node, firstStrain) != 1) + return NULL; + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + if (getAbsolutePassMarkerSeqID(marker) < firstStrain) + return marker; + + return NULL; +} + +void exportArcSequence(Arc * arc, FILE * outfile, int WORDLENGTH, + TightString ** sequences) +{ + char *str; + TightString *output = + newTightString(getNodeLength(getOrigin(arc)) + + getNodeLength(getDestination(arc))); + appendNodeSequence(getOrigin(arc), output, 0); + appendNodeSequence(getDestination(arc), output, + getNodeLength(getOrigin(arc))); + str = readTightString(output); + fprintf(outfile, "> ARC from NODE %d", getNodeID(getOrigin(arc))); + fprintf(outfile, "%s\n", str); + destroyTightString(output); + free(str); +} + +// Produce sequences necessary to recreate graph elsewhere... +void projectGraphToFile(Graph * graph, char *filename, int WORDLENGTH, + TightString ** sequences) +{ + FILE *outfile = fopen(filename, "w"); + IDnum index; + Node *currentNode; + Arc *arc; + + if (outfile == NULL) { + printf("Could not open %s, sorry\n", filename); + return; + } + + for (index = 1; index < nodeCount(graph); index++) { + currentNode = getNodeInGraph(graph, index); + for (arc = getArc(currentNode); arc != NULL; + arc = getNextArc(arc)) + exportArcSequence(arc, outfile, WORDLENGTH, + sequences); + + for (arc = getArc(getTwinNode(currentNode)); arc != NULL; + arc = getNextArc(arc)) + exportArcSequence(arc, outfile, WORDLENGTH, + sequences); + } + + fclose(outfile); +} + +void removeReferenceMarkers(Graph * graph, IDnum firstStrain) +{ + IDnum ID; + Node *node; + PassageMarker *marker, *oldMarker; + + for (ID = 1; ID <= nodeCount(graph); ID++) { + node = getNodeInGraph(graph, ID); + marker = getMarker(node); + while (marker != NULL) { + if (getAbsolutePassMarkerSeqID(marker) < + firstStrain) { + if (!isInitial(marker)) + changeMultiplicity + (getArcBetweenNodes + (getNode + (getPreviousInSequence + (marker)), node, graph), + -1); + if (!isTerminal(marker)) + changeMultiplicity + (getArcBetweenNodes + (node, + getNode(getNextInSequence + (marker)), graph), + -1); + oldMarker = marker; + marker = getNextInNode(marker); + destroyPassageMarker(oldMarker); + } else + marker = getNextInNode(marker); + } + + if (getMarker(node) == NULL) + destroyNode(node, graph); + } + + concatenateGraph(graph); +} + +void exportLongNodeSequences(char *filename, Graph * graph, + Coordinate minLength) +{ + FILE *outfile = fopen(filename, "w"); + IDnum nodeIndex; + TightString *tString; + Coordinate position; + char nucleotide; + Node *node; + int WORDLENGTH = getWordLength(graph); + GapMarker *gap; + //double sensitivity, specificity; + + if (outfile == NULL) { + printf("Could not write into %s, sorry\n", filename); + return; + } else { + printf("Writing contigs into %s...\n", filename); + } + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + + if (node == NULL || getNodeLength(node) < minLength) + continue; + + tString = expandNode(node, WORDLENGTH); + fprintf(outfile, ">NODE_%ld_length_%lld_cov_%f\n", + (long) nodeIndex, (long long) getNodeLength(node), + (getVirtualCoverage(node, 0) + + getVirtualCoverage(node, 1) + + readCoverage(node)) / + (float) getNodeLength(node)); + + gap = getGap(node, graph); + for (position = 0; position < WORDLENGTH; position++) { + if (gap && position >= getGapFinish(gap)) + gap = getNextGap(gap); + + if (gap == NULL || position < getGapStart(gap)) { + nucleotide = + getNucleotideChar(position, tString); + fprintf(outfile, "%c", nucleotide); + } else + fprintf(outfile, "N"); + } + + gap = getGap(node, graph); + for (; position < getLength(tString); position++) { + if (position % 60 == 0) + fprintf(outfile, "\n"); + + if (gap + && position - WORDLENGTH + 1 >= + getGapFinish(gap)) + gap = getNextGap(gap); + + if (gap == NULL + || position - WORDLENGTH + 1 < + getGapStart(gap)) { + nucleotide = + getNucleotideChar(position, tString); + fprintf(outfile, "%c", nucleotide); + } else + fprintf(outfile, "N"); + } + fprintf(outfile, "\n"); + + destroyTightString(tString); + } + + fclose(outfile); +} + +/* +void exportMediumNodeSequences(char* filename, Graph * graph, Coordinate minLength) +{ + IDnum dummy; + ReadSet *readSet = readFastAFile(sequenceFile); + char **reads = readSet->sequences; + TightString **sequences = + newTightStringArrayFromStringArray(reads, dummy); + FILE *outfile = fopen(filename, "w"); + char str[100]; + IDnum nodeIndex; + TightString *tString; + Coordinate start, finish; + Node *node; + double sensitivity, specificity; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (getNodeLength(node) < minLength + || getNodeLength(node) >= maxLength) + continue; + + tString = expandNode(node); + compareSequences(tString, sequences[0], &sensitivity, + &specificity, WORDLENGTH); + + fprintf(outfile, + "> MEDIUM NODE %d, Sensitivity = %f, Specificity = %f\n", + nodeIndex, sensitivity, specificity); + printf + ("> MEDIUM NODE %d, Sensitivity = %f, Specificity = %f\n", + nodeIndex, sensitivity, specificity); + + start = 0; + while (start <= getLength(tString)) { + finish = start + 60; + readTightStringFragment(tString, start, + finish, str); + fprintf(outfile, "%s\n", str); + start = finish; + } + + destroyTightString(tString); + } +} +*/ + +Coordinate maxLength(Graph * graph) +{ + IDnum index; + Node *node; + Coordinate max = 0; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node != NULL && getNodeLength(node) > max) + max = getNodeLength(node); + } + + return max; +} + +Coordinate n50(Graph * graph) +{ + FibHeap *heap = newFibHeap(); + IDnum index; + Coordinate totalLength = 0; + Coordinate sumLength = 0; + Node *node; + + if (nodeCount(graph) == 0) { + puts("EMPTY GRAPH"); + return 0; + } + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + insertNodeIntoHeap(heap, getNodeLength(node), node); + totalLength += getNodeLength(node); + } + totalLength /= 2; + + node = removeNextNodeFromHeap(heap); + while (node != NULL) { + sumLength += getNodeLength(node); + if (sumLength >= totalLength) + break; + node = removeNextNodeFromHeap(heap); + } + + destroyHeap(heap); + return getNodeLength(node); +} + +static Coordinate getTotalCoverage(Node * node) +{ + Category cat; + Coordinate coverage = 0; + + for (cat = 0; cat < CATEGORIES; cat++) + coverage += getVirtualCoverage(node, cat); + + return coverage; +} + +int compareNodeCovs(const void * A, const void * B) { + Node * nodeA = *((Node **) A); + Node * nodeB = *((Node **) B); + double covA; + double covB; + + if (getNodeLength(nodeA) == 0) + nodeA = NULL; + + if (getNodeLength(nodeB) == 0) + nodeB = NULL; + + // Null nodes considered to have infinite coverage + if (nodeA == NULL && nodeB == NULL) + return 0; + if (nodeA == NULL) + return 1; + if (nodeB == NULL) + return -1; + + // Deal with real coverage numbers: + covA = getTotalCoverage(nodeA) / (double) getNodeLength(nodeA); + covB = getTotalCoverage(nodeB) / (double) getNodeLength(nodeB); + + if (covA > covB) + return 1; + if (covA == covB) + return 0; + return -1; +} + +double estimated_cov(Graph * graph, char * directory) +{ + Node ** nodeArray = callocOrExit(nodeCount(graph), Node*); + IDnum index; + Coordinate halfTotalLength = 0; + Coordinate sumLength = 0; + Node *node; + char *logFilename = + mallocOrExit(strlen(directory) + 100, char); + char *statsLine = + mallocOrExit(5000, char); + FILE *logFile; + + strcpy(logFilename, directory); + strcat(logFilename, "/Log"); + logFile = fopen(logFilename, "a"); + + if (logFile == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not write to %s", + logFilename); + + puts("Measuring median coverage depth..."); + + if (nodeCount(graph) == 0) { + puts("EMPTY GRAPH"); + return 0; + } + + // Write nodes into array and compute total assembly length + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + nodeArray[index - 1] = node; + if (node == NULL) + continue; + halfTotalLength += getNodeLength(node); + } + halfTotalLength /= 2; + + // Sort nodes + qsort(nodeArray, nodeCount(graph), sizeof(Node *), compareNodeCovs); + + // Compute the length weighted median node coverage + for (index = 0; index < nodeCount(graph); index++) { + node = nodeArray[index]; + sumLength += getNodeLength(node); + if (sumLength >= halfTotalLength) { + printf("Median coverage depth = %f\n", getTotalCoverage(node) / (double) getNodeLength(node)); + fprintf(logFile, "Median coverage depth = %f\n", getTotalCoverage(node) / (double) getNodeLength(node)); + free(nodeArray); + fclose(logFile); + free(logFilename); + free(statsLine); + return getTotalCoverage(node) / (double) getNodeLength(node); + } + } + + // In case something went wrong... + free(nodeArray); + fclose(logFile); + free(logFilename); + free(statsLine); + + return -1; +} + +static void destroyMixedNode(Node * node) +{ + PassageMarker *marker = getMarker(node); + PassageMarker *current; + + while (marker != NULL) { + while (!isInitial(marker)) + marker = getPreviousInSequence(marker); + + while (marker != NULL) { + current = marker; + marker = getNextInSequence(marker); + destroyPassageMarker(current); + } + + marker = getMarker(node); + } +} + +void destroyMixedReads(Graph * graph, IDnum minCoverage) +{ + IDnum index; + Node *node; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (markerCount(node) < minCoverage) + destroyMixedNode(node); + } + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (getMarker(node) == NULL) + destroyNode(node, graph); + } + + concatenateGraph(graph); +} + +static boolean isConnectedRead(PassageMarker * marker) +{ + PassageMarker *current; + + for (current = marker; getNodeStatus(getNode(current)); + current = getNextInSequence(current)) + if (isTerminal(current)) + return false; + + for (current = getTwinMarker(marker); + getNodeStatus(getNode(current)); + current = getNextInSequence(current)) + if (isTerminal(current)) + return false; + + return true; +} + +static void destroyWholeRead(PassageMarker * marker) +{ + PassageMarker *current = marker; + PassageMarker *next; + + while (!isInitial(current)) + current = getPreviousInSequence(current); + + for (; current != NULL; current = next) { + next = getNextInSequence(current); + destroyPassageMarker(current); + } +} + +static void cleanUpNode(Node * node, Graph * graph) +{ + Category cat; + Node *twin = getTwinNode(node); + PassageMarker *marker, *twinMarker; + + for (cat = 0; cat < CATEGORIES; cat++) + setVirtualCoverage(node, cat, 0); + + while (getArc(node) != NULL) + destroyArc(getArc(node), graph); + while (getArc(twin) != NULL) + destroyArc(getArc(twin), graph); + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + twinMarker = getTwinMarker(marker); + + if (getNode(getNextInSequence(marker)) != twin + || getPassageMarkerSequenceID(marker) > 0) + createArc(node, + getNode(getNextInSequence(marker)), + graph); + if (getNode(getNextInSequence(twinMarker)) != node + || getPassageMarkerSequenceID(twinMarker) > 0) + createArc(twin, + getNode(getNextInSequence(twinMarker)), + graph); + } +} + +void destroySinglePoolNodes(Graph * graph) +{ + IDnum index; + Node *node; + PassageMarker *marker, *next; + + puts("Destroying single pool nodes"); + resetNodeStatus(graph); + + // Remove empty, single pool nodes, mark other single pool nodes + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (getVirtualCoverage(node, 0) != 0 + && getVirtualCoverage(node, 1) != 0) + continue; + + if (getMarker(node) == NULL) + destroyNode(node, graph); + else + setNodeStatus(node, true); + } + + // Remove disconnected reads + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL || !getNodeStatus(node)) + continue; + + for (marker = getMarker(node); marker != NULL; + marker = next) { + if (isConnectedRead(marker)) + next = getNextInNode(marker); + else { + destroyWholeRead(marker); + next = getMarker(node); + } + } + } + + // Remove empty, single pool nodes, review coverage of the others + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL || !getNodeStatus(node)) + continue; + + if (getMarker(node) == NULL) + destroyNode(node, graph); + else + cleanUpNode(node, graph); + } + + puts("Done"); + + concatenateGraph(graph); +} + +void destroyShortTips(Graph * graph) +{ + IDnum index; + Node *node; + boolean modified = true; + + puts("Removing short tips"); + + while (modified) { + modified = false; + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (getArc(node) == NULL + || getArc(getTwinNode(node)) == NULL) { + if (getNodeLength(node) < 500) { + modified = true; + destroyNode(node, graph); + } + } + } + } + + puts("Done"); + + concatenateGraph(graph); +} + +static Coordinate connectDomain(Node * node) +{ + Coordinate result = getNodeLength(node); + Arc *arc; + + if (getNodeStatus(node)) + return 0; + setNodeStatus(node, true); + + for (arc = getArc(node); arc != NULL; arc = getNextArc(arc)) + result += connectDomain(getDestination(arc)); + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) + result += connectDomain(getDestination(arc)); + + return result; + +} + +static void destroyConnectedDomain(Node * node, Graph * graph) +{ + Arc *arc; + + if (getNodeStatus(node)) + return; + setNodeStatus(node, true); + + for (arc = getArc(node); arc != NULL; arc = getNextArc(arc)) + destroyConnectedDomain(getDestination(arc), graph); + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) + destroyConnectedDomain(getDestination(arc), graph); + + destroyNode(node, graph); + +} + +void destroyDisconnectedElements(Graph * graph) +{ + Node *node; + IDnum index; + Coordinate domainSize; + FibHeap *heap = newFibHeap(); + Coordinate *domainSizes = + callocOrExit(1 + nodeCount(graph), Coordinate); + + resetNodeStatus(graph); + + puts("Destroying disconnected domains"); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL || getNodeStatus(node)) + continue; + domainSize = connectDomain(node); + printf("CONNECT\t%lld\n", (long long) domainSize); + insertNodeIntoHeap(heap, domainSize, node); + domainSizes[index] = domainSize; + } + + resetNodeStatus(graph); + + while (true) { + node = removeNextNodeFromHeap(heap); + if (node == NULL || domainSizes[getNodeID(node)] > 1200) + break; + + destroyConnectedDomain(node, graph); + } + + + destroyHeap(heap); + free(domainSizes); + puts("Done"); + + concatenateGraph(graph); +} + +static Coordinate connectDomainNodeCount(Node * node) +{ + Coordinate result = 1; + Arc *arc; + + if (getNodeStatus(node)) + return 0; + setNodeStatus(node, true); + + for (arc = getArc(node); arc != NULL; arc = getNextArc(arc)) + result += connectDomain(getDestination(arc)); + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) + result += connectDomain(getDestination(arc)); + + return result; + +} + +void measureTangleSizes(Graph * graph, Coordinate maxLength) +{ + Node *node; + IDnum index; + Coordinate domainSize; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + if (getNodeLength(node) >= maxLength) + destroyNode(node, graph); + } + + renumberNodes(graph); + resetNodeStatus(graph); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL || getNodeStatus(node)) + continue; + domainSize = connectDomainNodeCount(node); + printf("CONNECT\t%lld\n", (long long) domainSize); + } + + puts("Done"); +} + +void destroyEmptyNodes(Graph * graph) +{ + IDnum index; + Node *node; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (getMarker(node) == NULL) + destroyNode(node, graph); + } + + concatenateGraph(graph); +} + +static Coordinate getReadLength(PassageMarker * marker) +{ + PassageMarker *current; + Coordinate totalLength = 0; + + for (current = marker; current != NULL; + current = getNextInSequence(current)) + totalLength += getPassageMarkerLength(current); + + return totalLength; +} + +static void destroyRead(PassageMarker * marker) +{ + PassageMarker *current, *next; + + for (current = marker; current != NULL; current = next) { + next = getNextInSequence(current); + destroyPassageMarker(current); + } +} + +void removeShortReads(Graph * graph) +{ + IDnum index; + Node *node; + PassageMarker *marker, *next; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + + for (marker = getMarker(node); marker != NULL; + marker = next) { + if (getPassageMarkerSequenceID(marker) > 0 + && isInitial(marker) + && getReadLength(marker) < 400) { + destroyRead(marker); + next = getMarker(node); + } else if (getPassageMarkerSequenceID(marker) < 0 + && isTerminal(marker) + && getReadLength(getTwinMarker(marker)) + < 400) { + destroyRead(getTwinMarker(marker)); + next = getMarker(node); + } else + next = getNextInNode(marker); + + } + } +} + +Coordinate totalGraphLength(Graph * graph) +{ + IDnum index; + Coordinate totalLength = 0; + + for (index = 1; index <= nodeCount(graph); index++) + totalLength += getNodeLength(getNodeInGraph(graph, index)); + + return totalLength; +} + +void destroySinglePoolNodesStrict(Graph * graph) +{ + IDnum index; + Node *node; + + puts("Destroying single pool nodes"); + resetNodeStatus(graph); + + // Remove empty, single pool nodes, mark other single pool nodes + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (getVirtualCoverage(node, 0) != 0 + && getVirtualCoverage(node, 1) != 0) + continue; + + destroyNode(node, graph); + } + + puts("Done"); + + concatenateGraph(graph); +} + +void contigStats(Node ** contigs, IDnum readCount) +{ + FibHeap *heap = newFibHeap(); + IDnum index; + Coordinate totalLength = 0; + Coordinate sumLength = 0; + Node *node; + Coordinate halfLength; + + for (index = 0; index <= readCount; index++) { + if (contigs[index] != NULL) { + node = contigs[index]; + printf("CONTIG %lld\n", (long long) getNodeLength(node)); + insertNodeIntoHeap(heap, getNodeLength(node), + node); + totalLength += getNodeLength(node); + } + } + halfLength = totalLength / 2; + + node = removeNextNodeFromHeap(heap); + while (node != NULL) { + sumLength += getNodeLength(node); + if (sumLength >= halfLength) + break; + node = removeNextNodeFromHeap(heap); + } + + destroyHeap(heap); + printf("N50 %lld Total %lld\n", (long long) getNodeLength(node), (long long) totalLength); +} + +void exportContigs(Node ** contigs, ReadSet * reads, char *filename, + int WORDLENGTH, int pairedReadsCount) +{ + TightString **sequences = + mallocOrExit(reads->readCount, TightString *); + IDnum i; + + for (i = 0; i < pairedReadsCount; i++) { + if (contigs[i] == NULL) + sequences[i] = NULL; + else + sequences[i] = expandNode(contigs[i], WORDLENGTH); + } + + exportSequenceArray(filename, sequences, reads->readCount); +} + +boolean *removeLowCoverageNodesAndDenounceDubiousReads(Graph * graph, + double minCov) +{ + IDnum index; + Node *node; + boolean denounceReads = readStartsAreActivated(graph); + boolean *res = NULL; + ShortReadMarker *nodeArray, *shortMarker; + PassageMarker *marker; + IDnum maxIndex; + IDnum readID; + IDnum index2; + // Original + double nodeDensity = 0.0; + int countNodeTotal = 0, countNodeUnderMinCov = 0, countNodeOnLongRead = 0; + int countNodeUMCbutOnLongRead = 0, countNodeSupportedByLongRead = 0; + int countNodeEscapedByLongRead = 0; + boolean escapeByLongRead = false; + // Original + + printf("Removing contigs with coverage < %f...\n", minCov); + + if (denounceReads) + res = callocOrExit(sequenceCount(graph), boolean); + + // Original + countNodeTotal = nodeCount(graph); + // Original + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (getNodeLength(node) == 0) + continue; + + // Original + if ((marker = getMarker(node))){ + countNodeOnLongRead++; + } + // Original + + if ((double)getTotalCoverage(node) / getNodeLength(node) < minCov) { + // Original + countNodeUnderMinCov++; + if ((marker = getMarker(node))) { + nodeDensity = (double)getTotalCoverage(node) / getNodeLength(node); + // printf("[Under]\tnodeID : %d\tnodeLen : %d\tnodeCov : %f\n", + // getNodeID(node), getNodeLength(node), nodeDensity); + countNodeUMCbutOnLongRead++; + + if (markerCount(node) >= 1) { + countNodeSupportedByLongRead++; + } + if (markerCount(node) >= 1 && !(nodeDensity == 0.0)) { + escapeByLongRead = true; + countNodeEscapedByLongRead++; + } else { + escapeByLongRead = false; + } + } + // Original + + + // Original + //if (! escapeByLongRead) { + // Original + + if (denounceReads) { + nodeArray = getNodeReads(node, graph); + maxIndex = getNodeReadCount(node, graph); + for (index2 = 0; index2 < maxIndex; index2++) { + shortMarker = + getShortReadMarkerAtIndex(nodeArray, + index2); + readID = getShortReadMarkerID(shortMarker); + //printf("Dubious %d\n", readID); + if (readID > 0) + res[readID - 1] = true; + else + res[-readID - 1] = true; + } + + nodeArray = getNodeReads(getTwinNode(node), graph); + maxIndex = + getNodeReadCount(getTwinNode(node), graph); + for (index2 = 0; index2 < maxIndex; index2++) { + shortMarker = + getShortReadMarkerAtIndex(nodeArray, + index2); + readID = getShortReadMarkerID(shortMarker); + //printf("Dubious %d\n", readID); + if (readID > 0) + res[readID - 1] = true; + else + res[-readID - 1] = true; + } + } + + // Original + //} + // Original + + while ((marker = getMarker(node))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence(marker), + graph); + destroyPassageMarker(marker); + } + destroyNode(node, graph); + + } + // Original + // else { + // nodeDensity = (double)getTotalCoverage(node) / getNodeLength(node); + // printf("[Over]\tnodeID : %d\tnodeLen : %d\tnodeCov : %f\n", + // getNodeID(node), getNodeLength(node), nodeDensity); + //} + // Original + } + + /* + // Original + printf("No. of Nodes : %d\n", countNodeTotal); + printf("No. of Nodes Under MinCov : %d\n", countNodeUnderMinCov); + printf("No. of Nodes On Long Reads : %d\n", countNodeOnLongRead); + printf("No. of Nodes Under MinCov but On Long Reads : %d\n", countNodeOnLongRead); + printf("No. of Nodes Under MinCov but Supported by Long Reads : %d\n", + countNodeSupportedByLongRead); + printf("No. of Nodes 0 < Cov < minCov but Escaped (Supported) by Long Reads : %d\n", + countNodeEscapedByLongRead); + // Original + */ + + concatenateGraph(graph); + return res; +} + +void removeLowCoverageNodes(Graph * graph, double minCov) +{ + IDnum index; + Node *node; + PassageMarker *marker; + + if (minCov < 0) + return; + + printf("Applying a coverage cutoff of %f...\n", minCov); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (getNodeLength(node) > 0 + && getTotalCoverage(node) / getNodeLength(node) < + minCov) { + while ((marker = getMarker(node))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence(marker), + graph); + destroyPassageMarker(marker); + } + destroyNode(node, graph); + } + } + + concatenateGraph(graph); +} + +void removeHighCoverageNodes(Graph * graph, double maxCov) +{ + IDnum index; + Node *node; + PassageMarker *marker; + + if (maxCov < 0) + return; + + printf("Applying an upper coverage cutoff of %f...\n", maxCov); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (getNodeLength(node) > 0 + && getTotalCoverage(node) / getNodeLength(node) > + maxCov) { + while ((marker = getMarker(node))) { + if (!isInitial(marker) + && !isTerminal(marker)) + disconnectNextPassageMarker + (getPreviousInSequence(marker), + graph); + destroyPassageMarker(marker); + } + destroyNode(node, graph); + } + } + + concatenateGraph(graph); +} + +void removeMissingStrain(Graph * graph, Category cat) +{ + IDnum ID; + Node *node; + + for (ID = 1; ID <= nodeCount(graph); ID++) { + node = getNodeInGraph(graph, ID); + + if (node == NULL) + continue; + + if (getVirtualCoverage(node, cat) == 0) + destroyNode(node, graph); + } + + concatenateGraph(graph); +} + +static void exportAMOSLib(FILE * outfile, Graph * graph, Category cat) +{ + Coordinate distance = getInsertLength(graph, cat * 2); + double variance = getInsertLength_var(graph, cat * 2); + + if (distance == -1) + return; + + fprintf(outfile, "{LIB\n"); + fprintf(outfile, "iid:%d\n", (int) (cat + 1)); + fprintf(outfile, "{DST\n"); + fprintf(outfile, "mea:%lld\n", (long long) distance); + fprintf(outfile, "std:%lld\n", (long long) sqrt(variance)); + fprintf(outfile, "}\n"); + fprintf(outfile, "}\n"); +} + +static void exportAMOSMarker(FILE * outfile, PassageMarker * marker, + Coordinate nodeLength, Coordinate start, + Coordinate finish, int wordShift) +{ + Coordinate sequenceStart, sequenceFinish; + + if (getStartOffset(marker) >= finish + || getFinishOffset(marker) > nodeLength - start) + return; + + sequenceStart = getPassageMarkerStart(marker); + if (start > getStartOffset(marker)) { + if (getPassageMarkerSequenceID(marker) > 0) + sequenceStart += start - getStartOffset(marker); + else + sequenceStart -= start - getStartOffset(marker); + } + + sequenceFinish = getPassageMarkerFinish(marker); + if (nodeLength - finish > getFinishOffset(marker)) { + if (getPassageMarkerSequenceID(marker) > 0) + sequenceFinish -= + nodeLength - finish - getFinishOffset(marker); + else + sequenceFinish += + nodeLength - finish - getFinishOffset(marker); + } + + if (getPassageMarkerSequenceID(marker) > 0) + sequenceFinish += wordShift; + else + sequenceStart += wordShift; + + fprintf(outfile, "{TLE\n"); + fprintf(outfile, "src:%d\n", getAbsolutePassMarkerSeqID(marker)); + if (getStartOffset(marker) > start) + fprintf(outfile, "off:%lld\n", + (long long) (getStartOffset(marker) - start)); + else + fprintf(outfile, "off:0\n"); + fprintf(outfile, "clr:%lld,%lld\n", (long long) sequenceStart, (long long) sequenceFinish); + fprintf(outfile, "}\n"); +} + +static void exportAMOSShortMarker(FILE * outfile, ShortReadMarker * marker, + ReadSet * reads, Coordinate start, + Coordinate finish) +{ + Coordinate offset = + getShortReadMarkerPosition(marker) - + getShortReadMarkerOffset(marker); + TightString *sequence = + reads->tSequences[getShortReadMarkerID(marker) - 1]; + + if (getShortReadMarkerPosition(marker) == -1) + return; + + if (offset >= finish || offset + getLength(sequence) < start) + return; + + fprintf(outfile, "{TLE\n"); + fprintf(outfile, "src:%d\n", getShortReadMarkerID(marker)); + fprintf(outfile, "off:%lld\n", (long long) (offset - start)); + fprintf(outfile, "clr:0,%lld\n", (long long) getLength(sequence)); + fprintf(outfile, "}\n"); +} + +static void exportAMOSReverseShortMarker(FILE * outfile, + ShortReadMarker * marker, + Coordinate nodeLength, + int wordShift, ReadSet * reads, + Coordinate start, + Coordinate finish) +{ + TightString *sequence = + reads->tSequences[getShortReadMarkerID(marker) - 1]; + + Coordinate offset = + nodeLength - getShortReadMarkerPosition(marker) + + getShortReadMarkerOffset(marker) - getLength(sequence) + + wordShift; + + if (getShortReadMarkerPosition(marker) == -1) + return; + + if (offset >= finish || offset + getLength(sequence) < start) + return; + + fprintf(outfile, "{TLE\n"); + fprintf(outfile, "src:%d\n", getShortReadMarkerID(marker)); + fprintf(outfile, "off:%lld\n", (long long) (offset - start)); + fprintf(outfile, "clr:%lld,0\n", (long long) getLength(sequence)); + fprintf(outfile, "}\n"); +} + +static void exportAMOSContig(FILE * outfile, ReadSet * reads, Node * node, + Graph * graph, Coordinate contigStart, + Coordinate contigFinish, IDnum iid, + IDnum internalIndex) +{ + Coordinate start; + char str[100]; + PassageMarker *marker; + ShortReadMarker *shortMarkerArray, *shortMarker; + Coordinate index, maxIndex; + int wordShift = getWordLength(graph) - 1; + char *string = expandNodeFragment(node, contigStart, contigFinish, + getWordLength(graph)); + Coordinate length = contigFinish - contigStart + wordShift; + + fprintf(outfile, "{CTG\n"); + fprintf(outfile, "iid:%d\n", iid); + fprintf(outfile, "eid:%d-%d\n", getNodeID(node), internalIndex); + + fprintf(outfile, "seq:\n"); + for (start = 0; start <= length; start += 60) { + strncpy(str, &(string[start]), 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + } + fprintf(outfile, ".\n"); + + fprintf(outfile, "qlt:\n"); + for (start = 0; start <= length; start += 60) { + strncpy(str, &(string[start]), 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + } + fprintf(outfile, ".\n"); + + free(string); + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) + exportAMOSMarker(outfile, marker, getNodeLength(node), + contigStart, contigFinish, wordShift); + + if (readStartsAreActivated(graph)) { + shortMarkerArray = getNodeReads(node, graph); + maxIndex = getNodeReadCount(node, graph); + for (index = 0; index < maxIndex; index++) { + shortMarker = + getShortReadMarkerAtIndex(shortMarkerArray, + index); + exportAMOSShortMarker(outfile, shortMarker, reads, + contigStart, contigFinish); + } + + shortMarkerArray = getNodeReads(getTwinNode(node), graph); + maxIndex = getNodeReadCount(getTwinNode(node), graph); + for (index = 0; index < maxIndex; index++) { + shortMarker = + getShortReadMarkerAtIndex(shortMarkerArray, + index); + exportAMOSReverseShortMarker(outfile, shortMarker, + getNodeLength(node), + wordShift, reads, + contigStart, + contigFinish); + } + } + + fprintf(outfile, "}\n"); +} + +static void exportAMOSNode(FILE * outfile, ReadSet * reads, Node * node, + Graph * graph) +{ + Coordinate start = 0; + Coordinate finish; + GapMarker *gap; + IDnum smallIndex = 0; + static IDnum iid = 1; + IDnum contigIndex = iid; + int wordShift = getWordLength(graph) - 1; + + for (gap = getGap(node, graph); gap; gap = getNextGap(gap)) { + finish = getGapStart(gap); + exportAMOSContig(outfile, reads, node, graph, start, + finish, iid++, smallIndex++); + start = getGapFinish(gap); + } + + finish = getNodeLength(node); + exportAMOSContig(outfile, reads, node, graph, start, finish, iid++, + smallIndex); + + if (!getGap(node, graph)) + return; + + start = 0; + + fprintf(outfile, "{SCF\n"); + fprintf(outfile, "eid:%d\n", getNodeID(node)); + for (gap = getGap(node, graph); gap; gap = getNextGap(gap)) { + finish = getGapStart(gap); + fprintf(outfile, "{TLE\n"); + fprintf(outfile, "off:%lld\n", (long long) start); + fprintf(outfile, "clr:0,%lld\n", + (long long) (finish - start + (long long) wordShift)); + fprintf(outfile, "src:%d\n", contigIndex++); + fprintf(outfile, "}\n"); + start = getGapFinish(gap); + } + finish = getNodeLength(node); + fprintf(outfile, "{TLE\n"); + fprintf(outfile, "off:%lld\n", (long long) start); + fprintf(outfile, "clr:0,%lld\n", (long long) (finish - start)); + fprintf(outfile, "src:%d\n", contigIndex++); + fprintf(outfile, "}\n"); + + fprintf(outfile, "}\n"); +} + +static void exportAMOSRead(FILE * outfile, TightString * tString, + IDnum index, IDnum frg_index) +{ + Coordinate start, finish; + char str[100]; + + fprintf(outfile, "{RED\n"); + fprintf(outfile, "iid:%d\n", index); + fprintf(outfile, "eid:%d\n", index); + if (frg_index > 0) + fprintf(outfile, "frg:%d\n", frg_index); + + fprintf(outfile, "seq:\n"); + start = 0; + while (start <= getLength(tString)) { + finish = start + 60; + readTightStringFragment(tString, start, finish, str); + fprintf(outfile, "%s\n", str); + start = finish; + } + fprintf(outfile, ".\n"); + + fprintf(outfile, "qlt:\n"); + start = 0; + while (start <= getLength(tString)) { + finish = start + 60; + readTightStringFragment(tString, start, finish, str); + fprintf(outfile, "%s\n", str); + start = finish; + } + fprintf(outfile, ".\n"); + + fprintf(outfile, "}\n"); +} + +void exportAMOSContigs(char *filename, Graph * graph, + Coordinate cutoff_length, ReadSet * reads) +{ + IDnum index; + Category cat; + Node *node; + FILE *outfile; + + printf("Writing into AMOS file %s...\n", filename); + outfile = fopen(filename, "w"); + + if (outfile == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not write to AMOS file %s", + filename); + + for (cat = 0; cat <= CATEGORIES; cat++) + exportAMOSLib(outfile, graph, cat); + + for (index = 1; index <= reads->readCount; index++) { + if (reads->categories[index - 1] % 2 != 0 && + getInsertLength(graph, + reads->categories[index - 1]) >= 0) { + fprintf(outfile, "{FRG\n"); + fprintf(outfile, "lib:%d\n", + (int) ((reads->categories[index - 1] / 2) + 1)); + fprintf(outfile, "rds:%d,%d\n", index, + index + 1); + fprintf(outfile, "eid:%d\n", index); + fprintf(outfile, "iid:%d\n", index); + fprintf(outfile, "typ:I\n"); + fprintf(outfile, "}\n"); + index++; + } + } + + for (index = 1; index <= reads->readCount; index++) { + if (reads->categories[index - 1] % 2 != 0 && + getInsertLength(graph, + reads->categories[index - 1]) >= 0) { + exportAMOSRead(outfile, + reads->tSequences[index - 1], index, + index); + index++; + exportAMOSRead(outfile, + reads->tSequences[index - 1], index, + index - 1); + } else { + exportAMOSRead(outfile, + reads->tSequences[index - 1], index, + -1); + } + } + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (node == NULL) + continue; + + if (getNodeLength(node) >= cutoff_length) + exportAMOSNode(outfile, reads, node, graph); + } + + fclose(outfile); + +} + +boolean isNatural(Graph * graph) +{ + Node *node; + IDnum index; + + for (index = 1; index < nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + + if (getNodeLength(node) == 0) + return false; + + if (simpleArcCount(node) > 4) + return false; + + if (simpleArcCount(getTwinNode(node)) > 4) + return false; + } + + return true; +} + +static Node *followShortPath(Arc * arc) +{ + Node *node = getDestination(arc); + if (simpleArcCount(node) != 1) + return NULL; + node = getDestination(getArc(node)); + return getTwinNode(node); + +} + +static void checkNodeForHallidayJunction(Node * node, Graph * graph) +{ + Node *nodeA, *nodeB, *nodeC, *nodeD; + Arc *arc1, *arc2; + + if (simpleArcCount(node) != 2) + return; + + arc1 = getArc(node); + arc2 = getNextArc(arc1); + + nodeA = followShortPath(arc1); + if (nodeA == NULL || simpleArcCount(nodeA) != 2 + || !isUniqueBasic(nodeA)) + return; + + nodeB = followShortPath(arc2); + if (nodeB == NULL || simpleArcCount(nodeB) != 2 + || !isUniqueBasic(nodeB)) + return; + + if (nodeA == nodeB) { + return; + } + + arc1 = getArc(nodeA); + arc2 = getNextArc(arc1); + nodeC = followShortPath(arc1); + if (nodeC == NULL) + return; + if (nodeC == node) { + nodeC = followShortPath(arc2); + if (nodeC == NULL || nodeC == node + || simpleArcCount(nodeC) != 2 + || !isUniqueBasic(nodeC)) { + printf("NO %d %d %d %d\n", getNodeID(node), + getNodeID(nodeA), getNodeID(nodeB), + getNodeID(nodeC)); + return; + } + } else { + if (simpleArcCount(nodeC) != 2 || !isUniqueBasic(nodeC)) { + puts("2"); + return; + } + nodeD = followShortPath(arc2); + if (nodeD != node) { + puts("3"); + return; + } + } + + puts("A"); + + arc1 = getArc(nodeB); + arc2 = getNextArc(arc1); + nodeD = followShortPath(arc1); + if (nodeD != nodeC && nodeD != node) + return; + nodeD = followShortPath(arc2); + if (nodeD != nodeC && nodeD != node) + return; + + arc1 = getArc(nodeB); + arc2 = getNextArc(arc1); + nodeD = followShortPath(arc1); + if (nodeD != nodeC && nodeD != node) + return; + nodeD = followShortPath(arc2); + if (nodeD != nodeC && nodeD != node) + return; + + printf("JOHNNY HALLIDAY JUNCTION %d %d %d %d\n", + getNodeID(node), getNodeID(nodeC), getNodeID(nodeA), + getNodeID(nodeB)); +} + +void searchForHallidayJunction(Graph * graph) +{ + IDnum index; + Node *node; + + setBaseCoverage(8); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (isUniqueBasic(node)) { + checkNodeForHallidayJunction(node, graph); + checkNodeForHallidayJunction(getTwinNode(node), + graph); + } + } +} + +Coordinate totalAssemblyLength(Graph * graph) +{ + IDnum index; + Node *node; + Coordinate total = 0; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node) + total += getNodeLength(node); + } + + return total; +} + +IDnum usedReads(Graph * graph, Coordinate minContigLength) +{ + IDnum res = 0; + boolean * used = callocOrExit(sequenceCount(graph) + 1, boolean); + IDnum nodeID, readID; + Node * node; + PassageMarker * marker; + ShortReadMarker * shortReadArray, * shortReadMarker; + IDnum shortReadCount, shortReadIndex; + + for(nodeID = 1; nodeID <= nodeCount(graph); nodeID++) { + node = getNodeInGraph(graph, nodeID); + if (node == NULL || getNodeLength(node) < minContigLength) + continue; + + // Long reads + for(marker = getMarker(node); marker != NULL; marker = getNextInNode(marker)) { + readID = getPassageMarkerSequenceID(marker); + if (readID < 0) + readID = -readID; + used[readID] = true; + } + + // Short reads + if (!readStartsAreActivated(graph)) + continue; + + shortReadArray = getNodeReads(node, graph); + shortReadCount = getNodeReadCount(node, graph); + for (shortReadIndex = 0; shortReadIndex < shortReadCount; shortReadIndex++) { + shortReadMarker = getShortReadMarkerAtIndex(shortReadArray, shortReadIndex); + readID = getShortReadMarkerID(shortReadMarker); + used[readID] = true; + } + + shortReadArray = getNodeReads(getTwinNode(node), graph); + shortReadCount = getNodeReadCount(getTwinNode(node), graph); + for (shortReadIndex = 0; shortReadIndex < shortReadCount; shortReadIndex++) { + shortReadMarker = getShortReadMarkerAtIndex(shortReadArray, shortReadIndex); + readID = getShortReadMarkerID(shortReadMarker); + used[readID] = true; + } + } + + for (readID = 1; readID <= sequenceCount(graph); readID++) + if (used[readID]) + res++; + + free(used); + + return res; +} + +void logFinalStats(Graph * graph, Coordinate minContigKmerLength, char *directory) +{ + char *logFilename = + mallocOrExit(strlen(directory) + 100, char); + char *statsLine = + mallocOrExit(5000, char); + FILE *logFile; + + strcpy(logFilename, directory); + strcat(logFilename, "/Log"); + logFile = fopen(logFilename, "a"); + + if (logFile == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not write to %s", + logFilename); + + sprintf + (statsLine, "Final graph has %ld nodes and n50 of %lld, max %lld, total %lld, using %ld/%ld reads\n", + (long) nodeCount(graph), (long long) n50(graph), (long long) maxLength(graph), + (long long) totalAssemblyLength(graph), (long) usedReads(graph, minContigKmerLength), + (long) sequenceCount(graph)); + + fprintf(logFile, "%s", statsLine); + fprintf(stdout, "%s", statsLine); + + fclose(logFile); + free(logFilename); + free(statsLine); +} + +void exportUnusedReads(Graph* graph, ReadSet * reads, Coordinate minContigKmerLength, char* directory) { + char *outFilename = + mallocOrExit(strlen(directory) + 100, char); + FILE * outfile; + boolean * used = callocOrExit(sequenceCount(graph) + 1, boolean); + IDnum nodeID, readID; + Node * node; + PassageMarker * marker; + ShortReadMarker * shortReadArray, * shortReadMarker; + IDnum shortReadCount, shortReadIndex; + // Original + IDnum numUnusedReads; + // Original + + strcpy(outFilename, directory); + strcat(outFilename, "/UnusedReads.fa"); + outfile = fopen(outFilename, "w"); + + printf("Printing unused reads into %s\n", outFilename); + + for(nodeID = 1; nodeID <= nodeCount(graph); nodeID++) { + node = getNodeInGraph(graph, nodeID); + if (node == NULL || getNodeLength(node) < minContigKmerLength) + continue; + + // Long reads + for(marker = getMarker(node); marker != NULL; marker = getNextInNode(marker)) { + readID = getPassageMarkerSequenceID(marker); + if (readID < 0) + readID = -readID; + used[readID] = true; + } + + // Short reads + if (!readStartsAreActivated(graph)) + continue; + + shortReadArray = getNodeReads(node, graph); + shortReadCount = getNodeReadCount(node, graph); + for (shortReadIndex = 0; shortReadIndex < shortReadCount; shortReadIndex++) { + shortReadMarker = getShortReadMarkerAtIndex(shortReadArray, shortReadIndex); + readID = getShortReadMarkerID(shortReadMarker); + used[readID] = true; + } + + shortReadArray = getNodeReads(getTwinNode(node), graph); + shortReadCount = getNodeReadCount(getTwinNode(node), graph); + for (shortReadIndex = 0; shortReadIndex < shortReadCount; shortReadIndex++) { + shortReadMarker = getShortReadMarkerAtIndex(shortReadArray, shortReadIndex); + readID = getShortReadMarkerID(shortReadMarker); + used[readID] = true; + } + } + + for (readID = 1; readID <= sequenceCount(graph); readID++) + if (!used[readID]) { + exportTightString(outfile, reads->tSequences[readID - 1], readID); + numUnusedReads++; + } + + // Original + //printf("%d\n", numUnusedReads); + // Original + + free(outFilename); + free(used); + fclose(outfile); +} + +// Original +double getNodeDensity(Node * node) +{ + Coordinate nodeLength, nodeCoverage; + + nodeLength = getNodeLength(node); + nodeCoverage = (getVirtualCoverage(node, 0) + + getVirtualCoverage(node, 1)); + + return nodeCoverage /(double) nodeLength; +} + +int * makeDummySubgraphMask(Graph * graph) +{ + int lenSubgraphMask = 2 * nodeCount(graph) + 1; + int *subgraphMask = callocOrExit(lenSubgraphMask, int); + int index; + + for (index = 0; index < lenSubgraphMask; index++) + subgraphMask[index] = 1; + + return subgraphMask; +} + +int estimated_cov_multi(Graph * graph, int * subgraphMask, double expCovMulti[100]) +{ + double histo[LEN_HISTO_X]; + Node *node; + int index, ecmIndex = 0; + double binWidth = 0.2; + int bin = 0; + double peakCov = 0.0, peakHeight = 0.0; + double lastPeakHeight = 0.0; + double SNratio = 10, thresNoiseHeight = 0.0; + int noiseCount = 0, thresNoiseCount = 5; + double thresMinPeak = 2.0; + + puts("Starting peak detection..."); + + // Initialize expCovMulti[] and histo[] + for (index = 0; index < 100; index++) + expCovMulti[index] = -1; + for (index = 0; index < LEN_HISTO_X; index++) + histo[index] = 0.0; + + // Make histogram + for (index = 1; index <= nodeCount(graph); index++) { + if (subgraphMask[index + nodeCount(graph)] == 1) { + node = getNodeInGraph(graph, index); + node = getNodeInGraph(graph, index); + if (node == NULL || getNodeLength(node) <= 0) + continue; + bin = (int) floor(getNodeDensity(node) / binWidth); + if (bin >= LEN_HISTO_X - 1) + bin = LEN_HISTO_X - 1; + histo[bin] += getNodeLength(node); + } + } + + // Define length threshold of noise + // Skip index = 0 to avoid the influence of long reads + for (index = LEN_HISTO_X - 2; index >= 1; index--) { + if (histo[index] > peakHeight) + peakHeight = histo[index]; + } + thresNoiseHeight = peakHeight / (double) SNratio; + peakHeight = 0.0; + + // Detect peaks + for (index = LEN_HISTO_X - 2; index >= 1; index--) { + if (histo[index] > thresNoiseHeight) { + if (histo[index] > peakHeight + && histo[index] > lastPeakHeight) { + peakHeight = histo[index]; + peakCov = (double) (index + 0.5) * binWidth; + noiseCount = 0; + continue; + } + else { + noiseCount++; + } + } + else { + lastPeakHeight = 0.0; + noiseCount++; + } + + if (peakHeight > 0.0 && noiseCount >= thresNoiseCount) { + if (peakCov < thresMinPeak) + break; + + expCovMulti[ecmIndex++] = peakCov; + + peakCov = 0.0; + lastPeakHeight = peakHeight; + peakHeight = 0.0; + noiseCount = 0; + } + } + + // Output detedted peaks + if (ecmIndex == 0) { + puts("Error!! Couldn't detect any peaks"); + exit(1); + } + for (index = 0; index < ecmIndex; index++) + printf("Detected Peak Coverage : %f\n", expCovMulti[index]); + + puts("Peak detection finished"); + + return ecmIndex; +} + +static void eliminateNullNodes(Graph * graph, int * subgraphMask) +{ + Node *node; + int index; + int lenSubgraphMask = 2 * nodeCount(graph) + 1; + + for (index = 0; index < lenSubgraphMask; index++) { + node = getNodeInGraph(graph, index - nodeCount(graph)); + if (node == NULL || getNodeID(node) == 0) + subgraphMask[index] = -2; + } +} + +static boolean checkLongReadExistence(Graph * graph) +{ + int index; + + for (index = 1; index <= nodeCount(graph); index++) { + if (getMarker(getNodeInGraph(graph, index)) != NULL) + return true; + } + + return false; +} + +static void depthFirstSearchSubgraph(int currentIndex, Graph * graph, int * subgraphMask) +{ + Arc *activeArc = NULL; + int nextIndex = 0; + + if (subgraphMask[currentIndex + nodeCount(graph)] == 0) { + // Mark "Visiting" + subgraphMask[currentIndex + nodeCount(graph)] = 1; + + // Find "Unvisited" Node + for (activeArc = getArc(getNodeInGraph(graph, currentIndex)); + activeArc != NULL; activeArc = getNextArc(activeArc)) { + nextIndex = getNodeID(getDestination(activeArc)); + if (subgraphMask[nextIndex] == 0) { + depthFirstSearchSubgraph(nextIndex, graph, subgraphMask); + depthFirstSearchSubgraph((nextIndex * -1), graph, subgraphMask); + } + } + } +} + +void resetUniqueness(Graph * graph) +{ + Node *node; + int index; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node == NULL) + continue; + setUniqueness(node, false); + } +} + +static void setSubgraphMask(int * subgraphMask, int lenSubgraphMask, + int before, int after) +{ + int index; + + for (index = 0; index < lenSubgraphMask; index++) { + if (subgraphMask[index] == before) + subgraphMask[index] = after; + } +} + +static int getUnvisitedNodeID(int * subgraphMask, int lenSubgraphMask) +{ + int index; + + for (index = 0; index < lenSubgraphMask; index++) { + if (index == (lenSubgraphMask - 1) / 2) + continue; + if (subgraphMask[index] == 0) + return index - (lenSubgraphMask - 1) / 2; + } + + // Visited all nodes + return 0; +} + +static void shelveSubgraphMask(int * subgraphMask, int lenSubgraphMask, + int exception) +{ + int index; + + for (index = 0; index < lenSubgraphMask; index++) { + if (subgraphMask[index] == exception) + subgraphMask[index] = 0; + else + subgraphMask[index] += 100; + } +} + +static void unshelveSubgraphMask(int * subgraphMask, int lenSubgraphMask) +{ + int index; + + for (index = 0; index < lenSubgraphMask; index++) { + if (subgraphMask[index] >= 50) + subgraphMask[index] -= 100; + } +} + +static int estimated_cov_subgraph(Graph * graph, int * subgraphMask, double expCovPandS[2], + double rateChimericSubgraph) +{ + int nodeIndex; + long int sumLenPrimary = 0, sumLenSecondary = 0, sumLenTotal; + double perPrimary, perSecondary; + Node *node; + double cov = 0.0; + double primary = expCovPandS[0], secondary = expCovPandS[1]; + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + if (subgraphMask[nodeIndex + nodeCount(graph)] == 1) { + node = getNodeInGraph(graph, nodeIndex); + if (node == NULL) + continue; + cov = getNodeDensity(node); + if (fabs(cov - primary) <= fabs(cov - secondary)) + sumLenPrimary += getNodeLength(node); + else + sumLenSecondary += getNodeLength(node); + } + } + + sumLenTotal = sumLenPrimary + sumLenSecondary; + perPrimary = (double) sumLenPrimary / sumLenTotal; + perSecondary = (double) sumLenSecondary / sumLenTotal; + if (perSecondary <= rateChimericSubgraph) { + // Non Chimeric Subgraph, belongs to Primary Species + return 1; + } else if (perPrimary <= rateChimericSubgraph) { + // Non Chimeric Subgraph, belongs to Secondary Species + return -1; + } else { + // Chimeric Subgraph + return 2; + } +} + +static void forceSeparateChimericSubgraph(Graph * graph, int * subgraphMask, + double expCovPandS[2]) +{ + int maskIndex; + int lenSubgraphMask = nodeCount(graph) * 2 + 1; + Node *node; + double cov, primary = expCovPandS[0], secondary = expCovPandS[1]; + + for (maskIndex = 0; maskIndex < lenSubgraphMask; maskIndex++) { + if (subgraphMask[maskIndex] == 1) { + node = getNodeInGraph(graph, maskIndex - nodeCount(graph)); + if (node == NULL) + continue; + cov = getNodeDensity(node); + if (fabs(cov - primary) <= fabs(cov - secondary)) + subgraphMask[maskIndex] = 2; + else + subgraphMask[maskIndex] = -1; + } + } +} + +static void judgeChimericSubgraph(Graph * graph, int * subgraphMask, double expCovPandS[2], + double rateChimericSubgraph, boolean discardChimericSubgraph) +{ + int nodeIndex; + int judgeResult = 0; + int lenSubgraphMask = nodeCount(graph) * 2 + 1; + boolean flagVisitedAllNodes = false; + int numSubgraph = 0, thresNumSubgraph = lenSubgraphMask; + + // Shelve not "Visiting" subgraphs + shelveSubgraphMask(subgraphMask, lenSubgraphMask, 1); + + while (!flagVisitedAllNodes) { + // Check Infinite Loop + numSubgraph++; + if (numSubgraph >= thresNumSubgraph) { + puts("Resolving Repeat Error!! Infinite Loop"); + free(subgraphMask); + exit(1); + } + + // Choice starting unvisited node + nodeIndex = getUnvisitedNodeID(subgraphMask, lenSubgraphMask); + + printf("nodeIndex = %d\n", nodeIndex); + + // Depth-first search (node & twin) + depthFirstSearchSubgraph(nodeIndex, graph, subgraphMask); + depthFirstSearchSubgraph((nodeIndex * -1), graph, subgraphMask); + + // Estimate exp_cov in the Subgraph + judgeResult = estimated_cov_subgraph(graph, subgraphMask, expCovPandS, + rateChimericSubgraph); + + if (judgeResult == 1) { + printf("NonChimeric Subgraph, belongs to Primary Species\n"); + setSubgraphMask(subgraphMask, lenSubgraphMask, 1, 2); + } + else if (judgeResult == -1) { + printf("NonChimeric Subgraph, belongs to Secondary Species\n"); + setSubgraphMask(subgraphMask, lenSubgraphMask, 1, -1); + } + else { + printf("Chimeric Subgraph!\n"); + if (discardChimericSubgraph) + setSubgraphMask(subgraphMask, lenSubgraphMask, 1, -2); + else + forceSeparateChimericSubgraph(graph, subgraphMask, expCovPandS); + } + + // Judge whether all nodes in Subgraph have visited or not + if (getUnvisitedNodeID(subgraphMask, lenSubgraphMask) == 0) + flagVisitedAllNodes = true; + } + + // Unshelve not "Visiting" subgraphs + unshelveSubgraphMask(subgraphMask, lenSubgraphMask); + + printf("Separating Chimeric Subgraphs Finished!\n"); +} + +static boolean checkPrimaryExpCovExistence(double expCovMulti[2], double expCovPandS[2]) +{ + int index; + double expCov, primary = expCovPandS[0], secondary = expCovPandS[1]; + + if (secondary == -1) + secondary = primary / (double) 2; + + for (index = 0; index < 2; index++) { + expCov = expCovMulti[index]; + if (expCov == -1) + break; + if (fabs(expCov - primary) <= fabs(expCov - secondary)) { + printf("Primary exp_cov (%f <-> %f), Exist\n", + primary, secondary); + return true; + } + } + + printf("Primary exp_cov (%f <-> %f), NOT exist\n", primary, secondary); + return false; +} + +static boolean judgeSkip(Graph * graph, int * subgraphMask) +{ + int nodeIndex; + int lenSubgraphMask = nodeCount(graph) * 2 + 1; + int countSkip = 0; + double insertLen = getInsertLength(graph, 1); + double skipCandidateNodeLen = 0.0; + Node *node = NULL; + + for (nodeIndex = 1; nodeIndex < nodeCount(graph); nodeIndex++) { + if (subgraphMask[nodeIndex + nodeCount(graph)] == 1) { + countSkip++; + node = getNodeInGraph(graph, nodeIndex); + skipCandidateNodeLen = getNodeLength(node); + } + } + + if (countSkip <= 1 && skipCandidateNodeLen < insertLen) { + setSubgraphMask(subgraphMask, lenSubgraphMask, 1, -2); + printf("Skipped\n"); + return true; + } + else + return false; +} + +static void printActiveNodes(int * subgraphMask, int lenSubgraphMask) +{ + int index; + + printf("Active Nodes : "); + for (index = 0; index < lenSubgraphMask; index++) { + if (subgraphMask[index] == 2) + printf("%d ", index - (lenSubgraphMask - 1) / 2); + } + + printf("\n"); +} + +void resolveRepeatOfAllSubgraphs(Graph * graph, ReadSet * reads, double expCovMulti[100], + boolean * dubious, boolean force_jumps, int argPebbleRounds, + double rateChimericSubgraph, boolean discardChimericSubgraph, + double repeatNodeCovSD) +{ + int nodeIndex = 1, ecmIndex = 0; + int numSubgraph = 0; + int thresNumSubgraph = nodeCount(graph) * 2; + int lenSubgraphMask = 2 * nodeCount(graph) + 1; + int *subgraphMask = callocOrExit(lenSubgraphMask, int); + double expCovSubgraph = 0.0; + double expCovPandS[2]; + int numPeaks = 0; + int countInterRepeatLoop = 0, thresInterRepeatLoop = 20; + int pebbleRounds = argPebbleRounds; + boolean flagLongRead = false, flagVisitedAllNodes = false; + + puts("\nResolving Repeats for each subgraph\n"); + + // Eliminate NULL nodes + eliminateNullNodes(graph, subgraphMask); + + // Check whether long reads are in the input sequences + flagLongRead = checkLongReadExistence(graph); + + // Print Expected Coverages + for (ecmIndex = 0; expCovMulti[ecmIndex] > 0.001; ecmIndex++) + printf("Expected Coverage %d : %f\n", ecmIndex+1, expCovMulti[ecmIndex]); + ecmIndex = 0; + + /* + // Detect peaks from whole Graph + setSubgraphMask(subgraphMask, lenSubgraphMask, 0, 1); + estimated_cov_multi(graph, subgraphMask, expCovMulti); + setSubgraphMask(subgraphMask, lenSubgraphMask, 1, 0); + */ + + while (!flagVisitedAllNodes) { + // Set expCovPandS + if (expCovMulti[ecmIndex] != -1) { + expCovPandS[0] = expCovMulti[ecmIndex++]; + expCovPandS[1] = expCovMulti[ecmIndex]; + } + printf("\nPrimary exp_cov : %f\n", expCovPandS[0]); + + // Resolve repeats for each Subgraph + while (true) { + // Check Infinite Loop + numSubgraph++; + if (numSubgraph >= thresNumSubgraph) { + puts("Resolving Repeat Error!! Infinite Loop"); + free(subgraphMask); + exit(1); + } + + // Choice starting unvisited node + nodeIndex = getUnvisitedNodeID(subgraphMask, lenSubgraphMask); + + printf("nodeIndex = %d\n", nodeIndex); + + // Depth-first search (node & twin) + depthFirstSearchSubgraph(nodeIndex, graph, subgraphMask); + depthFirstSearchSubgraph((nodeIndex * -1), graph, subgraphMask); + + // Estimate the number of peaks + numPeaks = estimated_cov_subgraph(graph, subgraphMask, expCovPandS, + rateChimericSubgraph); + + // Judge whether the Subgraph is chimeric or not + if (numPeaks >= 2) { + puts("Multiple Peaks Detected!"); + // Identify and Separate InterRepeats + while (identifyAndSeparateInterRepeats(graph, expCovPandS, + repeatNodeCovSD)) { + // Check Infinite Loop + if (countInterRepeatLoop++ >= thresInterRepeatLoop) { + puts("Force-quitted to Identify InterRepeats"); + eliminateNullNodes(graph, subgraphMask); + break; + //puts("Identifying InterRepeat Error! Infinite Loop"); + //free(subgraphMask); + //exit(1); + } + // Eliminate NULL nodes + eliminateNullNodes(graph, subgraphMask); + } + // Judge whether each Subgraph is chimeric or not + judgeChimericSubgraph(graph, subgraphMask, expCovPandS, + rateChimericSubgraph, discardChimericSubgraph); + } + else if (numPeaks == 1) + setSubgraphMask(subgraphMask, lenSubgraphMask, 1, 2); + + // Judge whether all nodes in Subgraphs have visited or not + if (getUnvisitedNodeID(subgraphMask, lenSubgraphMask) != 0) { + printf("Unvisited Node : %d\n", + getUnvisitedNodeID(subgraphMask, lenSubgraphMask)); + continue; + } + else { + printf("\nGo to Assembly!\n"); + //printActiveNodes(subgraphMask, lenSubgraphMask); + expCovSubgraph = expCovPandS[0]; + printf("exp_cov = %f\n", expCovSubgraph); + } + + // -------------------- Assemble in the Subgraph -------------------- + // Judge unique or repeat + identifyUniqueNodesSubgraph(graph, subgraphMask, + isUniqueSolexaSubgraph, expCovSubgraph); + // Rock Band in the Subgraph + if (flagLongRead) + readCoherentSubgraph(graph, expCovSubgraph, reads, subgraphMask); + // Pebble in the Subgraph + for (pebbleRounds = argPebbleRounds; pebbleRounds > 0; pebbleRounds--) + exploitShortReadPairs(graph, reads, dubious, force_jumps); + // Print "Finished" + printf("Subgraph Assembly Finished!\n\n"); + // ------------------------------------------------------------------ + + // Eliminate NULL Nodes + eliminateNullNodes(graph, subgraphMask); + + // Reset uniqueness + resetUniqueness(graph); + + // Set "2" -> "-2", "-1" -> "0" + setSubgraphMask(subgraphMask, lenSubgraphMask, 2, -2); + setSubgraphMask(subgraphMask, lenSubgraphMask, -1, 0); + + // Judge whether all nodes in Graph have visited or not + if (getUnvisitedNodeID(subgraphMask, lenSubgraphMask) == 0) + flagVisitedAllNodes = true; + break; + } + } + + // Resolved Successfully + puts("Resolved Successfully!\n"); + free(subgraphMask); +} +// Original diff --git a/contrib/MetaVelvet-v0.3.1/src/graphStats.h b/contrib/MetaVelvet-v0.3.1/src/graphStats.h new file mode 100644 index 0000000..b4b522d --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/graphStats.h @@ -0,0 +1,161 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _GRAPHSTATS_H_ +#define _GRAPHSTATS_H_ + +// This header file and the corresponding code file contain a load of +// miscellaneous functions, many of which coded quickly and used only once +// for reasons long forgotten since. +// Sorry for the mess ;-) + +/////////////////////////////////////////////////////////////////// +// Useful functions +/////////////////////////////////////////////////////////////////// + +// Original +double getNodeDensity(Node * node); +int * makeDummySubgraphMask(Graph * graph); +int estimated_cov_multi(Graph * graph, int * subgraphMask, double expCovMulti[100]); +void resolveRepeatOfAllSubgraphs(Graph * graph, ReadSet * reads, double expCovMulti[100], + boolean * dubious, boolean force_jumps, int pebbleRounds, + double rateChimericSubgraph, boolean discardChimericSubgraph, + double repeatNodeCovSD); +void resetUniqueness(Graph * graph); +// Original + +void displayGraphStatistics(Graph * graph); + +void displayGeneralStatistics(Graph * graph, char *filename, ReadSet * reads); + +void exportLongNodeSequences(char *filename, Graph * graph, + Coordinate minLength); + +void exportMediumNodeSequences(char *filename, Graph * graph, + Coordinate minLength); + +IDnum readStarts(Node * node); + +Coordinate readCoverage(Node * node); + +IDnum strainMarkerCount(Node * node, IDnum firstStrain); + +IDnum nodeMultiplicity(Node * node); + +Coordinate n50(Graph * graph); + +double estimated_cov(Graph * graph, char * directory); + +Coordinate maxLength(Graph * graph); + +boolean *removeLowCoverageNodesAndDenounceDubiousReads(Graph * graph, + double minCov); + +void exportAMOSContigs(char *filename, Graph * graph, + Coordinate cutoff_length, ReadSet * reads); + +IDnum usedReads(Graph * graph, Coordinate minContigLength); + +Coordinate totalAssemblyLength(Graph * graph); + +void logFinalStats(Graph * graph, Coordinate minContigKmerLength, char *directory); + +void exportUnusedReads(Graph* graph, ReadSet * reads, Coordinate minContigKmerLength, char* filename); + +/////////////////////////////////////////////////////////////////// +// Dodgy functions +/////////////////////////////////////////////////////////////////// + +IDnum countSinksAndSources(Graph * graph); + +IDnum countTangles(Graph * graph); + +IDnum countRepeats(Graph * graph); + +IDnum countSNPs(Graph * graph, IDnum firstStrain, int WORDLENGTH); + +void displayGraphStatisticsSelective(Graph * graph, IDnum first); + +void grossErrorRemoval(Graph * graph, IDnum firstStrain); + +Coordinate countCommonLength(Graph * graph, IDnum firstStrain); + +IDnum countBreakpoints(Graph * graph, IDnum firstStrain); + +IDnum countStrainOnlyNodes(Graph * graph, IDnum firstStrain); + +Coordinate countStrainOnlyBp(Graph * graph, IDnum firstStrain); + +void displayStrainOnlySequences(Graph * graph, IDnum firstStrain, + char *inputFilename, char *filename, + int WORDLENGTH); + +void displayStrainOnlyDescriptors(Graph * graph, IDnum firstStrain); + +void chainSawCorrection(Graph * graph, int minMult); + +void displayBreakpoints(Graph * graph, IDnum firstStrain); + +void destroyStrainSpecificIslands(Graph * graph, IDnum firstStrain); + +void spotIrregularReads(Graph * graph, IDnum firstStrain, + char *sequenceFile, char *outputFile); + +void displayAlignmentToReference(Graph * graph, IDnum seqID, + IDnum firstStrain, + TightString ** sequences, int WORDLENGTH, + char *filename); + +void removeReferenceMarkers(Graph * graph, IDnum firstStrain); + +void testForBizarreMarkers(Graph * graph); + +void surveyPaths(Graph * graph); + +void destroyMixedReads(Graph * graph, IDnum minCoverage); + +void destroySinglePoolNodes(Graph * graph); +void destroySinglePoolNodesStrict(Graph * graph); +void destroyShortTips(Graph * graph); + +void destroyDisconnectedElements(Graph * graph); +void measureTangleSizes(Graph * graph, Coordinate maxLength); + +void destroyEmptyNodes(Graph * graph); + +void removeShortReads(Graph * graph); + +Coordinate totalGraphLength(Graph * graph); + +void contigStats(Node ** node, IDnum readCount); + +void exportContigs(Node ** contigs, ReadSet * reads, char *filename, + int WORDLENGTH, int pairedReadsCount); + +void removeLowCoverageNodes(Graph * graph, double minCov); +void removeHighCoverageNodes(Graph * graph, double maxCov); + +void removeMissingStrain(Graph * graph, Category cat); + +boolean isNatural(Graph * graph); + +void searchForHallidayJunction(Graph * graph); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/kmer.c b/contrib/MetaVelvet-v0.3.1/src/kmer.c new file mode 100644 index 0000000..77c4cc5 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/kmer.c @@ -0,0 +1,508 @@ +/* +Copyright 2007, 2008, 2009 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "kmer.h" +#include "utility.h" + +static const uint64_t longLongLeftFilter = (uint64_t) 3 << 62; +static const uint32_t longLeftFilter = (uint32_t) 3 << 30; +static const uint16_t intLeftFilter = (uint16_t) 3 << 14; +static const uint8_t charLeftFilter = (uint8_t) 3 << 6; + +static uint64_t longLongWordFilter = (uint64_t) ((int64_t) -1); +static uint32_t longWordFilter = (uint32_t) ((int32_t) -1); +static uint16_t intWordFilter = (uint16_t) ((int16_t) -1); +static uint8_t charWordFilter = (uint8_t) ((int8_t) -1); + +#define UNDEFINED 0 +#define CHARS 1 +#define INTS 2 +#define LONGS 3 +#define LONGLONGS 4 +static int kmerFilterIndex = UNDEFINED; +static int kmerFilterOffset = 0; +static int longLongKmerFilterIndex = KMER_LONGLONGS; +static uint64_t longLongKmerFilter = (uint64_t) ((int64_t) -1); + +void resetWordFilter(int wordLength) { + int kmer_bit_size = wordLength * 2; + int i; + + if (wordLength > MAXKMERLENGTH) + exitErrorf(EXIT_FAILURE, true, "Word length %i greater than max allowed value (%i).\nRecompile Velvet to deal with this word length.", wordLength, MAXKMERLENGTH); + +#if KMER_LONGLONGS + for (i = 0; i < KMER_LONGLONGS; i++) { + if (kmer_bit_size > 64) { + kmer_bit_size -= 64; + continue; + } else if (kmer_bit_size == 64) { + longLongKmerFilterIndex = i; + longLongKmerFilter = longLongWordFilter; + kmerFilterIndex = LONGLONGS; + kmerFilterOffset = kmer_bit_size - 2; + longWordFilter = 0; + intWordFilter = 0; + charWordFilter = 0; + return; + } else { + longLongKmerFilterIndex = i; + longLongKmerFilter = (((uint64_t) 1) << kmer_bit_size) - 1; + kmerFilterIndex = LONGLONGS; + kmerFilterOffset = kmer_bit_size - 2; + longWordFilter = 0; + intWordFilter = 0; + charWordFilter = 0; + return; + } + } +#endif +#if KMER_LONGS + if (kmer_bit_size > 32) + kmer_bit_size -= 32; + else if (kmer_bit_size == 32) { + kmerFilterIndex = LONGS; + kmerFilterOffset = kmer_bit_size - 2; + intWordFilter = 0; + charWordFilter = 0; + return; + } else { + longWordFilter = (((uint32_t) 1) << kmer_bit_size) - 1; + kmerFilterIndex = LONGS; + kmerFilterOffset = kmer_bit_size - 2; + intWordFilter = 0; + charWordFilter = 0; + return; + } +#endif +#if KMER_INTS + if (kmer_bit_size > 16) + kmer_bit_size -= 16; + else if (kmer_bit_size == 16) { + kmerFilterIndex = INTS; + kmerFilterOffset = kmer_bit_size - 2; + charWordFilter = 0; + return; + } else { + intWordFilter = (((uint16_t) 1) << kmer_bit_size) - 1; + kmerFilterIndex = INTS; + kmerFilterOffset = kmer_bit_size - 2; + charWordFilter = 0; + return; + } + +#endif +#if KMER_CHARS + if (kmer_bit_size < 8) + charWordFilter = (((uint8_t) 1) << kmer_bit_size) - 1; + + kmerFilterIndex = CHARS; + kmerFilterOffset = kmer_bit_size - 2; +#endif + +} + +static void shiftRight(Kmer * kmer) { + int i; + uint64_t leftBits = 0; + uint64_t rightBits; + +#if KMER_CHARS + +#if KMER_INTS | KMER_LONGS | KMER_LONGLONGS + rightBits = kmer->chars & 3; +#endif + + kmer->chars >>= 2; + kmer->chars += (uint8_t) leftBits; + +#if KMER_INTS | KMER_LONGS | KMER_LONGLONGS + leftBits = rightBits; +#endif +#endif + +#if KMER_INTS + +#if KMER_LONGS | KMER_LONGLONGS + rightBits = kmer->ints & 3; +#endif + + leftBits <<= 14; + kmer->ints >>= 2; + kmer->ints += (uint16_t) leftBits; + +#if KMER_LONGS | KMER_LONGLONGS + leftBits = rightBits; +#endif +#endif + +#if KMER_LONGS + +#if KMER_LONGLONGS + rightBits = kmer->longs & 3; +#endif + + leftBits <<= 30; + kmer->longs >>= 2; + kmer->longs += (uint32_t) leftBits; + +#if KMER_LONGLONGS + leftBits = rightBits; +#endif +#endif + +#if KMER_LONGLONGS + for (i = KMER_LONGLONGS - 1; i >= 0; i--) { + rightBits = kmer->longlongs[i] & 3; + leftBits <<= 62; + kmer->longlongs[i] >>= 2; + kmer->longlongs[i] += leftBits; + leftBits = rightBits; + } +#endif +} + +void copyKmers(Kmer* k1, Kmer* k2) { + int i; + +#if KMER_LONGLONGS + for (i = 0; i < KMER_LONGLONGS; i++) + k1->longlongs[i] = k2->longlongs[i]; +#endif +#if KMER_LONGS + k1->longs = k2->longs; +#endif +#if KMER_INTS + k1->ints = k2->ints; +#endif +#if KMER_CHARS + k1->chars = k2->chars; +#endif +} + +int compareKmers(Kmer* k1, Kmer* k2) { +#if KMER_LONGLONGS + int i; +#endif + +#if KMER_CHARS + if (k1->chars == k2->chars) + ; + else if (k1->chars > k2->chars) + return 1; + else + return -1; +#endif +#if KMER_INTS + if (k1->ints == k2->ints) + ; + else if (k1->ints > k2->ints) + return 1; + else + return -1; +#endif +#if KMER_LONGS + if (k1->longs == k2->longs) + ; + else if (k1->longs > k2->longs) + return 1; + else + return -1; +#endif +#if KMER_LONGLONGS + for (i = KMER_LONGLONGS - 1; i >= 0; i--) { + if (k1->longlongs[i] == k2->longlongs[i]) + continue; + else if (k1->longlongs[i] > k2->longlongs[i]) + return 1; + else + return -1; + } +#endif + + return 0; +} + +void clearKmer(Kmer * kmer) { + int i; + +#if KMER_LONGLONGS + for (i = 0; i < KMER_LONGLONGS; i++) + kmer->longlongs[i] = 0; +#endif +#if KMER_LONGS + kmer->longs = 0; +#endif +#if KMER_INTS + kmer->ints = 0; +#endif +#if KMER_CHARS + kmer->chars = 0; +#endif +} + +void printKmer(Kmer * kmer) { + int i; + +#if KMER_CHARS + printf("%hx\t", kmer->chars); +#endif +#if KMER_INTS + printf("%x\t", kmer->ints); +#endif +#if KMER_LONGS + printf("%x\t", kmer->longs); +#endif +#if KMER_LONGLONGS + for (i = KMER_LONGLONGS - 1; i >= 0; i--) + printf("%llx\t", (long long) kmer->longlongs[i]); +#endif + puts(""); +} + +void testKmers(int argc, char** argv) { + Kmer kmer; + Kmer *k2; + Kmer k4; + + k2 = &k4; + int i; + + printf("FORMATS %u %u %u %u\n", KMER_CHARS, KMER_INTS, KMER_LONGS, KMER_LONGLONGS); + printf("FILTERS %hx %x %lx %llx\n", (short) charLeftFilter, (int) intLeftFilter, (long) longLeftFilter, (long long) longLongLeftFilter); + printf("FILTERS %hx %x %lx %llx\n", (short) charWordFilter, (int) intWordFilter, (long) longWordFilter, (long long) longLongWordFilter); + printKmer(&kmer); + puts("Clear"); + clearKmer(&kmer); + printKmer(&kmer); + + puts("Fill up"); + for (i = 0; i < MAXKMERLENGTH; i++) { + pushNucleotide(&kmer, ((i + 1) % 4)); + printKmer(&kmer); + } + + puts("Shift right"); + for (i = 0; i < MAXKMERLENGTH; i++) { + popNucleotide(&kmer); + printKmer(&kmer); + } + + puts("Reverse complement"); + resetWordFilter(9); + clearKmer(&kmer); + for (i = 0; i < MAXKMERLENGTH; i++) { + reversePushNucleotide(&kmer, ((i + 1) % 4)); + printKmer(&kmer); + } + + puts("Copy"); + copyKmers(k2, &kmer); + printKmer(k2); + printf("%i\n", compareKmers(k2, &kmer)); + +} + +void pushNucleotide(Kmer * kmer, Nucleotide nucleotide) { + register int i; + +#if KMER_LONGLONGS + register uint64_t * ptr; +#endif +#if KMER_LONGLONGS > 1 | KMER_LONGS | KMER_INTS | KMER_CHARS + uint64_t leftBits; +#endif + uint64_t rightBits = 0; + +#if KMER_LONGLONGS + ptr = kmer->longlongs; + +#if KMER_LONGLONGS > 1 + for (i = 0; i < longLongKmerFilterIndex; i++) { + leftBits = (*ptr & longLongLeftFilter); + leftBits >>= 62; + *ptr <<= 2; + *ptr += rightBits; + *ptr &= longLongWordFilter; + rightBits = leftBits; + ptr++; + } +#endif + +#if KMER_LONGS | KMER_INTS | KMER_CHARS + leftBits = (*ptr & longLongLeftFilter); + leftBits >>= 62; +#endif + + *ptr <<= 2; + *ptr += rightBits; + *ptr &= longLongKmerFilter; + +#if KMER_LONGS | KMER_INTS | KMER_CHARS + rightBits = leftBits; +#endif +#endif + +#if KMER_LONGS + +#if KMER_INTS | KMER_CHARS + leftBits = kmer->longs & longLeftFilter; + leftBits >>= 30; +#endif + kmer->longs <<= 2; + kmer->longs += rightBits; + kmer->longs &= longWordFilter; + +#if KMER_INTS | KMER_CHARS + rightBits = leftBits; +#endif + +#endif + +#if KMER_INTS + +#if KMER_CHARS + leftBits = kmer->ints & intLeftFilter; + leftBits >>= 14; +#endif + kmer->ints <<= 2; + kmer->ints += rightBits; + kmer->ints &= intWordFilter; + +#if KMER_CHARS + rightBits = leftBits; +#endif + +#endif + +#if KMER_CHARS + kmer->chars <<= 2; + kmer->chars += rightBits; + kmer->chars &= charWordFilter; +#endif + +#if KMER_LONGLONGS + kmer->longlongs[0] += nucleotide; + if (kmer->longlongs[0] >= nucleotide) + return; + + for (i = 1; i < KMER_LONGLONGS; i++) + if (++kmer->longlongs[i]) + return; +#if KMER_LONGS + if (++kmer->longs) + return; +#endif +#if KMER_INTS + if (++kmer->ints) + return; +#endif +#if KMER_CHARS + ++kmer->chars; +#endif + +#else + +#if KMER_LONGS + kmer->longs += nucleotide; + if (kmer->longs >= nucleotide) + return; +#if KMER_INTS + if (++kmer->ints) + return; +#endif +#if KMER_CHARS + ++kmer->chars; +#endif + +#else + +#if KMER_INTS + kmer->ints += nucleotide; + if (kmer->ints >= nucleotide) + return; +#if KMER_CHARS + ++kmer->chars; +#endif + +#else + +#if KMER_CHARS + kmer->chars += nucleotide; +#endif + +#endif +#endif +#endif +} + +Nucleotide popNucleotide(Kmer * kmer) { + Nucleotide nucl; + +#if KMER_LONGLONGS + nucl = kmer->longlongs[0] & 3; +#elif KMER_LONGS + nucl = kmer->longs & 3; +#elif KMER_INTS + nucl = kmer->ints & 3; +#elif KMER_CHARS + nucl = kmer->chars & 3; +#endif + + shiftRight(kmer); + return nucl; +} + +void reversePushNucleotide(Kmer * kmer, Nucleotide nucleotide) { + uint64_t templongLong = nucleotide; + + shiftRight(kmer); + + switch(kmerFilterIndex) { + case UNDEFINED: + abort(); +#if KMER_LONGLONGS + case LONGLONGS: + kmer->longlongs[longLongKmerFilterIndex] += templongLong << kmerFilterOffset; + return; +#endif +#if KMER_LONGS + case LONGS: + kmer->longs += templongLong << kmerFilterOffset; + return; +#endif +#if KMER_INTS + case INTS: + kmer->ints += templongLong << kmerFilterOffset; + return; +#endif +#if KMER_CHARS + case CHARS: + kmer->chars += templongLong << kmerFilterOffset; + return; +#endif + } + + exitErrorf(EXIT_FAILURE, true, "Anomaly in k-mer filering"); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/kmer.h b/contrib/MetaVelvet-v0.3.1/src/kmer.h new file mode 100644 index 0000000..356e3ba --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/kmer.h @@ -0,0 +1,68 @@ +/* +Copyright 2007, 2008, 2009 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _KMER_H_ +#define _KMER_H_ + +#include + +void copyKmers(Kmer* k1, Kmer* k2); + +void pushNucleotide(Kmer * kmer, Nucleotide nucleotide); +Nucleotide popNucleotide(Kmer * kmer); + +int compareKmers(Kmer* k1, Kmer* k2); + +void reversePushNucleotide(Kmer * kmer, Nucleotide nucleotide); + +void printKmer(Kmer * kmer); + +void clearKmer(Kmer * kmer); + +void resetWordFilter(int wordLength); + +#define KMER_QUOTIENT (MAXKMERLENGTH / 4) +#define KMER_REMAINDER (MAXKMERLENGTH % 4) +#if KMER_REMAINDER +#define KMER_BYTE_SIZE (KMER_QUOTIENT + 1) +#else +#define KMER_BYTE_SIZE KMER_QUOTIENT +#endif +#define KMER_LONGLONGS (KMER_BYTE_SIZE / 8) +#define KMER_LONGS ((KMER_BYTE_SIZE % 8) / 4) +#define KMER_INTS ((KMER_BYTE_SIZE % 4) / 2) +#define KMER_CHARS (KMER_BYTE_SIZE % 2) + +struct kmer_st { +#if KMER_LONGLONGS + uint64_t longlongs[KMER_LONGLONGS]; +#endif +#if KMER_LONGS + uint32_t longs; +#endif +#if KMER_INTS + uint16_t ints; +#endif +#if KMER_CHARS + uint8_t chars; +#endif +}; + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/locallyCorrectedGraph.c b/contrib/MetaVelvet-v0.3.1/src/locallyCorrectedGraph.c new file mode 100644 index 0000000..e32fb79 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/locallyCorrectedGraph.c @@ -0,0 +1,579 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "graph.h" +#include "tightString.h" +#include "dfibHeap.h" +#include "recycleBin.h" +#include "passageMarker.h" +#include "shortReadPairs.h" +#include "locallyCorrectedGraph.h" +#include "utility.h" + +static const Time INDEL = 0; +static const Time SIM[4][4] = { + {1, 0, 0, 0}, + {0, 1, 0, 0}, + {0, 0, 1, 0}, + {0, 0, 0, 1} +}; + +//Global variables used throughout this procedure(internal use only !) +static int MAXREADLENGTH = 100; +static int MAXNODELENGTH = 200; +static double MAXDIVERGENCE = 0.2; +static int MAXGAPS = 3; + +static Time *times; +static Node **previous; + +static DFibHeapNode **dheapNodes; +static DFibHeap *dheap; + +static TightString *fastSequence; +static TightString *slowSequence; + +static int WORDLENGTH; +static int SELF_LOOP_CUTOFF; +static Graph *graph; +static Node *start; + +static PassageMarker *fastPath; +static PassageMarker *slowPath; + +static double **Fmatrix; +//End of global variables; + +static void setNodeTime(Node * node, Time time) +{ + times[getNodeID(node) + nodeCount(graph)] = time; +} + +static Time getNodeTime(Node * node) +{ + return times[getNodeID(node) + nodeCount(graph)]; +} + +static Node *getNodePrevious(Node * node) +{ + return previous[getNodeID(node) + nodeCount(graph)]; +} + +static boolean isPreviousToNode(Node * previous, Node * target) +{ + Node *currentNode = target; + Node *previousNode = NULL; + Time targetTime = getNodeTime(target); + + //printf("Testing if %li is previous to %li\n", getNodeID(previous), getNodeID(target)); + + while (true) { + //printf("CCC %li %f\n", getNodeID(currentNode), getNodeTime(currentNode)); + + if (currentNode == previous) + return true; + + if (currentNode == previousNode) + return false; + + if (getNodeID(currentNode) > nodeCount(graph) + || getNodeID(currentNode) < -nodeCount(graph)) { + printf("Node ID??? %d %d\n", + getNodeID(currentNode), + getNodeID(previousNode)); + } + + if (getNodeTime(currentNode) != targetTime) + return false; + + previousNode = currentNode; + currentNode = getNodePrevious(currentNode); + } +} + +static boolean +extractSequence(PassageMarker * path, TightString * sequence) +{ + PassageMarker *marker; + Coordinate seqLength = 0; + Coordinate writeIndex = 0; + + //printf("Extracting sequence %li ... ", pathLength); + + //Measure length + for (marker = getNextInSequence(path); !isTerminal(marker); + marker = getNextInSequence(marker)) + seqLength += getNodeLength(getNode(marker)); + + if (seqLength > MAXREADLENGTH) + return false; + else + setTightStringLength(sequence, seqLength); + + //Copy sequences + for (marker = getNextInSequence(path); !isTerminal(marker); + marker = getNextInSequence(marker)) { + appendNodeSequence(getNode(marker), sequence, writeIndex); + writeIndex += getNodeLength(getNode(marker)); + } + + return true; +} + +static Time max(Time A, Time B, Time C) +{ + if (A >= B && A >= C) + return A; + else if (B >= C) + return B; + else + return C; +} + +static boolean +compareSequences(TightString * sequence1, TightString * sequence2) +{ + Coordinate i, j; + Coordinate length1 = getLength(sequence1); + Coordinate length2 = getLength(sequence2); + Coordinate maxLength; + Time Choice1, Choice2, Choice3; + Time maxScore; + + if (length1 == 0 || length2 == 0) + return false; + + maxLength = (length1 > length2 ? length1 : length2); + + if (length1 < WORDLENGTH || length2 < WORDLENGTH) + if (maxLength - length1 > MAXGAPS + || maxLength - length2 > MAXGAPS) + return false; + + for (i = 0; i <= length1; i++) + Fmatrix[i][0] = 0; + for (j = 0; j <= length2; j++) + Fmatrix[0][j] = 0; + + for (i = 1; i <= length1; i++) { + for (j = 1; j <= length2; j++) { + Choice1 = + Fmatrix[i - 1][j - 1] + + SIM[(int) getNucleotide(i - 1, sequence1)] + [(int) getNucleotide(j - 1, sequence2)]; + Choice2 = Fmatrix[i - 1][j] + INDEL; + Choice3 = Fmatrix[i][j - 1] + INDEL; + Fmatrix[i][j] = max(Choice1, Choice2, Choice3); + } + } + + maxScore = Fmatrix[length1][length2]; + + if (maxScore < maxLength - MAXGAPS) + return false; + + if ((1 - maxScore / maxLength) > MAXDIVERGENCE) + return false; + + return true; +} + +static void destroyPaths() +{ + PassageMarker *marker; + + while (slowPath != NULL) { + marker = slowPath; + getNodeTime(getNode(marker)); + getNodeTime(getTwinNode(getNode(marker))); + + slowPath = getNextInSequence(marker); + destroyPassageMarker(marker); + } + + while (fastPath != NULL) { + marker = fastPath; + getNodeTime(getNode(marker)); + getNodeTime(getTwinNode(getNode(marker))); + fastPath = getNextInSequence(marker); + destroyPassageMarker(marker); + } +} + +static void cleanUpRedundancy_local() +{ + PassageMarker *current; + + for (current = getNextInSequence(slowPath); !isTerminal(current); + current = getNextInSequence(current)) + handicapNode(getNode(current)); + + destroyPaths(); +} + +static void comparePaths_local(Node * destination, Node * origin) +{ + IDnum slowLength, fastLength; + Node *fastNode, *slowNode; + IDnum i; + PassageMarker *marker; + + //Measure lengths + slowLength = fastLength = 0; + fastNode = destination; + slowNode = origin; + + //puts("Looking into separate paths"); + + while (fastNode != slowNode) { + //printf("Fast node %li Slow node %li\n", getNodeID(fastNode), getNodeID(slowNode)); + + if (getNodeTime(fastNode) > getNodeTime(slowNode)) { + fastLength++; + fastNode = getNodePrevious(fastNode); + } else if (getNodeTime(fastNode) < getNodeTime(slowNode)) { + slowLength++; + slowNode = getNodePrevious(slowNode); + } else if (isPreviousToNode(slowNode, fastNode)) { + while (fastNode != slowNode) { + fastLength++; + fastNode = getNodePrevious(fastNode); + } + } else if (isPreviousToNode(fastNode, slowNode)) { + while (slowNode != fastNode) { + slowLength++; + slowNode = getNodePrevious(slowNode); + } + } else { + fastLength++; + fastNode = getNodePrevious(fastNode); + slowLength++; + slowNode = getNodePrevious(slowNode); + } + + if (slowLength > MAXNODELENGTH + || fastLength > MAXNODELENGTH) { + //printf("Paths too fragmented %li %li\n", slowLength, fastLength); + return; + } + } + + if (fastLength == 0) + return; + + //Backtracking to record actual paths + fastPath = addUncertainPassageMarker(1, destination); + setPassageMarkerStatus(fastPath, true); + + for (i = 0; i < fastLength; i++) { + marker = + addUncertainPassageMarker(1, + getNodePrevious(getNode + (fastPath))); + setPassageMarkerStatus(marker, true); + connectPassageMarkers(marker, fastPath, graph); + fastPath = marker; + } + + slowPath = addUncertainPassageMarker(2, destination); + setPassageMarkerStatus(slowPath, true); + + marker = addUncertainPassageMarker(2, origin); + setPassageMarkerStatus(marker, true); + connectPassageMarkers(marker, slowPath, graph); + slowPath = marker; + + for (i = 0; i < slowLength; i++) { + marker = + addUncertainPassageMarker(2, + getNodePrevious(getNode + (slowPath))); + setPassageMarkerStatus(marker, true); + connectPassageMarkers(marker, slowPath, graph); + slowPath = marker; + } + + //Extract sequences + if (!extractSequence(fastPath, fastSequence) + || !extractSequence(slowPath, slowSequence)) { + //puts("Paths too long"); + destroyPaths(); + return; + } + //Compare sequences + if (compareSequences(fastSequence, slowSequence)) { + //puts("Correcting discrepancy"); + cleanUpRedundancy_local(); + return; + } + //puts("\tFinished comparing paths, changes made"); + destroyPaths(); +} + +static void tourBusArc_local(Node * origin, Arc * arc, Time originTime) +{ + Node *destination = getDestination(arc); + Time arcTime, totalTime, destinationTime; + IDnum nodeIndex = getNodeID(destination) + nodeCount(graph); + Node *oldPrevious = previous[nodeIndex]; + + //printf("Trying arc from %li -> %li\n", getNodeID(origin), getNodeID(destination)); + + if (oldPrevious == origin) + return; + + arcTime = + ((Time) getNodeLength(origin)) / ((Time) getMultiplicity(arc)); + totalTime = originTime + arcTime; + + destinationTime = times[nodeIndex]; + + if (destinationTime == -1) { + //puts("New destination"); + setNodeTime(destination, totalTime); + dheapNodes[nodeIndex] = + insertNodeIntoDHeap(dheap, totalTime, destination); + previous[nodeIndex] = origin; + return; + } else if (destinationTime > totalTime) { + //printf("Previously visited from slower node %li\n", getNodeID(getNodePrevious(destination))); + if (dheapNodes[nodeIndex] == NULL) { + return; + } + + setNodeTime(destination, totalTime); + replaceKeyInDHeap(dheap, dheapNodes[nodeIndex], totalTime); + previous[nodeIndex] = origin; + + comparePaths_local(destination, oldPrevious); + return; + } else { + //printf("Previously visited by faster node %li\n", getNodeID(getNodePrevious(destination))); + comparePaths_local(destination, origin); + } +} + +static void tourBusNode_local(Node * node) +{ + Arc *arc; + Node *destination; + Time nodeTime = getNodeTime(node); + + //printf("Node %li %f %i %p\n", getNodeID(node), + // times[getNodeID(node) + nodeCount(graph)], simpleArcCount(node), + // node); + + for (arc = getArc(node); arc != NULL; arc = getNextArc(arc)) { + destination = getDestination(arc); + + // Node doesn't belong to the marked node area + if (getNodeStatus(getDestination(arc)) != 1) + continue; + + tourBusArc_local(node, arc, nodeTime); + + if (getNodeStatus(node) != 1) + break; + } +} + +static boolean isLocalDeadEnd(Node * node) +{ + Arc *arc; + + for (arc = getArc(node); arc != NULL; arc = getNextArc(arc)) + if (getNodeStatus(getDestination(arc)) == 1) + return false; + + return true; +} + +static boolean isLocalTwinDeadEnd(Node * node) +{ + Arc *arc; + + for (arc = getArc(getTwinNode(node)); arc != NULL; + arc = getNextArc(arc)) + if (getNodeStatus(getTwinNode(getDestination(arc))) == 1) + return false; + + return true; +} + +static void clipTipsVeryHardLocally() +{ + NodeList *nodeList, *next; + Node *current, *twin; + boolean modified = true; + + //puts("Clipping short tips off graph HARD"); + + while (modified) { + modified = false; + + for (nodeList = getMarkedNodeList(); nodeList != NULL; + nodeList = next) { + next = nodeList->next; + current = nodeList->node; + + if (current == NULL || getNodeStatus(current) != 1) + continue; + + if (getUniqueness(current)) + continue; + + //printf("Checking node HARD %li %i\n", getNodeID(current), simpleArcCount(current)); + + twin = getTwinNode(current); + + if (isLocalDeadEnd(current) + || isLocalTwinDeadEnd(current)) { + //printf("Found tip at node %li\n", getNodeID(current)); + handicapNode(current); + modified = true; + } + } + } +} + +static void tourBus_local(Node * startingPoint) +{ + Node *currentNode = startingPoint; + IDnum nodeID = getNodeID(startingPoint) + nodeCount(graph); + + //printf("Tour bus from node %li...\n", getNodeID(startingPoint)); + + times[nodeID] = 0; + previous[nodeID] = currentNode; + + while (currentNode != NULL) { + dheapNodes[getNodeID(currentNode) + nodeCount(graph)] = + NULL; + tourBusNode_local(currentNode); + currentNode = removeNextNodeFromDHeap(dheap); + } +} + +void prepareGraphForLocalCorrections(Graph * argGraph) +{ + IDnum nodes = nodeCount(argGraph); + IDnum index; + + //Setting global params + graph = argGraph; + WORDLENGTH = getWordLength(graph);; + SELF_LOOP_CUTOFF = WORDLENGTH; + // Done with global params + + // Original + /* + printf("Preparing to correct graph with cutoff %f\n", + MAXDIVERGENCE); + */ + // Original + + // Allocating memory + times = mallocOrExit(2 * nodes + 1, Time); + previous = mallocOrExit(2 * nodes + 1, Node *); + + dheapNodes = mallocOrExit(2 * nodes + 1, DFibHeapNode *); + + dheap = newDFibHeap(); + + fastSequence = newTightString(MAXREADLENGTH); + slowSequence = newTightString(MAXREADLENGTH); + + for (index = 0; index < (2 * nodeCount(graph) + 1); index++) { + times[index] = -1; + dheapNodes[index] = NULL; + previous[index] = NULL; + } + + Fmatrix = callocOrExit(MAXREADLENGTH + 1, double *); + for (index = 0; index < MAXREADLENGTH + 1; index++) + Fmatrix[index] = callocOrExit(MAXREADLENGTH + 1, double); + //Done with memory +} + +void correctGraphLocally(Node * argStart) +{ + IDnum index, nodeIndex; + NodeList *nodeList; + + start = argStart; + //printf("Correcting graph from node %li\n", getNodeID(start)); + + clipTipsVeryHardLocally(); + + index = 0; + for (nodeList = getMarkedNodeList(); nodeList != NULL; + nodeList = nodeList->next) { + nodeIndex = getNodeID(nodeList->node) + nodeCount(graph); + times[nodeIndex] = -1; + dheapNodes[nodeIndex] = NULL; + previous[nodeIndex] = NULL; + } + + tourBus_local(start); +} + +void deactivateLocalCorrectionSettings() +{ + // Original + /* + puts("Deactivating local correction settings"); + */ + // Original + + IDnum index; + + for (index = 0; index <= MAXREADLENGTH; index++) { + free(Fmatrix[index]); + } + free(Fmatrix); + + free(times); + free(previous); + free(dheapNodes); + destroyDHeap(dheap); + + destroyTightString(fastSequence); + destroyTightString(slowSequence); +} + +void setLocalMaxReadLength(int value) +{ + MAXREADLENGTH = value; + MAXNODELENGTH = 2 * value; +} + +void setLocalMaxGaps(int value) +{ + MAXGAPS = value; +} + +void setLocalMaxDivergence(double value) +{ + MAXDIVERGENCE = value; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/locallyCorrectedGraph.h b/contrib/MetaVelvet-v0.3.1/src/locallyCorrectedGraph.h new file mode 100644 index 0000000..2053589 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/locallyCorrectedGraph.h @@ -0,0 +1,40 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _LOCALLYCORRECTEDGRAPH_H_ +#define _LOCALLYCORRECTEDGRAPH_H_ + +struct nodeList_st { + Node *node; + NodeList *previous; + NodeList *next; +}; + +void prepareGraphForLocalCorrections(Graph * graph); + +void correctGraphLocally(Node * start); + +void deactivateLocalCorrectionSettings(); + +// Black arts: +void setLocalMaxReadLength(int value); +void setLocalMaxGaps(int value); +void setLocalMaxDivergence(double value); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/passageMarker.c b/contrib/MetaVelvet-v0.3.1/src/passageMarker.c new file mode 100644 index 0000000..293ab3f --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/passageMarker.c @@ -0,0 +1,639 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "graph.h" +#include "recycleBin.h" +#include "passageMarker.h" +#include "tightString.h" +#include "utility.h" + +struct passage_st { + struct node_st *node; + PassageMarker *nextInNode; + PassageMarker *previousInNode; + PassageMarker *twinMarker; + PassageMarker *nextInSequence; + Coordinate start; + Coordinate finishOffset; + IDnum sequenceID; + boolean status; +}; + +static RecycleBin *markerMemory = NULL; +static RecycleBin *listMemory = NULL; +static const int MARKERBLOCKSIZE = 1000000; +static const int LISTBLOCKSIZE = 10000; + +PassageMarker *allocatePassageMarker() +{ + if (markerMemory == NULL) + markerMemory = + newRecycleBin(sizeof(PassageMarker), MARKERBLOCKSIZE); + + return (PassageMarker *) allocatePointer(markerMemory); +} + +static void deallocatePassageMarker(PassageMarker * marker) +{ + deallocatePointer(markerMemory, marker); +} + +PassageMarkerList *allocatePassageMarkerList() +{ + if (listMemory == NULL) + listMemory = + newRecycleBin(sizeof(PassageMarkerList), + LISTBLOCKSIZE); + + return (PassageMarkerList *) allocatePointer(listMemory); +} + +void deallocatePassageMarkerList(PassageMarkerList * marker) +{ + deallocatePointer(listMemory, marker); +} + +void setNextInSequence(PassageMarker * previous, PassageMarker * next) +{ + if (previous == NULL) + return; + + previous->nextInSequence = next; +} + +void extractPassageMarker(PassageMarker * marker) +{ + PassageMarker *twin; + + if (marker == NULL) + return; + + if (marker->node == NULL) + return; + + if (marker->previousInNode == marker) + setMarker(marker->node, marker->nextInNode); + else + setNextInNode(marker->previousInNode, marker->nextInNode); + + marker->previousInNode = NULL; + marker->nextInNode = NULL; + marker->node = NULL; + + twin = marker->twinMarker; + twin->nextInNode = NULL; + twin->previousInNode = NULL; + twin->node = NULL; +} + +void destroyPassageMarker(PassageMarker * marker) +{ + PassageMarker *twin = marker->twinMarker; + + if (marker == NULL) + return; + + extractPassageMarker(marker); + + if (marker->nextInSequence != NULL + && marker->nextInSequence->twinMarker->nextInSequence == twin) + marker->nextInSequence->twinMarker->nextInSequence = NULL; + + if (twin->nextInSequence != NULL + && twin->nextInSequence->twinMarker->nextInSequence == marker) + twin->nextInSequence->twinMarker->nextInSequence = NULL; + + deallocatePassageMarker(twin); + deallocatePassageMarker(marker); + + //puts("Done destroying passage marker"); +} + +void destroyAllPassageMarkers() +{ + if (markerMemory != NULL) + destroyRecycleBin(markerMemory); + if (listMemory != NULL) + destroyRecycleBin(listMemory); +} + + +void setPreviousInSequence(PassageMarker * previous, PassageMarker * next) +{ + if (next == NULL) + return; + else if (previous == NULL) + next->twinMarker->nextInSequence = NULL; + else + next->twinMarker->nextInSequence = previous->twinMarker; +} + +void disconnectNextPassageMarker(PassageMarker * marker, Graph * graph) +{ + PassageMarker *middle = getNextInSequence(marker); + PassageMarker *next = getNextInSequence(middle); + + setPreviousInSequence(marker, next); + concatenatePassageMarkers(marker, middle); + setNextInSequence(middle, NULL); + setPreviousInSequence(NULL, middle); +} + +PassageMarker *getNextInNode(PassageMarker * marker) +{ + if (marker == NULL) + return NULL; + + return marker->nextInNode; +} + +void setNextInNode(PassageMarker * marker, PassageMarker * next) +{ + if (marker == NULL) + return; + + if (next == NULL) { + marker->nextInNode = NULL; + marker->twinMarker->nextInNode = NULL; + } else { + if (marker->twinMarker == NULL) { + printf("Dead marker in node %d %d\n", + getNodeID(getNode(marker)), + getPassageMarkerSequenceID(marker)); + abort(); + } + marker->nextInNode = next; + marker->twinMarker->nextInNode = next->twinMarker; + next->previousInNode = marker; + next->twinMarker->previousInNode = marker->twinMarker; + } +} + +void setTopOfTheNode(PassageMarker * marker) +{ + if (marker == NULL) + return; + + marker->previousInNode = marker; +} + +PassageMarker *getNextInSequence(PassageMarker * marker) +{ + if (marker == NULL || marker->nextInSequence == NULL) + return NULL; + + return marker->nextInSequence; +} + +PassageMarker *getPreviousInSequence(PassageMarker * marker) +{ + if (marker == NULL) + return NULL; + + if (marker->twinMarker->nextInSequence == NULL) + return NULL; + + return marker->twinMarker->nextInSequence->twinMarker; +} + +void +connectPassageMarkers(PassageMarker * previous, PassageMarker * next, + Graph * graph) +{ + if (previous != NULL) + setNextInSequence(previous, next); + + if (next != NULL) + setPreviousInSequence(previous, next); +} + +char *readPassageMarker(PassageMarker * marker) +{ + char *s = mallocOrExit(100, char); + + if (marker == NULL) + return s; + + sprintf(s, "MARKER %ld (%lld -> %lld):", (long) marker->sequenceID, + (long long) marker->start, (long long) getPassageMarkerFinish(marker)); + + if (getPreviousInSequence(marker) == NULL) + sprintf(s, "%s START -> %ld", s, + (long) getNodeID(getNode(marker))); + else + sprintf(s, "%s %ld -> %ld", s, + (long) getNodeID(getNode(getPreviousInSequence(marker))), + (long) getNodeID(getNode(marker))); + + if (getNextInSequence(marker) == NULL) + sprintf(s, "%s -> FINISH", s); + else + sprintf(s, "%s -> %ld ", s, + (long) getNodeID(getNode(getNextInSequence(marker)))); + + return s; +} + +char *readPassageMarkerSequence(PassageMarker * marker, + TightString ** sequences, int WORDLENGTH) +{ + TightString *sequence = + sequences[getAbsolutePassMarkerSeqID(marker) - 1]; + int i; + char *s = NULL; + + if (marker == NULL) + return s; + + s = mallocOrExit(getPassageMarkerLength(marker) + 1, char); + + if (getPassageMarkerSequenceID(marker) > 0) + for (i = 0; i < getPassageMarkerLength(marker); i++) + s[i] = + getNucleotideChar(getPassageMarkerStart(marker) + + i + WORDLENGTH - 1, + sequence); + else + for (i = 0; i < getPassageMarkerLength(marker); i++) + s[i] = + getInverseNucleotideChar(getPassageMarkerStart + (marker) - i - 1, + sequence); + + s[getPassageMarkerLength(marker)] = '\0'; + + return s; +} + +PassageMarker *addPassageMarker(IDnum sequenceID, Coordinate start, + Node * node) +{ + PassageMarker *marker = allocatePassageMarker(); + PassageMarker *twinMarker = allocatePassageMarker(); + + marker->sequenceID = sequenceID; + marker->start = start; + marker->node = node; + marker->nextInSequence = NULL; + marker->finishOffset = 0; + marker->twinMarker = twinMarker; + marker->status = false; + + twinMarker->sequenceID = -sequenceID; + twinMarker->start = start + getNodeLength(node); + twinMarker->node = getTwinNode(node); + twinMarker->nextInSequence = NULL; + twinMarker->finishOffset = 0; + twinMarker->twinMarker = marker; + twinMarker->status = false; + + setNextInNode(marker, getMarker(node)); + setMarker(node, marker); + + return marker; +} + +PassageMarkerList *copyPassageMarkerList(PassageMarkerList * list) +{ + PassageMarkerList *copy; + PassageMarkerList *result = NULL; + PassageMarkerList *pointer; + + if (list == NULL) + return NULL; + + for (pointer = list; pointer != NULL; pointer = pointer->next) { + copy = allocatePassageMarkerList(); + copy->marker = pointer->marker; + copy->next = result; + result = copy; + } + + return result; +} + +PassageMarker *copyPassageMarker(PassageMarker * marker) +{ + PassageMarker *twin = marker->twinMarker; + PassageMarker *copy = allocatePassageMarker(); + PassageMarker *twinCopy = allocatePassageMarker(); + + copy->sequenceID = marker->sequenceID; + copy->start = marker->start; + copy->nextInNode = NULL; + copy->previousInNode = NULL; + copy->node = NULL; + copy->nextInSequence = marker->nextInSequence; + copy->finishOffset = marker->finishOffset; + copy->status = false; + + twinCopy->sequenceID = twin->sequenceID; + twinCopy->start = twin->start; + twinCopy->nextInNode = NULL; + twinCopy->previousInNode = NULL; + twinCopy->node = NULL; + twinCopy->nextInSequence = twin->nextInSequence; + twinCopy->finishOffset = twin->finishOffset; + twinCopy->status = false; + + copy->twinMarker = twinCopy; + twinCopy->twinMarker = copy; + + return copy; +} + +void incrementFinishOffset(PassageMarker * marker, Coordinate offset) +{ + marker->finishOffset += offset; +} + +void incrementStartOffset(PassageMarker * marker, Coordinate offset) +{ + marker->twinMarker->finishOffset += offset; +} + +Coordinate getFinishOffset(PassageMarker * marker) +{ + return marker->finishOffset; +} + +void setFinishOffset(PassageMarker * marker, Coordinate offset) +{ + marker->finishOffset = offset; +} + +Coordinate getStartOffset(PassageMarker * marker) +{ + return marker->twinMarker->finishOffset; +} + +void setStartOffset(PassageMarker * marker, Coordinate offset) +{ + marker->twinMarker->finishOffset = offset; +} + +void transposePassageMarker(PassageMarker * marker, Node * node) +{ + marker->node = node; + marker->twinMarker->node = getTwinNode(node); + insertPassageMarker(marker, node); + insertPassageMarker(marker->twinMarker, getTwinNode(node)); +} + +PassageMarker *getTwinMarker(PassageMarker * marker) +{ + return marker->twinMarker; +} + +IDnum getPassageMarkerSequenceID(PassageMarker * marker) +{ + return marker->sequenceID; +} + +IDnum getAbsolutePassMarkerSeqID(PassageMarker * marker) +{ + IDnum ID = marker->sequenceID; + + if (ID > 0) + return ID; + else + return -ID; +} + +Node *getNode(PassageMarker * marker) +{ + if (marker == NULL) + return NULL; + + return marker->node; +} + +void concatenatePassageMarkers(PassageMarker * marker, + PassageMarker * next) +{ + + if (marker == NULL || next == NULL) + return; + + marker->finishOffset = next->finishOffset; + marker->twinMarker->start = next->twinMarker->start; + marker->nextInSequence = next->nextInSequence; +} + +boolean getPassageMarkerStatus(PassageMarker * marker) +{ + return marker->status; +} + +void setPassageMarkerStatus(PassageMarker * marker, boolean status) +{ + marker->status = status; + marker->twinMarker->status = status; +} + +boolean isDestinationToMarker(PassageMarker * marker, Node * node) +{ + if (marker->nextInSequence == NULL) + return false; + + return marker->nextInSequence->node == node; +} + +boolean isTerminal(PassageMarker * marker) +{ + if (marker == NULL) + return false; + + return marker->nextInSequence == NULL; +} + +boolean isInitial(PassageMarker * marker) +{ + if (marker == NULL) + return false; + + if (marker->twinMarker == NULL) { + printf("Unpaired marker seq %ld start %lld node %ld\n", + (long) marker->sequenceID, (long long) marker->start, + (long) getNodeID(marker->node)); + puts("SNAFU"); + abort(); + } + + return marker->twinMarker->nextInSequence == NULL; +} + +Coordinate getPassageMarkerStart(PassageMarker * marker) +{ + return marker->start; +} + +void setPassageMarkerStart(PassageMarker * marker, Coordinate start) +{ + marker->start = start; +} + +Coordinate getPassageMarkerFinish(PassageMarker * marker) +{ + if (marker->twinMarker->start == -10) + return -10; + + return marker->twinMarker->start; +} + +void setPassageMarkerFinish(PassageMarker * marker, Coordinate finish) +{ + if (finish == -10) + marker->twinMarker->start = -10; + + marker->twinMarker->start = finish; +} + +Coordinate getPassageMarkerLength(PassageMarker * marker) +{ + if (marker->start == -10 || marker->twinMarker->start == -10) + return 0; + + else if (marker->sequenceID > 0) + return marker->twinMarker->start - marker->start; + else + return marker->start - marker->twinMarker->start; +} + +int passageMarkerDirection(PassageMarker * marker) +{ + if (marker->sequenceID > 0) + return 1; + else + return -1; +} + +PassageMarker *addUncertainPassageMarker(IDnum sequenceID, Node * node) +{ + PassageMarker *marker = allocatePassageMarker(); + PassageMarker *twinMarker = allocatePassageMarker(); + + marker->sequenceID = sequenceID; + marker->start = -10; + marker->node = node; + marker->nextInSequence = NULL; + marker->finishOffset = 0; + marker->twinMarker = twinMarker; + marker->status = false; + + twinMarker->sequenceID = -sequenceID; + twinMarker->start = -10; + if (node == NULL) + twinMarker->node = NULL; + else + twinMarker->node = getTwinNode(node); + twinMarker->nextInSequence = NULL; + twinMarker->finishOffset = 0; + twinMarker->twinMarker = marker; + twinMarker->status = false; + + if (node != NULL) { + setNextInNode(marker, getMarker(node)); + setMarker(node, marker); + } + + return marker; +} + +PassageMarkerList *newPassageMarkerList(PassageMarker * marker, + PassageMarkerList * next) +{ + PassageMarkerList *list = allocatePassageMarkerList(); + list->marker = marker; + list->next = next; + return list; +} + +PassageMarker *newPassageMarker(IDnum seqID, Coordinate start, + Coordinate finish, Coordinate startOffset, + Coordinate finishOffset) +{ + PassageMarker *marker = allocatePassageMarker(); + PassageMarker *twinMarker = allocatePassageMarker(); + +// printf("Values %d\t%d\t%d\t%d\t%d\n", seqID, start, finish, startOffset, finishOffset); + + marker->sequenceID = seqID; + marker->node = NULL; + marker->nextInSequence = NULL; + marker->twinMarker = twinMarker; + marker->nextInNode = NULL; + marker->status = false; + + twinMarker->sequenceID = -seqID; + twinMarker->node = NULL; + twinMarker->nextInSequence = NULL; + twinMarker->twinMarker = marker; + twinMarker->nextInNode = NULL; + twinMarker->status = false; + + setPassageMarkerStart(marker, start); + setPassageMarkerFinish(marker, finish); + setStartOffset(marker, startOffset); + setFinishOffset(marker, finishOffset); + + if (getPassageMarkerLength(marker) < 0) { + printf("Negative marker %ld %lld %lld %lld\n", + (long) getPassageMarkerSequenceID(marker), + (long long) getPassageMarkerStart(marker), + (long long) getPassageMarkerFinish(marker), + (long long) getPassageMarkerLength(marker)); + abort(); + } + + return marker; +} + +void exportMarker(FILE * outfile, PassageMarker * marker, + TightString ** sequences, int WORDLENGTH) +{ + PassageMarker *current; + + if (marker->sequenceID > 0) { + if (!isInitial(marker)) { + return; + } + current = marker; + } else { + if (!isTerminal(marker)) { + return; + } + current = marker->twinMarker; + } + + fprintf(outfile, "SEQ\t%d\n", current->sequenceID); + for (; current != NULL; current = current->nextInSequence) { + fprintf(outfile, "%ld\t%lld\t%lld\t%lld\t%lld", + (long) getNodeID(current->node), (long long) getStartOffset(current), + (long long) getPassageMarkerStart(current), + (long long) getPassageMarkerFinish(current), + (long long) getFinishOffset(current)); + fprintf(outfile, "\n"); + } +} diff --git a/contrib/MetaVelvet-v0.3.1/src/passageMarker.h b/contrib/MetaVelvet-v0.3.1/src/passageMarker.h new file mode 100644 index 0000000..1891408 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/passageMarker.h @@ -0,0 +1,142 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _PASSAGEMARKER_H_ +#define _PASSAGEMARKER_H_ + +struct passageList_st { + PassageMarker *marker; + PassageMarkerList *next; +}; + +/////////////////////////////////////////////////////////////////// +// PassageMarker lists +/////////////////////////////////////////////////////////////////// +// You can always malloc a PassaegMarkerList but these routines manage the +// memory for you, thus avoiding fragmentation +PassageMarkerList *newPassageMarkerList(PassageMarker * marker, + PassageMarkerList * next); + +void deallocatePassageMarkerList(PassageMarkerList * list); + +/////////////////////////////////////////////////////////////////// +// Creators/Destructors +/////////////////////////////////////////////////////////////////// +PassageMarker *addPassageMarker(IDnum sequenceID, Coordinate start, + Node * node); + +PassageMarker *copyPassageMarker(PassageMarker * marker); + +PassageMarker *addUncertainPassageMarker(IDnum sequenceID, Node * node); + +PassageMarker *newPassageMarker(IDnum seqID, Coordinate start, + Coordinate finish, Coordinate startOffset, + Coordinate finishOffset); + +// Deallocates but also removes all pointers towards that structure +void destroyPassageMarker(PassageMarker * marker); +void destroyAllPassageMarkers(); + +/////////////////////////////////////////////////////////////////// +// Node +/////////////////////////////////////////////////////////////////// + +// Current node +Node *getNode(PassageMarker * marker); + +// Yank out of current node +void extractPassageMarker(PassageMarker * marker); + +// Insert into a node +void transposePassageMarker(PassageMarker * marker, Node * destination); + +/////////////////////////////////////////////////////////////////// +// General Info +/////////////////////////////////////////////////////////////////// +// Export into file +void exportMarker(FILE * outfile, PassageMarker * marker, + TightString ** sequences, int wordLength); + +// General info for debugging +char *readPassageMarker(PassageMarker * marker); + +// String description +char *readPassageMarkerSequence(PassageMarker * marker, + TightString ** sequences, int WORDLENGTH); + +// Sequence ID associated to the passage marker +IDnum getPassageMarkerSequenceID(PassageMarker * marker); +IDnum getAbsolutePassMarkerSeqID(PassageMarker * marker); +int passageMarkerDirection(PassageMarker * marker); + +// Coordinates +Coordinate getPassageMarkerStart(PassageMarker * marker); +void setPassageMarkerStart(PassageMarker * marker, Coordinate start); +Coordinate getPassageMarkerFinish(PassageMarker * marker); +void setPassageMarkerFinish(PassageMarker * marker, Coordinate finish); +Coordinate getPassageMarkerLength(PassageMarker * marker); + +// Offsets +Coordinate getStartOffset(PassageMarker * marker); +void setStartOffset(PassageMarker * marker, Coordinate offset); +void incrementStartOffset(PassageMarker * marker, Coordinate offset); +Coordinate getFinishOffset(PassageMarker * marker); +void setFinishOffset(PassageMarker * marker, Coordinate offset); +void incrementFinishOffset(PassageMarker * marker, Coordinate offset); + +// Status +void setPassageMarkerStatus(PassageMarker * marker, boolean status); +boolean getPassageMarkerStatus(PassageMarker * marker); + +/////////////////////////////////////////////////////////////////// +// Marker Sequences +/////////////////////////////////////////////////////////////////// + +// Corresponding marker of reverse complement sequence +PassageMarker *getTwinMarker(PassageMarker * marker); + +// Within a node +PassageMarker *getNextInNode(PassageMarker * marker); +void setNextInNode(PassageMarker * marker, PassageMarker * next); +void setTopOfTheNode(PassageMarker * marker); + +// Within a sequence +PassageMarker *getNextInSequence(PassageMarker * marker); +void setNextInSequence(PassageMarker * previous, PassageMarker * next); +PassageMarker *getPreviousInSequence(PassageMarker * marker); +void setPreviousInSequence(PassageMarker * previous, PassageMarker * next); +void connectPassageMarkers(PassageMarker * previous, PassageMarker * next, + Graph * graph); + +// End of read chains +boolean isTerminal(PassageMarker * marker); +boolean isInitial(PassageMarker * marker); + +// Checks whether the node of the next marker is the one given in parameter +boolean isDestinationToMarker(PassageMarker * marker, Node * node); + +// Bypasses the middle marker +void disconnectNextPassageMarker(PassageMarker * marker, Graph * graph); + +// Merge two markers (cf concatenateGraph()) +void concatenatePassageMarkers(PassageMarker * marker, + PassageMarker * nextMarker); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/preGraph.c b/contrib/MetaVelvet-v0.3.1/src/preGraph.c new file mode 100644 index 0000000..bde0433 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/preGraph.c @@ -0,0 +1,1204 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include + +#include "globals.h" +#include "preGraph.h" +#include "recycleBin.h" +#include "tightString.h" +#include "run.h" +#include "utility.h" + +#define ADENINE 0 +#define CYTOSINE 1 +#define GUANINE 2 +#define THYMINE 3 + +struct preArc_st { + PreArc *nextLeft; + PreArc *nextRight; + IDnum multiplicity; + IDnum preNodeIDLeft; + IDnum preNodeIDRight; +}; + +struct preNode_st { + PreArc *preArcLeft; + PreArc *preArcRight; + Descriptor *descriptor; + Coordinate length; +}; + +struct preGraph_st { + PreNode *preNodes; + IDnum sequenceCount; + IDnum preNodeCount; + int wordLength; + boolean double_strand; +}; + +static RecycleBin *preArcMemory = NULL; + +#define BLOCKSIZE 10000 + +PreArc *allocatePreArc_pg() +{ + if (preArcMemory == NULL) + preArcMemory = newRecycleBin(sizeof(PreArc), BLOCKSIZE); + + return allocatePointer(preArcMemory); +} + +void deallocatePreArc_pg(PreArc * preArc) +{ + deallocatePointer(preArcMemory, preArc); +} + +// Returns the length of the preNode's descriptor list +Coordinate getPreNodeLength_pg(IDnum preNodeID, PreGraph * preGraph) +{ + IDnum ID = preNodeID; + + if (ID < 0) + ID = -ID; + + return (preGraph->preNodes[ID]).length; +} + +// Returns the number of preNodes in the preGraph +IDnum preNodeCount_pg(PreGraph * preGraph) +{ + return preGraph->preNodeCount; +} + +// returns the number of sequences used to buid the preGraph +IDnum sequenceCount_pg(PreGraph * preGraph) +{ + return preGraph->sequenceCount; +} + +PreArc *getPreArcBetweenPreNodes_pg(IDnum originPreNodeID, + IDnum destinationPreNodeID, + PreGraph * preGraph) +{ + PreArc *preArc; + + if (originPreNodeID == 0 || destinationPreNodeID == 0) { + return NULL; + } + + for (preArc = getPreArc_pg(originPreNodeID, preGraph); + preArc != NULL; + preArc = getNextPreArc_pg(preArc, originPreNodeID)) { + if (getDestination_pg(preArc, originPreNodeID) == + destinationPreNodeID) { + return preArc; + } + } + + return NULL; +} + +static void addPreArcToPreNode_pg(PreArc * preArc, IDnum preNodeID, + PreGraph * preGraph) +{ + IDnum ID = preNodeID; + PreNode *preNode; + PreArc **preArcPtr; + + if (ID < 0) + ID = -ID; + + preNode = &(preGraph->preNodes[ID]); + + if (preNodeID > 0) + preArcPtr = &(preNode->preArcRight); + else + preArcPtr = &(preNode->preArcLeft); + + if (preNodeID == preArc->preNodeIDLeft) { + preArc->nextLeft = *preArcPtr; + *preArcPtr = preArc; + } + + if (preNodeID == preArc->preNodeIDRight) { + preArc->nextRight = *preArcPtr; + *preArcPtr = preArc; + } +} + +// Creates an preArc from preNode origin to preNode destination. +// If this preArc already exists, increments its multiplicity by 1. +PreArc *createPreArc_pg(IDnum originPreNodeID, IDnum destinationPreNodeID, + PreGraph * preGraph) +{ + PreArc *preArc; + + + if (originPreNodeID == 0 || destinationPreNodeID == 0) + return NULL; + + preArc = + getPreArcBetweenPreNodes_pg(originPreNodeID, + destinationPreNodeID, preGraph); + + if (preArc != NULL) { + preArc->multiplicity++; + return preArc; + } + // If not found + preArc = allocatePreArc_pg(); + preArc->preNodeIDLeft = originPreNodeID; + preArc->preNodeIDRight = -destinationPreNodeID; + preArc->multiplicity = 1; + + addPreArcToPreNode_pg(preArc, originPreNodeID, preGraph); + + // Hairpin case + if (destinationPreNodeID == -originPreNodeID) { + preArc->multiplicity++; + return preArc; + } + + addPreArcToPreNode_pg(preArc, -destinationPreNodeID, preGraph); + + return preArc; +} + +void createAnalogousPreArc_pg(IDnum originPreNodeID, + IDnum destinationPreNodeID, + PreArc * refPreArc, PreGraph * preGraph) +{ + PreArc *preArc; + + if (originPreNodeID == 0 || destinationPreNodeID == 0) + return; + + preArc = + getPreArcBetweenPreNodes_pg(originPreNodeID, + destinationPreNodeID, preGraph); + + if (preArc != NULL) { + preArc->multiplicity += refPreArc->multiplicity; + return; + } + // If not found + preArc = allocatePreArc_pg(); + preArc->preNodeIDLeft = originPreNodeID; + preArc->preNodeIDRight = -destinationPreNodeID; + preArc->multiplicity = refPreArc->multiplicity; + + addPreArcToPreNode_pg(preArc, originPreNodeID, preGraph); + + // Hairpin case + if (destinationPreNodeID == -originPreNodeID) { + preArc->multiplicity++; + return; + } + + addPreArcToPreNode_pg(preArc, -destinationPreNodeID, preGraph); +} + +void changeMultiplicity_pg(PreArc * preArc, IDnum variation) +{ + if (preArc == NULL) + return; + preArc->multiplicity += variation; +} + +static void setNextPreArc_pg(PreArc * preArc, IDnum preNodeID, + PreArc * nextPreArc) +{ + if (preNodeID == preArc->preNodeIDLeft) + preArc->nextLeft = nextPreArc; + if (preNodeID == preArc->preNodeIDRight) + preArc->nextRight = nextPreArc; +} + +void removePreArcFromList_pg(PreArc * preArc, IDnum preNodeID, + PreGraph * preGraph) +{ + IDnum ID = preNodeID; + PreNode *preNode; + PreArc **preArcPtr; + PreArc *tempPreArc; + + if (ID < 0) + ID = -ID; + + preNode = &(preGraph->preNodes[ID]); + + if (preNodeID > 0) + preArcPtr = &(preNode->preArcRight); + else + preArcPtr = &(preNode->preArcLeft); + + if (*preArcPtr == preArc) { + *preArcPtr = getNextPreArc_pg(preArc, preNodeID); + return; + } + + for (tempPreArc = *preArcPtr; tempPreArc != NULL; + tempPreArc = getNextPreArc_pg(tempPreArc, preNodeID)) + if (getNextPreArc_pg(tempPreArc, preNodeID) == preArc) + setNextPreArc_pg(tempPreArc, preNodeID, + getNextPreArc_pg(preArc, + preNodeID)); +} + +void destroyPreArc_pg(PreArc * preArc, PreGraph * preGraph) +{ + IDnum leftID, rightID; + + if (preArc == NULL) + return; + + leftID = preArc->preNodeIDLeft; + rightID = preArc->preNodeIDRight; + + // Removing preArc from list + removePreArcFromList_pg(preArc, leftID, preGraph); + + // Removing preArc's twin from list + if (rightID != leftID) + removePreArcFromList_pg(preArc, rightID, preGraph); + + deallocatePreArc_pg(preArc); +} + +void destroyPreNode_pg(IDnum preNodeID, PreGraph * preGraph) +{ + PreNode *preNode; + IDnum ID = preNodeID; + + //printf("Destroying %ld\n and twin %ld\n", getPreNodeID(preNode), getPreNodeID(twin)); + + if (ID < 0) + ID = -ID; + + preNode = &(preGraph->preNodes[ID]); + + // PreNode preArcs: + while (preNode->preArcLeft != NULL) + destroyPreArc_pg(preNode->preArcLeft, preGraph); + while (preNode->preArcRight != NULL) + destroyPreArc_pg(preNode->preArcRight, preGraph); + + // Descriptors + free(preNode->descriptor); + + // Flag as destroyed + preNode->descriptor = NULL; +} + +void destroyPreGraph_pg(PreGraph * preGraph) +{ + IDnum index; + PreNode *preNode = &(preGraph->preNodes[1]); + + // Descriptors + for (index = 1; index <= preGraph->preNodeCount; index++) { + free(preNode->descriptor); + preNode++; + } + + // Arcs + destroyRecycleBin(preArcMemory); + + // Nodes + free(preGraph->preNodes); + + // Graph + free(preGraph); + +} + +static Nucleotide getNucleotideInDescriptor_pg(Descriptor * descriptor, + Coordinate i) +{ + Descriptor *fourMer = descriptor + i / 4; + + switch (i % 4) { + case 0: + return (*fourMer & 3); + case 1: + return (*fourMer & 12) >> 2; + case 2: + return (*fourMer & 48) >> 4; + case 3: + return (*fourMer & 192) >> 6; + } + return 0; +} + +PreNode *getPreNodeInPreGraph_pg(PreGraph * preGraph, IDnum preNodeID) +{ + PreNode *preNode; + if (preNodeID <= 0) + abort(); + else { + preNode = &(preGraph->preNodes[preNodeID]); + if (preNode->descriptor != NULL) + return preNode; + else + return NULL; + } + return NULL; +} + +PreArc *getPreArc_pg(IDnum preNodeID, PreGraph * preGraph) +{ + IDnum ID = preNodeID; + PreNode *preNode; + + if (ID < 0) + ID = -ID; + + preNode = &(preGraph->preNodes[ID]); + + if (preNodeID > 0) + return preNode->preArcRight; + else + return preNode->preArcLeft; +} + +PreArc *getNextPreArc_pg(PreArc * preArc, IDnum preNodeID) +{ + if (preNodeID == preArc->preNodeIDLeft) + return preArc->nextLeft; + else + return preArc->nextRight; +} + +IDnum getMultiplicity_pg(PreArc * preArc) +{ + if (preArc == NULL) + return 0; + + return preArc->multiplicity; +} + +IDnum getOtherEnd_pg(PreArc * preArc, IDnum preNodeID) +{ + if (preNodeID == preArc->preNodeIDLeft) + return preArc->preNodeIDRight; + else + return preArc->preNodeIDLeft; +} + +IDnum getDestination_pg(PreArc * preArc, IDnum preNodeID) +{ + if (preArc == NULL) + return 0; + + if (preNodeID == preArc->preNodeIDLeft) + return -preArc->preNodeIDRight; + else + return -preArc->preNodeIDLeft; +} + +static void writeNucleotideInDescriptor_pg(Nucleotide nucleotide, + Descriptor * descriptor, + Coordinate i) +{ + Descriptor *fourMer = descriptor + i / 4; + switch (i % 4) { + case 3: + *fourMer &= 63; + *fourMer += nucleotide << 6; + return; + case 2: + *fourMer &= 207; + *fourMer += nucleotide << 4; + return; + case 1: + *fourMer &= 243; + *fourMer += nucleotide << 2; + return; + case 0: + *fourMer &= 252; + *fourMer += nucleotide; + } +} + +static inline Descriptor *mergeDescriptors_pg(Descriptor * descr, + Coordinate destinationLength, + Descriptor * copy, + Coordinate sourceLength, + int wordLength) +{ + Descriptor *readPtr, *writePtr; + Descriptor readCopy = 0; + int readOffset, writeOffset; + size_t arrayLength; + Coordinate newLength = + destinationLength + sourceLength + wordLength - 1; + Descriptor *new; + Coordinate index; + + // Specify new array + arrayLength = newLength / 4; + if (newLength % 4) + arrayLength++; + new = callocOrExit(arrayLength, Descriptor); + for (index = 0; index < arrayLength; index++) + new[index] = 0; + + // Copying first descriptor + readPtr = descr; + writePtr = new; + writeOffset = 0; + for (index = 0; index < destinationLength + wordLength - 1; + index++) { + (*writePtr) >>= 2; + if (writeOffset == 0) + readCopy = *readPtr; + (*writePtr) += (readCopy & 3) << 6; + + /*switch ((readCopy & 3)) { + case ADENINE: + printf("A%ld", index); + break; + case CYTOSINE: + printf("C%ld", index); + break; + case GUANINE: + printf("G%ld", index); + break; + case THYMINE: + printf("T%ld", index); + break; + } */ + readCopy >>= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + readPtr++; + writeOffset = 0; + } + } + + //puts(""); + + // Skipping initial k-1 letters in second descriptor + readPtr = &(copy[(wordLength - 1) / 4]); + readCopy = *readPtr; + readOffset = (wordLength - 1) % 4; + readCopy >>= (readOffset * 2); + + // Going on copying second descriptor + for (index = 0; index < sourceLength; index++) { + (*writePtr) >>= 2; + if (readOffset == 0) + readCopy = *readPtr; + (*writePtr) += (readCopy & 3) << 6; + /*switch ((readCopy & 3)) { + case ADENINE: + printf("A%ld", index); + break; + case CYTOSINE: + printf("C%ld", index); + break; + case GUANINE: + printf("G%ld", index); + break; + case THYMINE: + printf("T%ld", index); + break; + default: + printf("?%ld;", index); + } */ + readCopy >>= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + writeOffset = 0; + } + + readOffset++; + if (readOffset == 4) { + readPtr++; + readOffset = 0; + } + } + + //puts(""); + + if (writeOffset != 0) { + while (writeOffset != 4) { + (*writePtr) >>= 2; + writeOffset++; + } + } + + return new; +} + +static inline Descriptor *mergeDescriptorsH2H_pg(Descriptor * descr, + Coordinate + destinationLength, + Descriptor * copy, + Coordinate sourceLength, + int wordLength) +{ + Descriptor *readPtr, *writePtr; + Descriptor readCopy; + int readOffset, writeOffset; + size_t arrayLength; + Coordinate newLength = + destinationLength + sourceLength + wordLength - 1; + Descriptor *new; + Coordinate index; + + // Specify new array + arrayLength = newLength / 4; + if (newLength % 4) + arrayLength++; + new = callocOrExit(arrayLength, Descriptor); + for (index = 0; index < arrayLength; index++) + new[index] = 0; + + // Copying first descriptor (including final (k-1)-mer) + readPtr = descr; + readCopy = *readPtr; + writePtr = new; + writeOffset = 0; + readOffset = 0; + for (index = 0; index < destinationLength + wordLength - 1; + index++) { + (*writePtr) >>= 2; + if (writeOffset == 0) + readCopy = *readPtr; + (*writePtr) += (readCopy & 3) << 6; + /*switch ((readCopy & 3)) { + case ADENINE: + printf("A(%ld %i %i) ", index, writeOffset, readOffset); + break; + case CYTOSINE: + printf("C(%ld %i %i) ", index, writeOffset, readOffset); + break; + case GUANINE: + printf("G(%ld %i %i) ", index, writeOffset, readOffset); + break; + case THYMINE: + printf("T(%ld %i %i) ", index, writeOffset, readOffset); + break; + default: + printf("?(%ld %i %i);", index, writeOffset, readOffset); + } */ + readCopy >>= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + readPtr++; + writeOffset = 0; + } + } + + //puts(""); + + // Going to end of second descriptor + readPtr = &(copy[(sourceLength - 1) / 4]); + readCopy = *readPtr; + readOffset = (sourceLength - 1) % 4; + readCopy <<= ((3 - readOffset) * 2); + + //printf("Read copy %x\n", readCopy); + + // Going on copying reverse complement of second descriptor + for (index = 0; index < sourceLength; index++) { + (*writePtr) >>= 2; + if (readOffset == 3) + readCopy = *readPtr; +#ifndef COLOR + (*writePtr) += 192 - (readCopy & 192); +#else + (*writePtr) += (readCopy & 192); +#endif + /*switch (3 - ((readCopy & 192) >> 6)) { + case ADENINE: + printf("A(%ld %i %i) ", index, writeOffset, readOffset); + break; + case CYTOSINE: + printf("C(%ld %i %i) ", index, writeOffset, readOffset); + break; + case GUANINE: + printf("G(%ld %i %i) ", index, writeOffset, readOffset); + break; + case THYMINE: + printf("T(%ld %i %i) ", index, writeOffset, readOffset); + break; + default: + printf("?(%ld %i %i);", index, writeOffset, readOffset); + } */ + readCopy <<= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + writeOffset = 0; + } + + readOffset--; + if (readOffset == -1) { + readPtr--; + readOffset = 3; + } + } + + //puts(""); + + if (writeOffset != 0) { + while (writeOffset != 4) { + (*writePtr) >>= 2; + writeOffset++; + } + } + + return new; +} + +static inline Descriptor *mergeDescriptorsF2F_pg(Descriptor * descr, + Coordinate + destinationLength, + Descriptor * copy, + Coordinate sourceLength, + int wordLength) +{ + Descriptor *readPtr, *writePtr; + Descriptor readCopy; + int readOffset, writeOffset; + size_t arrayLength; + Coordinate newLength = + destinationLength + sourceLength + wordLength - 1; + Descriptor *new; + Coordinate index; + + // Specify new array + arrayLength = newLength / 4; + if (newLength % 4) + arrayLength++; + new = callocOrExit(arrayLength, Descriptor); + for (index = 0; index < arrayLength; index++) + new[index] = 0; + + writePtr = new; + writeOffset = 0; + + // Going to end of first descriptor + readPtr = &(copy[(sourceLength + wordLength - 2) / 4]); + readCopy = *readPtr; + readOffset = (sourceLength + wordLength - 2) % 4; + readCopy <<= ((3 - readOffset) * 2); + + // Copying reverse complement of first descriptor (minus final (k-1)-mer) + for (index = 0; index < sourceLength; index++) { + (*writePtr) >>= 2; + if (readOffset == 3) + readCopy = *readPtr; +#ifndef COLOR + (*writePtr) += 192 - (readCopy & 192); +#else + (*writePtr) += (readCopy & 192); +#endif + /*switch (3 - ((readCopy & 192) >> 6)) { + case ADENINE: + printf("A(%ld %i %i) ", index, writeOffset, readOffset); + break; + case CYTOSINE: + printf("C(%ld %i %i) ", index, writeOffset, readOffset); + break; + case GUANINE: + printf("G(%ld %i %i) ", index, writeOffset, readOffset); + break; + case THYMINE: + printf("T(%ld %i %i) ", index, writeOffset, readOffset); + break; + default: + printf("?(%ld %i %i);", index, writeOffset, readOffset); + } */ + readCopy <<= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + writeOffset = 0; + } + + readOffset--; + if (readOffset == -1) { + readPtr--; + readOffset = 3; + } + } + + //puts(""); + + // Going on copying second descriptor + readPtr = descr; + readCopy = *readPtr; + readOffset = 0; + + for (index = 0; index < destinationLength + wordLength - 1; + index++) { + (*writePtr) >>= 2; + if (readOffset == 0) + readCopy = *readPtr; + (*writePtr) += (readCopy & 3) << 6; + /*switch ((readCopy & 3)) { + case ADENINE: + printf("A(%ld %i %i) ", index, writeOffset, readOffset); + break; + case CYTOSINE: + printf("C(%ld %i %i) ", index, writeOffset, readOffset); + break; + case GUANINE: + printf("G(%ld %i %i) ", index, writeOffset, readOffset); + break; + case THYMINE: + printf("T(%ld %i %i) ", index, writeOffset, readOffset); + break; + default: + printf("?(%ld %i %i);", index, writeOffset, readOffset); + } */ + readCopy >>= 2; + + writeOffset++; + if (writeOffset == 4) { + writePtr++; + writeOffset = 0; + } + + readOffset++; + if (readOffset == 4) { + readPtr++; + readOffset = 0; + } + } + + //puts(""); + + if (writeOffset != 0) { + while (writeOffset != 4) { + (*writePtr) >>= 2; + writeOffset++; + } + } + + return new; +} + +void setMultiplicity_pg(PreArc * preArc, IDnum mult) +{ + preArc->multiplicity = mult; +} + +static void updatePreArcData_pg(PreArc * preArc, IDnum oldPreNodeID, + IDnum newPreNodeID) +{ + if (preArc->preNodeIDLeft == oldPreNodeID) + preArc->preNodeIDLeft = newPreNodeID; + if (preArc->preNodeIDRight == oldPreNodeID) + preArc->preNodeIDRight = newPreNodeID; +} + +// Reshuffles the preGraph->preNodes array to remove NULL pointers +// Beware that preNode IDs are accordingly reshuffled (all pointers remain valid though) +void renumberPreNodes_pg(PreGraph * preGraph) +{ + IDnum preNodeIndex; + PreNode *currentPreNode, *destinationPreNode; + IDnum counter = 0; + IDnum preNodes = preGraph->preNodeCount; + IDnum newIndex; + PreArc *preArc; + + puts("Renumbering preNodes"); + printf("Initial preNode count %d\n", preGraph->preNodeCount); + + for (preNodeIndex = 1; preNodeIndex <= preNodes; preNodeIndex++) { + currentPreNode = &(preGraph->preNodes[preNodeIndex]); + + if (currentPreNode->descriptor == NULL) + counter++; + else if (counter != 0) { + newIndex = preNodeIndex - counter; + destinationPreNode = + &(preGraph->preNodes[newIndex]); + + destinationPreNode->preArcLeft = + currentPreNode->preArcLeft; + destinationPreNode->preArcRight = + currentPreNode->preArcRight; + destinationPreNode->descriptor = + currentPreNode->descriptor; + destinationPreNode->length = + currentPreNode->length; + + for (preArc = getPreArc_pg(newIndex, preGraph); + preArc != NULL; + preArc = getNextPreArc_pg(preArc, newIndex)) + updatePreArcData_pg(preArc, preNodeIndex, + newIndex); + for (preArc = getPreArc_pg(-newIndex, preGraph); + preArc != NULL; + preArc = getNextPreArc_pg(preArc, -newIndex)) + updatePreArcData_pg(preArc, -preNodeIndex, + -newIndex); + } + } + + preGraph->preNodeCount -= counter; + preGraph->preNodes = reallocOrExit(preGraph->preNodes, + preGraph->preNodeCount + + 1, PreNode); + + printf("Destroyed %d preNodes\n", counter); +} + +// Allocate memory for an empty preGraph created with sequenceCount different sequences +PreGraph *emptyPreGraph_pg(IDnum sequenceCount, int wordLength, boolean double_strand) +{ + PreGraph *newPreGraph = mallocOrExit(1, PreGraph); + newPreGraph->sequenceCount = sequenceCount; + newPreGraph->wordLength = wordLength; + newPreGraph->preNodeCount = 0; + newPreGraph->double_strand = double_strand; + return newPreGraph; +} + +static Descriptor *newDescriptor_pg(Coordinate length, FILE * file, + Kmer * initialKmer, int wordLength) +{ + char letter; + Nucleotide nucleotide; + Coordinate totalLength = length + wordLength - 1; + size_t arrayLength = totalLength / 4; + Descriptor *res; + Coordinate index; + Kmer kmerCopy; + + if (totalLength % 4 > 0) + arrayLength++; + + res = callocOrExit(arrayLength, Descriptor); + + copyKmers(&kmerCopy, initialKmer); + for (index = wordLength - 2; index >= 0; index--) + writeNucleotideInDescriptor_pg(popNucleotide(&kmerCopy), res, + index); + + for (index = wordLength - 1; index < totalLength; index++) { + letter = getc(file); + while (!isalpha(letter)) + letter = getc(file); + + //printf("%c", letter); + switch (letter) { + case 'A': + nucleotide = ADENINE; + break; + case 'C': + nucleotide = CYTOSINE; + break; + case 'G': + nucleotide = GUANINE; + break; + case 'T': + nucleotide = THYMINE; + break; + default: + fflush(stdout); + abort(); + } + + writeNucleotideInDescriptor_pg(nucleotide, res, index); + pushNucleotide(initialKmer, nucleotide); + } + + //printf(" "); + + return res; +} + +void allocatePreNodeSpace_pg(PreGraph * preGraph, IDnum preNodeCount) +{ + preGraph->preNodes = callocOrExit(preNodeCount + 1, PreNode); + preGraph->preNodeCount = preNodeCount; +} + +void addPreNodeToPreGraph_pg(PreGraph * preGraph, Coordinate start, + Coordinate finish, FILE * file, + Kmer * initialKmer, IDnum ID) +{ + PreNode *newnd = &(preGraph->preNodes[ID]); + + newnd->preArcLeft = NULL; + newnd->preArcRight = NULL; + + newnd->length = finish - start; + + newnd->descriptor = + newDescriptor_pg(newnd->length, file, initialKmer, + preGraph->wordLength); +} + +static void exportPreNode_pg(FILE * outfile, PreNode * preNode, IDnum ID, + int wordLength) +{ + Coordinate index; + Nucleotide nucleotide; + + if (preNode == NULL) + return; + + fprintf(outfile, "NODE\t%ld\t%lld\n", (long) ID, (long long) preNode->length); + + if (preNode->length == 0) { + fprintf(outfile, "\n"); + return; + } + + for (index = 0; index < preNode->length + wordLength - 1; index++) { + nucleotide = + getNucleotideInDescriptor_pg(preNode->descriptor, + index); + switch (nucleotide) { + case ADENINE: + fprintf(outfile, "A"); + break; + case CYTOSINE: + fprintf(outfile, "C"); + break; + case GUANINE: + fprintf(outfile, "G"); + break; + case THYMINE: + fprintf(outfile, "T"); + break; + } + } + + fprintf(outfile, "\n"); +} + +void exportPreGraph_pg(char *filename, PreGraph * preGraph) +{ + IDnum index; + FILE *outfile; + PreNode *preNode; + int wordLength = getWordLength_pg(preGraph); + + if (preGraph == NULL) { + return; + } + + outfile = fopen(filename, "w"); + if (outfile == NULL) { + puts("Couldn't open file, sorry"); + return; + } else + printf("Writing into pregraph file %s...\n", filename); + + // General data + fprintf(outfile, "%ld\t%ld\t%i\t%hi\n", (long) preGraph->preNodeCount, + (long) preGraph->sequenceCount, preGraph->wordLength, (short) preGraph->double_strand); + + // PreNode info + for (index = 1; index <= preGraph->preNodeCount; index++) { + preNode = getPreNodeInPreGraph_pg(preGraph, index); + exportPreNode_pg(outfile, preNode, index, wordLength); + } + + fclose(outfile); +} + +int getWordLength_pg(PreGraph * preGraph) +{ + return preGraph->wordLength; +} + +void displayPreArcMemory_pg() +{ + if (preArcMemory == NULL) + return; + printf("ARC MEMORY %lld allocated %lld free\n", + (long long) RecycleBin_memory_usage(preArcMemory), + (long long) recycleBinFreeSpace(preArcMemory)); +} + +boolean hasSinglePreArc_pg(IDnum preNodeID, PreGraph * preGraph) +{ + IDnum ID = preNodeID; + PreNode *preNode; + PreArc *preArc; + + if (ID < 0) + ID = -ID; + + preNode = &(preGraph->preNodes[ID]); + + if (preNodeID > 0) + preArc = preNode->preArcRight; + else + preArc = preNode->preArcLeft; + + return (preArc != NULL + && getNextPreArc_pg(preArc, preNodeID) == NULL); +} + +char simplePreArcCount_pg(IDnum preNodeID, PreGraph * preGraph) +{ + PreNode *preNode; + PreArc *preArc; + char count = 0; + IDnum ID = preNodeID; + + if (ID < 0) + ID = -ID; + + preNode = &(preGraph->preNodes[ID]); + + if (preNodeID > 0) + preArc = preNode->preArcRight; + else + preArc = preNode->preArcLeft; + + for (; preArc != NULL; + preArc = getNextPreArc_pg(preArc, preNodeID)) + count++; + + return count; +} + +boolean isLoop_pg(PreArc * preArc) +{ + return (preArc->preNodeIDLeft == preArc->preNodeIDRight + || preArc->preNodeIDLeft == -preArc->preNodeIDRight); +} + +void setPreNodeDescriptor_pg(Descriptor * descr, Coordinate length, IDnum preNodeID, PreGraph * preGraph) { + PreNode * preNode; + + if (preNodeID < 0) + preNodeID = -preNodeID; + + preNode = getPreNodeInPreGraph_pg(preGraph, preNodeID); + free(preNode->descriptor); + preNode->descriptor = descr; + preNode->length = length; +} + +static void appendPositiveDescriptor_pg(Descriptor ** writePtr, int * writeOffset, IDnum preNodeID, PreGraph * preGraph, boolean initial) { + PreNode * preNode = getPreNodeInPreGraph_pg(preGraph, preNodeID); + Descriptor * readPtr = preNode->descriptor; + Descriptor readCopy; + int wordLength = getWordLength_pg(preGraph); + Coordinate length = preNode->length; + Coordinate index; + int readOffset = 0; + + if (initial) { + index = 0; + readPtr = preNode->descriptor; + readCopy = *readPtr; + readOffset = 0; + } else { + index = wordLength - 1; + readPtr = &(preNode->descriptor[(wordLength - 1) / 4]); + readCopy = *readPtr; + readOffset = (wordLength - 1) % 4; + readCopy >>= (readOffset * 2); + } + + for (; index < length + wordLength - 1; index++) { + (**writePtr) >>= 2; + if (readOffset == 0) + readCopy = *readPtr; + (**writePtr) += (readCopy & 3) << 6; + readCopy >>= 2; + + if (++(*writeOffset) == 4) { + (*writePtr)++; + *writeOffset = 0; + } + + if (++readOffset == 4) { + readPtr++; + readOffset = 0; + } + } +} + +static void appendNegativeDescriptor_pg(Descriptor ** writePtr, int * writeOffset, IDnum preNodeID, PreGraph * preGraph, boolean initial) { + PreNode * preNode = getPreNodeInPreGraph_pg(preGraph, preNodeID); + Descriptor * readPtr = preNode->descriptor; + Descriptor readCopy; + int wordLength = getWordLength_pg(preGraph); + Coordinate length = preNode->length; + Coordinate index; + int readOffset; + + if (initial) + length += wordLength - 1; + + readPtr = &(preNode->descriptor[(length - 1) / 4]); + readCopy = *readPtr; + readOffset = (length - 1) % 4; + readCopy <<= ((3 - readOffset) * 2); + + for (index = 0; index < length; index++) { + (**writePtr) >>= 2; + if (readOffset == 3) + readCopy = *readPtr; +#ifndef COLOR + (**writePtr) += 192 - (readCopy & 192); +#else + (**writePtr) += (readCopy & 192); +#endif + readCopy <<= 2; + + (*writeOffset)++; + if (*writeOffset == 4) { + (*writePtr)++; + *writeOffset = 0; + } + + readOffset--; + if (readOffset == -1) { + readPtr--; + readOffset = 3; + } + } +} + +void appendDescriptors_pg(Descriptor ** start, int * writeOffset, IDnum preNodeID, PreGraph* preGraph, boolean initial) { + if (preNodeID > 0) + appendPositiveDescriptor_pg(start, writeOffset, preNodeID, preGraph, initial); + else + appendNegativeDescriptor_pg(start, writeOffset, -preNodeID, preGraph, initial); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/preGraph.h b/contrib/MetaVelvet-v0.3.1/src/preGraph.h new file mode 100644 index 0000000..87bad5c --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/preGraph.h @@ -0,0 +1,102 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _PREGRAPH_H_ +#define _PREGRAPH_H_ + +//////////////////////////////////////////////////////////// +// PreNode functions +//////////////////////////////////////////////////////////// + +//Creators/destructor +PreNode *newPreNode_pg(Coordinate start, + Coordinate finish, + FILE * file, Kmer * initialKmer, int wordLength); +void destroyPreNode_pg(IDnum preNode, PreGraph * preGraph); + +// Locator +PreNode *getPreNodeInPreGraph_pg(PreGraph * preGraph, IDnum preNodeID); + +// PreArc info +PreArc *getPreArc_pg(IDnum preNodeID, PreGraph * preGraph); +boolean hasSinglePreArc_pg(IDnum preNodeID, PreGraph * graph); +char simplePreArcCount_pg(IDnum preNodeID, PreGraph * preGraph); + +// Descriptor +Coordinate getPreNodeLength_pg(IDnum preNodeID, PreGraph * preGraph); +void setPreNodeDescriptor_pg(Descriptor * descr, Coordinate length, IDnum preNodeID, PreGraph * preGraph); +void appendDescriptors_pg(Descriptor ** start, int * writeOffset, IDnum preNodeID, PreGraph* preGraph, boolean initial); + +//////////////////////////////////////////////////////////// +// PreArc functions +//////////////////////////////////////////////////////////// + +// Creators/destructor +PreArc *createPreArc_pg(IDnum originID, IDnum destinationID, + PreGraph * preGraph); +void createAnalogousPreArc_pg(IDnum originID, IDnum destinationID, + PreArc * refPreArc, PreGraph * preGraph); +void destroyPreArc_pg(PreArc * preArc, PreGraph * preGraph); + +// Multiplicity +void setMultiplicity_pg(PreArc * preArc, IDnum mult); +IDnum getMultiplicity_pg(PreArc * preArc); +void changeMultiplicity_pg(PreArc * preArc, IDnum variation); + +// Extremities +IDnum getDestination_pg(PreArc * preArc, IDnum nodeID); +IDnum getOtherEnd_pg(PreArc * preArc, IDnum preNodeID); + +// Finding preArcs +PreArc *getPreArcBetweenPreNodes_pg(IDnum originID, IDnum destinationID, + PreGraph * preGraph); +PreArc *getNextPreArc_pg(PreArc * preArc, IDnum originPreNodeID); + +// Misc +boolean isLoop_pg(PreArc * preArc); + +//////////////////////////////////////////////////////////// +// PreGraph functions +//////////////////////////////////////////////////////////// + +// Memory allocation +PreGraph *emptyPreGraph_pg(IDnum sequenceCount, int wordLength, boolean double_strand); +void allocatePreNodeSpace_pg(PreGraph * preGraph, IDnum preNodeCount); +void addPreNodeToPreGraph_pg(PreGraph * preGraph, Coordinate start, + Coordinate stop, FILE * file, + Kmer * initialKmer, IDnum ID); + +// Deallocation +void destroyPreGraph_pg(PreGraph * preGraph); + +// Dimensions +IDnum preNodeCount_pg(PreGraph * preGraph); +IDnum sequenceCount_pg(PreGraph * preGraph); +void renumberPreNodes_pg(PreGraph * preGraph); + +// File IO +void exportPreGraph_pg(char *filename, PreGraph * preGraph); + +int getWordLength_pg(PreGraph * preGraph); + +void displayPreArcMemory_pg(); + +int test_preGraph(int argc, char **argv); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/preGraphConstruction.c b/contrib/MetaVelvet-v0.3.1/src/preGraphConstruction.c new file mode 100644 index 0000000..d767dd9 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/preGraphConstruction.c @@ -0,0 +1,691 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include + +#include "globals.h" +#include "preGraph.h" +#include "recycleBin.h" +#include "roadMap.h" +#include "readSet.h" +#include "concatenatedPreGraph.h" +#include "utility.h" +#include "kmer.h" + +#define ADENINE 0 +#define CYTOSINE 1 +#define GUANINE 2 +#define THYMINE 3 + +// Internal structure used to mark the ends of an Annotation +struct insertionMarker_st { + Annotation *annot; + boolean isStart; +}; + +Coordinate getInsertionMarkerPosition(InsertionMarker * marker) +{ + if (marker->isStart) + return getStart(marker->annot); + else + return getFinish(marker->annot); +} + +int compareInsertionMarkers(const void *A, const void *B) +{ + Coordinate Apos = + getInsertionMarkerPosition((InsertionMarker *) A); + Coordinate Bpos = + getInsertionMarkerPosition((InsertionMarker *) B); + + if (Apos < Bpos) + return -1; + else if (Apos == Bpos) + return 0; + else + return 1; +} + +// Applies mergeSort to each insertion marker list (in order of position) +static void +orderInsertionMarkers(InsertionMarker ** insMarkers, + IDnum * markerCounters, RoadMapArray * rdmaps) +{ + IDnum sequenceIndex; + IDnum sequenceCounter = rdmaps->length; + + puts("Ordering insertion markers"); + for (sequenceIndex = 1; sequenceIndex <= sequenceCounter; + sequenceIndex++) { + qsort(insMarkers[sequenceIndex], + markerCounters[sequenceIndex], + sizeof(InsertionMarker), compareInsertionMarkers); + } +} + +// Creates insertion marker lists +static void +setInsertionMarkers(RoadMapArray * rdmaps, + IDnum * markerCounters, + InsertionMarker ** veryLastMarker, + InsertionMarker ** insertionMarkers) +{ + IDnum sequenceCounter = rdmaps->length; + IDnum sequenceIndex, sequenceIndex2; + IDnum totalCount = 0; + RoadMap *rdmap; + Annotation *annot = rdmaps->annotations; + InsertionMarker *nextMarker, *newMarker; + IDnum annotIndex, lastAnnotIndex; + InsertionMarker **insMarkers = + callocOrExit(rdmaps->length + 1, InsertionMarker *); + // Counting insertion markers + for (sequenceIndex = 1; sequenceIndex < sequenceCounter + 1; + sequenceIndex++) { + //printf("Going through sequence %d\n", sequenceIndex); + rdmap = getRoadMapInArray(rdmaps, sequenceIndex - 1); + lastAnnotIndex = getAnnotationCount(rdmap); + + // Set insertion markers in previous sequences : + + for (annotIndex = 0; annotIndex < lastAnnotIndex; + annotIndex++) { + if (getAnnotSequenceID(annot) > 0) { + markerCounters[getAnnotSequenceID(annot)] + += 2; + } else { + markerCounters[-getAnnotSequenceID(annot)] + += 2; + } + totalCount += 2; + annot = getNextAnnotation(annot); + } + } + + // Allocating space + *insertionMarkers = callocOrExit(totalCount, InsertionMarker); + *veryLastMarker = *insertionMarkers + totalCount; + + // Pointing each node to its space + nextMarker = *insertionMarkers; + for (sequenceIndex = 1; sequenceIndex < sequenceCounter + 1; + sequenceIndex++) { + insMarkers[sequenceIndex] = nextMarker; + nextMarker = nextMarker + markerCounters[sequenceIndex]; + markerCounters[sequenceIndex] = 0; + } + + // Filling up space with data + annot = rdmaps->annotations; + for (sequenceIndex = 1; sequenceIndex < sequenceCounter + 1; + sequenceIndex++) { + //printf("Going through sequence %d\n", sequenceIndex); + rdmap = getRoadMapInArray(rdmaps, sequenceIndex - 1); + lastAnnotIndex = getAnnotationCount(rdmap); + + // Set insertion markers in previous sequences : + + for (annotIndex = 0; annotIndex < lastAnnotIndex; + annotIndex++) { + sequenceIndex2 = getAnnotSequenceID(annot); + if (sequenceIndex2 > 0) { + newMarker = + insMarkers[sequenceIndex2] + + (markerCounters[sequenceIndex2])++; + newMarker->annot = annot; + newMarker->isStart = true; + + newMarker = + insMarkers[sequenceIndex2] + + (markerCounters[sequenceIndex2])++; + newMarker->annot = annot; + newMarker->isStart = false; + } else { + incrementAnnotationCoordinates(annot); + + newMarker = + insMarkers[-sequenceIndex2] + + (markerCounters[-sequenceIndex2])++; + newMarker->annot = annot; + newMarker->isStart = true; + + newMarker = + insMarkers[-sequenceIndex2] + + (markerCounters[-sequenceIndex2])++; + newMarker->annot = annot; + newMarker->isStart = false; + } + annot = getNextAnnotation(annot); + } + } + + orderInsertionMarkers(insMarkers, markerCounters, rdmaps); + free(insMarkers); +} + +// Counts how many preNodes are to be created to allocate appropriate memory +static void +countPreNodes(RoadMapArray * rdmaps, PreGraph * preGraph, + IDnum * markerCounters, InsertionMarker * insertionMarkers, + InsertionMarker * veryLastMarker) +{ + Annotation *annot = rdmaps->annotations; + InsertionMarker *currentMarker = insertionMarkers; + IDnum markerIndex, lastMarkerIndex; + IDnum sequenceIndex; + Coordinate currentPosition, nextStop; + IDnum preNodeCounter = 0; + RoadMap *rdmap; + IDnum annotIndex, lastAnnotIndex; + + // Now that we have read all of the annotations, we go on to create the preNodes and tie them up + for (sequenceIndex = 1; + sequenceIndex <= sequenceCount_pg(preGraph); + sequenceIndex++) { + rdmap = getRoadMapInArray(rdmaps, sequenceIndex - 1); + annotIndex = 0; + lastAnnotIndex = getAnnotationCount(rdmap); + markerIndex = 0; + lastMarkerIndex = markerCounters[sequenceIndex]; + currentPosition = 0; + + + while (annotIndex < lastAnnotIndex) { + if (markerIndex == lastMarkerIndex + || getPosition(annot) <= + getInsertionMarkerPosition(currentMarker)) + nextStop = getPosition(annot); + else + nextStop = + getInsertionMarkerPosition + (currentMarker); + + if (currentPosition != nextStop) { + preNodeCounter++; + currentPosition = nextStop; + } + + while (markerIndex < lastMarkerIndex + && getInsertionMarkerPosition(currentMarker) + == currentPosition) { + currentMarker++; + markerIndex++; + } + + while (annotIndex < lastAnnotIndex + && getPosition(annot) == currentPosition) { + annot = getNextAnnotation(annot); + annotIndex++; + } + + } + + while (markerIndex < lastMarkerIndex) { + if (currentPosition == + getInsertionMarkerPosition(currentMarker)) { + currentMarker++; + markerIndex++; + } else { + preNodeCounter++; + currentPosition = + getInsertionMarkerPosition + (currentMarker); + } + } + } + + allocatePreNodeSpace_pg(preGraph, preNodeCounter); +} + +static void convertInsertionMarkers(InsertionMarker * insertionMarkers, + InsertionMarker * veryLastMarker, + IDnum * chains) +{ + InsertionMarker *marker; + Annotation *annot; + + for (marker = insertionMarkers; marker != veryLastMarker; marker++) { + annot = marker->annot; + + if (getAnnotSequenceID(annot) > 0) { + if (marker->isStart) { + if (getStartID(annot) == 0) + setStartID(annot, + chains + [getAnnotSequenceID + (annot)]); + else + setStartID(annot, + getStartID(annot) + 1); + } + } else { + if (marker->isStart) + setStartID(annot, -getStartID(annot)); + else { + if (getFinishID(annot) == 0) + setFinishID(annot, + -chains + [-getAnnotSequenceID + (annot)]); + else + setFinishID(annot, + -getFinishID(annot) - + 1); + } + } + } + + free(insertionMarkers); +} + +static void convertMarker(InsertionMarker * marker, IDnum nodeID) +{ + if (marker->isStart) + setStartID(marker->annot, nodeID); + else + setFinishID(marker->annot, nodeID); +} + +// Creates the preNode using insertion marker and annotation lists for each sequence +static void +createPreNodes(RoadMapArray * rdmaps, PreGraph * preGraph, + IDnum * markerCounters, InsertionMarker * insertionMarkers, + InsertionMarker * veryLastMarker, IDnum * chains, + char *sequenceFilename, int WORDLENGTH) +{ + Annotation *annot = rdmaps->annotations; + IDnum latestPreNodeID; + InsertionMarker *currentMarker = insertionMarkers; + IDnum sequenceIndex; + Coordinate currentPosition, nextStop; + IDnum preNodeCounter = 1; + FILE *file = fopen(sequenceFilename, "r"); + char line[50000]; + int lineLength = 50000; + Coordinate readIndex; + boolean tooShort; + Kmer initialKmer; + char c; + RoadMap *rdmap; + IDnum annotIndex, lastAnnotIndex; + IDnum markerIndex, lastMarkerIndex; + + if (file == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not read %s", sequenceFilename); + // Reading sequence descriptor in first line + if (!fgets(line, lineLength, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", sequenceFilename); + + // Now that we have read all of the annotations, we go on to create the preNodes and tie them up + for (sequenceIndex = 1; + sequenceIndex <= sequenceCount_pg(preGraph); + sequenceIndex++) { + if (sequenceIndex % 100000 == 0) + printf("Sequence %d / %d\n", sequenceIndex, + sequenceCount_pg(preGraph)); + + while (line[0] != '>') + if (!fgets(line, lineLength, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", sequenceFilename); + + rdmap = getRoadMapInArray(rdmaps, sequenceIndex - 1); + annotIndex = 0; + lastAnnotIndex = getAnnotationCount(rdmap); + markerIndex = 0; + lastMarkerIndex = markerCounters[sequenceIndex]; + currentPosition = 0; + + // Reading first (k-1) nucleotides + tooShort = false; + clearKmer(&initialKmer); + //printf("Initial kmer: "); + for (readIndex = 0; readIndex < WORDLENGTH - 1; + readIndex++) { + if (!isalpha((c = getc(file)))) { + if (c == '>') { + ungetc(c, file); + tooShort = true; + break; + } else { + continue; + } + } + //printf("%c", c); + switch (c) { + case 'A': + pushNucleotide(&initialKmer, ADENINE); + break; + case 'C': + pushNucleotide(&initialKmer, CYTOSINE); + break; + case 'G': + pushNucleotide(&initialKmer, GUANINE); + break; + case 'T': + pushNucleotide(&initialKmer, THYMINE); + break; + default: + printf + ("Irregular sequence file: are you sure your Sequence and Roadmap file come from the same source?\n"); + fflush(stdout); + abort(); + } + } + //puts(""); + + if (tooShort) { + //printf("Skipping short read.. %d\n", sequenceIndex); + chains[sequenceIndex] = preNodeCounter; + if (!fgets(line, lineLength, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", sequenceFilename); + continue; + } + + latestPreNodeID = 0; + + while (annotIndex < lastAnnotIndex) { + if (markerIndex == lastMarkerIndex + || getPosition(annot) <= + getInsertionMarkerPosition(currentMarker)) + nextStop = getPosition(annot); + else { + nextStop = + getInsertionMarkerPosition + (currentMarker); + } + + if (currentPosition != nextStop) { + addPreNodeToPreGraph_pg(preGraph, + currentPosition, + nextStop, + file, + &initialKmer, + preNodeCounter); + if (latestPreNodeID == 0) { + chains[sequenceIndex] = + preNodeCounter; + } + latestPreNodeID = preNodeCounter++; + currentPosition = nextStop; + } + + while (markerIndex < lastMarkerIndex + && getInsertionMarkerPosition(currentMarker) + == nextStop) { + convertMarker(currentMarker, + latestPreNodeID); + currentMarker++; + markerIndex++; + } + + while (annotIndex < lastAnnotIndex + && getPosition(annot) == nextStop) { + for (readIndex = 0; + readIndex < + getAnnotationLength(annot); + readIndex++) { + c = getc(file); + while (!isalpha(c)) + c = getc(file); + + //printf("(%c)", c); + switch (c) { + case 'A': + pushNucleotide(&initialKmer, ADENINE); + break; + case 'C': + pushNucleotide(&initialKmer, CYTOSINE); + break; + case 'G': + pushNucleotide(&initialKmer, GUANINE); + break; + case 'T': + pushNucleotide(&initialKmer, THYMINE); + break; + default: + printf + ("Irregular sequence file: are you sure your Sequence and Roadmap file come from the same source?\n"); + fflush(stdout); + exit(1); + } + } + + annot = getNextAnnotation(annot); + annotIndex++; + } + + } + + while (markerIndex < lastMarkerIndex) { + if (currentPosition == + getInsertionMarkerPosition(currentMarker)) { + convertMarker(currentMarker, + latestPreNodeID); + currentMarker++; + markerIndex++; + } else { + nextStop = + getInsertionMarkerPosition + (currentMarker); + addPreNodeToPreGraph_pg(preGraph, + currentPosition, + nextStop, file, + &initialKmer, + preNodeCounter); + if (latestPreNodeID == 0) + chains[sequenceIndex] = + preNodeCounter; + latestPreNodeID = preNodeCounter++; + currentPosition = + getInsertionMarkerPosition + (currentMarker); + } + } + + // End of sequence + if (!fgets(line, lineLength, file) && sequenceIndex < sequenceCount_pg(preGraph)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", sequenceFilename); + //puts(" "); + + if (latestPreNodeID == 0) + chains[sequenceIndex] = preNodeCounter; + } + + free(markerCounters); + fclose(file); + +} + +static void connectPreNodeToTheNext(IDnum * currentPreNodeID, + IDnum nextPreNodeID, + Coordinate * currentPosition, + PassageMarker ** latestPassageMarker, + IDnum sequenceIndex, + PreGraph * preGraph) +{ + if (nextPreNodeID == 0) + return; + + if (*currentPreNodeID != 0) + createPreArc_pg(*currentPreNodeID, nextPreNodeID, + preGraph); + + *currentPreNodeID = nextPreNodeID; + + *currentPosition += + getPreNodeLength_pg(*currentPreNodeID, preGraph); + +} + +static IDnum chooseNextInternalPreNode(IDnum currentPreNodeID, + IDnum sequenceIndex, + PreGraph * preGraph, IDnum * chains) +{ + if (currentPreNodeID >= preNodeCount_pg(preGraph)) + return 0; + if (sequenceIndex >= sequenceCount_pg(preGraph)) + return currentPreNodeID + 1; + if (currentPreNodeID + 1 < chains[sequenceIndex + 1]) + return currentPreNodeID + 1; + return 0; +} + +static void connectAnnotation(IDnum * currentPreNodeID, Annotation * annot, + Coordinate * currentPosition, + PassageMarker ** latestPassageMarker, + IDnum sequenceIndex, PreGraph * preGraph) +{ + IDnum nextPreNodeID = getStartID(annot); + + connectPreNodeToTheNext(currentPreNodeID, nextPreNodeID, + currentPosition, latestPassageMarker, + sequenceIndex, preGraph); + + while (*currentPreNodeID != getFinishID(annot)) { + nextPreNodeID = (*currentPreNodeID) + 1; + + connectPreNodeToTheNext(currentPreNodeID, nextPreNodeID, + currentPosition, + latestPassageMarker, sequenceIndex, + preGraph); + } +} + +// Threads each sequences and creates preArcs according to road map indications +static void connectPreNodes(RoadMapArray * rdmaps, PreGraph * preGraph, + IDnum * chains) +{ + Coordinate currentPosition, currentInternalPosition; + IDnum sequenceIndex; + Annotation *annot = rdmaps->annotations; + IDnum currentPreNodeID, nextInternalPreNodeID; + PassageMarker *latestPassageMarker; + RoadMap *rdmap; + IDnum annotIndex, lastAnnotIndex; + + for (sequenceIndex = 1; + sequenceIndex <= sequenceCount_pg(preGraph); + sequenceIndex++) { + + if (sequenceIndex % 100000 == 0) + printf("Connecting %d / %d\n", sequenceIndex, + sequenceCount_pg(preGraph)); + + rdmap = getRoadMapInArray(rdmaps, sequenceIndex - 1); + annotIndex = 0; + lastAnnotIndex = getAnnotationCount(rdmap); + nextInternalPreNodeID = chooseNextInternalPreNode + (chains[sequenceIndex] - 1, sequenceIndex, + preGraph, chains); + + currentPosition = 0; + currentInternalPosition = 0; + currentPreNodeID = 0; + latestPassageMarker = NULL; + // Recursion up to last annotation + while (annotIndex < lastAnnotIndex + || nextInternalPreNodeID != 0) { + if (annotIndex == lastAnnotIndex + || (nextInternalPreNodeID != 0 + && currentInternalPosition < + getPosition(annot))) { + connectPreNodeToTheNext(¤tPreNodeID, + nextInternalPreNodeID, + ¤tPosition, + &latestPassageMarker, + sequenceIndex, + preGraph); + nextInternalPreNodeID = + chooseNextInternalPreNode + (currentPreNodeID, sequenceIndex, + preGraph, chains); + currentInternalPosition += + getPreNodeLength_pg(currentPreNodeID, + preGraph); + + } else { + connectAnnotation(¤tPreNodeID, annot, + ¤tPosition, + &latestPassageMarker, + sequenceIndex, preGraph); + annot = getNextAnnotation(annot); + annotIndex++; + } + } + } +} + +// Post construction memory deallocation routine (of sorts, could certainly be optimized) +static void +cleanUpMemory(PreGraph * preGraph, RoadMapArray * rdmaps, IDnum * chains) +{ + // Killing off roadmaps + free(rdmaps->annotations); + free(rdmaps->array); + free(rdmaps); + + // Finishing off the chain markers + free(chains); +} + +// The full monty, wrapped up in one function +PreGraph *newPreGraph_pg(RoadMapArray * rdmapArray, char *sequenceFilename) +{ + int WORDLENGTH = rdmapArray->WORDLENGTH; + IDnum sequenceCount = rdmapArray->length; + IDnum *markerCounters = callocOrExit(sequenceCount + 1, IDnum); + IDnum *chains = callocOrExit(sequenceCount + 1, IDnum); + InsertionMarker *insertionMarkers; + InsertionMarker *veryLastMarker; + + PreGraph *preGraph = + emptyPreGraph_pg(sequenceCount, rdmapArray->WORDLENGTH, rdmapArray->double_strand); + + puts("Creating insertion markers"); + setInsertionMarkers(rdmapArray, markerCounters, &veryLastMarker, + &insertionMarkers); + + puts("Counting preNodes"); + countPreNodes(rdmapArray, preGraph, markerCounters, + insertionMarkers, veryLastMarker); + + printf("%d preNodes counted, creating them now\n", + preNodeCount_pg(preGraph)); + createPreNodes(rdmapArray, preGraph, markerCounters, + insertionMarkers, veryLastMarker, chains, + sequenceFilename, WORDLENGTH); + + puts("Adjusting marker info..."); + convertInsertionMarkers(insertionMarkers, veryLastMarker, chains); + + puts("Connecting preNodes"); + connectPreNodes(rdmapArray, preGraph, chains); + + puts("Cleaning up memory"); + cleanUpMemory(preGraph, rdmapArray, chains); + puts("Concatenating preGraph"); + concatenatePreGraph_pg(preGraph); + puts("Done creating preGraph"); + + return preGraph; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/preGraphConstruction.h b/contrib/MetaVelvet-v0.3.1/src/preGraphConstruction.h new file mode 100644 index 0000000..41346aa --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/preGraphConstruction.h @@ -0,0 +1,26 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _PREGRAPHCONSTRUCTION_H_ +#define _PREGRAPHCONSTRUCTION_H_ + +PreGraph *newPreGraph_pg(RoadMapArray * rdmaps, char *filename); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/readCoherentGraph.c b/contrib/MetaVelvet-v0.3.1/src/readCoherentGraph.c new file mode 100644 index 0000000..4509385 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/readCoherentGraph.c @@ -0,0 +1,1076 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +// Original +#include +#include +// Original + +#include "globals.h" +#include "graph.h" +#include "recycleBin.h" +#include "passageMarker.h" +#include "graphStats.h" +#include "concatenatedGraph.h" +#include "readSet.h" +// Original +#include "readCoherentGraph.h" +#include "shortReadPairs.h" +// Original + +#define LONG_NODE_CUTOFF 50 +#define LN2 0.693147 +#define PROBABILITY_CUTOFF 5 +#define MAX_READ_COUNT 100 +#define MAX_READ_LENGTH 2000 +// Original +#define LEN_ARRAY_OUTIN 2 +// Original + +static Graph *graph = NULL; +static PassageMarker *path = NULL; +static RecycleBin *listMemory = NULL; +static double expected_coverage = 1; +static TightString **sequences = NULL; +static int MULTIPLICITY_CUTOFF = 2; + +static IDnum multCounter = 0; +static IDnum dbgCounter = 0; +static IDnum nullCounter = 0; + +typedef struct rb_connection_st RBConnection; + +struct rb_connection_st { + Node *node; + PassageMarker *marker; + RBConnection *next; + IDnum multiplicity; +}; + +static RecycleBin *nodeListMemory = NULL; + +#define BLOCKSIZE 1000 + +// Original +static void getStatusOfUniqueness(Node * node, double probability); +// Original + +static RBConnection *allocateRBConnection() +{ + if (nodeListMemory == NULL) + nodeListMemory = + newRecycleBin(sizeof(RBConnection), BLOCKSIZE); + + return allocatePointer(nodeListMemory); +} + +static void deallocateRBConnection(RBConnection * nodeList) +{ + deallocatePointer(nodeListMemory, nodeList); +} + +void setBaseCoverage(double coverage) +{ + expected_coverage = coverage; +} + +boolean isUniqueBasic(Node * node) +{ + if (getNodeLength(node) < LONG_NODE_CUTOFF) { + return false; + } + if (readCoverage(node) / (double) getNodeLength(node) > + 1.5 * expected_coverage) { + return false; + } + + return true; +} + +boolean isUniqueSolexa(Node * node) +{ + Coordinate nodeLength = getNodeLength(node); + Coordinate nodeCoverage = + (getVirtualCoverage(node, 0) + getVirtualCoverage(node, 1)); + double nodeDensity, probability; + + if (nodeLength == 0) { + return false; + } + if (nodeLength > LONG_NODE_CUTOFF) { + nodeDensity = nodeCoverage / (double) nodeLength; + + probability = + -1 * LN2 / 2 + + nodeLength / (2 * expected_coverage) * + (expected_coverage * expected_coverage - + nodeDensity * nodeDensity / 2); + + // Original + getStatusOfUniqueness(node, probability); + // Original + + return probability > PROBABILITY_CUTOFF; + } else { + return false; + probability = + expected_coverage * nodeLength - nodeCoverage / LN2; + return probability > 0; + } +} + +static void identifyUniqueNodes(boolean(*isUniqueFunction) (Node *)) +{ + IDnum index; + Node *node; + IDnum counter = 0; + + puts("Identifying unique nodes"); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (node == NULL) + continue; + + setUniqueness(node, isUniqueFunction(node)); + + if (getUniqueness(node)) + counter++; + } + + printf("Done, %u unique nodes counted\n", counter); +} + +static boolean uniqueNodesConnect(Node * startingNode) +{ + Node *destination = NULL; + PassageMarker *startMarker, *currentMarker; + RBConnection *newList; + RBConnection *list = NULL; + boolean multipleHits = false; + // Original + IDnum outputStartingNodeID = 0; + IDnum outputDestinationNodeID = 0; + // Original + + if (arcCount(startingNode) == 0) + return false; + + if (getMarker(startingNode) == NULL) + return false; + + dbgCounter++; + + // Checking for multiple destinations + for (startMarker = getMarker(startingNode); startMarker != NULL; + startMarker = getNextInNode(startMarker)) { + if (getFinishOffset(startMarker) > + 2 * getWordLength(graph)) + continue; + + for (currentMarker = getNextInSequence(startMarker); + currentMarker != NULL; + currentMarker = getNextInSequence(currentMarker)) { + if (!getUniqueness(getNode(currentMarker))) { + continue; + } else if (getNodeStatus(getNode(currentMarker))) { + if (getStartOffset(currentMarker) > + 2 * getWordLength(graph)) + break; + for (newList = list; newList != NULL; + newList = newList->next) { + if (newList->node == + getNode(currentMarker)) { + newList->multiplicity++; + break; + } + } + if (newList == NULL) + abort(); + break; + } else { + if (getStartOffset(currentMarker) > + 2 * getWordLength(graph)) + break; + setSingleNodeStatus(getNode(currentMarker), + true); + newList = allocateRBConnection(); + newList->node = getNode(currentMarker); + newList->multiplicity = 1; + newList->marker = startMarker; + newList->next = list; + list = newList; + break; + } + } + } + + while (list != NULL) { + newList = list; + list = newList->next; + setSingleNodeStatus(newList->node, false); + if (newList->multiplicity >= MULTIPLICITY_CUTOFF) { + if (destination == NULL) { + destination = newList->node; + path = newList->marker; + } else if (destination != newList->node) + multipleHits = true; + } + deallocateRBConnection(newList); + } + + if (multipleHits) { + multCounter++; + setUniqueness(startingNode, false); + return false; + } + + if (destination == NULL || destination == startingNode + || destination == getTwinNode(startingNode)) { + nullCounter++; + return false; + } + + // Original + // Reserving startingNode and destinationNode + outputStartingNodeID = getNodeID(startingNode); + outputDestinationNodeID = getNodeID(destination); + // Original + + // Check for reciprocity + for (startMarker = getMarker(getTwinNode(destination)); + startMarker != NULL; + startMarker = getNextInNode(startMarker)) { + if (getFinishOffset(startMarker) > + 2 * getWordLength(graph)) + continue; + + for (currentMarker = getNextInSequence(startMarker); + currentMarker != NULL; + currentMarker = getNextInSequence(currentMarker)) { + if (!getUniqueness(getNode(currentMarker))) { + continue; + } else if (getNodeStatus(getNode(currentMarker))) { + if (getStartOffset(currentMarker) > + 2 * getWordLength(graph)) + break; + for (newList = list; newList != NULL; + newList = newList->next) { + if (newList->node == + getNode(currentMarker)) { + newList->multiplicity++; + break; + } + } + if (newList == NULL) + abort(); + break; + } else { + if (getStartOffset(currentMarker) > + 2 * getWordLength(graph)) + break; + setSingleNodeStatus(getNode(currentMarker), + true); + newList = allocateRBConnection(); + newList->node = getNode(currentMarker); + newList->multiplicity = 1; + newList->next = list; + list = newList; + break; + } + } + } + + while (list != NULL) { + newList = list; + list = newList->next; + setSingleNodeStatus(newList->node, false); + if (newList->multiplicity >= MULTIPLICITY_CUTOFF + && newList->node != getTwinNode(startingNode)) + multipleHits = true; + deallocateRBConnection(newList); + } + + if (multipleHits) { + multCounter++; + setUniqueness(destination, false); + return false; + } + // Aligning long reads to each other: + // TODO + + // Merge pairwise alignments and produce consensus + // TODO + + // Original + if (outputStartingNodeID != 0 && outputDestinationNodeID != 0) { + printf("RBConnection\tStarting : %d\tDestination : %d\n", + outputStartingNodeID, outputDestinationNodeID); + } + // Original + + return true; +} + +static boolean goesToNode(PassageMarker * marker, Node * node) +{ + PassageMarker *current; + + for (current = marker; current != NULL; + current = getNextInSequence(current)) + if (getNode(current) == node) + return true; + + return false; +} + +static void updateMembers(Node * bypass, Node * nextNode) +{ + PassageMarker *marker, *next, *tmp; + Coordinate nextLength = getNodeLength(nextNode); + + // Update marker + arc info + for (marker = getMarker(bypass); marker != NULL; marker = tmp) { + tmp = getNextInNode(marker); + + if (!isTerminal(marker) + && getNode(getNextInSequence(marker)) == nextNode) { + // Marker steps right into target + next = getNextInSequence(marker); + disconnectNextPassageMarker(marker, graph); + destroyPassageMarker(next); + } else if (getUniqueness(nextNode) + && goesToNode(marker, nextNode)) { + // Marker goes indirectly to target + while (getNode(getNextInSequence(marker)) != + nextNode) { + next = getNextInSequence(marker); + disconnectNextPassageMarker(marker, graph); + destroyPassageMarker(next); + } + + next = getNextInSequence(marker); + disconnectNextPassageMarker(marker, graph); + destroyPassageMarker(next); + } else if (!isTerminal(marker) + && getFinishOffset(marker) == 0) { + // Marker goes somewhere else than to target + next = getNextInSequence(marker); + incrementFinishOffset(marker, nextLength); + } else { + // Marker goes nowhere + incrementFinishOffset(marker, nextLength); + } + } +} + +static void admitGroupies(Node * source, Node * bypass) +{ + PassageMarker *marker, *tmpMarker; + + for (marker = getMarker(source); marker != NULL; + marker = tmpMarker) { + tmpMarker = getNextInNode(marker); + extractPassageMarker(marker); + transposePassageMarker(marker, bypass); + incrementFinishOffset(getTwinMarker(marker), + getNodeLength(bypass)); + } + +} + +static void adjustShortReads(Node * target, PassageMarker * pathMarker) +{ + ShortReadMarker *targetArray, *marker; + IDnum targetLength, index; + Coordinate position, nodeLength; + + if (!readStartsAreActivated(graph)) + return; + + targetArray = getNodeReads(getTwinNode(target), graph); + targetLength = getNodeReadCount(getTwinNode(target), graph); + + nodeLength = getPassageMarkerLength(pathMarker); + + for (index = 0; index < targetLength; index++) { + marker = getShortReadMarkerAtIndex(targetArray, index); + position = getShortReadMarkerPosition(marker); + position += nodeLength; + setShortReadMarkerPosition(marker, position); + } +} + +static Node *bypass() +{ + Node *bypass = getNode(path); + Node *next = NULL; + Arc *arc; + Category cat; + PassageMarker *nextMarker; + + // Remove unwanted arcs + while (getArc(bypass) != NULL) + destroyArc(getArc(bypass), graph); + + // Update extensive variables (length + descriptors + passage markers) + while (!isTerminal(path)) { + nextMarker = getNextInSequence(path); + next = getNode(nextMarker); + while (next == bypass) { + disconnectNextPassageMarker(path, graph); + destroyPassageMarker(nextMarker); + nextMarker = getNextInSequence(path); + next = getNode(nextMarker); + } + + if (next == NULL) + return bypass; + + // Overall node update + if (!getUniqueness(next)) { + adjustShortReads(bypass, getNextInSequence(path)); + appendSequence(bypass, sequences, + getNextInSequence(path), graph); + } else { + concatenateReadStarts(bypass, next, graph); + // Update virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementVirtualCoverage(bypass, cat, + getVirtualCoverage + (next, cat)); + + // Update original virtual coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementOriginalVirtualCoverage(bypass, + cat, + getOriginalVirtualCoverage + (next, + cat)); + appendDescriptors(bypass, next); + } + + // Members + updateMembers(bypass, next); + + // Termination + if (isTerminal(path) || getUniqueness(next)) + break; + } + + // Remove unique groupies from arrival + admitGroupies(next, bypass); + + // Copy destination arcs + for (arc = getArc(next); arc != NULL; arc = getNextArc(arc)) { + if (getDestination(arc) == next) + continue; + else if (getDestination(arc) == getTwinNode(next)) + createAnalogousArc(bypass, getTwinNode(bypass), + arc, graph); + else + createAnalogousArc(bypass, getDestination(arc), + arc, graph); + } + + destroyNode(next, graph); + + return bypass; +} + +static void trimLongReadTips() +{ + IDnum index; + Node *node; + PassageMarker *marker, *next; + + printf("Trimming read tips\n"); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (getUniqueness(node)) + continue; + + for (marker = getMarker(node); marker != NULL; + marker = next) { + next = getNextInNode(marker); + + if (!isInitial(marker) && !isTerminal(marker)) + continue; + + if (isTerminal(marker)) + marker = getTwinMarker(marker); + + while (!getUniqueness(getNode(marker))) { + if (next != NULL + && (marker == next + || marker == getTwinMarker(next))) + next = getNextInNode(next); + if (getNextInSequence(marker) != NULL) { + marker = getNextInSequence(marker); + destroyPassageMarker + (getPreviousInSequence + (marker)); + } else { + destroyPassageMarker(marker); + break; + } + } + } + } +} + +void readCoherentGraph(Graph * inGraph, boolean(*isUnique) (Node * node), + double coverage, ReadSet * reads) +{ + IDnum nodeIndex; + Node *node; + IDnum previousNodeCount = 0; + + graph = inGraph; + listMemory = newRecycleBin(sizeof(PassageMarkerList), 100000); + expected_coverage = coverage; + sequences = reads->tSequences; + + puts("Read coherency..."); + resetNodeStatus(graph); + identifyUniqueNodes(isUnique); + trimLongReadTips(); + + previousNodeCount = 0; + while (previousNodeCount != nodeCount(graph)) { + + previousNodeCount = nodeCount(graph); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); + nodeIndex++) { + + node = getNodeInGraph(graph, nodeIndex); + + if (node == NULL || !getUniqueness(node)) + continue; + + while (uniqueNodesConnect(node)) + node = bypass(); + + node = getTwinNode(node); + + while (uniqueNodesConnect(node)) + node = bypass(); + + } + + renumberNodes(graph); + } + + destroyRecycleBin(listMemory); + + printf("Confronted to %i multiple hits and %i null over %i\n", + multCounter, nullCounter, dbgCounter); + + puts("Read coherency over!"); +} + +void setMultiplicityCutoff(int value) +{ + if (value < 0) { + printf("Negative long read multiplicity cutoff %i!\n", + value); + puts("Exiting..."); + exit(1); + } + MULTIPLICITY_CUTOFF = value; +} + +// Original +static void getStatusOfUniqueness(Node * node, double probability) +{ + Node *twin = getTwinNode(node); + char *repeatByProb = "Unique", *repeatByArcCount = "Unique"; + IDnum index; + Node *arrayNode[] = {node, twin}; + Arc *tmpArc = NULL; + Node *tmpNode = NULL; + char *arcDirection = ""; + + if (probability > PROBABILITY_CUTOFF) { + repeatByProb = "Unique"; + } else { + repeatByProb = "Repeat"; + } + if ( (simpleArcCount(node) >= 2 && simpleArcCount(twin) >= 1) + || (simpleArcCount(twin) >= 2 && simpleArcCount(node) >= 1) ) + repeatByArcCount = "Repeat"; + else + repeatByArcCount = "Unique"; + + printf("Node : %d\tLen : %d\tCov : %6.2f\t", + getNodeID(node), getNodeLength(node), getNodeDensity(node)); + printf("F : %6.2f\tbyF : %s\tbyArc : %s\n", + probability, repeatByProb, repeatByArcCount); + + if ( !(strcmp(repeatByProb, "Repeat") == 0 || + strcmp(repeatByArcCount, "Repeat") == 0) ) + return; + + for (index = 0; index <= 1; index++) { + if (index == 0) + arcDirection = "Out"; + else + arcDirection = "In"; + + tmpArc = getArc(arrayNode[index]); + while (tmpArc != NULL) { + tmpNode = getDestination(tmpArc); + + printf("\t%s\tNode : %d\t", + arcDirection, getNodeID(tmpNode)); + printf("Len : %d\tCov : %6.2f\n", + getNodeLength(tmpNode), getNodeDensity(node)); + + tmpArc = getNextArc(tmpArc); + } + } +} + +boolean isUniqueSolexaSubgraph(Node * node, double expCovSubgraph) +{ + int nodeLength = getNodeLength(node); + double nodeDensity, probability; + + if (nodeLength == 0) { + return false; + } + if (nodeLength > LONG_NODE_CUTOFF) { + nodeDensity = getNodeDensity(node); + + probability = + -1 * LN2 / 2 + + nodeLength / (2 * expCovSubgraph) * + (expCovSubgraph * expCovSubgraph - + nodeDensity * nodeDensity / 2); + + return probability > PROBABILITY_CUTOFF; + } else { + return false; + } +} + +void identifyUniqueNodesSubgraph(Graph * graph, int * subgraphMask, + boolean(*isUniqueSubgraph) (Node *, double), + double expCovSubgraph) +{ + IDnum index; + Node *node; + + for (index = 0; index < nodeCount(graph); index++) { + node = getNodeInGraph(graph, index + 1); + if (node == NULL) + continue; + if (subgraphMask[index + 1 + nodeCount(graph)] == 2) + setUniqueness(node, + isUniqueSubgraph(node, expCovSubgraph)); + else + setUniqueness(node, false); + } +} + +static double getNearestExpCov(double argCov, double * expCovMulti) +{ + int index; + double difference, minDiff = 1000000.0, resultExpCov = 0.0; + + for (index = 0; index < 2; index++) { + if (expCovMulti[index] == -1) + break; + difference = fabs(argCov - expCovMulti[index]); + if (difference < minDiff) { + minDiff = difference; + resultExpCov = expCovMulti[index]; + } + } + + return resultExpCov; +} + +static boolean compareExpCovOutIn(Node * arrayOutInNode[][LEN_ARRAY_OUTIN], + double * expCovMulti, + double repeatNodeCov, double repeatNodeCovSD) +{ + int outIndex, inIndex, arrayIndex, sortIndex; + double outExpCov, tmpExpCov, checkedOutExpCov[LEN_ARRAY_OUTIN]; + Node *recordedInNode[LEN_ARRAY_OUTIN]; + int checkIndex, checkedCount = 0; + boolean flagChecked = false, flagOutInMatched = true; + double arrayCovOutIn[LEN_ARRAY_OUTIN * 2], tmpCovOutIn; + double aveCovOutInAll = 0.0, aveCovOutInPrimary = 0.0; + + // Initialize recordedInNode + for (inIndex = 0; inIndex < LEN_ARRAY_OUTIN; inIndex++) + recordedInNode[inIndex] = NULL; + + // Check Coverage of Repeat Node + // Prepare arrayCovOutIn + arrayIndex = 0; + for (outIndex = 0; outIndex < 2; outIndex++) { + for (inIndex = 0; inIndex < LEN_ARRAY_OUTIN; inIndex++) { + tmpCovOutIn = getNodeDensity(arrayOutInNode[outIndex][inIndex]); + arrayCovOutIn[arrayIndex++] = tmpCovOutIn; + aveCovOutInAll += tmpCovOutIn; + } + } + aveCovOutInAll /= 2.0; + // Sort + for (sortIndex = 0; sortIndex < LEN_ARRAY_OUTIN * 2; sortIndex++) { + for (arrayIndex = sortIndex + 1; arrayIndex < LEN_ARRAY_OUTIN * 2; + arrayIndex++) { + if (arrayCovOutIn[sortIndex] < arrayCovOutIn[arrayIndex]) { + tmpCovOutIn = arrayCovOutIn[arrayIndex]; + arrayCovOutIn[arrayIndex] = arrayCovOutIn[sortIndex]; + arrayCovOutIn[sortIndex] = tmpCovOutIn; + } + } + } + aveCovOutInPrimary = (arrayCovOutIn[0] + arrayCovOutIn[1]) / (double) 2; + // If Repeat Node Coverage is Too Small, "return false" + //printf("CovRepeat = %f, aveCovAll = %f, aveCovPrimary = %f -> ", + // repeatNodeCov, aveCovOutInAll, aveCovOutInPrimary); + if ((aveCovOutInAll * (1.0 + repeatNodeCovSD) < repeatNodeCov) + || (aveCovOutInAll * (1.0 - repeatNodeCovSD) > repeatNodeCov)) { + //printf("Dubious\n"); + return false; + } + //printf("Valid\n"); + + + for (outIndex = 0; outIndex < LEN_ARRAY_OUTIN; outIndex++) { + if (arrayOutInNode[0][outIndex] == NULL) + break; + + if (!flagOutInMatched) + return false; + + // Judge whether this OutCov has been checked or not + flagChecked = false; + tmpExpCov = getNodeDensity(arrayOutInNode[0][outIndex]); + outExpCov = getNearestExpCov(tmpExpCov, expCovMulti); + for (checkIndex = 0; checkIndex < checkedCount; checkIndex++) { + if (outExpCov == checkedOutExpCov[checkIndex]) { + flagChecked = true; + break; + } + } + if (flagChecked) + continue; + checkedOutExpCov[checkedCount++] = outExpCov; + + // Compare exp_cov between Out- and In-Nodes + flagOutInMatched = false; + for (inIndex = 0; inIndex < LEN_ARRAY_OUTIN; inIndex++) { + if (arrayOutInNode[1][inIndex] == NULL) + break; + + // Record corresponding In-Node + tmpExpCov = getNodeDensity(arrayOutInNode[1][inIndex]); + if (outExpCov == getNearestExpCov(tmpExpCov, expCovMulti)) { + recordedInNode[outIndex] = arrayOutInNode[1][inIndex]; + flagOutInMatched = true; + break; + } + } + } + + for (inIndex = 0; inIndex < LEN_ARRAY_OUTIN; inIndex++) { + if (arrayOutInNode[1][inIndex] == NULL) + break; + arrayOutInNode[1][inIndex] = recordedInNode[inIndex]; + } + + if (flagOutInMatched && checkedCount > 1) + return true; + else + return false; +} + +static boolean isInterRepeat(Node * node, double * expCovMulti, + Node * arrayOutInNode[2][LEN_ARRAY_OUTIN], + double repeatNodeCovSD) +{ + Node *twin = getTwinNode(node); + char *isRepeat = "Unique"; + IDnum nodeIndex, arcIndex; + Node *arrayNode[] = {node, twin}; + Arc *tmpArc = NULL; + Node *tmpNode = NULL; + char *arcDirection = ""; + boolean flagOutputNodeInfo = true; + boolean resultCompareECOI = false; + + // Judge Unique or Repeat by ArcCount + if (simpleArcCount(node) == 2 && simpleArcCount(twin) == 2) + isRepeat = "Repeat"; + else + isRepeat = "Unique"; + + // Not InterRepeat but Unique + if (strcmp(isRepeat, "Unique") == 0) + return false; + + // Record Out- and In-Nodes + for (nodeIndex = 0; nodeIndex <= 1; nodeIndex++) { + arcIndex = 0; + tmpArc = getArc(arrayNode[nodeIndex]); + while (tmpArc != NULL) { + tmpNode = getDestination(tmpArc); + + // Record to array + if (nodeIndex == 0) { + arrayOutInNode[nodeIndex][arcIndex++] + = tmpNode; + } + else { + arrayOutInNode[nodeIndex][arcIndex++] + = getTwinNode(tmpNode); + } + + tmpArc = getNextArc(tmpArc); + } + } + + // Judge Intra- or Inter-Repeat by Out- and In-Nodes + resultCompareECOI = compareExpCovOutIn(arrayOutInNode, expCovMulti, + getNodeDensity(node), repeatNodeCovSD); + + if (resultCompareECOI) { + // Output Out- and In-Nodes + for (nodeIndex = 0; nodeIndex <= 1; nodeIndex++) { + arcIndex = 0; + tmpArc = getArc(arrayNode[nodeIndex]); + while (tmpArc != NULL) { + tmpNode = getDestination(tmpArc); + + if (nodeIndex == 0) + arcDirection = "Out"; + else + arcDirection = "In"; + + // Node Information + if (flagOutputNodeInfo) { + printf("Node : %d\tLen : %d \t", + getNodeID(node), getNodeLength(node)); + printf("Cov : %6.2f\tbyArc : %s\n", + getNodeDensity(node), isRepeat); + flagOutputNodeInfo = false; + } + // Connecting Information + printf("\t%s\tNode : %d\t", + arcDirection, getNodeID(tmpNode)); + printf("Len : %d \tCov : %6.2f\n", + getNodeLength(tmpNode), + getNodeDensity(tmpNode)); + + tmpArc = getNextArc(tmpArc); + } + } + } + + return resultCompareECOI; +} + +int identifyAndSeparateInterRepeats(Graph * argGraph, double * expCovMulti, + double repeatNodeCovSD) +{ + int graphIndex, nodeIndex, arcIndex; + Graph *graph = argGraph; + Node *node, *outNode, *inNode; + Node *arrayOutInNode[2][LEN_ARRAY_OUTIN]; + int numInterRepeat = 0; + + puts("\nIdentifying and Separating InterRepeats"); + + // Reset NodeStatus and Uniqueness + resetNodeStatus(graph); resetUniqueness(graph); + + for (graphIndex = 0; graphIndex < nodeCount(graph); graphIndex++) { + node = getNodeInGraph(graph, graphIndex + 1); + + if (getNodeID(node) == 0) + continue; + + // Initialize arrayOutIn and arrayOutInNode + for (arcIndex = 0; arcIndex < LEN_ARRAY_OUTIN; arcIndex++) { + for (nodeIndex = 0; nodeIndex <= 1; nodeIndex++) { + arrayOutInNode[nodeIndex][arcIndex] = NULL; + } + } + + // Identify InterRepeats + if (!isInterRepeat(node, expCovMulti, arrayOutInNode, repeatNodeCovSD)) + continue; + printf("Identified InterRepeat Node %d\n", getNodeID(node)); + numInterRepeat++; + + // Separate the InterRepeat + for (arcIndex = 0; arcIndex < LEN_ARRAY_OUTIN; arcIndex++) { + if (arrayOutInNode[1][arcIndex] == NULL) + break; + + inNode = arrayOutInNode[1][arcIndex]; + outNode = arrayOutInNode[0][arcIndex]; + + setNodeStatus(node, true); setUniqueness(node, false); + setNodeStatus(inNode, true); setUniqueness(inNode, true); + setNodeStatus(outNode, true); setUniqueness(outNode, true); + + if (!pushNeighboursInterRepeat(inNode, node, outNode, graph)) { + printf("Error!! Separating Failed at Node %d", + getNodeID(node)); + printf(" -- In : %d Out : %d\n\n", + getNodeID(inNode), getNodeID(outNode)); + exit(1); + } + } + } + + // Reset NodeStatus and Uniqueness + resetNodeStatus(graph); resetUniqueness(graph); + + // Return the number of InterRepeats + return numInterRepeat; +} + +static boolean trimLongReadTipsSubgraph(int * subgraphMask) +{ + IDnum index; + Node *node; + PassageMarker *marker, *next; + boolean flagLongRead = false; + + //printf("Trimming read tips\n"); + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + + if (node == NULL || subgraphMask[index + nodeCount(graph)] != 1) + continue; + + if (getUniqueness(node)) + continue; + + for (marker = getMarker(node); marker != NULL; marker = next) { + next = getNextInNode(marker); + + flagLongRead = true; + + if (!isInitial(marker) && !isTerminal(marker)) + continue; + + if (isTerminal(marker)) + marker = getTwinMarker(marker); + + while (!getUniqueness(getNode(marker))) { + if (next != NULL + && (marker == next + || marker == getTwinMarker(next))) + next = getNextInNode(next); + if (getNextInSequence(marker) != NULL) { + marker = getNextInSequence(marker); + destroyPassageMarker + (getPreviousInSequence + (marker)); + } else { + destroyPassageMarker(marker); + break; + } + } + } + } + + return flagLongRead; +} + +static int computeNodeCount(Graph * argGraph) +{ + int index; + int count = 0; + Node *node; + Graph *graph = argGraph; + + for (index = 1; index <= nodeCount(graph); index++) { + node = getNodeInGraph(graph, index); + if (node != NULL) + count++; + } + + return count; +} + +void readCoherentSubgraph(Graph * inGraph, double expCovSubgraph, + ReadSet * reads, int * subgraphMask) +{ + IDnum nodeIndex; + Node *node; + IDnum previousNodeCount = 0; + int checkModified = -1; + + graph = inGraph; + listMemory = newRecycleBin(sizeof(PassageMarkerList), 100000); + expected_coverage = expCovSubgraph; + sequences = reads->tSequences; + + //puts("Read coherency..."); + + if (!trimLongReadTipsSubgraph(subgraphMask)) { + destroyRecycleBin(listMemory); + //puts("Read Coherency didn't work. No Long Reads in the Subgraph"); + return; + } + + while (previousNodeCount != computeNodeCount(graph)) { + + previousNodeCount = computeNodeCount(graph); + + for (nodeIndex = 1; nodeIndex <= nodeCount(graph); nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex); + if (node == NULL || !getUniqueness(node)) + continue; + + while (uniqueNodesConnect(node)) + node = bypass(); + + node = getTwinNode(node); + while (uniqueNodesConnect(node)) + node = bypass(); + } + + checkModified++; + } + + destroyRecycleBin(listMemory); + + //printf("readCoherentSubgraph checkModified = %d\n", checkModified); + + //puts("Read coherency over!"); +} +// Original diff --git a/contrib/MetaVelvet-v0.3.1/src/readCoherentGraph.h b/contrib/MetaVelvet-v0.3.1/src/readCoherentGraph.h new file mode 100644 index 0000000..f3f3fa5 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/readCoherentGraph.h @@ -0,0 +1,45 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _READCOHERENTGRAPH_H_ +#define _READCOHERENTGRAPH_H_ + +// Original +boolean isUniqueSolexaSubgraph(Node * node, double expCovSubgraph); +int identifyAndSeparateInterRepeats(Graph * argGraph, double * expCovMulti, + double repeatNodeCovSD); +void identifyUniqueNodesSubgraph(Graph * graph, int * subgraphMask, + boolean(*isUniqueSubgraph) (Node *, double), + double expCovSubgraph); +void readCoherentSubgraph(Graph * inGraph, double expCovSubgraph, + ReadSet * reads, int * subgraphMask); +// Original + +void readCoherentGraph(Graph * graph, boolean(*isUnique) (Node * node), + double coverage, ReadSet * reads); + +boolean isUniqueBasic(Node * node); + +boolean isUniqueSolexa(Node * node); + +void setBaseCoverage(double coverage); + +void setMultiplicityCutoff(int value); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/readSet.c b/contrib/MetaVelvet-v0.3.1/src/readSet.c new file mode 100644 index 0000000..07876bc --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/readSet.c @@ -0,0 +1,1606 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include +#include + +#include "globals.h" +#include "tightString.h" +#include "readSet.h" +#include "utility.h" + +#if defined(_WIN32) || defined(__WIN32__) || defined(WIN32) +#include "../../third-party/zlib-1.2.3/Win32/include/zlib.h" +#else +#include "../../third-party/zlib-1.2.3/zlib.h" +#endif + +#if defined(MSDOS) || defined(OS2) || defined(WIN32) || defined(__CYGWIN__) +# include +# include +# define SET_BINARY_MODE(file) setmode(fileno(file), O_BINARY) +#else +# define SET_BINARY_MODE(file) +#endif + +ReadSet *newReadSet() +{ + ReadSet *rs = callocOrExit(1, ReadSet); + return rs; +} + +static void velvetifySequence(char * str) { + int i = strlen(str) - 1; + char c; + + for (i = strlen(str) - 1; i >= 0; i--) { + c = str[i]; + switch (c) { + case '\n': + case '\r': + case EOF: + str[i] = '\0'; + break; + case 'C': + case 'c': + str[i] = 'C'; + break; + case 'G': + case 'g': + str[i] = 'G'; + break; + case 'T': + case 't': + str[i] = 'T'; + break; + default: + str[i] = 'A'; + } + } +} + +static void reverseComplementSequence(char * str) +{ + size_t length = strlen(str); + size_t i; + + for (i = 0; i < length-1 - i; i++) { + char c = str[i]; + str[i] = str[length-1 - i]; + str[length-1 - i] = c; + } + + for (i = 0; i < length; i++) { + switch (str[i]) { + case 'A': + case 'a': + str[i] = 'T'; + break; + case 'C': + case 'c': + str[i] = 'G'; + break; + case 'G': + case 'g': + str[i] = 'C'; + break; + // As in velvetifySequence(), anything unusual ends up as 'A' + default: + str[i] = 'A'; + break; + } + } +} + +static void writeFastaSequence(FILE * outfile, const char * str) +{ + size_t length = strlen(str); + size_t start; + for (start = 0; start < length; start += 60) + fprintf(outfile, "%.60s\n", &str[start]); +} + +ReadSet *newReadSetAroundTightStringArray(TightString ** array, + IDnum length) +{ + ReadSet *rs = newReadSet(); + rs->tSequences = array; + rs->readCount = length; + return rs; +} + +void concatenateReadSets(ReadSet * A, ReadSet * B) +{ + ReadSet tmp; + IDnum index; + + // Read count: + tmp.readCount = A->readCount + B->readCount; + + // Sequences + if (A->sequences != NULL || B->sequences != NULL) { + tmp.sequences = mallocOrExit(tmp.readCount, char *); + if (A->sequences != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.sequences[index] = A->sequences[index]; + free(A->sequences); + } else + for (index = 0; index < A->readCount; index++) + tmp.sequences[index] = NULL; + + if (B->sequences != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.sequences[A->readCount + index] = + B->sequences[index]; + free(B->sequences); + } else + for (index = 0; index < B->readCount; index++) + tmp.sequences[A->readCount + index] = NULL; + } else + tmp.sequences = NULL; + + // tSequences + if (A->tSequences != NULL || B->tSequences != NULL) { + tmp.tSequences = + mallocOrExit(tmp.readCount, TightString *); + + if (A->tSequences != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.tSequences[index] = + A->tSequences[index]; + free(A->tSequences); + } else + for (index = 0; index < A->readCount; index++) + tmp.tSequences[index] = NULL; + + if (B->tSequences != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.tSequences[A->readCount + index] = + B->tSequences[index]; + free(B->tSequences); + } else + for (index = 0; index < B->readCount; index++) + tmp.tSequences[A->readCount + index] = + NULL; + } else + tmp.tSequences = NULL; + + // Labels + if (A->labels != NULL || B->labels != NULL) { + tmp.labels = mallocOrExit(tmp.readCount, char *); + + if (A->labels != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.labels[index] = A->labels[index]; + free(A->labels); + } else + for (index = 0; index < A->readCount; index++) + tmp.labels[index] = NULL; + + if (B->labels != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.labels[A->readCount + index] = + B->labels[index]; + free(B->labels); + } else + for (index = 0; index < B->readCount; index++) + tmp.labels[A->readCount + index] = NULL; + } else + tmp.labels = NULL; + + + // Confidence scores + if (A->confidenceScores != NULL || B->confidenceScores != NULL) { + tmp.confidenceScores = + mallocOrExit(tmp.readCount, Quality *); + + if (A->confidenceScores != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.confidenceScores[index] = + A->confidenceScores[index]; + free(A->confidenceScores); + } else + for (index = 0; index < A->readCount; index++) + tmp.confidenceScores[index] = NULL; + + if (B->confidenceScores != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.confidenceScores[A->readCount + + index] = + B->confidenceScores[index]; + free(B->confidenceScores); + } else + for (index = 0; index < B->readCount; index++) + tmp.confidenceScores[A->readCount + + index] = NULL; + } else + tmp.confidenceScores = NULL; + + // Kmer probabilities + if (A->kmerProbabilities != NULL || B->kmerProbabilities != NULL) { + tmp.kmerProbabilities = + mallocOrExit(tmp.readCount, Quality *); + + if (A->kmerProbabilities != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.kmerProbabilities[index] = + A->kmerProbabilities[index]; + free(A->kmerProbabilities); + } else + for (index = 0; index < A->readCount; index++) + tmp.kmerProbabilities[index] = NULL; + + if (B->kmerProbabilities != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.kmerProbabilities[A->readCount + + index] = + B->kmerProbabilities[index]; + free(B->kmerProbabilities); + } else + for (index = 0; index < B->readCount; index++) + tmp.kmerProbabilities[A->readCount + + index] = NULL; + } else + tmp.kmerProbabilities = NULL; + + // Mate reads + if (A->mateReads != NULL || B->mateReads != NULL) { + tmp.mateReads = mallocOrExit(tmp.readCount, IDnum); + + if (A->mateReads != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.mateReads[index] = A->mateReads[index]; + free(A->mateReads); + } else + for (index = 0; index < A->readCount; index++) + tmp.mateReads[index] = 0; + + if (B->mateReads != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.mateReads[A->readCount + index] = + B->mateReads[index]; + free(B->mateReads); + } else + for (index = 0; index < B->readCount; index++) + tmp.mateReads[A->readCount + index] = 0; + } else + tmp.mateReads = NULL; + + // Categories + if (A->categories != NULL || B->categories != NULL) { + tmp.categories = mallocOrExit(tmp.readCount, Quality *); + + if (A->categories != NULL) { + for (index = 0; index < A->readCount; index++) + tmp.categories[index] = + A->categories[index]; + free(A->categories); + } else + for (index = 0; index < A->readCount; index++) + tmp.categories[index] = CATEGORIES; + + if (B->categories != NULL) { + for (index = 0; index < B->readCount; index++) + tmp.categories[A->readCount + index] = + B->categories[index]; + free(B->categories); + } else + for (index = 0; index < B->readCount; index++) + tmp.categories[A->readCount + index] = + CATEGORIES; + } else + tmp.categories = NULL; + + // Put everything back into A + A->readCount = tmp.readCount; + A->sequences = tmp.sequences; + A->tSequences = tmp.tSequences; + A->labels = tmp.labels; + A->confidenceScores = tmp.confidenceScores; + A->kmerProbabilities = tmp.kmerProbabilities; + A->mateReads = tmp.mateReads; + A->categories = tmp.categories; + + // Deallocate + free(B); +} + +void convertSequences(ReadSet * rs) +{ + rs->tSequences = + newTightStringArrayFromStringArray(rs->sequences, + rs->readCount); + rs->sequences = NULL; +} + +static Probability convertQualityScore(Quality score) +{ + return (Probability) 1 - pow(10, -score / ((double) 10)); +} + +void convertConfidenceScores(ReadSet * rs, int WORDLENGTH) +{ + Quality *baseCallerScores; + Probability *kmerProbabilities; + IDnum index; + Coordinate position; + Probability proba; + + rs->kmerProbabilities = + mallocOrExit(rs->readCount, Probability *); + + for (index = 0; index < rs->readCount; index++) { + rs->kmerProbabilities[index] = + mallocOrExit(getLength(rs->tSequences[index]) - WORDLENGTH + + 1, Probability); + kmerProbabilities = rs->kmerProbabilities[index]; + baseCallerScores = rs->confidenceScores[index]; + + proba = 1; + for (position = 0; + position < getLength(rs->tSequences[index]); + position++) { + proba *= + convertQualityScore(baseCallerScores + [position]); + if (position < WORDLENGTH) + continue; + + proba /= + convertQualityScore(baseCallerScores + [position - WORDLENGTH]); + kmerProbabilities[position - WORDLENGTH + 1] = + proba; + } + + rs->confidenceScores[index] = NULL; + free(baseCallerScores); + } + + free(rs->confidenceScores); + rs->confidenceScores = NULL; +} + +void categorizeReads(ReadSet * readSet, Category category) +{ + IDnum index; + + if (readSet->categories == NULL) + readSet->categories = + mallocOrExit(readSet->readCount, Category); + + for (index = 0; index < readSet->readCount; index++) + readSet->categories[index] = category; +} + +void simplifyReads(ReadSet * readSet) +{ + IDnum index; + + if (readSet->categories == NULL) + readSet->categories = + mallocOrExit(readSet->readCount, Category); + + for (index = 0; index < readSet->readCount; index++) { + if (readSet->categories[index] < CATEGORIES) { + readSet->categories[index] /= 2; + readSet->categories[index] *= 2; + } + } +} + +void exportIDMapping(char *filename, ReadSet * reads) +{ + IDnum index; + FILE *outfile = fopen(filename, "w"); + + if (outfile == NULL) { + printf("Couldn't open %s, sorry\n", filename); + return; + } else + puts("Writing into file..."); + + if (reads->labels == NULL) { + fclose(outfile); + return; + } + + for (index = 0; index < reads->readCount; index++) + if (reads->labels != NULL) + fprintf(outfile, "s/SEQUENCE %ld/%s/\n", (long) (index + 1), + reads->labels[index]); + + fclose(outfile); + +} + +// Returns the value of a 32-bit little-endian-stored integer. +static int int32(const unsigned char * ptr) +{ + int x = ptr[3]; + x = (x << 8) | ptr[2]; + x = (x << 8) | ptr[1]; + x = (x << 8) | ptr[0]; + return x; +} + +// Imports sequences from a fastq file +// Memory space allocated within this function. +static void readSolexaFile(FILE* outfile, char *filename, Category cat, IDnum * sequenceIndex) +{ + FILE *file = fopen(filename, "r"); + IDnum counter = 0; + const int maxline = 500; + char line[500]; + char readName[500]; + char readSeq[500]; + char str[100]; + Coordinate start; + + if (strcmp(filename, "-")) + file = fopen(filename, "r"); + else + file = stdin; + + if (file != NULL) + printf("Reading Solexa file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + while (fgets(line, maxline, file) != NULL) + if (strchr(line, '.') == NULL) { + sscanf(line, "%s\t%*i\t%*i\t%*i\t%*c%[^\n]", + readName, readSeq); + fprintf(outfile, ">%s\t%ld\t%d\n", readName, (long) ((*sequenceIndex)++), (int) cat); + velvetifySequence(readSeq); + start = 0; + while (start <= strlen(readSeq)) { + strncpy(str, readSeq + start, 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + start += 60; + } + + counter++; + } + + fclose(file); + + printf("%d sequences found\n", counter); + puts("Done"); +} + +static void readElandFile(FILE* outfile, char *filename, Category cat, IDnum * sequenceIndex) +{ + FILE *file = fopen(filename, "r"); + IDnum counter = 0; + const int maxline = 5000; + char line[5000]; + char readName[5000]; + char readSeq[5000]; + char str[100]; + Coordinate start; + + if (strcmp(filename, "-")) + file = fopen(filename, "r"); + else + file = stdin; + + if (file != NULL) + printf("Reading Solexa file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Reopen file and memorize line: + while (fgets(line, maxline, file) != NULL) { + sscanf(line, "%[^\t]\t%[^\t\n]", + readName, readSeq); + fprintf(outfile, ">%s\t%ld\t%d\n", readName, (long) ((*sequenceIndex)++), (int) cat); + velvetifySequence(readSeq); + start = 0; + while (start <= strlen(readSeq)) { + strncpy(str, readSeq + start, 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + start += 60; + } + + counter++; + } + + fclose(file); + + printf("%d sequences found\n", counter); + puts("Done"); +} + +void goToEndOfLine(char *line, FILE * file) +{ + size_t length = strlen(line); + char c = line[length - 1]; + + while (c != '\n') + c = fgetc(file); +} + +// Imports sequences from a fastq file +// Memory space allocated within this function. +static void readFastQFile(FILE* outfile, char *filename, Category cat, IDnum * sequenceIndex) +{ + FILE *file; + const int maxline = 5000; + char line[5000]; + char str[100]; + IDnum counter = 0; + Coordinate start, i; + char c; + + if (strcmp(filename, "-")) + file = fopen(filename, "r"); + else + file = stdin; + + if (file != NULL) + printf("Reading FastQ file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Checking if FastQ + c = getc(file); + if (c != '@') + exitErrorf(EXIT_FAILURE, false, "%s does not seem to be in FastQ format", filename); + ungetc(c, file); + + while(fgets(line, maxline, file)) { + + for (i = strlen(line) - 1; + i >= 0 && (line[i] == '\n' || line[i] == '\r'); i--) { + line[i] = '\0'; + } + + fprintf(outfile,">%s\t%ld\t%d\n", line + 1, (long) ((*sequenceIndex)++), (int) cat); + counter++; + + if(!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", filename); + + velvetifySequence(line); + + start = 0; + while (start <= strlen(line)) { + strncpy(str, line + start, 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + start += 60; + } + + if(!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", filename); + if(!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", filename); + } + + fclose(file); + printf("%d reads found.\n", counter); + puts("Done"); +} + +// Imports sequences from a zipped rfastq file +// Memory space allocated within this function. +static void readFastQGZFile(FILE * outfile, char *filename, Category cat, IDnum *sequenceIndex) +{ + gzFile file; + const int maxline = 5000; + char line[5000]; + char str[100]; + IDnum counter = 0; + Coordinate start, i; + char c; + + if (strcmp(filename, "-")) + file = gzopen(filename, "rb"); + else { + file = gzdopen(fileno(stdin), "rb"); + SET_BINARY_MODE(stdin); + } + + if (file != NULL) + printf("Reading FastQ file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Checking if FastQ + c = gzgetc(file); + if (c != '@') + exitErrorf(EXIT_FAILURE, false, "%s does not seem to be in FastQ format", filename); + gzungetc(c, file); + + while (gzgets(file, line, maxline)) { + for (i = strlen(line) - 1; + i >= 0 && (line[i] == '\n' || line[i] == '\r'); i--) { + line[i] = '\0'; + } + + fprintf(outfile,">%s\t%ld\t%d\n", line + 1, (long) ((*sequenceIndex)++), (int) cat); + counter++; + + gzgets(file, line, maxline); + + velvetifySequence(line); + + start = 0; + while (start <= strlen(line)) { + strncpy(str, line + start, 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + start += 60; + } + + gzgets(file, line, maxline); + gzgets(file, line, maxline); + } + + gzclose(file); + printf("%d reads found.\n", counter); + puts("Done"); +} + +// Imports sequences from a fasta file +// Memory is allocated within the function +static void readFastAFile(FILE* outfile, char *filename, Category cat, IDnum * sequenceIndex) +{ + FILE *file; + const int maxline = 5000; + char line[5000]; + char str[100]; + IDnum counter = 0; + Coordinate i; + char c; + Coordinate start; + int offset = 0; + + if (strcmp(filename, "-")) + file = fopen(filename, "r"); + else + file = stdin; + + if (file != NULL) + printf("Reading FastA file %s;\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Checking if FastA + c = getc(file); + if (c != '>') + exitErrorf(EXIT_FAILURE, false, "%s does not seem to be in FastA format", filename); + ungetc(c, file); + + while (fgets(line, maxline, file)) { + if (line[0] == '>') { + if (strchr(line,'\t')) { + printf("FastA headers in %s contain tabs, please remove them.\n", filename); + printf("E.g.: %s", line); + puts("Exiting"); + exit(1); + } + + if (offset != 0) { + fprintf(outfile, "\n"); + offset = 0; + } + + for (i = strlen(line) - 1; + i >= 0 && (line[i] == '\n' || line[i] == '\r'); i--) { + line[i] = '\0'; + } + + fprintf(outfile,"%s\t%ld\t%d\n", line, (long) ((*sequenceIndex)++), (int) cat); + counter++; + } else { + velvetifySequence(line); + start = 0; + while (start < strlen(line)) { + strncpy(str, line + start, 60 - offset); + str[60 - offset] = '\0'; + fprintf(outfile, "%s", str); + offset += strlen(str); + if (offset >= 60) { + fprintf(outfile, "\n"); + offset = 0; + } + start += strlen(str); + } + } + } + + if (offset != 0) + fprintf(outfile, "\n"); + fclose(file); + + printf("%d sequences found\n", counter); + puts("Done"); +} + +// Imports sequences from a zipped fasta file +// Memory is allocated within the function +static void readFastAGZFile(FILE* outfile, char *filename, Category cat, IDnum * sequenceIndex) +{ + gzFile file; + const int maxline = 5000; + char line[5000]; + char str[100]; + IDnum counter = 0; + Coordinate i, start; + char c; + int offset = 0; + + if (strcmp(filename, "-")) + file = gzopen(filename, "rb"); + else { + file = gzdopen(fileno(stdin), "rb"); + SET_BINARY_MODE(stdin); + } + + if (file != NULL) + printf("Reading zipped FastA file %s;\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Checking if FastA + c = gzgetc(file); + if (c != '>') + exitErrorf(EXIT_FAILURE, false, "%s does not seem to be in FastA format", filename); + gzungetc(c, file); + + while (gzgets(file, line, maxline)) { + if (line[0] == '>') { + if (offset != 0) { + fprintf(outfile, "\n"); + offset = 0; + } + + for (i = strlen(line) - 1; + i >= 0 && (line[i] == '\n' || line[i] == '\r'); i--) { + line[i] = '\0'; + } + + fprintf(outfile, "%s\t%ld\t%d\n", line, (long) ((*sequenceIndex)++), (int) cat); + counter++; + } else { + velvetifySequence(line); + + start = 0; + while (start < strlen(line)) { + strncpy(str, line + start, 60 - offset); + str[60 - offset] = '\0'; + fprintf(outfile, "%s", str); + offset += strlen(str); + if (offset >= 60) { + fprintf(outfile, "\n"); + offset = 0; + } + start += strlen(str); + } + } + } + + if (offset != 0) + fprintf(outfile, "\n"); + gzclose(file); + + printf("%d sequences found\n", counter); + puts("Done"); +} + +// Parser for new output +static void readMAQGZFile(FILE* outfile, char *filename, Category cat, IDnum * sequenceIndex) +{ + gzFile file; + const int maxline = 1000; + char line[1000]; + IDnum counter = 0; + char readName[500]; + char readSeq[500]; + char str[100]; + Coordinate start; + + if (strcmp(filename, "-")) + file = gzopen(filename, "rb"); + else { + file = gzdopen(fileno(stdin), "rb"); + SET_BINARY_MODE(stdin); + } + + if (file != NULL) + printf("Reading zipped MAQ file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Reopen file and memorize line: + while (gzgets(file, line, maxline)) { + sscanf(line, "%s\t%*i\t%*i\t%*c\t%*i\t%*i\t%*i\t%*i\t%*i\t%*i\t%*i\t%*i\t%*i\t%*i\t%[^\t]", + readName, readSeq); + fprintf(outfile, ">%s\t%ld\t%d\n", readName, (long) ((*sequenceIndex)++), (int) cat); + velvetifySequence(readSeq); + start = 0; + while (start <= strlen(readSeq)) { + strncpy(str, readSeq + start, 60); + str[60] = '\0'; + fprintf(outfile, "%s\n", str); + start += 60; + } + + counter++; + } + + gzclose(file); + + printf("%d sequences found\n", counter); + puts("Done"); +} + +static void readSAMFile(FILE *outfile, char *filename, Category cat, IDnum *sequenceIndex) +{ + char line[5000]; + unsigned long lineno, readCount; + char previous_qname_pairing[10]; + char previous_qname[5000]; + char previous_seq[5000]; + boolean previous_paired = false; + + FILE *file = (strcmp(filename, "-") != 0)? fopen(filename, "r") : stdin; + if (file) + printf("Reading SAM file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + readCount = 0; + for (lineno = 1; fgets(line, sizeof(line), file); lineno++) + if (line[0] != '@') { + char *qname, *flag, *seq; + int i; + + qname = strtok(line, "\t"); + flag = strtok(NULL, "\t"); + for (i = 3; i < 10; i++) + (void) strtok(NULL, "\t"); + seq = strtok(NULL, "\t"); + + if (seq == NULL) { + fprintf(stderr, + "Line #%lu: ignoring SAM record with too few fields\n", + lineno); + } + else if (strcmp(seq, "*") == 0) { + fprintf(stderr, + "Line #%lu: ignoring SAM record with omitted SEQ field\n", + lineno); + } + else { + // Accept flags represented in either decimal or hex: + int flagbits = strtol(flag, NULL, 0); + + const char *qname_pairing = ""; + if (flagbits & 0x40) + qname_pairing = "/1"; + else if (flagbits & 0x80) + qname_pairing = "/2"; + + if (flagbits & 0x10) + reverseComplementSequence(seq); + + // Determine if paired to previous read + if (readCount > 0) { + if (cat % 2) { + if (previous_paired) { + // Last read paired to penultimate read + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + previous_paired = false; + } else if (strcmp(qname, previous_qname) == 0) { + // Last read paired to current reads + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + previous_paired = true; + } else { + // Last read unpaired + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat - 1); + writeFastaSequence(outfile, previous_seq); + previous_paired = false; + } + } else { + // Unpaired dataset + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + } + } + + strcpy(previous_qname, qname); + strcpy(previous_qname_pairing, qname_pairing); + strcpy(previous_seq, seq); + velvetifySequence(previous_seq); + + readCount++; + } + } + + if (readCount) { + if (cat % 2) { + if (previous_paired) { + // Last read paired to penultimate read + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + } else { + // Last read unpaired + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat - 1); + writeFastaSequence(outfile, previous_seq); + } + } else { + // Unpaired dataset + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + } + + } + + fclose(file); + printf("%lu reads found.\nDone\n", readCount); +} + +static int readBAMint32(gzFile file) +{ + unsigned char buffer[4]; + if (gzread(file, buffer, 4) != 4) + exitErrorf(EXIT_FAILURE, false, "BAM file header truncated"); + + return int32(buffer); +} + +static void readBAMFile(FILE *outfile, char *filename, Category cat, IDnum *sequenceIndex) +{ + size_t seqCapacity = 0; + char *seq = NULL; + size_t bufferCapacity = 4; + unsigned char *buffer = mallocOrExit(bufferCapacity, unsigned char); + unsigned long recno, readCount; + int i, refCount; + gzFile file; + char previous_qname_pairing[10]; + char previous_qname[5000]; + char previous_seq[5000]; + boolean previous_paired = false; + + if (strcmp(filename, "-") != 0) + file = gzopen(filename, "rb"); + else { + file = gzdopen(fileno(stdin), "rb"); + SET_BINARY_MODE(stdin); + } + + if (file != NULL) + printf("Reading BAM file %s\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + if (! (gzread(file, buffer, 4) == 4 && memcmp(buffer, "BAM\1", 4) == 0)) + exitErrorf(EXIT_FAILURE, false, "%s is not in BAM format", filename); + + // Skip header text + if (gzseek(file, readBAMint32(file), SEEK_CUR) == -1) + exitErrorf(EXIT_FAILURE, false, "gzseek failed"); + + // Skip header reference list + refCount = readBAMint32(file); + for (i = 0; i < refCount; i++) { + if (gzseek(file, readBAMint32(file) + 4, SEEK_CUR) == -1) + exitErrorf(EXIT_FAILURE, false, "gzseek failed"); + } + + readCount = 0; + for (recno = 1; gzread(file, buffer, 4) == 4; recno++) { + int blockSize = int32(buffer); + int readLength; + + if (bufferCapacity < 4 + blockSize) { + bufferCapacity = 4 + blockSize + 4096; + buffer = reallocOrExit(buffer, bufferCapacity, unsigned char); + } + + if (gzread(file, &buffer[4], blockSize) != blockSize) + exitErrorf(EXIT_FAILURE, false, "BAM alignment record truncated"); + + readLength = int32(&buffer[20]); + if (readLength == 0) { + fprintf(stderr, + "Record #%lu: ignoring BAM record with omitted SEQ field\n", + recno); + } + else { + int readNameLength = buffer[12]; + int flag_nc = int32(&buffer[16]); + int flagbits = flag_nc >> 16; + int cigarLength = flag_nc & 0xffff; + char *qname = (char *)&buffer[36]; + unsigned char *rawseq = + &buffer[36 + readNameLength + 4 * cigarLength]; + + const char *qname_pairing = ""; + if (flagbits & 0x40) + qname_pairing = "/1"; + else if (flagbits & 0x80) + qname_pairing = "/2"; + + if (seqCapacity < readLength + 1) { + seqCapacity = readLength * 2 + 1; + seq = reallocOrExit(seq, seqCapacity, char); + } + + for (i = 0; i < readLength; i += 2) { + static const char decode_bases[] = "=ACMGRSVTWYHKDBN"; + unsigned int packed = *rawseq++; + seq[i] = decode_bases[packed >> 4]; + seq[i+1] = decode_bases[packed & 0xf]; + } + seq[readLength] = '\0'; + + if (flagbits & 0x10) + reverseComplementSequence(seq); + + // Determine if paired to previous read + if (readCount > 0) { + if (cat % 2) { + if (previous_paired) { + // Last read paired to penultimate read + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + previous_paired = false; + } else if (strcmp(qname, previous_qname) == 0) { + // Last read paired to current reads + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + previous_paired = true; + } else { + // Last read unpaired + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat - 1); + writeFastaSequence(outfile, previous_seq); + previous_paired = false; + } + } else { + // Unpaired dataset + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + } + } + + strcpy(previous_qname, qname); + strcpy(previous_qname_pairing, qname_pairing); + strcpy(previous_seq, seq); + velvetifySequence(previous_seq); + + readCount++; + } + } + + if (readCount) { + if (cat % 2) { + if (previous_paired) { + // Last read paired to penultimate read + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + } else { + // Last read unpaired + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat - 1); + writeFastaSequence(outfile, previous_seq); + } + } else { + // Unpaired dataset + fprintf(outfile, ">%s%s\t%ld\t%d\n", previous_qname, previous_qname_pairing, + (long) ((*sequenceIndex)++), (int) cat); + writeFastaSequence(outfile, previous_seq); + } + } + + free(seq); + free(buffer); + + gzclose(file); + printf("%lu reads found.\nDone\n", readCount); +} + +#define FASTQ 1 +#define FASTA 2 +#define GERALD 3 +#define ELAND 4 +#define FASTA_GZ 5 +#define FASTQ_GZ 6 +#define MAQ_GZ 7 +#define SAM 8 +#define BAM 9 + +// General argument parser for most functions +// Basically a reused portion of toplevel code dumped into here +void parseDataAndReadFiles(char * filename, int argc, char **argv, boolean * double_strand) +{ + int argIndex = 1; + FILE *outfile = fopen(filename, "w"); + int filetype = FASTA; + Category cat = 0; + IDnum sequenceIndex = 1; + short short_var; + + if (argc < 2) { + puts("Wrong number of arguments!"); + puts("Correct usage:"); + puts("run - [...] "); + puts("Allowed filetypes:"); + puts("\t-fasta"); + puts("\t-fastq"); + puts("\t-solexa"); + puts("\t-eland"); + puts("If reading exclusively fasta file, the -fasta parameter is not necessary"); + exit(1); + } + + for (argIndex = 1; argIndex < argc; argIndex++) { + if (argv[argIndex][0] == '-' && strlen(argv[argIndex]) > 1) { + + if (strcmp(argv[argIndex], "-fastq") == 0) + filetype = FASTQ; + else if (strcmp(argv[argIndex], "-fasta") == 0) + filetype = FASTA; + else if (strcmp(argv[argIndex], "-gerald") == 0) + filetype = GERALD; + else if (strcmp(argv[argIndex], "-eland") == 0) + filetype = ELAND; + else if (strcmp(argv[argIndex], "-fastq.gz") == 0) + filetype = FASTQ_GZ; + else if (strcmp(argv[argIndex], "-fasta.gz") == 0) + filetype = FASTA_GZ; + else if (strcmp(argv[argIndex], "-sam") == 0) + filetype = SAM; + else if (strcmp(argv[argIndex], "-bam") == 0) + filetype = BAM; + else if (strcmp(argv[argIndex], "-maq.gz") == 0) + filetype = MAQ_GZ; + else if (strcmp(argv[argIndex], "-short") == 0) + cat = 0; + else if (strcmp(argv[argIndex], "-shortPaired") == + 0) + cat = 1; + else if (strncmp + (argv[argIndex], "-shortPaired", + 12) == 0) { + sscanf(argv[argIndex], "-shortPaired%hd", &short_var); + cat = (Category) short_var; + if (cat < 1 || cat > CATEGORIES) { + printf("Unknown option: %s\n", + argv[argIndex]); + exit(1); + } + cat--; + cat *= 2; + cat++; + } else if (strncmp(argv[argIndex], "-short", 6) == + 0) { + sscanf(argv[argIndex], "-short%hd", &short_var); + cat = (Category) short_var; + if (cat < 1 || cat > CATEGORIES) { + printf("Unknown option: %s\n", + argv[argIndex]); + exit(1); + } + cat--; + cat *= 2; + } else if (strcmp(argv[argIndex], "-long") == 0) + cat = CATEGORIES * 2; + else if (strcmp(argv[argIndex], "-longPaired") == + 0) + cat = CATEGORIES * 2 + 1; + else if (strcmp(argv[argIndex], "-strand_specific") + == 0) + *double_strand = false; + else { + printf("Unknown option: %s\n", + argv[argIndex]); + exit(1); + } + + continue; + } + + if (cat == -1) + continue; + + switch (filetype) { + case FASTA: + readFastAFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case FASTQ: + readFastQFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case GERALD: + readSolexaFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case ELAND: + readElandFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case FASTA_GZ: + readFastAGZFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case FASTQ_GZ: + readFastQGZFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case SAM: + readSAMFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case BAM: + readBAMFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + case MAQ_GZ: + readMAQGZFile(outfile, argv[argIndex], cat, &sequenceIndex); + break; + default: + puts("Screw up in parser... exiting"); + exit(1); + } + } + + fclose(outfile); +} + +void createReadPairingArray(ReadSet* reads) { + IDnum index; + IDnum *mateReads = mallocOrExit(reads->readCount, IDnum); + + for (index = 0; index < reads->readCount; index++) + mateReads[index] = -1; + + reads->mateReads = mateReads; +} + +boolean pairUpReads(ReadSet * reads, Category cat) +{ + int phase = 0; + IDnum index; + boolean found = false; + + for (index = 0; index < reads->readCount; index++) { + if (reads->categories[index] != cat) { + if (phase == 1) { + reads->mateReads[index - 1] = -1; + reads->categories[index - 1]--; + phase = 0; + } + } else if (phase == 0) { + found = true; + reads->mateReads[index] = index + 1; + phase = 1; + } else { + found = true; + reads->mateReads[index] = index - 1; + phase = 0; + } + } + + return found; +} + +void detachDubiousReads(ReadSet * reads, boolean * dubiousReads) +{ + IDnum index; + IDnum pairID; + IDnum sequenceCount = reads->readCount; + IDnum *mateReads = reads->mateReads; + + if (dubiousReads == NULL || mateReads == NULL) + return; + + for (index = 0; index < sequenceCount; index++) { + if (!dubiousReads[index]) + continue; + + pairID = mateReads[index]; + + if (pairID != -1) { + //printf("Separating %d and %d\n", index, pairID); + mateReads[index] = -1; + mateReads[pairID] = -1; + } + } +} + +static void exportRead(FILE * outfile, ReadSet * reads, IDnum index) +{ + Coordinate start, finish; + char str[100]; + TightString *sequence = reads->tSequences[index]; + + if (sequence == NULL) + return; + + fprintf(outfile, ">SEQUENCE_%ld_length_%lld", (long) index, + (long long) getLength(sequence)); + + if (reads->categories != NULL) + fprintf(outfile, "\t%i", (int) reads->categories[index]); + + fprintf(outfile, "\n"); + + start = 0; + while (start <= getLength(sequence)) { + finish = start + 60; + readTightStringFragment(sequence, start, finish, str); + fprintf(outfile, "%s\n", str); + start = finish; + } + + fflush(outfile); +} + +void exportReadSet(char *filename, ReadSet * reads) +{ + IDnum index; + FILE *outfile = fopen(filename, "w+"); + + if (outfile == NULL) { + puts("Couldn't open file, sorry"); + return; + } else + printf("Writing into readset file: %s\n", filename); + + for (index = 0; index < reads->readCount; index++) { + exportRead(outfile, reads, index); + } + + fclose(outfile); + + puts("Done"); +} + +ReadSet *importReadSet(char *filename) +{ + FILE *file = fopen(filename, "r"); + char *sequence = NULL; + Coordinate bpCount = 0; + const int maxline = 5000; + char line[5000]; + IDnum sequenceCount, sequenceIndex; + IDnum index; + ReadSet *reads; + short int temp_short; + + if (file != NULL) + printf("Reading read set file %s;\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + reads = newReadSet(); + + // Count number of separate sequences + sequenceCount = 0; + while (fgets(line, maxline, file) != NULL) + if (line[0] == '>') + sequenceCount++; + fclose(file); + printf("%d sequences found\n", sequenceCount); + + reads->readCount = sequenceCount; + + if (reads->readCount == 0) { + reads->sequences = NULL; + reads->categories = NULL; + return reads; + } + + reads->sequences = callocOrExit(sequenceCount, char *); + reads->categories = callocOrExit(sequenceCount, Category); + // Counting base pair length of each sequence: + file = fopen(filename, "r"); + sequenceIndex = -1; + while (fgets(line, maxline, file) != NULL) { + if (line[0] == '>') { + + // Reading category info + sscanf(line, "%*[^\t]\t%*[^\t]\t%hd", + &temp_short); + reads->categories[sequenceIndex + 1] = (Category) temp_short; + + if (sequenceIndex != -1) + reads->sequences[sequenceIndex] = + mallocOrExit(bpCount + 1, char); + sequenceIndex++; + bpCount = 0; + } else { + bpCount += (Coordinate) strlen(line) - 1; + } + } + + //printf("Sequence %d has length %d\n", sequenceIndex, bpCount); + reads->sequences[sequenceIndex] = + mallocOrExit(bpCount + 1, char); + fclose(file); + + // Reopen file and memorize line: + file = fopen(filename, "r"); + sequenceIndex = -1; + while (fgets(line, maxline, file)) { + if (line[0] == '>') { + if (sequenceIndex != -1) { + sequence[bpCount] = '\0'; + } + sequenceIndex++; + bpCount = 0; + //printf("Starting to read sequence %d\n", + // sequenceIndex); + sequence = reads->sequences[sequenceIndex]; + } else { + for (index = 0; index < (Coordinate) strlen(line) - 1; + index++) + sequence[bpCount + index] = line[index]; + bpCount += (Coordinate) (strlen(line) - 1); + } + } + + sequence[bpCount] = '\0'; + fclose(file); + + puts("Done"); + return reads; + +} + +void logInstructions(int argc, char **argv, char *directory) +{ + int index; + char *logFilename = + mallocOrExit(strlen(directory) + 100, char); + FILE *logFile; + time_t date; + char *string; + + time(&date); + string = ctime(&date); + + strcpy(logFilename, directory); + strcat(logFilename, "/Log"); + logFile = fopen(logFilename, "a"); + + if (logFile == NULL) + exitErrorf(EXIT_FAILURE, true, "Could not write to %s", logFilename); + + fprintf(logFile, "%s", string); + + for (index = 0; index < argc; index++) + fprintf(logFile, " %s", argv[index]); + + fprintf(logFile, "\n"); + + fclose(logFile); + free(logFilename); +} + +void destroyReadSet(ReadSet * reads) +{ + IDnum index; + + if (reads == NULL) + return; + + if (reads->sequences != NULL) + for (index = 0; index < reads->readCount; index++) + free(reads->sequences[index]); + + if (reads->tSequences != NULL) + for (index = 0; index < reads->readCount; index++) + destroyTightString(reads->tSequences[index]); + + if (reads->labels != NULL) + for (index = 0; index < reads->readCount; index++) + free(reads->labels[index]); + + if (reads->confidenceScores != NULL) + for (index = 0; index < reads->readCount; index++) + free(reads->confidenceScores[index]); + + if (reads->kmerProbabilities != NULL) + for (index = 0; index < reads->readCount; index++) + free(reads->kmerProbabilities[index]); + + free(reads->sequences); + free(reads->tSequences); + free(reads->labels); + free(reads->confidenceScores); + free(reads->kmerProbabilities); + free(reads->mateReads); + free(reads->categories); + free(reads); +} + +Coordinate *getSequenceLengths(ReadSet * reads, int wordLength) +{ + Coordinate *lengths = callocOrExit(reads->readCount, Coordinate); + IDnum index; + int lengthOffset = wordLength - 1; + + for (index = 0; index < reads->readCount; index++) + lengths[index] = + getLength(reads->tSequences[index]) - lengthOffset; + + return lengths; +} + +Coordinate *getSequenceLengthsFromFile(char *filename, int wordLength) +{ + Coordinate *lengths; + FILE *file = fopen(filename, "r"); + Coordinate bpCount = 0; + const int maxline = 100; + char line[100]; + IDnum sequenceCount, sequenceIndex; + int lengthOffset = wordLength - 1; + + if (file != NULL) + printf("Reading read set file %s;\n", filename); + else + exitErrorf(EXIT_FAILURE, true, "Could not open %s", filename); + + // Count number of separate sequences + sequenceCount = 0; + while (fgets(line, maxline, file) != NULL) + if (line[0] == '>') + sequenceCount++; + fclose(file); + + lengths = callocOrExit(sequenceCount, Coordinate); + // Counting base pair length of each sequence: + file = fopen(filename, "r"); + sequenceIndex = -1; + while (fgets(line, maxline, file) != NULL) { + if (line[0] == '>') { + if (sequenceIndex != -1) + lengths[sequenceIndex] = + bpCount - lengthOffset; + sequenceIndex++; + bpCount = 0; + } else { + bpCount += (Coordinate) strlen(line) - 1; + } + } + lengths[sequenceIndex] = bpCount - lengthOffset; + fclose(file); + + return lengths; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/readSet.h b/contrib/MetaVelvet-v0.3.1/src/readSet.h new file mode 100644 index 0000000..eff8ef1 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/readSet.h @@ -0,0 +1,68 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _READSET_H_ +#define _READSET_H_ + +struct readSet_st { + char **sequences; + TightString **tSequences; + char **labels; + Quality **confidenceScores; + Probability **kmerProbabilities; + IDnum *mateReads; + Category *categories; + IDnum readCount; +}; + +ReadSet *newReadSet(); +ReadSet *newReadSetAroundTightStringArray(TightString ** array, + IDnum length); + +Coordinate *getSequenceLengths(ReadSet * reads, int wordLength); +Coordinate *getSequenceLengthsFromFile(char *filename, int wordLength); + +void concatenateReadSets(ReadSet * A, ReadSet * B); + +void convertSequences(ReadSet * rs); + +void convertConfidenceScores(ReadSet * rs, int WORDLENGTH); + +void categorizeReads(ReadSet * reads, Category category); +void simplifyReads(ReadSet * reads); + +// Exports a .sed script allowing to transform internal IDs to the original ones +void exportIDMapping(char *filename, ReadSet * reads); + +ReadSet *importReadSet(char *filename); +void exportReadSet(char *filename, ReadSet * reads); + +// The overall argument parser and file reader for the hash function +void parseDataAndReadFiles(char * filename, int argc, char **argv, boolean * double_strand); + +void logInstructions(int argc, char **argv, char *directory); + +// Read pairing info +void createReadPairingArray(ReadSet* reads); +boolean pairUpReads(ReadSet * reads, Category cat); +void detachDubiousReads(ReadSet * reads, boolean * dubiousReads); + +void destroyReadSet(ReadSet * reads); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/recycleBin.c b/contrib/MetaVelvet-v0.3.1/src/recycleBin.c new file mode 100644 index 0000000..7802768 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/recycleBin.c @@ -0,0 +1,178 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/****************************************************************\ +* * +* Efficient Memory Allocation Routines * +* * +* Guy St.C. Slater.. mailto:guy@ebi.ac.uk * +* Copyright (C) 2000-2005. All Rights Reserved. * +* * +* This source code is distributed under the terms of the * +* GNU Lesser General Public License. See the file COPYING * +* or http://www.fsf.org/copyleft/lesser.html for details * +* * +* If you use this code, please keep this notice intact. * +* * +\****************************************************************/ + +#include +#include + +#include "recycleBin.h" + +typedef struct RecycleBin_Node { + struct RecycleBin_Node *next; +} RecycleBin_Node; + +typedef struct chunk_st { + struct chunk_st *next; +} Chunk; + +struct recycleBin_st { + Chunk *chunk_list; + RecycleBin_Node *recycle; + size_t node_size; + int chunk_pos; + int nodes_per_chunk; +}; + +RecycleBin *newRecycleBin(size_t node_size, int nodes_per_chunk) +{ + register RecycleBin *recycle_bin = malloc(sizeof(RecycleBin)); + + if (recycle_bin == NULL) { + puts("Allocation failed!"); + exit(-1); + } + + if (node_size < sizeof(RecycleBin_Node)) { + puts("Too small elements to create a recycle bin!"); + exit(-1); + } + recycle_bin->chunk_list = NULL; + recycle_bin->chunk_pos = nodes_per_chunk; + recycle_bin->nodes_per_chunk = nodes_per_chunk; + recycle_bin->node_size = node_size; + recycle_bin->recycle = NULL; + return recycle_bin; +} + +void destroyRecycleBin(RecycleBin * recycle_bin) +{ + register Chunk *chunk; + + if (recycle_bin == NULL) + return; + + while (recycle_bin->chunk_list != NULL) { + chunk = recycle_bin->chunk_list; + recycle_bin->chunk_list = recycle_bin->chunk_list->next; + free(chunk); + } + free(recycle_bin); + return; +} + +void *allocatePointer(RecycleBin * recycle_bin) +{ + register RecycleBin_Node *node; + register Chunk *chunk; + + if (recycle_bin == NULL) { + puts("Null recycle bin!"); + exit(-1); + } + + if (recycle_bin->recycle != NULL) { + node = recycle_bin->recycle; + recycle_bin->recycle = node->next; + return node; + } + + if (recycle_bin->chunk_pos == recycle_bin->nodes_per_chunk) { + chunk = malloc(sizeof(Chunk) + recycle_bin->nodes_per_chunk + * recycle_bin->node_size); + if (chunk == NULL) { + puts("No more memory for memory chunk!"); + exit(-1); + } + chunk->next = recycle_bin->chunk_list; + recycle_bin->chunk_list = chunk; + recycle_bin->chunk_pos = 1; + return (RecycleBin_Node *) ((size_t) (void *) chunk + + sizeof(Chunk)); + } + + chunk = recycle_bin->chunk_list; + return (RecycleBin_Node *) ((size_t) (void *) chunk + sizeof(Chunk) + + + (recycle_bin-> + node_size * + recycle_bin->chunk_pos++)); +} + +void deallocatePointer(RecycleBin * recycle_bin, void *data) +{ + register RecycleBin_Node *node = data; + + node->next = recycle_bin->recycle; + recycle_bin->recycle = node; + + return; +} + +size_t RecycleBin_memory_usage(RecycleBin * recycle_bin) +{ + int chunk_count = 0; + Chunk *chunk; + + for (chunk = recycle_bin->chunk_list; chunk != NULL; + chunk = chunk->next) + chunk_count++; + + return recycle_bin->node_size + * recycle_bin->nodes_per_chunk * chunk_count; +} + +size_t recycleBinFreeSpace(RecycleBin * bin) +{ + RecycleBin_Node *freeNode = bin->recycle; + size_t result = 0; + while (freeNode != NULL) { + freeNode = freeNode->next; + result++; + } + + return bin->node_size * (result + + (bin->nodes_per_chunk - bin->chunk_pos)); +} + +size_t recycleBinAvailablePointers(RecycleBin * bin) +{ + Chunk *chunk = bin->chunk_list; + size_t result = 0; + while (chunk != NULL) { + chunk = chunk->next; + result++; + } + + return result * bin->nodes_per_chunk; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/recycleBin.h b/contrib/MetaVelvet-v0.3.1/src/recycleBin.h new file mode 100644 index 0000000..a2225ba --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/recycleBin.h @@ -0,0 +1,54 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +/****************************************************************\ +* * +* Efficient Memory Allocation Routines * +* * +* Guy St.C. Slater.. mailto:guy@ebi.ac.uk * +* Copyright (C) 2000-2005. All Rights Reserved. * +* * +* This source code is distributed under the terms of the * +* GNU Lesser General Public License. See the file COPYING * +* or http://www.fsf.org/copyleft/lesser.html for details * +* * +* If you use this code, please keep this notice intact. * +* * +\****************************************************************/ + +#ifndef INCLUDED_RECYCLEBIN_H +#define INCLUDED_RECYCLEBIN_H + +typedef struct recycleBin_st RecycleBin; + +// Constructor, Destructor +RecycleBin *newRecycleBin(size_t node_size, int nodes_per_chunk); +void destroyRecycleBin(RecycleBin * recycle_bin); + +// Use +void *allocatePointer(RecycleBin * recycle_bin); +void deallocatePointer(RecycleBin * recycle_bin, void *data); + +// Stats +size_t RecycleBin_memory_usage(RecycleBin * recycle_bin); +size_t recycleBinFreeSpace(RecycleBin * recycle_bin); +size_t recycleBinAvailablePointers(RecycleBin * recycle_bin); + +#endif /* INCLUDED_RECYCLEBIN_H */ diff --git a/contrib/MetaVelvet-v0.3.1/src/roadMap.c b/contrib/MetaVelvet-v0.3.1/src/roadMap.c new file mode 100644 index 0000000..ece4d16 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/roadMap.c @@ -0,0 +1,206 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "recycleBin.h" +#include "tightString.h" +#include "roadMap.h" +#include "utility.h" +#include "kmer.h" + +#ifndef NULL +#define NULL 0 +#endif + +union positionPtr { + Coordinate coord; + IDnum nodeID; +}; + +struct annotation_st { + Coordinate position; // 32 + union positionPtr start; // 32 + union positionPtr finish; // 32 + Coordinate length; // 32 + IDnum sequenceID; // 32 +}; + +struct roadmap_st { + IDnum annotationCount; +}; + +// Creates empty RoadMap +RoadMap *newRoadMap() +{ + return callocOrExit(1, RoadMap); +} + +IDnum getAnnotationCount(RoadMap * rdmap) +{ + return rdmap->annotationCount; +} + +Coordinate getFinish(Annotation * annot) +{ + return annot->finish.coord; +} + +IDnum getAnnotSequenceID(Annotation * annot) +{ + return annot->sequenceID; +} + +Coordinate getStart(Annotation * annot) +{ + return annot->start.coord; +} + +Coordinate getPosition(Annotation * annot) +{ + return annot->position; +} + +Coordinate getAnnotationLength(Annotation * annot) +{ + if (annot == NULL) + return 0; + + return annot->length; +} + +// Imports roadmap from the appropriate file format +// Memory allocated within the function +RoadMapArray *importRoadMapArray(char *filename) +{ + FILE *file; + const int maxline = 100; + char *line = mallocOrExit(maxline, char); + RoadMap *array; + RoadMap *rdmap = NULL; + IDnum rdmapIndex = 0; + IDnum seqID; + Coordinate position, start, finish; + Annotation *nextAnnotation; + RoadMapArray *result = mallocOrExit(1, RoadMapArray); + IDnum sequenceCount; + IDnum annotationCount = 0; + short short_var; + long long_var; + long long longlong_var, longlong_var2, longlong_var3; + + printf("Reading roadmap file %s\n", filename); + + file = fopen(filename, "r"); + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", filename); + sscanf(line, "%ld\t%i\t%hi\n", &long_var, &(result->WORDLENGTH), &short_var); + sequenceCount = (IDnum) long_var; + resetWordFilter(result->WORDLENGTH); + result->length = sequenceCount; + array = mallocOrExit(sequenceCount, RoadMap); + result->array = array; + result->double_strand = (boolean) short_var; + + while (fgets(line, maxline, file) != NULL) + if (line[0] != 'R') + annotationCount++; + + result->annotations = callocOrExit(annotationCount, Annotation); + nextAnnotation = result->annotations; + fclose(file); + + file = fopen(filename, "r"); + + if (!fgets(line, maxline, file)) + exitErrorf(EXIT_FAILURE, true, "%s incomplete.", filename); + while (fgets(line, maxline, file) != NULL) { + if (line[0] == 'R') { + rdmap = getRoadMapInArray(result, rdmapIndex++); + rdmap->annotationCount = 0; + } else { + sscanf(line, "%ld\t%lld\t%lld\t%lld\n", &long_var, + &longlong_var, &longlong_var2, &longlong_var3); + seqID = (IDnum) long_var; + position = (Coordinate) longlong_var; + start = (Coordinate) longlong_var2; + finish = (Coordinate) longlong_var3; + nextAnnotation->sequenceID = seqID; + nextAnnotation->position = position; + nextAnnotation->start.coord = start; + nextAnnotation->finish.coord = finish; + + if (seqID > 0) + nextAnnotation->length = finish - start; + else + nextAnnotation->length = start - finish; + + + rdmap->annotationCount++; + nextAnnotation++; + } + } + + printf("%d roadmaps reads\n", rdmapIndex); + + fclose(file); + free(line); + return result; +} + +RoadMap *getRoadMapInArray(RoadMapArray * array, IDnum index) +{ + return &(array->array[index]); +} + +void setStartID(Annotation * annot, IDnum nodeID) +{ + annot->start.nodeID = nodeID; +} + +void setFinishID(Annotation * annot, IDnum nodeID) +{ + annot->finish.nodeID = nodeID; +} + +IDnum getStartID(Annotation * annot) +{ + return annot->start.nodeID; +} + +IDnum getFinishID(Annotation * annot) +{ + return annot->finish.nodeID; +} + +void incrementAnnotationCoordinates(Annotation * annot) +{ + annot->start.coord++; + annot->finish.coord++; +} + +Annotation *getNextAnnotation(Annotation * annot) +{ + return annot + 1; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/roadMap.h b/contrib/MetaVelvet-v0.3.1/src/roadMap.h new file mode 100644 index 0000000..372fe87 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/roadMap.h @@ -0,0 +1,64 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _ROADMAP_H_ +#define _ROADMAP_H_ + +#include + +struct roadMapArray_st { + RoadMap *array; + Annotation *annotations; + IDnum length; + int WORDLENGTH; + boolean double_strand; +}; + +//////////////////////////////////////////////////////////////////// +// Annotation stuff +//////////////////////////////////////////////////////////////////// +IDnum getAnnotSequenceID(Annotation * annot); +Coordinate getFinish(Annotation * annot); +Coordinate getStart(Annotation * annot); +Coordinate getPosition(Annotation * annot); +Coordinate getAnnotationLength(Annotation * annot); +void incrementAnnotationCoordinates(Annotation * annot); + +void setStartID(Annotation * annot, IDnum nodeID); +IDnum getStartID(Annotation * annot); +void setFinishID(Annotation * annot, IDnum nodeID); +IDnum getFinishID(Annotation * annot); + +char *readAnnotation(Annotation * annot); + +Annotation *getNextAnnotation(Annotation * annot); + +//////////////////////////////////////////////////////////////////// +// RoadMap stuff +//////////////////////////////////////////////////////////////////// +RoadMap *newRoadMap(); + +IDnum getAnnotationCount(RoadMap * rdmap); + +RoadMap *getRoadMapInArray(RoadMapArray * array, IDnum index); + +// Same thing but for the RoadMap file generated by the hash +RoadMapArray *importRoadMapArray(char *filename); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/run.c b/contrib/MetaVelvet-v0.3.1/src/run.c new file mode 100644 index 0000000..f362712 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/run.c @@ -0,0 +1,163 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#if defined(_WIN32) || defined(__WIN32__) || defined(WIN32) +#include +#define Arc v_Arc +#else +#include +#endif + +#include "run.h" + +static void printUsage() +{ + puts("Usage:"); + puts("./velveth directory hash_length {[-file_format][-read_type] filename} [options]"); + puts(""); + puts("\tdirectory\t\t: directory name for output files"); + printf("\thash_length\t\t: odd integer (if even, it will be decremented) <= %i (if above, will be reduced)\n", MAXKMERLENGTH); + puts("\tfilename\t\t: path to sequence file or - for standard input"); + puts(""); + puts("File format options:"); + puts("\t-fasta"); + puts("\t-fastq"); + puts("\t-fasta.gz"); + puts("\t-fastq.gz"); + puts("\t-sam"); + puts("\t-bam"); + puts("\t-eland"); + puts("\t-gerald"); + puts(""); + puts("Read type options:"); + puts("\t-short"); + puts("\t-shortPaired"); + puts("\t-short2"); + puts("\t-shortPaired2"); + puts("\t-long"); + puts("\t-longPaired"); + puts(""); + puts("Options:"); + puts("\t-strand_specific\t: for strand specific transcriptome sequencing data (default: off)"); + puts(""); + puts("Output:"); + puts("\tdirectory/Roadmaps"); + puts("\tdirectory/Sequences"); + puts("\t\t[Both files are picked up by graph, so please leave them there]"); +} + +int main(int argc, char **argv) +{ + ReadSet *allSequences; + SplayTable *splayTable; + int hashLength; + char *directory, *filename, *buf; + boolean double_strand = true; + DIR *dir; + + setProgramName("velveth"); + + if (argc < 4) { + puts("velveth - simple hashing program"); + printf("Version %i.%i.%2.2i\n", VERSION_NUMBER, + RELEASE_NUMBER, UPDATE_NUMBER); + puts("\nCopyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk)"); + puts("This is free software; see the source for copying conditions. There is NO"); + puts("warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"); + puts("Compilation settings:"); + printf("CATEGORIES = %i\n", CATEGORIES); + printf("MAXKMERLENGTH = %i\n", MAXKMERLENGTH); + puts(""); + printUsage(); + return 0; + } + + directory = argv[1]; + filename = mallocOrExit(strlen(directory) + 100, char); + buf = mallocOrExit(strlen(directory) + 100, char); + + hashLength = atoi(argv[2]); + + if (hashLength > MAXKMERLENGTH) { + printf + ("Velvet can't handle k-mers as long as %i! We'll stick to %i if you don't mind.\n", + hashLength, MAXKMERLENGTH); + hashLength = MAXKMERLENGTH; + } else if (hashLength <= 0) { + printf("Invalid hash length: %s\n", argv[2]); + printUsage(); + return 0; + } + + if (hashLength % 2 == 0) { + printf + ("Velvet can't work with even length k-mers, such as %i. We'll use %i instead, if you don't mind.\n", + hashLength, hashLength - 1); + hashLength--; + } + resetWordFilter(hashLength); + + dir = opendir(directory); + + if (dir == NULL) + mkdir(directory, 0777); + else { + sprintf(buf, "%s/PreGraph", directory); + remove(buf); + sprintf(buf, "%s/Graph", directory); + remove(buf); + sprintf(buf, "%s/Graph2", directory); + remove(buf); + sprintf(buf, "%s/Graph3", directory); + remove(buf); + sprintf(buf, "%s/Graph4", directory); + remove(buf); + sprintf(buf, "%s/Log", directory); + remove(buf); + } + + logInstructions(argc, argv, directory); + + splayTable = newSplayTable(hashLength); + + strcpy(filename, directory); + strcat(filename, "/Sequences"); + parseDataAndReadFiles(filename, argc - 2, &(argv[2]), &double_strand); + + allSequences = importReadSet(filename); + printf("%i sequences in total.\n", allSequences->readCount); + + strcpy(filename, directory); + strcat(filename, "/Roadmaps"); + inputSequenceArrayIntoSplayTableAndArchive(allSequences, + splayTable, filename, + double_strand); + + destroySplayTable(splayTable); + if (dir) + closedir(dir); + free(filename); + free(buf); + + return 0; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/run.h b/contrib/MetaVelvet-v0.3.1/src/run.h new file mode 100644 index 0000000..ec8e226 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/run.h @@ -0,0 +1,51 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +// Compilation +#include "globals.h" + +// Utilities +#include "graphStats.h" +#include "utility.h" + +// Datastructures +#include "kmer.h" +#include "readSet.h" +#include "tightString.h" +#include "roadMap.h" +#include "splayTable.h" +#include "graph.h" +#include "scaffold.h" + +// PreGraph operations +#include "preGraph.h" +#include "preGraphConstruction.h" +#include "concatenatedPreGraph.h" + +// Graph operations +#include "graph.h" +#include "graphReConstruction.h" +#include "concatenatedGraph.h" +#include "correctedGraph.h" +#include "locallyCorrectedGraph.h" + +// Repeat resolution +#include "readCoherentGraph.h" +#include "shortReadPairs.h" diff --git a/contrib/MetaVelvet-v0.3.1/src/run2.c b/contrib/MetaVelvet-v0.3.1/src/run2.c new file mode 100644 index 0000000..252cdaf --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/run2.c @@ -0,0 +1,537 @@ +/* +Copyright 2011 Toshiaki Namiki (namiki@dna.bio.keio.ac.jp). + + This file is part of MetaVelvet. + This file is originally distributed as a part of the Velvet software, + and modified to solve de novo metagenome assembly problem. + + MetaVelvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + MetaVelvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with MetaVelvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include + +#include "run.h" + +static void printUsage() +{ + puts("Usage:"); + puts("./velvetg directory [options]"); + puts(""); + puts("\tdirectory\t\t\t: working directory name"); + puts(""); + puts("Standard options:"); + puts("\t-cov_cutoff \t: removal of low coverage nodes AFTER tour bus or allow the system to infer it"); + puts("\t\t(default: no removal)"); + puts("\t-ins_length \t\t: expected distance between two paired end reads (default: no read pairing)"); + puts("\t-read_trkg \t\t: tracking of short read positions in assembly (default: no tracking)"); + puts("\t-min_contig_lgth \t: minimum contig length exported to contigs.fa file (default: hash length * 2)"); + puts("\t-amos_file \t\t: export assembly to AMOS file (default: no export)"); + puts("\t-exp_cov \t: expected coverage of unique regions or allow the system to infer it"); + puts("\t\t(default: no long or paired-end read resolution)"); + puts(""); + puts("Advanced options:"); + puts("\t-ins_length2 \t\t: expected distance between two paired-end reads in the second short-read dataset (default: no read pairing)"); + puts("\t-ins_length_long \t: expected distance between two long paired-end reads (default: no read pairing)"); + puts("\t-ins_length*_sd \t: est. standard deviation of respective dataset (default: 10\% of corresponding length)"); + puts("\t\t[replace '*' by nothing, '2' or '_long' as necessary]"); + puts("\t-scaffolding \t\t: scaffolding of contigs used paired end information (default: on)"); + puts("\t-max_branch_length \t: maximum length in base pair of bubble (default: 100)"); + puts("\t-max_divergence : maximum divergence rate between two branches in a bubble (default: 0.2)"); + puts("\t-max_gap_count \t: maximum number of gaps allowed in the alignment of the two branches of a bubble (default: 3)"); + puts("\t-min_pair_count \t: minimum number of paired end connections to justify the scaffolding of two long contigs (default: 10)"); + puts("\t-max_coverage \t: removal of high coverage nodes AFTER tour bus (default: no removal)"); + puts("\t-long_mult_cutoff \t\t: minimum number of long reads required to merge contigs (default: 2)"); + puts("\t-unused_reads \t\t: export unused reads in UnusedReads.fa file (default: no)"); + puts(""); + puts("Output:"); + puts("\tdirectory/contigs.fa\t\t: fasta file of contigs longer than twice hash length"); + puts("\tdirectory/stats.txt\t\t: stats file (tab-spaced) useful for determining appropriate coverage cutoff"); + puts("\tdirectory/LastGraph\t\t: special formatted file with all the information on the final graph"); + puts("\tdirectory/velvet_asm.afg\t: (if requested) AMOS compatible assembly file"); +} + +int main(int argc, char **argv) +{ + ReadSet *sequences = NULL; + RoadMapArray *rdmaps; + PreGraph *preGraph; + Graph *graph; + char *directory, *graphFilename, *preGraphFilename, *seqFilename, + *roadmapFilename; + double coverageCutoff = -1; + double maxCoverageCutoff = -1; + double expectedCoverage = -1; + int longMultCutoff = -1; + Coordinate minContigLength = -1; + Coordinate minContigKmerLength; + boolean *dubious = NULL; + Coordinate insertLength[CATEGORIES]; + Coordinate insertLengthLong = -1; + Coordinate std_dev[CATEGORIES]; + Coordinate std_dev_long = -1; + short int accelerationBits = 24; + boolean readTracking = false; + boolean exportAssembly = false; + boolean unusedReads = false; + boolean estimateCoverage = false; + boolean estimateCutoff = false; + FILE *file; + int arg_index, arg_int; + double arg_double; + char *arg; + Coordinate *sequenceLengths = NULL; + Category cat; + boolean scaffolding = true; + int pebbleRounds = 1; + long long longlong_var; + short int short_var; + // Original + double expectedCoverageMulti[100]; + int numECM = 0; + int ecmIndex; + char *tokenPointer, *manualECM[100]; + const char *splitChar = "_"; + boolean flagManualECM = false; + int *dummyMask; + double rateChimericSubgraph = 0.0; + boolean discardChimericSubgraph = false; + double repeatNodeCovSD = 0.10; + // Original + + setProgramName("velvetg"); + + for (cat = 0; cat < CATEGORIES; cat++) { + insertLength[cat] = -1; + std_dev[cat] = -1; + } + + // Error message + if (argc == 1) { + puts("velvetg - de Bruijn graph construction, error removal and repeat resolution"); + printf("Version %i.%i.%2.2i\n", VERSION_NUMBER, + RELEASE_NUMBER, UPDATE_NUMBER); + puts("\nCopyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk)"); + puts("This is free software; see the source for copying conditions. There is NO"); + puts("warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"); + puts("Compilation settings:"); + printf("CATEGORIES = %i\n", CATEGORIES); + printf("MAXKMERLENGTH = %i\n", MAXKMERLENGTH); + puts(""); + printUsage(); + return 1; + } + + if (strcmp(argv[1], "--help") == 0) { + printUsage(); + return 0; + } + + // Memory allocation + directory = argv[1]; + graphFilename = mallocOrExit(strlen(directory) + 100, char); + preGraphFilename = + mallocOrExit(strlen(directory) + 100, char); + roadmapFilename = mallocOrExit(strlen(directory) + 100, char); + seqFilename = mallocOrExit(strlen(directory) + 100, char); + // Argument parsing + for (arg_index = 2; arg_index < argc; arg_index++) { + arg = argv[arg_index++]; + if (arg_index >= argc) { + puts("Unusual number of arguments!"); + printUsage(); + exit(1); + } + + if (strcmp(arg, "-cov_cutoff") == 0) { + if (strcmp(argv[arg_index], "auto") == 0) { + estimateCutoff = true; + } else { + sscanf(argv[arg_index], "%lf", &coverageCutoff); + } + } else if (strcmp(arg, "-exp_cov") == 0) { + if (strcmp(argv[arg_index], "auto") == 0) { + estimateCoverage = true; + readTracking = true; + } else { + sscanf(argv[arg_index], "%lf", &expectedCoverage); + if (expectedCoverage > 0) + readTracking = true; + } + } else if (strcmp(arg, "-ins_length") == 0) { + sscanf(argv[arg_index], "%lli", &longlong_var); + insertLength[0] = (Coordinate) longlong_var; + if (insertLength[0] < 0) { + printf("Invalid insert length: %lli\n", + (long long) insertLength[0]); + exit(1); + } + } else if (strcmp(arg, "-ins_length_sd") == 0) { + sscanf(argv[arg_index], "%lli", &longlong_var); + std_dev[0] = (Coordinate) longlong_var; + if (std_dev[0] < 0) { + printf("Invalid std deviation: %lli\n", + (long long) std_dev[0]); + exit(1); + } + } else if (strcmp(arg, "-ins_length_long") == 0) { + sscanf(argv[arg_index], "%lli", &longlong_var); + insertLengthLong = (Coordinate) longlong_var; + } else if (strcmp(arg, "-ins_length_long_sd") == 0) { + sscanf(argv[arg_index], "%lli", &longlong_var); + std_dev_long = (Coordinate) longlong_var; + } else if (strncmp(arg, "-ins_length", 11) == 0 + && strchr(arg, 'd') == NULL) { + sscanf(arg, "-ins_length%hi", &short_var); + cat = (Category) short_var; + if (cat < 1 || cat > CATEGORIES) { + printf("Unknown option: %s\n", arg); + exit(1); + } + sscanf(argv[arg_index], "%lli", &longlong_var); + insertLength[cat - 1] = (Coordinate) longlong_var; + if (insertLength[cat - 1] < 0) { + printf("Invalid insert length: %lli\n", + (long long) insertLength[cat - 1]); + exit(1); + } + } else if (strncmp(arg, "-ins_length", 11) == 0) { + sscanf(arg, "-ins_length%hi_sd", &short_var); + cat = (Category) short_var; + if (cat < 1 || cat > CATEGORIES) { + printf("Unknown option: %s\n", arg); + exit(1); + } + sscanf(argv[arg_index], "%lli", &longlong_var); + std_dev[cat - 1] = (Coordinate) longlong_var; + if (std_dev[cat - 1] < 0) { + printf("Invalid std deviation: %lli\n", + (long long) std_dev[cat - 1]); + exit(1); + } + } else if (strcmp(arg, "-read_trkg") == 0) { + readTracking = + (strcmp(argv[arg_index], "yes") == 0); + } else if (strcmp(arg, "-scaffolding") == 0) { + scaffolding = + (strcmp(argv[arg_index], "yes") == 0); + } else if (strcmp(arg, "-amos_file") == 0) { + exportAssembly = + (strcmp(argv[arg_index], "yes") == 0); + } else if (strcmp(arg, "-min_contig_lgth") == 0) { + sscanf(argv[arg_index], "%lli", &longlong_var); + minContigLength = (Coordinate) longlong_var; + } else if (strcmp(arg, "-accel_bits") == 0) { + sscanf(argv[arg_index], "%hi", &accelerationBits); + if (accelerationBits < 0) { + printf + ("Illegal acceleration parameter: %s\n", + argv[arg_index]); + printUsage(); + return -1; + } + } else if (strcmp(arg, "-max_branch_length") == 0) { + sscanf(argv[arg_index], "%i", &arg_int); + setMaxReadLength(arg_int); + setLocalMaxReadLength(arg_int); + } else if (strcmp(arg, "-max_divergence") == 0) { + sscanf(argv[arg_index], "%lf", &arg_double); + setMaxDivergence(arg_double); + setLocalMaxDivergence(arg_double); + } else if (strcmp(arg, "-max_gap_count") == 0) { + sscanf(argv[arg_index], "%i", &arg_int); + setMaxGaps(arg_int); + setLocalMaxGaps(arg_int); + } else if (strcmp(arg, "-min_pair_count") == 0) { + sscanf(argv[arg_index], "%i", &arg_int); + setUnreliableConnectionCutoff(arg_int); + } else if (strcmp(arg, "-max_coverage") == 0) { + sscanf(argv[arg_index], "%lf", &maxCoverageCutoff); + } else if (strcmp(arg, "-long_mult_cutoff") == 0) { + sscanf(argv[arg_index], "%i", &longMultCutoff); + setMultiplicityCutoff(longMultCutoff); + } else if (strcmp(arg, "-unused_reads") == 0) { + unusedReads = + (strcmp(argv[arg_index], "yes") == 0); + if (unusedReads) + readTracking = true; + // Original + } else if (strcmp(arg, "-rate_chimeric_sg") == 0) { + sscanf(argv[arg_index], "%lf", &rateChimericSubgraph); + } else if (strcmp(arg, "-discard_chimeric_sg") == 0) { + discardChimericSubgraph = + (strcmp(argv[arg_index], "yes") == 0); + } else if (strcmp(arg, "-manual_exp_cov_multi") == 0) { + flagManualECM = true; + tokenPointer = argv[arg_index]; + for (numECM = 0; numECM < 100; numECM++) { + if ((manualECM[numECM] + = strtok(tokenPointer, splitChar)) == NULL) + break; + tokenPointer = NULL; + } + for (ecmIndex = 0; ecmIndex < numECM; ecmIndex++) { + sscanf(manualECM[ecmIndex], "%lf", + &expectedCoverageMulti[ecmIndex]); + printf("Manual Expected Coverage : %lf\n", + expectedCoverageMulti[ecmIndex]); + } + printf("numECM : %d\n", numECM); + } else if (strcmp(arg, "-repeat_cov_sd") == 0) { + sscanf(argv[arg_index], "%lf", &repeatNodeCovSD); + // Original + } else if (strcmp(arg, "--help") == 0) { + printUsage(); + return 0; + } else { + printf("Unknown option: %s;\n", arg); + printUsage(); + return 1; + } + } + + // Bookkeeping + logInstructions(argc, argv, directory); + + strcpy(seqFilename, directory); + strcat(seqFilename, "/Sequences"); + + strcpy(roadmapFilename, directory); + strcat(roadmapFilename, "/Roadmaps"); + + strcpy(preGraphFilename, directory); + strcat(preGraphFilename, "/PreGraph"); + + if (!readTracking) { + strcpy(graphFilename, directory); + strcat(graphFilename, "/Graph"); + } else { + strcpy(graphFilename, directory); + strcat(graphFilename, "/Graph2"); + } + + // Graph uploading or creation + if ((file = fopen(graphFilename, "r")) != NULL) { + fclose(file); + graph = importGraph(graphFilename); + } else if ((file = fopen(preGraphFilename, "r")) != NULL) { + fclose(file); + sequences = importReadSet(seqFilename); + convertSequences(sequences); + graph = + importPreGraph(preGraphFilename, sequences, + readTracking, accelerationBits); + sequenceLengths = + getSequenceLengths(sequences, getWordLength(graph)); + correctGraph(graph, sequenceLengths); + exportGraph(graphFilename, graph, sequences->tSequences); + } else if ((file = fopen(roadmapFilename, "r")) != NULL) { + fclose(file); + rdmaps = importRoadMapArray(roadmapFilename); + preGraph = newPreGraph_pg(rdmaps, seqFilename); + clipTips_pg(preGraph); + exportPreGraph_pg(preGraphFilename, preGraph); + destroyPreGraph_pg(preGraph); + + sequences = importReadSet(seqFilename); + convertSequences(sequences); + graph = + importPreGraph(preGraphFilename, sequences, + readTracking, accelerationBits); + sequenceLengths = + getSequenceLengths(sequences, getWordLength(graph)); + correctGraph(graph, sequenceLengths); + exportGraph(graphFilename, graph, sequences->tSequences); + } else { + puts("No Roadmap file to build upon! Please run velveth (see manual)"); + exit(1); + } + + // Set insert lengths and their standard deviations + for (cat = 0; cat < CATEGORIES; cat++) { + if (insertLength[cat] > -1 && std_dev[cat] < 0) + std_dev[cat] = insertLength[cat] / 10; + setInsertLengths(graph, cat, + insertLength[cat], std_dev[cat]); + } + + if (insertLengthLong > -1 && std_dev_long < 0) + std_dev_long = insertLengthLong / 10; + setInsertLengths(graph, CATEGORIES, + insertLengthLong, std_dev_long); + + // Coverage cutoff + if (expectedCoverage < 0 && estimateCoverage == true) { + expectedCoverage = estimated_cov(graph, directory); + if (coverageCutoff < 0) { + coverageCutoff = expectedCoverage / 2; + estimateCutoff = true; + } + } else { + estimateCoverage = false; + if (coverageCutoff < 0 && estimateCutoff) + coverageCutoff = estimated_cov(graph, directory) / 2; + else + estimateCutoff = false; + } + + if (coverageCutoff < 0) { + puts("WARNING: NO COVERAGE CUTOFF PROVIDED"); + puts("Velvet will probably leave behind many detectable errors"); + puts("See manual for instructions on how to set the coverage cutoff parameter"); + } + + // Original + // Outputting Graph_EstimateCovMulti Status + strcpy(graphFilename, directory); + strcat(graphFilename, "/stats_EstimateCovMulti.txt"); + displayGeneralStatistics(graph, graphFilename, sequences); + // Estimating multiple expected coverages + dummyMask = makeDummySubgraphMask(graph); + if (!flagManualECM) + numECM = estimated_cov_multi(graph, dummyMask, expectedCoverageMulti); + free(dummyMask); + // Define coverageCutoff as half of minimum exp_cov + coverageCutoff = expectedCoverageMulti[numECM - 1] / (double) 2; + // Original + + dubious = + removeLowCoverageNodesAndDenounceDubiousReads(graph, coverageCutoff); + + removeHighCoverageNodes(graph, maxCoverageCutoff); + clipTipsHard(graph); + + if (expectedCoverage > 0) { + if (sequences == NULL) { + sequences = importReadSet(seqFilename); + convertSequences(sequences); + } + + // Original + // Outputting Graph Before Rock Band + strcpy(graphFilename, directory); + strcat(graphFilename, "/Graph_BeforeRockBand"); + exportGraph(graphFilename, graph, sequences->tSequences); + // Outputting G_BRB Node Sequences + if (minContigLength < 2 * getWordLength(graph)) + minContigKmerLength = getWordLength(graph); + else + minContigKmerLength = minContigLength - getWordLength(graph) + 1; + strcpy(graphFilename, directory); + strcat(graphFilename, "/Graph_BeforeRockBand_NODE.fasta"); + exportLongNodeSequences(graphFilename, graph, minContigKmerLength); + // Original + + // Original + // Mixed length sequencing + //readCoherentGraph(graph, isUniqueSolexa, expectedCoverage, sequences); + // Original + + // Paired ends module + createReadPairingArray(sequences); + for (cat = 0; cat < CATEGORIES; cat++) + if(pairUpReads(sequences, 2 * cat + 1)) + pebbleRounds++; + + if (pairUpReads(sequences, 2 * CATEGORIES + 1)) + pebbleRounds++; + + detachDubiousReads(sequences, dubious); + activateGapMarkers(graph); + // Original + //for ( ;pebbleRounds > 0; pebbleRounds--) + // exploitShortReadPairs(graph, sequences, dubious, scaffolding); + // Original + + /* + // Original + // Identifying and Separating InterRepeats + identifyAndSeparateInterRepeats(graph, expectedCoverageMulti); + concatenateGraph(graph); + // Original + + // Original + // Outputting Graph After Separating InterRepeats + strcpy(graphFilename, directory); + strcat(graphFilename, "/Graph_AfterSeparation"); + exportGraph(graphFilename, graph, sequences->tSequences); + // Original + */ + + // Original + // Resolving Repeats for each subgraph + resolveRepeatOfAllSubgraphs(graph, sequences, expectedCoverageMulti, + dubious, scaffolding, pebbleRounds, + rateChimericSubgraph, discardChimericSubgraph, + repeatNodeCovSD); + // Original + } else { + puts("WARNING: NO EXPECTED COVERAGE PROVIDED"); + puts("Velvet will be unable to resolve any repeats"); + puts("See manual for instructions on how to set the expected coverage parameter"); + } + + free(dubious); + + concatenateGraph(graph); + + if (minContigLength < 2 * getWordLength(graph)) + minContigKmerLength = getWordLength(graph); + else + minContigKmerLength = minContigLength - getWordLength(graph) + 1; + + strcpy(graphFilename, directory); + strcat(graphFilename, "/contigs.fa"); + exportLongNodeSequences(graphFilename, graph, minContigKmerLength); + + if (sequences == NULL) { + sequences = importReadSet(seqFilename); + convertSequences(sequences); + } + + strcpy(graphFilename, directory); + strcat(graphFilename, "/stats.txt"); + displayGeneralStatistics(graph, graphFilename, sequences); + + strcpy(graphFilename, directory); + strcat(graphFilename, "/LastGraph"); + exportGraph(graphFilename, graph, sequences->tSequences); + + if (exportAssembly) { + strcpy(graphFilename, directory); + strcat(graphFilename, "/velvet_asm.afg"); + exportAMOSContigs(graphFilename, graph, minContigKmerLength, sequences); + } + + if (unusedReads) { + exportUnusedReads(graph, sequences, minContigKmerLength, directory); + } + + if (estimateCoverage) + printf("Estimated Coverage = %f\n", expectedCoverage); + if (estimateCutoff) + printf("Estimated Coverage cutoff = %f\n", coverageCutoff); + + logFinalStats(graph, minContigKmerLength, directory); + + destroyGraph(graph); + free(graphFilename); + free(preGraphFilename); + free(seqFilename); + free(roadmapFilename); + destroyReadSet(sequences); + return 0; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/scaffold.c b/contrib/MetaVelvet-v0.3.1/src/scaffold.c new file mode 100644 index 0000000..205906d --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/scaffold.c @@ -0,0 +1,1165 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "concatenatedGraph.h" +#include "recycleBin.h" +#include "locallyCorrectedGraph.h" +#include "passageMarker.h" +#include "readSet.h" +#include "utility.h" +#include "scaffold.h" + +#define BLOCK_SIZE 100000 +#define LN2 1.4 + +typedef struct readOccurence_st ReadOccurence; + +struct connection_st { + Node *destination; + Connection *next; + Connection *previous; + Connection *twin; + double distance; + double variance; + IDnum direct_count; + IDnum paired_count; +}; + +struct readOccurence_st { + Coordinate position; + Coordinate offset; + IDnum nodeID; +}; + +// Global params +static IDnum UNRELIABLE_CONNECTION_CUTOFF = 5; + +// Global pointers +static Graph *graph; +static Connection **scaffold = NULL; +static RecycleBin *connectionMemory = NULL; +static boolean estimated[CATEGORIES + 1]; + +static Connection *allocateConnection() +{ + if (connectionMemory == NULL) + connectionMemory = + newRecycleBin(sizeof(Connection), BLOCK_SIZE); + + return allocatePointer(connectionMemory); +} + +static void deallocateConnection(Connection * connect) +{ + deallocatePointer(connectionMemory, connect); +} + +Node * getConnectionDestination(Connection * connect) { + return connect->destination; +} + +Connection * getNextConnection(Connection * connect) { + return connect->next; +} + +Connection * getTwinConnection(Connection * connect) { + return connect->twin; +} + +Coordinate getConnectionDistance(Connection * connect) { + return (Coordinate) connect->distance; +} + +double getConnectionVariance(Connection * connect) { + return connect->variance; +} + +IDnum getConnectionDirectCount(Connection * connect) { + return connect->direct_count; +} + +IDnum getConnectionPairedCount(Connection * connect) { + return connect->paired_count; +} + +Connection * getConnection(Node * node) { + return scaffold[getNodeID(node) + nodeCount(graph)]; +} + +void incrementConnectionDistance(Connection * connect, Coordinate increment) { + connect->distance += increment; +} + +static double norm(double X) +{ + return 0.4 * exp(-X * X / 2); +} + +static double normInt(double X, double Y) +{ + return (erf(0.7 * Y) - erf(0.7 * X)) / 2; +} + +static IDnum expectedNumberOfConnections(IDnum IDA, Connection * connect, + IDnum ** counts, Category cat) +{ + Node *A = getNodeInGraph(graph, IDA); + Node *B = connect->destination; + double left, middle, right; + Coordinate longLength, shortLength, D; + IDnum longCount; + double M, N, O, P; + Coordinate mu = getInsertLength(graph, cat); + double sigma = sqrt(getInsertLength_var(graph, cat)); + double result; + + if (mu <= 0) + return 0; + + if (getNodeLength(A) < getNodeLength(B)) { + longLength = getNodeLength(B); + shortLength = getNodeLength(A); + longCount = counts[cat][getNodeID(B) + nodeCount(graph)]; + } else { + longLength = getNodeLength(A); + shortLength = getNodeLength(B); + longCount = counts[cat][IDA + nodeCount(graph)]; + } + + D = getConnectionDistance(connect) - (longLength + shortLength) / 2; + + M = (D - mu) / sigma; + N = (D + shortLength - mu) / sigma; + O = (D + longLength - mu) / sigma; + P = (D + shortLength + longLength - mu) / sigma; + + left = ((norm(M) - norm(N)) - M * normInt(M, N)) * sigma; + middle = shortLength * normInt(N, O); + right = ((norm(O) - norm(P)) - P * normInt(O, P)) * (-sigma); + + result = (longCount * (left + middle + right)) / longLength; + + if (result > 0) + return (IDnum) result; + else + return 0; +} + +void destroyConnection(Connection * connect, IDnum nodeID) +{ + Connection *previous, *next; + + //printf("Destroying connection from %li to %li\n", nodeID, getNodeID(connect->destination)); + + if (connect == NULL) + return; + + previous = connect->previous; + next = connect->next; + + if (previous != NULL) + previous->next = next; + if (next != NULL) + next->previous = previous; + + if (scaffold[nodeID + nodeCount(graph)] == connect) + scaffold[nodeID + nodeCount(graph)] = next; + + if (connect->twin != NULL) { + connect->twin->twin = NULL; + destroyConnection(connect->twin, + getNodeID(connect->destination)); + } + + deallocateConnection(connect); +} + +static boolean testConnection(IDnum IDA, Connection * connect, + IDnum ** counts) +{ + IDnum total = 0; + Category cat; + + // Spare unique -> undetermined node connections + if (!getUniqueness(connect->destination)) + return true; + + // Destroy tenuous connections + if (connect->paired_count + connect->direct_count < + UNRELIABLE_CONNECTION_CUTOFF) + return false; + + for (cat = 0; cat <= CATEGORIES; cat++) + total += + expectedNumberOfConnections(IDA, connect, counts, cat); + + // Remove inconsistent connections + return connect->paired_count >= total / 10; +} + +static IDnum *computeReadToNodeCounts() +{ + IDnum readIndex, nodeIndex; + IDnum maxNodeIndex = 2 * nodeCount(graph) + 1; + IDnum maxReadIndex = sequenceCount(graph) + 1; + IDnum *readNodeCounts = callocOrExit(maxReadIndex, IDnum); + boolean *readMarker = callocOrExit(maxReadIndex, boolean); + ShortReadMarker *nodeArray, *shortMarker; + PassageMarker *marker; + Node *node; + IDnum nodeReadCount; + + // Original + /* + puts("Computing read to node mapping array sizes"); + */ + // Original + + for (nodeIndex = 0; nodeIndex < maxNodeIndex; nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex - nodeCount(graph)); + if (node == NULL) + continue; + nodeArray = getNodeReads(node, graph); + nodeReadCount = getNodeReadCount(node, graph); + + // Short reads + for (readIndex = 0; readIndex < nodeReadCount; readIndex++) { + shortMarker = + getShortReadMarkerAtIndex(nodeArray, + readIndex); + readNodeCounts[getShortReadMarkerID + (shortMarker)]++; + } + + // Long reads + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + readIndex = getPassageMarkerSequenceID(marker); + if (readIndex < 0) + continue; + + if (readMarker[readIndex]) + continue; + + readNodeCounts[readIndex]++; + readMarker[readIndex] = true; + } + + // Clean up marker array + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + readIndex = getPassageMarkerSequenceID(marker); + if (readIndex > 0) + readMarker[readIndex] = false; + } + } + + free(readMarker); + return readNodeCounts; +} + +static ReadOccurence **allocateReadToNodeTables(IDnum * readNodeCounts) +{ + IDnum readIndex; + IDnum maxReadIndex = sequenceCount(graph) + 1; + ReadOccurence **readNodes = + callocOrExit(maxReadIndex, ReadOccurence *); + + for (readIndex = 1; readIndex < maxReadIndex; readIndex++) { + if (readNodeCounts[readIndex] != 0) { + readNodes[readIndex] = + callocOrExit(readNodeCounts[readIndex], + ReadOccurence); + readNodeCounts[readIndex] = 0; + } + } + + return readNodes; +} + +static void computePartialReadToNodeMapping(IDnum nodeID, + ReadOccurence ** readNodes, + IDnum * readNodeCounts, + boolean * readMarker) +{ + ShortReadMarker *shortMarker; + IDnum index, readIndex; + ReadOccurence *readArray, *readOccurence; + Node *node = getNodeInGraph(graph, nodeID); + ShortReadMarker *nodeArray = getNodeReads(node, graph); + IDnum nodeReadCount = getNodeReadCount(node, graph); + PassageMarker *marker; + + for (index = 0; index < nodeReadCount; index++) { + shortMarker = getShortReadMarkerAtIndex(nodeArray, index); + readIndex = getShortReadMarkerID(shortMarker); + readArray = readNodes[readIndex]; + readOccurence = &readArray[readNodeCounts[readIndex]]; + readOccurence->nodeID = nodeID; + readOccurence->position = + getShortReadMarkerPosition(shortMarker); + readOccurence->offset = + getShortReadMarkerOffset(shortMarker); + readNodeCounts[readIndex]++; + } + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + readIndex = getPassageMarkerSequenceID(marker); + if (readIndex < 0) + continue; + + if (!readMarker[readIndex]) { + readArray = readNodes[readIndex]; + readOccurence = + &readArray[readNodeCounts[readIndex]]; + readOccurence->nodeID = nodeID; + readOccurence->position = getStartOffset(marker); + readOccurence->offset = + getPassageMarkerStart(marker); + readNodeCounts[readIndex]++; + readMarker[readIndex] = true; + } else { + readArray = readNodes[readIndex]; + readOccurence = + &readArray[readNodeCounts[readIndex] - 1]; + readOccurence->position = -1; + readOccurence->offset = -1; + } + } + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + readIndex = getPassageMarkerSequenceID(marker); + if (readIndex > 0) + readMarker[readIndex] = false; + } +} + +static ReadOccurence **computeReadToNodeMappings(IDnum * readNodeCounts) +{ + IDnum nodeID; + IDnum nodes = nodeCount(graph); + ReadOccurence **readNodes = + allocateReadToNodeTables(readNodeCounts); + boolean *readMarker = + callocOrExit(sequenceCount(graph) + 1, boolean); + + // Original + /* + puts("Computing read to node mappings"); + */ + // Original + + for (nodeID = -nodes; nodeID <= nodes; nodeID++) + if (nodeID != 0 && getNodeInGraph(graph, nodeID)) + computePartialReadToNodeMapping(nodeID, readNodes, + readNodeCounts, + readMarker); + + free(readMarker); + return readNodes; +} + +static boolean * countCoOccurences(IDnum * coOccurencesCount, ReadOccurence ** readNodes, IDnum * readNodeCounts, IDnum * readPairs, Category * cats) { + IDnum readIndex, readPairIndex; + IDnum readNodeCount; + IDnum readOccurenceIndex, readPairOccurenceIndex; + ReadOccurence * readOccurence, *readPairOccurence; + boolean * interestingReads = callocOrExit(sequenceCount(graph), boolean); + Category libID; + + for (libID = 0; libID < CATEGORIES + 1; libID++) + coOccurencesCount[libID] = 0; + + for (readIndex = 0; readIndex < sequenceCount(graph); readIndex++) { + // Eliminating dodgy, unpaired, already counted or user-specified reads + if ( readPairs[readIndex] < readIndex + || getInsertLength(graph, cats[readIndex]) > -1) + continue; + + // Check for co-occurence + // We know that for each read the read occurences are ordered by increasing node ID + // Therefore one list is followed by increasing index, whereas the other is followed + // by decreasing index + libID = cats[readIndex]/2; + readPairIndex = readPairs[readIndex]; + + readOccurenceIndex = 0; + readOccurence = readNodes[readIndex + 1]; + readNodeCount = readNodeCounts[readIndex + 1]; + + readPairOccurenceIndex = readNodeCounts[readPairIndex + 1] - 1; + readPairOccurence = &(readNodes[readPairIndex + 1][readPairOccurenceIndex]); + + while (readOccurenceIndex < readNodeCount && readPairOccurenceIndex >= 0) { + if (readOccurence->nodeID == -readPairOccurence->nodeID) { + if (readOccurence->position > 0 && readPairOccurence->position > 0) { + coOccurencesCount[libID]++; + interestingReads[readIndex] = true; + break; + } else { + readOccurence++; + readOccurenceIndex++; + readPairOccurence--; + readPairOccurenceIndex--; + } + } else if (readOccurence->nodeID < -readPairOccurence->nodeID) { + readOccurence++; + readOccurenceIndex++; + } else { + readPairOccurence--; + readPairOccurenceIndex--; + } + } + } + + return interestingReads; +} + +static void measureCoOccurences(Coordinate ** coOccurences, boolean * interestingReads, ReadOccurence ** readNodes, IDnum * readNodeCounts, IDnum * readPairs, Category * cats) { + IDnum coOccurencesIndex[CATEGORIES + 1]; + IDnum observationIndex; + IDnum readIndex, readPairIndex; + IDnum readNodeCount; + IDnum readOccurenceIndex, readPairOccurenceIndex; + ReadOccurence * readOccurence, *readPairOccurence; + Category libID; + + for (libID = 0; libID < CATEGORIES + 1; libID++) + coOccurencesIndex[libID] = 0; + + for (readIndex = 0; readIndex < sequenceCount(graph); readIndex++) { + // Eliminating dodgy, unpaired, already counted or user-specified reads + if (!interestingReads[readIndex]) + continue; + + // Find co-occurence + // We know that for each read the read occurences are ordered by increasing node ID + libID = cats[readIndex]/2; + readPairIndex = readPairs[readIndex]; + observationIndex = coOccurencesIndex[libID]; + + readOccurence = readNodes[readIndex + 1]; + readOccurenceIndex = 0; + readNodeCount = readNodeCounts[readIndex + 1]; + + readPairOccurenceIndex = readNodeCounts[readPairIndex + 1] - 1; + readPairOccurence = &(readNodes[readPairIndex + 1][readPairOccurenceIndex]); + + while (readOccurenceIndex < readNodeCount && readPairOccurenceIndex >= 0) { + if (readOccurence->nodeID == -readPairOccurence->nodeID) { + if (readOccurence->position > 0 && readPairOccurence->position > 0) { + coOccurences[libID][observationIndex] = + getNodeLength(getNodeInGraph(graph, readOccurence->nodeID)) + + getWordLength(graph) - 1 + - (readOccurence->position - readOccurence->offset) + - (readPairOccurence->position - readPairOccurence->offset); + coOccurencesIndex[libID]++; + break; + } else { + readOccurence++; + readOccurenceIndex++; + readPairOccurence--; + readPairOccurenceIndex--; + } + } else if (readOccurence->nodeID < -readPairOccurence->nodeID) { + readOccurence++; + readOccurenceIndex++; + } else { + readPairOccurence--; + readPairOccurenceIndex--; + } + } + } +} + +int compareReadOccurences(const void *A, const void * B) { + Coordinate * cA = (Coordinate *) A; + Coordinate * cB = (Coordinate *) B; + + if (*cA > *cB) + return 1; + if (*cA == *cB) + return 0; + return -1; +} + +static void estimateLibraryInsertLength(Coordinate * coOccurences, IDnum coOccurencesCount, Category libID) { + Coordinate median, variance; + IDnum index; + int counter = 0; + qsort(coOccurences, coOccurencesCount, sizeof(Coordinate), compareReadOccurences); + + median = coOccurences[coOccurencesCount / 2]; + + // Modified variance around the median (proxy for expected value) + // interval censoring + variance = 0; + for (index = 0; index < coOccurencesCount; index++) { + if (coOccurences[index] > 0 && coOccurences[index] < 5 * median) { + variance += (coOccurences[index] - median) * (coOccurences[index] - median); + counter++; + } + } + if (counter) + variance /= counter; + else { + variance = 0; + for (index = 0; index < coOccurencesCount; index++) + variance += (coOccurences[index] - median) * (coOccurences[index] - median); + variance /= coOccurencesCount; + } + + // To avoid subsequent divisions by zero + if (variance == 0) + variance = 1; + + printf("Paired-end library %i has length: %lli, sample standard deviation: %lli\n", libID + 1, (long long) median, (long long) sqrt(variance)); + setInsertLengths(graph, libID, median, sqrt(variance)); + estimated[libID] = true; +} + +static void estimateLibraryInsertLengths(Coordinate ** coOccurences, IDnum * coOccurencesCounts) { + Category libID; + + for (libID = 0; libID < CATEGORIES + 1; libID++) + estimated[libID] = false; + + for (libID = 0; libID < CATEGORIES + 1; libID++) + if (coOccurencesCounts[libID] > 0) + estimateLibraryInsertLength(coOccurences[libID], coOccurencesCounts[libID], libID); +} + +static void estimateMissingInsertLengths(ReadOccurence ** readNodes, IDnum * readNodeCounts, IDnum * readPairs, Category * cats) { + Coordinate * coOccurences[CATEGORIES + 1]; + IDnum coOccurencesCounts[CATEGORIES + 1]; + Category libID; + + // Original + /* + puts("Estimating library insert lengths..."); + */ + // Original + + boolean * interestingReads = countCoOccurences(coOccurencesCounts, readNodes, readNodeCounts, readPairs, cats); + + for (libID = 0; libID < CATEGORIES + 1; libID++) + coOccurences[libID] = callocOrExit(coOccurencesCounts[libID], Coordinate); + + measureCoOccurences(coOccurences, interestingReads, readNodes, readNodeCounts, readPairs, cats); + estimateLibraryInsertLengths(coOccurences, coOccurencesCounts); + + for (libID = 0; libID < CATEGORIES + 1; libID++) + free(coOccurences[libID]); + + free(interestingReads); + + // Original + /* + puts("Done"); + */ + // Original +} + +static Connection *findConnection(IDnum nodeID, IDnum node2ID) +{ + Node *node2 = getNodeInGraph(graph, node2ID); + Connection *connect; + + if (node2 == NULL) + return NULL; + + for (connect = scaffold[nodeID + nodeCount(graph)]; + connect != NULL; connect = connect->next) + if (connect->destination == node2) + break; + + return connect; +} + +static void createTwinConnection(IDnum nodeID, IDnum node2ID, + Connection * connect) +{ + Connection *newConnection = allocateConnection(); + IDnum nodeIndex = nodeID + nodeCount(graph); + + // Fill in + newConnection->distance = connect->distance; + newConnection->variance = connect->variance; + newConnection->direct_count = connect->direct_count; + newConnection->paired_count = connect->paired_count; + newConnection->destination = getNodeInGraph(graph, node2ID); + + // Batch to twin + newConnection->twin = connect; + connect->twin = newConnection; + + // Insert in scaffold + newConnection->previous = NULL; + newConnection->next = scaffold[nodeIndex]; + if (scaffold[nodeIndex] != NULL) + scaffold[nodeIndex]->previous = newConnection; + scaffold[nodeIndex] = newConnection; +} + +Connection *createNewConnection(IDnum nodeID, IDnum node2ID, + IDnum direct_count, + IDnum paired_count, + Coordinate distance, + double variance) +{ + Node *destination = getNodeInGraph(graph, node2ID); + IDnum nodeIndex = nodeID + nodeCount(graph); + Connection *connect = allocateConnection(); + + // Fill in + connect->destination = destination; + connect->direct_count = direct_count; + connect->paired_count = paired_count; + connect->distance = (double) distance; + connect->variance = variance; + + // Insert in scaffold + connect->previous = NULL; + connect->next = scaffold[nodeIndex]; + if (scaffold[nodeIndex] != NULL) + scaffold[nodeIndex]->previous = connect; + scaffold[nodeIndex] = connect; + + // Event. pair up to twin + if (getUniqueness(destination)) + createTwinConnection(node2ID, nodeID, connect); + else + connect->twin = NULL; + + return connect; +} + +void readjustConnection(Connection * connect, Coordinate distance, + double variance, IDnum direct_count, + IDnum paired_count) +{ + connect->direct_count += direct_count; + connect->paired_count += paired_count; + + connect->distance = + (variance * connect->distance + + distance * connect->variance) / (variance + + connect->variance); + connect->variance = + (variance * + connect->variance) / (variance + connect->variance); + + if (connect->twin != NULL) { + connect->twin->distance = connect->distance; + connect->twin->variance = connect->variance; + connect->twin->direct_count = connect->direct_count; + connect->twin->paired_count = connect->paired_count; + } +} + +static void createConnection(IDnum nodeID, IDnum node2ID, + IDnum direct_count, + IDnum paired_count, + Coordinate distance, double variance) +{ + Connection *connect = findConnection(nodeID, node2ID); + + if (connect != NULL) + readjustConnection(connect, distance, variance, + direct_count, paired_count); + else + createNewConnection(nodeID, node2ID, direct_count, + paired_count, distance, variance); +} + +static void projectFromSingleRead(Node * node, + ReadOccurence * readOccurence, + Coordinate position, + Coordinate offset, Coordinate length) +{ + Coordinate distance = 0; + Node *target = getNodeInGraph(graph, -readOccurence->nodeID); + double variance = 1; + + if (target == getTwinNode(node) || target == node) + return; + + if (position < 0) { + variance += getNodeLength(node) * getNodeLength(node) / 16; + // distance += 0; + } else { + // variance += 0; + distance += position - getNodeLength(node) / 2; + } + + if (readOccurence->position < 0) { + variance += + getNodeLength(target) * getNodeLength(target) / 16; + //distance += 0; + } else { + // variance += 0; + distance += + -readOccurence->position + getNodeLength(target) / 2; + } + + if (readOccurence->offset < 0 || offset < 0) { + variance += length * length / 16; + //distance += 0; + } else { + // variance += 0; + distance += readOccurence->offset - offset; + } + + // Relative ordering + if (offset > 0 && readOccurence->offset > 0) { + if (offset < readOccurence->offset) { + if (distance - getNodeLength(node)/2 - getNodeLength(target)/2 < -10) + ; + else if (distance < getNodeLength(node)/2 + getNodeLength(target)/2) + createConnection(getNodeID(node), getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + else + createConnection(getNodeID(node), getNodeID(target), 1, 0, + distance, variance); + } else if (offset > readOccurence->offset) { + if (-distance - getNodeLength(node)/2 - getNodeLength(target)/2 < -10) + ; + else if (-distance < getNodeLength(node)/2 + getNodeLength(target)/2) + createConnection(-getNodeID(node), -getNodeID(target), 1, + 0, getNodeLength(node)/2 + getNodeLength(target)/2 , variance); + else + createConnection(-getNodeID(node), -getNodeID(target), 1, + 0, -distance, variance); + } + } else if (offset > 0 && position > 0) { + if (distance - offset > -getNodeLength(node)/2 && distance - offset + length > getNodeLength(node)/2) + createConnection(getNodeID(node), getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + else if (distance - offset < -getNodeLength(node)/2 && distance - offset + length < getNodeLength(node)/2) + createConnection(-getNodeID(node), -getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + else { + createConnection(getNodeID(node), getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + createConnection(-getNodeID(node), -getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + } + } else if (readOccurence->offset > 0 && readOccurence->position > 0) { + if (-distance - readOccurence->offset > -getNodeLength(target)/2 && -distance - readOccurence->offset + length > getNodeLength(target)/2) + createConnection(-getNodeID(node), -getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + if (-distance - readOccurence->offset < -getNodeLength(target)/2 && -distance - readOccurence->offset + length < getNodeLength(target)/2) + createConnection(getNodeID(node), getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + else { + createConnection(getNodeID(node), getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + createConnection(-getNodeID(node), -getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + } + } else { + createConnection(getNodeID(node), getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + createConnection(-getNodeID(node), -getNodeID(target), 1, 0, + getNodeLength(node)/2 + getNodeLength(target)/2, variance); + } +} + +static void projectFromReadPair(Node * node, ReadOccurence * readOccurence, + Coordinate position, Coordinate offset, + Coordinate insertLength, + double insertVariance) +{ + Coordinate distance = insertLength; + Coordinate variance = insertVariance; + Node *target = getNodeInGraph(graph, readOccurence->nodeID); + + if (target == getTwinNode(node) || target == node) + return; + + if (getUniqueness(target) && getNodeID(target) < getNodeID(node)) + return; + + if (position < 0) { + variance += getNodeLength(node) * getNodeLength(node) / 16; + // distance += 0; + } else { + // variance += 0; + distance += position - offset - getNodeLength(node) / 2; + } + + if (readOccurence->position < 0) { + variance += + getNodeLength(target) * getNodeLength(target) / 16; + //distance += 0; + } else { + // variance += 0; + distance += + readOccurence->position - readOccurence->offset - + getNodeLength(target) / 2; + } + + if (distance - getNodeLength(node)/2 - getNodeLength(target)/2 < -6 * sqrt(insertVariance)) + return; + else if (distance < getNodeLength(node)/2 + getNodeLength(target)/2) + distance = getNodeLength(node)/2 + getNodeLength(target)/2; + + createConnection(getNodeID(node), getNodeID(target), 0, 1, + distance, variance); +} + +static void projectFromShortRead(Node * node, + ShortReadMarker * shortMarker, + IDnum * readPairs, Category * cats, + ReadOccurence ** readNodes, + IDnum * readNodeCounts, + Coordinate * lengths) +{ + IDnum index; + IDnum readIndex = getShortReadMarkerID(shortMarker); + ReadOccurence *readArray; + IDnum readPairIndex; + Category cat; + Coordinate position = getShortReadMarkerPosition(shortMarker); + Coordinate offset = getShortReadMarkerOffset(shortMarker); + Coordinate length = lengths[getShortReadMarkerID(shortMarker) - 1]; + Coordinate insertLength; + double insertVariance; + + // Going through single-read information + if (readNodeCounts[readIndex] > 1) { + readArray = readNodes[readIndex]; + for (index = 0; index < readNodeCounts[readIndex]; index++) + projectFromSingleRead(node, &readArray[index], + position, offset, length); + } + // Going through paired read information + if (readPairs == NULL) + return; + + readPairIndex = readPairs[readIndex - 1] + 1; + + if (readPairIndex == 0) + return; + + cat = cats[readIndex - 1]; + insertLength = getInsertLength(graph, cat); + insertVariance = getInsertLength_var(graph, cat); + + readArray = readNodes[readPairIndex]; + for (index = 0; index < readNodeCounts[readPairIndex]; index++) + projectFromReadPair(node, &readArray[index], position, + offset, insertLength, insertVariance); + +} + +static void projectFromLongRead(Node * node, PassageMarker * marker, + IDnum * readPairs, Category * cats, + ReadOccurence ** readNodes, + IDnum * readNodeCounts, + Coordinate * lengths) +{ + IDnum index; + IDnum readIndex = getPassageMarkerSequenceID(marker); + ReadOccurence *readArray; + IDnum readPairIndex; + Category cat; + Coordinate position = getStartOffset(marker); + Coordinate offset = getPassageMarkerStart(marker); + Coordinate length = + lengths[getPassageMarkerSequenceID(marker) - 1]; + Coordinate insertLength; + double insertVariance; + + // Going through single-read information + if (readNodeCounts[readIndex] > 1 && position > 0) { + readArray = readNodes[readIndex]; + for (index = 0; index < readNodeCounts[readIndex]; index++) + projectFromSingleRead(node, &readArray[index], + position, offset, length); + } + // Going through paired read information + if (readPairs == NULL) + return; + + readPairIndex = readPairs[readIndex - 1] + 1; + + if (readPairIndex == 0) + return; + + cat = cats[readIndex - 1]; + insertLength = getInsertLength(graph, cat); + insertVariance = getInsertLength_var(graph, cat); + + readArray = readNodes[readPairIndex]; + for (index = 0; index < readNodeCounts[readPairIndex]; index++) + projectFromReadPair(node, &readArray[index], position, + offset, insertLength, insertVariance); + +} + +static void projectFromNode(IDnum nodeID, + ReadOccurence ** readNodes, + IDnum * readNodeCounts, + IDnum * readPairs, Category * cats, + boolean * dubious, Coordinate * lengths) +{ + IDnum index; + ShortReadMarker *nodeArray, *shortMarker; + PassageMarker *marker; + Node *node; + IDnum nodeReadCount; + + node = getNodeInGraph(graph, nodeID); + + if (node == NULL || !getUniqueness(node)) + return; + + nodeArray = getNodeReads(node, graph); + nodeReadCount = getNodeReadCount(node, graph); + for (index = 0; index < nodeReadCount; index++) { + shortMarker = getShortReadMarkerAtIndex(nodeArray, index); + if (dubious[getShortReadMarkerID(shortMarker) - 1]) + continue; + projectFromShortRead(node, shortMarker, readPairs, cats, + readNodes, readNodeCounts, lengths); + } + + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + if (getPassageMarkerSequenceID(marker) > 0) + projectFromLongRead(node, marker, readPairs, cats, + readNodes, readNodeCounts, + lengths); + } +} + +static Connection **computeNodeToNodeMappings(ReadOccurence ** readNodes, + IDnum * readNodeCounts, + IDnum * readPairs, + Category * cats, + boolean * dubious, + Coordinate * lengths) +{ + IDnum nodeID; + IDnum nodes = nodeCount(graph); + scaffold = callocOrExit(2 * nodes + 1, Connection *); + + // Original + /* + puts("Computing direct node to node mappings"); + */ + // Original + + for (nodeID = -nodes; nodeID <= nodes; nodeID++) { + // Original + /* + if (nodeID % 10000 == 0) + printf("Scaffolding node %d\n", nodeID); + */ + // Original + + projectFromNode(nodeID, readNodes, readNodeCounts, + readPairs, cats, dubious, lengths); + } + + return scaffold; +} + +static IDnum **countShortReads(Graph * graph, ReadSet * reads) +{ + IDnum **counts = callocOrExit(CATEGORIES + 1, IDnum *); + Category cat; + IDnum nodeIndex; + IDnum nodes = nodeCount(graph); + Node *node; + ShortReadMarker *array, *marker; + IDnum readCount, readIndex, readID; + + // Allocate memory where needed + for (cat = 0; cat <= CATEGORIES; cat++) + if (getInsertLength(graph, cat) > 0) + counts[cat] = + callocOrExit(2 * nodeCount(graph) + 1, + IDnum); + + // Start fillin' + for (nodeIndex = 0; nodeIndex < 2 * nodes + 1; nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex - nodes); + + if (node == NULL || !getUniqueness(node)) + continue; + + array = getNodeReads(node, graph); + readCount = getNodeReadCount(node, graph); + for (readIndex = 0; readIndex < readCount; readIndex++) { + marker = + getShortReadMarkerAtIndex(array, readIndex); + readID = getShortReadMarkerID(marker); + cat = reads->categories[readID - 1]; + if (cat % 2 == 1 && counts[cat / 2] != NULL) + counts[cat / 2][nodeIndex]++; + } + } + + return counts; +} + +void printConnections(ReadSet * reads) +{ + IDnum maxNodeIndex = nodeCount(graph) * 2 + 1; + IDnum index; + Connection *connect, *next; + Node *node; + IDnum **counts = countShortReads(graph, reads); + IDnum nodes = nodeCount(graph); + + puts("CONNECT IDA IDB dcount pcount dist lengthA lengthB var countA countB coordA coordB real exp distance test"); + + for (index = 0; index < maxNodeIndex; index++) { + node = getNodeInGraph(graph, index - nodeCount(graph)); + for (connect = scaffold[index]; connect != NULL; + connect = next) { + next = connect->next; + if (getUniqueness(connect->destination)) { + printf + ("CONNECT %ld %ld %ld %ld %lld %lld %lld %f %ld %ld", + (long) index - nodeCount(graph), + (long) getNodeID(connect->destination), + (long) connect->direct_count, + (long) connect->paired_count, + (long long) getConnectionDistance(connect), + (long long) getNodeLength(node), + (long long) getNodeLength(connect->destination), + connect->variance, + (long) getNodeReadCount(node, graph), + (long) getNodeReadCount(connect->destination, + graph)); + if (markerCount(node) == 1 + && markerCount(connect->destination) == + 1) + printf(" %lld %lld %lld", + (long long) getPassageMarkerFinish + (getMarker(node)), + (long long) getPassageMarkerFinish + (getMarker + (connect->destination)), + (long long) (getPassageMarkerFinish + (getMarker(node)) - + getPassageMarkerFinish + (getMarker + (connect->destination)))); + else + printf(" ? ?"); + printf(" %ld", (long) expectedNumberOfConnections(index-nodeCount(graph), connect, counts, 0)); + printf(" %lld", (long long) (getConnectionDistance(connect) - (getNodeLength(node) + getNodeLength(connect->destination))/2)); + if (testConnection + (index - nodes, connect, counts)) + puts(" OK"); + else + puts(" NG"); + } + } + } +} + +static void removeUnreliableConnections(ReadSet * reads) +{ + IDnum maxNodeIndex = nodeCount(graph) * 2 + 1; + IDnum index; + Connection *connect, *next; + Category cat; + IDnum **counts = countShortReads(graph, reads); + IDnum nodes = nodeCount(graph); + + for (index = 0; index < maxNodeIndex; index++) { + for (connect = scaffold[index]; connect != NULL; + connect = next) { + next = connect->next; + if (!testConnection + (index - nodes, connect, counts)) + destroyConnection(connect, index - nodes); + } + } + + // Free memory + for (cat = 0; cat <= CATEGORIES; cat++) + if (counts[cat]) + free(counts[cat]); + free(counts); +} + +void buildScaffold(Graph * argGraph, ReadSet * reads, boolean * dubious) { + IDnum *readPairs; + Category *cats; + IDnum *readNodeCounts; + ReadOccurence **readNodes; + Coordinate *lengths = + getSequenceLengths(reads, getWordLength(argGraph)); + IDnum index; + + graph = argGraph; + readPairs = reads->mateReads; + cats = reads->categories; + + // Prepare primary scaffold + readNodeCounts = computeReadToNodeCounts(); + readNodes = computeReadToNodeMappings(readNodeCounts); + + estimateMissingInsertLengths(readNodes, readNodeCounts, readPairs, cats); + + scaffold = computeNodeToNodeMappings(readNodes, readNodeCounts, + readPairs, cats, dubious, lengths); + removeUnreliableConnections(reads); + + // Clean up memory + for (index = 1; index <= sequenceCount(graph); index++) + free(readNodes[index]); + + free(readNodes); + free(readNodeCounts); + free(lengths); +} + +void setUnreliableConnectionCutoff(int val) +{ + UNRELIABLE_CONNECTION_CUTOFF = (IDnum) val; +} + +void cleanScaffoldMemory() { + Category libID; + + for (libID = 0; libID < CATEGORIES + 1; libID++) + if (estimated[libID]) + setInsertLengths(graph, libID, -1, -1); + + destroyRecycleBin(connectionMemory); + free(scaffold); + connectionMemory = NULL; +} diff --git a/contrib/MetaVelvet-v0.3.1/src/scaffold.h b/contrib/MetaVelvet-v0.3.1/src/scaffold.h new file mode 100644 index 0000000..2cb2bfc --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/scaffold.h @@ -0,0 +1,58 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _SSCAFFOLD_H_ +#define _SCAFFOLD_H_ + +typedef struct connection_st Connection; + +//General scaffold function +void buildScaffold(Graph * graph, ReadSet * reads, boolean * dubious); + +Connection *createNewConnection(IDnum nodeID, IDnum node2ID, + IDnum direct_count, + IDnum paired_count, + Coordinate distance, + double variance); +void readjustConnection(Connection * connect, Coordinate distance, + double variance, IDnum direct_count, + IDnum paired_count); +void destroyConnection(Connection * connect, IDnum nodeID); + +void cleanScaffoldMemory(); + +void setUnreliableConnectionCutoff(int val); + +void printConnections(ReadSet * reads); + +// Connection handlers +Connection * getConnection(Node * node); + +Node * getConnectionDestination(Connection * connect); +Coordinate getConnectionDistance(Connection * connect); +Connection * getNextConnection(Connection * connect); +Connection * getTwinConnection(Connection * connect); +double getConnectionVariance(Connection * connect); +IDnum getConnectionDirectCount(Connection * connect); +IDnum getConnectionPairedCount(Connection * connect); + +void incrementConnectionDistance(Connection * connect, Coordinate increment); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/sequenceExpander.c b/contrib/MetaVelvet-v0.3.1/src/sequenceExpander.c new file mode 100644 index 0000000..79abf1d --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/sequenceExpander.c @@ -0,0 +1,79 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "tightString.h" +#include "passageMarker.h" + +static TightString *readPositivePassageMarker(PassageMarker * marker, + TightString ** seqs, + int WORDLENGTH) +{ + Coordinate index; + Nucleotide nucleotide; + TightString *tString = + seqs[getPassageMarkerSequenceID(marker) - 1]; + TightString *res = newTightString(getPassageMarkerLength(marker)); + + for (index = 0; index < getLength(tString); index++) { + nucleotide = + getNucleotide(getPassageMarkerStart(marker) + index + + WORDLENGTH - 1, tString); + writeNucleotideAtPosition(nucleotide, index, res); + } + + return res; +} + +static TightString *readNegativePassageMarker(PassageMarker * marker, + TightString ** seqs) +{ + Coordinate index; + Nucleotide nucleotide; + TightString *tString = + seqs[getAbsolutePassMarkerSeqID(marker) - 1]; + TightString *res = newTightString(getPassageMarkerLength(marker)); + + for (index = 0; index < getPassageMarkerLength(marker); index++) { + nucleotide = + getNucleotide(getPassageMarkerStart(marker) - index, + tString); +#ifndef COLOR + writeNucleotideAtPosition(3 - nucleotide, index, res); +#else + writeNucleotideAtPosition(nucleotide, index, res); +#endif + } + + return res; +} + +TightString *expandPassageMarker(PassageMarker * marker, + TightString ** sequences, int WORDLENGTH) +{ + if (getPassageMarkerSequenceID(marker) > 0) + return readPositivePassageMarker(marker, sequences, + WORDLENGTH); + else + return readNegativePassageMarker(marker, sequences); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/sequenceExpander.h b/contrib/MetaVelvet-v0.3.1/src/sequenceExpander.h new file mode 100644 index 0000000..62f9449 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/sequenceExpander.h @@ -0,0 +1,26 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _SEQUENCEEXPANDER_H_ +#define _SEQUENCEEXPANDER_H_ + +TightString *expandPassageMarker(PassageMarker * marker, + TightString ** sequences, int WORDLENGTH); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/shortReadPairs.c b/contrib/MetaVelvet-v0.3.1/src/shortReadPairs.c new file mode 100644 index 0000000..547fbe0 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/shortReadPairs.c @@ -0,0 +1,1357 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include + +#include "globals.h" +#include "graph.h" +#include "concatenatedGraph.h" +#include "recycleBin.h" +#include "locallyCorrectedGraph.h" +#include "passageMarker.h" +#include "readSet.h" +#include "utility.h" +#include "scaffold.h" + +#define BLOCK_SIZE 100000 +#define LN2 1.4 +#define BACKTRACK_CUTOFF 100 + +typedef struct miniConnection_st MiniConnection; + +struct miniConnection_st { + Coordinate distance; + double variance; + Connection *frontReference; + Connection *backReference; + NodeList *nodeList; +}; + + +// Global pointers +static Graph *graph; +static NodeList *markedNodes; +static RecycleBin *nodeListMemory = NULL; +static MiniConnection *localScaffold = NULL; + +static NodeList *allocateNodeList() +{ + if (nodeListMemory == NULL) + nodeListMemory = + newRecycleBin(sizeof(NodeList), BLOCK_SIZE); + + return allocatePointer(nodeListMemory); +} + +static void deallocateNodeList(NodeList * nodeList) +{ + deallocatePointer(nodeListMemory, nodeList); +} + +static NodeList *recordNode(Node * node) +{ + NodeList *nodeList = allocateNodeList(); + nodeList->node = node; + nodeList->next = markedNodes; + nodeList->previous = NULL; + + if (markedNodes != NULL) + markedNodes->previous = nodeList; + + markedNodes = nodeList; + + return nodeList; +} + +static void destroyNodeList(NodeList * nodeList) +{ + //printf("Destroy NL %p > %p > %p\n", nodeList->previous, nodeList, nodeList->next); + + if (nodeList->previous != NULL) + nodeList->previous->next = nodeList->next; + else + markedNodes = nodeList->next; + + if (nodeList->next != NULL) + nodeList->next->previous = nodeList->previous; + + nodeList->previous = nodeList->next = NULL; + + deallocateNodeList(nodeList); +} + +static Node *popNodeRecord() +{ + MiniConnection *localConnect; + + NodeList *nodeList = markedNodes; + Node *node; + + if (markedNodes == NULL) + return NULL; + + node = nodeList->node; + markedNodes = nodeList->next; + if (markedNodes != NULL) + markedNodes->previous = NULL; + + localConnect = + &localScaffold[getNodeID(nodeList->node) + nodeCount(graph)]; + localConnect->nodeList = NULL; + + deallocateNodeList(nodeList); + return node; +} + +void detachImprobablePairs(ReadSet * sequences) +{ + IDnum index, nodeIndex; + IDnum maxNodeIndex = 2 * nodeCount(graph) + 1; + ShortReadMarker *nodeArray, *shortMarker; + Node *node; + IDnum nodeReadCount; + IDnum seqID, pairID; + IDnum *mateReads = sequences->mateReads; + Category *cats = sequences->categories; + + for (nodeIndex = 0; nodeIndex < maxNodeIndex; nodeIndex++) { + node = getNodeInGraph(graph, nodeIndex - nodeCount(graph)); + if (node == NULL) + continue; + + nodeArray = getNodeReads(node, graph); + nodeReadCount = getNodeReadCount(node, graph); + + for (index = 0; index < nodeReadCount; index++) { + shortMarker = + getShortReadMarkerAtIndex(nodeArray, index); + + seqID = getShortReadMarkerID(shortMarker); + if (mateReads[seqID] == -1) + continue; + + if (getNodeLength(node) - + getShortReadMarkerPosition(shortMarker) > + 2 * getInsertLength(graph, cats[seqID])) { + pairID = mateReads[seqID]; + + if (pairID != -1) { + mateReads[seqID] = -1; + mateReads[pairID] = -1; + } + } + } + } +} + +static void resetMiniConnection(Node * node, MiniConnection * localConnect, + Coordinate distance, double variance, + Connection * frontReference, + Connection * backReference, boolean status) +{ + setSingleNodeStatus(node, status); + localConnect->distance = distance; + localConnect->variance = variance; + localConnect->frontReference = frontReference; + localConnect->backReference = backReference; + localConnect->nodeList = recordNode(node); +} + +static void setEmptyMiniConnection(Node * node) +{ + MiniConnection *localConnect = + &localScaffold[getNodeID(node) + nodeCount(graph)]; + localConnect->distance = 0; + localConnect->variance = 1; + localConnect->frontReference = NULL; + localConnect->backReference = NULL; + localConnect->nodeList = recordNode(node); + setSingleNodeStatus(node, true); +} + +static void readjustMiniConnection(Node * node, + MiniConnection * localConnect, + Coordinate distance, + Coordinate min_distance, + double variance, + Connection * frontReference, + Connection * backReference) +{ + + localConnect->distance = + (variance * localConnect->distance + + distance * localConnect->variance) / (variance + + localConnect->variance); + localConnect->variance = + (variance * + localConnect->variance) / (variance + localConnect->variance); + + if (frontReference != NULL) + localConnect->frontReference = frontReference; + if (backReference != NULL) + localConnect->backReference = backReference; + + if (localConnect->distance > min_distance) + setSingleNodeStatus(node, 1); + else + setSingleNodeStatus(node, -1); +} + +static void integrateDerivativeDistances(Connection * connect, + Coordinate min_distance, + boolean direction) +{ + Node *reference = getConnectionDestination(connect); + Node *destination; + IDnum destinationID; + Coordinate distance, baseDistance; + double variance, baseVariance; + Connection *connect2; + MiniConnection *localConnect; + + // debug + IDnum counter = 0; + + if (!getUniqueness(reference)) + return; + + //printf("Opposite node %li length %li at %li ± %f\n", getNodeID(reference), getNodeLength(reference), getConnectionDistance(connect), getConnectionVariance(connect)); + + baseDistance = getConnectionDistance(connect); + baseVariance = getConnectionVariance(connect); + + for (connect2 = getConnection(reference); + connect2 != NULL; connect2 = getNextConnection(connect2)) { + // Avoid null derivative + if (connect2 == getTwinConnection(connect)) + continue; + + destination = getConnectionDestination(connect2); + + // Beware of directionality + if (!direction) + destination = getTwinNode(destination); + + // Derivate values + destinationID = getNodeID(destination); + // Beware of directionality (bis) + if (direction) + distance = baseDistance - getConnectionDistance(connect2); + else + distance = getConnectionDistance(connect2) - baseDistance; + variance = getConnectionVariance(connect2) + baseVariance; + localConnect = + &localScaffold[destinationID + nodeCount(graph)]; + + // Avoid over-projection + if (distance < min_distance) { + //printf("Node %li not at distance %li± %f (min %li)\n", destinationID, distance, variance, min_distance); + continue; + } + + counter++; + + if (getNodeStatus(destination)) { + readjustMiniConnection(destination, localConnect, + distance, min_distance, + variance, NULL, NULL); + } else + resetMiniConnection(destination, localConnect, + distance, variance, NULL, NULL, + true); + + //printf("Node %li now at distance %li\n", destinationID, localConnect->distance); + } + + //printf("%li secondary distances added\n", counter); +} + +static void markInterestingNodes(Node * node) +{ + Connection *connect; + Node *destination; + MiniConnection *localConnect; + Coordinate min_distance = + getNodeLength(node) / 2 - BACKTRACK_CUTOFF; + + // Mark own node + setEmptyMiniConnection(node); + + // Loop thru primary scaffold + for (connect = getConnection(node); connect != NULL; + connect = getNextConnection(connect)) { + destination = getTwinNode(getConnectionDestination(connect)); + + localConnect = + &localScaffold[getNodeID(destination) + + nodeCount(graph)]; + + if (getNodeStatus(destination)) { + readjustMiniConnection(destination, localConnect, + getConnectionDistance(connect), + min_distance, + getConnectionVariance(connect), connect, + NULL); + localConnect->backReference = NULL; + } else { + resetMiniConnection(destination, localConnect, + getConnectionDistance(connect), + getConnectionVariance(connect), connect, + NULL, true); + } + + integrateDerivativeDistances(connect, min_distance, true); + } + + // Loop thru twin's primary scaffold + for (connect = getConnection(getTwinNode(node)); connect != NULL; + connect = getNextConnection(connect)) { + destination = getConnectionDestination(connect); + localConnect = + &localScaffold[getNodeID(destination) + + nodeCount(graph)]; + + if (getNodeStatus(destination)) + readjustMiniConnection(destination, localConnect, + -getConnectionDistance(connect), + min_distance, + getConnectionVariance(connect), NULL, + connect); + else + resetMiniConnection(destination, localConnect, + -getConnectionDistance(connect), + getConnectionVariance(connect), NULL, + connect, -1); + + integrateDerivativeDistances(connect, min_distance, false); + } +} + +void unmarkNode(Node * node, MiniConnection * localConnect) +{ + if (localConnect->frontReference != NULL + || localConnect->backReference != NULL) { + if (getNodeStatus(node) > 0) + setSingleNodeStatus(node, 10); + else + setSingleNodeStatus(node, -10); + } else { + setSingleNodeStatus(node, false); + destroyNodeList(localConnect->nodeList); + localConnect->frontReference = NULL; + localConnect->backReference = NULL; + localConnect->nodeList = NULL; + } +} + +void handicapNode(Node * node) +{ + if (getNodeStatus(node) > 0) + setSingleNodeStatus(node, 10); + else + setSingleNodeStatus(node, -10); +} + +static void absorbExtension(Node * node, Node * extension) +{ + Arc *arc; + + appendNodeGaps(node, extension, graph); + appendDescriptors(node, extension); + + // Destroy old nodes + while (getArc(node) != NULL) + destroyArc(getArc(node), graph); + + // Create new + for (arc = getArc(extension); arc != NULL; arc = getNextArc(arc)) + createAnalogousArc(node, getDestination(arc), arc, graph); +} + +NodeList *getMarkedNodeList() +{ + return markedNodes; +} + +static void absorbExtensionInScaffold(Node * node, Node * source) +{ + IDnum nodeID = getNodeID(node); + IDnum sourceID = getNodeID(source); + IDnum sourceIndex = sourceID + nodeCount(graph); + Node *twinSource = getTwinNode(source); + IDnum twinSourceIndex = getNodeID(twinSource) + nodeCount(graph); + Connection *connect, *original; + Node *destination; + IDnum destinationID; + Coordinate distance_shift = + (getNodeLength(node) - getNodeLength(source)) / 2; + Coordinate min_distance = + getNodeLength(node) / 2 - BACKTRACK_CUTOFF; + MiniConnection *localConnect; + Coordinate distance; + double variance; + IDnum direct_count; + IDnum paired_count; + + while ((connect = getConnection(source))) { + destination = getTwinNode(getConnectionDestination(connect)); + + if (destination == getTwinNode(node)) { + localConnect = &localScaffold[twinSourceIndex]; + localConnect->frontReference = NULL; + unmarkNode(twinSource, localConnect); + destroyConnection(connect, sourceID); + continue; + } + if (destination == node) { + localConnect = &localScaffold[sourceIndex]; + localConnect->backReference = NULL; + unmarkNode(source, localConnect); + destroyConnection(connect, sourceID); + continue; + } + + destinationID = getNodeID(destination); + localConnect = + &localScaffold[destinationID + nodeCount(graph)]; + incrementConnectionDistance(connect, distance_shift); + distance = getConnectionDistance(connect); + variance = getConnectionVariance(connect); + direct_count = getConnectionDirectCount(connect); + paired_count = getConnectionPairedCount(connect); + + if (getNodeStatus(destination)) { + readjustMiniConnection(destination, localConnect, + distance, min_distance, + variance, NULL, NULL); + if ((original = localConnect->frontReference)) + readjustConnection(original, distance, + variance, direct_count, + paired_count); + else + localConnect->frontReference = + createNewConnection(nodeID, + -destinationID, + direct_count, + paired_count, + distance, + variance); + } else + resetMiniConnection(destination, localConnect, + distance, variance, + createNewConnection(nodeID, + -destinationID, + direct_count, + paired_count, + distance, + variance), + NULL, true); + + integrateDerivativeDistances(connect, min_distance, true); + + destroyConnection(connect, sourceID); + } + + // Loop thru twin's primary scaffold + while ((connect = getConnection(getTwinNode(source)))) { + destination = getConnectionDestination(connect); + + if (destination == node) { + localConnect = &localScaffold[sourceIndex]; + localConnect->frontReference = NULL; + unmarkNode(source, localConnect); + destroyConnection(connect, -sourceID); + continue; + } + if (destination == getTwinNode(node)) { + localConnect = &localScaffold[twinSourceIndex]; + localConnect->backReference = NULL; + unmarkNode(twinSource, localConnect); + destroyConnection(connect, -sourceID); + continue; + } + + destinationID = getNodeID(destination); + + localConnect = + &localScaffold[destinationID + nodeCount(graph)]; + incrementConnectionDistance(connect, -distance_shift); + distance = getConnectionDistance(connect); + variance = getConnectionVariance(connect); + direct_count = getConnectionDirectCount(connect); + paired_count = getConnectionPairedCount(connect); + + if (distance > min_distance && getNodeStatus(destination) < 0) { + readjustMiniConnection(destination, localConnect, + -distance, min_distance, + variance, NULL, NULL); + if ((original = localConnect->backReference)) + readjustConnection(original, distance, + variance, direct_count, + paired_count); + } else if (getNodeStatus(destination) < 0) { + if ((original = localConnect->backReference)) { + destroyConnection(original, -nodeID); + localConnect->backReference = NULL; + } + unmarkNode(destination, localConnect); + } else if (getNodeStatus(destination) > 0) { + if ((original = localConnect->frontReference)) { + destroyConnection(original, nodeID); + localConnect->frontReference = NULL; + } + unmarkNode(destination, localConnect); + } else if (distance > min_distance) { + resetMiniConnection(destination, localConnect, + -distance, variance, NULL, + createNewConnection(-nodeID, + destinationID, + direct_count, + paired_count, + distance, + variance), + -1); + integrateDerivativeDistances(connect, min_distance, true); + } + + destroyConnection(connect, -sourceID); + } +} + +static void recenterNode(Node * node, Coordinate oldLength) +{ + IDnum nodeID = getNodeID(node); + Connection *connect, *next; + Coordinate distance_shift = (getNodeLength(node) - oldLength) / 2; + Coordinate min_distance = + getNodeLength(node) / 2 - BACKTRACK_CUTOFF; + MiniConnection *localConnect; + + //puts("Recentering node"); + + for (connect = getConnection(node); connect != NULL; + connect = next) { + next = getNextConnection(connect); + incrementConnectionDistance(connect, -distance_shift); + + if (getConnectionDistance(connect) < min_distance) { + //printf("Unrecording %li\n", + // -getNodeID(getConnectionDestination(connect))); + localConnect = + &localScaffold[-getNodeID(getConnectionDestination(connect)) + + nodeCount(graph)]; + localConnect->frontReference = NULL; + unmarkNode(getTwinNode(getConnectionDestination(connect)), + localConnect); + destroyConnection(connect, nodeID); + } else if (getTwinConnection(connect) != NULL) + incrementConnectionDistance(getTwinConnection(connect), -distance_shift); + } + + for (connect = getConnection(getTwinNode(node)); connect != NULL; + connect = next) { + next = getNextConnection(connect); + incrementConnectionDistance(connect, distance_shift); + + if (getTwinConnection(connect) != NULL) + incrementConnectionDistance(getTwinConnection(connect), distance_shift); + } +} + +static void recenterLocalScaffold(Node * node, Coordinate oldLength) +{ + MiniConnection *localConnect; + Coordinate distance_shift = (getNodeLength(node) - oldLength) / 2; + Coordinate min_distance = + getNodeLength(node) / 2 - BACKTRACK_CUTOFF; + NodeList *nodeList, *next; + IDnum node2ID; + Node *node2; + + for (nodeList = markedNodes; nodeList != NULL; nodeList = next) { + next = nodeList->next; + + node2 = nodeList->node; + + if (node2 == node) { + setSingleNodeStatus(node2, 1); + continue; + } + + node2ID = getNodeID(node2); + localConnect = &localScaffold[node2ID + nodeCount(graph)]; + localConnect->distance -= distance_shift; + + if (localConnect->distance < min_distance + && localConnect->backReference == NULL + && localConnect->frontReference == NULL) + unmarkNode(node2, localConnect); + else if (getNodeStatus(node2) > 0) + setSingleNodeStatus(node2, 1); + else if (getNodeStatus(node2) < 0) + setSingleNodeStatus(node2, -1); + } +} + +static void adjustShortReads(Node * target, Node * source) +{ + ShortReadMarker *targetArray, *marker; + IDnum targetLength, index; + Coordinate position, nodeLength; + + if (!readStartsAreActivated(graph)) + return; + + targetArray = getNodeReads(getTwinNode(target), graph); + targetLength = getNodeReadCount(getTwinNode(target), graph); + + nodeLength = getNodeLength(source); + + for (index = 0; index < targetLength; index++) { + marker = getShortReadMarkerAtIndex(targetArray, index); + position = getShortReadMarkerPosition(marker); + position += nodeLength; + setShortReadMarkerPosition(marker, position); + } +} + +static void adjustLongReads(Node * target, Node * source) +{ + PassageMarker *marker; + Coordinate nodeLength = getNodeLength(source); + + for (marker = getMarker(source); marker != NULL; + marker = getNextInNode(marker)) + incrementFinishOffset(marker, nodeLength); +} + +static boolean goesToNode(PassageMarker * marker, Node * node) +{ + PassageMarker *current; + + for (current = marker; current != NULL; + current = getNextInSequence(current)) + if (getNode(current) == node) + return true; + + return false; +} + +static boolean comesFromNode(PassageMarker * marker, Node * node) +{ + Node *target = getTwinNode(node); + PassageMarker *current; + + for (current = getTwinMarker(marker); current != NULL; + current = getNextInSequence(current)) + if (getNode(current) == target) + return true; + + return false; +} + +static void reconnectPassageMarker(PassageMarker * marker, Node * node, + PassageMarker ** ptr) +{ + PassageMarker *current; + PassageMarker *next = getNextInSequence(marker); + PassageMarker *tmpMarker; + + for (current = marker; getNode(current) != node; + current = getPreviousInSequence(current)); + + setPreviousInSequence(current, next); + concatenatePassageMarkers(current, marker); + + for (; marker != current; marker = tmpMarker) { + tmpMarker = getPreviousInSequence(marker); + if (*ptr == marker || *ptr == getTwinMarker(marker)) + *ptr = getNextInNode(*ptr); + setNextInSequence(marker, NULL); + setPreviousInSequence(NULL, marker); + destroyPassageMarker(marker); + } +} + +static void concatenateLongReads(Node * node, Node * candidate, + Graph * graph) +{ + PassageMarker *marker, *tmpMarker; + + // Passage marker management in node: + for (marker = getMarker(node); marker != NULL; + marker = getNextInNode(marker)) { + if (!goesToNode(marker, candidate)) + incrementFinishOffset(marker, + getNodeLength(candidate)); + } + + // Swapping new born passageMarkers from candidate to node + for (marker = getMarker(candidate); marker != NULL; + marker = tmpMarker) { + tmpMarker = getNextInNode(marker); + + if (!comesFromNode(marker, node)) { + extractPassageMarker(marker); + transposePassageMarker(marker, node); + incrementFinishOffset(getTwinMarker(marker), + getNodeLength(node)); + } else { + reconnectPassageMarker(marker, node, &tmpMarker); + } + } +} + +static void adjustShortReadsByLength(Node * target, Coordinate nodeLength) +{ + ShortReadMarker *targetArray, *marker; + IDnum targetLength, index; + Coordinate position; + + if (!readStartsAreActivated(graph)) + return; + + targetArray = getNodeReads(getTwinNode(target), graph); + targetLength = getNodeReadCount(getTwinNode(target), graph); + + for (index = 0; index < targetLength; index++) { + marker = getShortReadMarkerAtIndex(targetArray, index); + position = getShortReadMarkerPosition(marker); + position += nodeLength; + setShortReadMarkerPosition(marker, position); + } +} + +static boolean abs_bool(boolean val) +{ + return val >= 0 ? val : -val; +} + +static IDnum abs_ID(IDnum val) +{ + return val >= 0 ? val : -val; +} + +static NodeList *pathIsClear(Node * node, Node * oppositeNode, + Coordinate distance) +{ + Arc *arc; + Node *candidate, *dest, *current; + Coordinate extension_distance = 0; + boolean maxRepeat = 1; + Node *repeatEntrance = NULL; + IDnum counter = 0; + NodeList *path = NULL; + NodeList *tail = path; + + setSingleNodeStatus(node, 2); + + current = node; + while (true) { + + ////////////////////////////////// + // Selecting destination // + ////////////////////////////////// + candidate = NULL; + + // First round for priority nodes + for (arc = getArc(current); arc != NULL; + arc = getNextArc(arc)) { + dest = getDestination(arc); + + if (dest == node || dest == getTwinNode(node)) + continue; + + if (getNodeStatus(dest) <= 0) + continue; + + if (candidate == NULL + || getNodeStatus(candidate) > + getNodeStatus(dest) + || (getNodeStatus(candidate) == + getNodeStatus(dest) + && extension_distance > + localScaffold[getNodeID(dest) + + nodeCount(graph)]. + distance - getNodeLength(dest) / 2)) { + extension_distance = + localScaffold[getNodeID(dest) + + nodeCount(graph)]. + distance - getNodeLength(dest) / 2; + candidate = dest; + } + } + + if (candidate != NULL && repeatEntrance) { + for (arc = getArc(node); arc != NULL; + arc = getNextArc(arc)) { + dest = getDestination(arc); + if (dest != candidate + && getNodeStatus(dest)) { + break; + } + } + } + // In case of failure + if (candidate == NULL) { + for (arc = getArc(current); arc != NULL; + arc = getNextArc(arc)) { + dest = getDestination(arc); + + if (getNodeStatus(dest) == 0) + continue; + + if (dest == node + || dest == getTwinNode(node)) + continue; + + if (candidate == NULL + || getNodeStatus(candidate) < + getNodeStatus(dest) + || (getNodeStatus(candidate) == + getNodeStatus(dest) + && extension_distance < + localScaffold[getNodeID(dest) + + nodeCount(graph)]. + distance - + getNodeLength(dest) / 2)) { + extension_distance = + localScaffold[getNodeID(dest) + + nodeCount + (graph)]. + distance - + getNodeLength(dest) / 2; + candidate = dest; + } + } + } + if (candidate == NULL) { + while (path) { + tail = path->next; + deallocateNodeList(path); + path = tail; + } + return false; + } + // Loop detection + if (candidate == repeatEntrance + && abs_bool(getNodeStatus(candidate)) == + maxRepeat + 1) { + while (path) { + tail = path->next; + deallocateNodeList(path); + path = tail; + } + return false; + } else if (abs_bool(getNodeStatus(candidate)) > maxRepeat) { + maxRepeat = abs_bool(getNodeStatus(candidate)); + repeatEntrance = candidate; + } else if (abs_bool(getNodeStatus(candidate)) == 1) { + maxRepeat = 1; + repeatEntrance = NULL; + } + + if (getNodeStatus(candidate) > 0) + setSingleNodeStatus(candidate, + getNodeStatus(candidate) + 1); + else + setSingleNodeStatus(candidate, + getNodeStatus(candidate) - 1); + + + // DEBUG + if (abs_bool(getNodeStatus(candidate)) > 100 + || counter > nodeCount(graph)) { + while (path) { + tail = path->next; + deallocateNodeList(path); + path = tail; + } + return false; + } + + // Missassembly detection + if (getUniqueness(candidate) && oppositeNode + && candidate != oppositeNode + && extension_distance > distance) { + while (path) { + tail = path->next; + deallocateNodeList(path); + path = tail; + } + return false; + } + + if (path == NULL) { + path = allocateNodeList(); + path->next = NULL; + path->node = candidate; + tail = path; + } else { + tail->next = allocateNodeList(); + tail = tail->next; + tail->node = candidate; + tail->next = NULL; + } + + if (getUniqueness(candidate)) + return path; + + current = candidate; + } +} + +static boolean pushNeighbours(Node * node, Node * oppositeNode, + Coordinate distance, boolean force_jumps) +{ + Node *candidate; + Node *lastCandidate = NULL; + Coordinate oldLength = getNodeLength(node); + Category cat; + MiniConnection *localConnect; + NodeList *path, *tmp; + + if ((path = pathIsClear(node, oppositeNode, distance))) { + while (path) { + candidate = path->node; + tmp = path->next; + deallocateNodeList(path); + path = tmp; + + /////////////////////////////////////// + // Stepping forward to destination // + /////////////////////////////////////// + + if (getUniqueness(candidate)) { + concatenateReadStarts(node, candidate, graph); + concatenateLongReads(node, candidate, graph); + absorbExtension(node, candidate); + + // Scaffold changes + recenterNode(node, oldLength); + recenterLocalScaffold(node, oldLength); + absorbExtensionInScaffold(node, candidate); + + // Read coverage + for (cat = 0; cat < CATEGORIES; cat++) { + incrementVirtualCoverage(node, cat, + getVirtualCoverage + (candidate, + cat)); + incrementOriginalVirtualCoverage + (node, cat, + getOriginalVirtualCoverage + (candidate, cat)); + } + + if (getNodeStatus(candidate)) { + localConnect = + &localScaffold[getNodeID + (candidate) + + nodeCount + (graph)]; + if (localConnect->frontReference) { + destroyConnection + (localConnect-> + frontReference, + getNodeID(node)); + localConnect-> + frontReference = NULL; + } + if (localConnect->backReference) { + destroyConnection + (localConnect-> + backReference, + -getNodeID(node)); + localConnect-> + backReference = NULL; + } + unmarkNode(candidate, + localConnect); + } + if (getNodeStatus(getTwinNode(candidate))) { + localConnect = + &localScaffold[-getNodeID + (candidate) + + nodeCount + (graph)]; + if (localConnect->frontReference) { + destroyConnection + (localConnect-> + frontReference, + getNodeID(node)); + localConnect-> + frontReference = NULL; + } + if (localConnect->backReference) { + destroyConnection + (localConnect-> + backReference, + -getNodeID(node)); + localConnect-> + backReference = NULL; + } + unmarkNode(getTwinNode(candidate), + localConnect); + } + // Original + printf("Pebble Concatenated Node %d -- ", + getNodeID(node)); + printf("Node %d\n", getNodeID(candidate)); + // Original + + destroyNode(candidate, graph); + return true; + } else { + adjustShortReads(node, candidate); + adjustLongReads(node, candidate); + absorbExtension(node, candidate); + lastCandidate = candidate; + } + } + } + + if (force_jumps && oppositeNode + && abs_ID(getNodeID(oppositeNode)) < abs_ID(getNodeID(node))) { + distance -= getNodeLength(node) / 2; + distance -= getNodeLength(oppositeNode) / 2; + if (distance > 10) { + adjustShortReadsByLength(node, distance); + appendGap(node, distance, graph); + } else { + adjustShortReadsByLength(node, 10); + appendGap(node, 10, graph); + } + + concatenateReadStarts(node, oppositeNode, graph); + concatenateLongReads(node, oppositeNode, graph); + absorbExtension(node, oppositeNode); + + // Scaffold changes + recenterNode(node, oldLength); + recenterLocalScaffold(node, oldLength); + absorbExtensionInScaffold(node, oppositeNode); + + // Read coverage + for (cat = 0; cat < CATEGORIES; cat++) + incrementVirtualCoverage(node, cat, + getVirtualCoverage + (oppositeNode, cat)); + + if (getNodeStatus(oppositeNode)) { + localConnect = + &localScaffold[getNodeID(oppositeNode) + + nodeCount(graph)]; + if (localConnect->frontReference) { + destroyConnection(localConnect-> + frontReference, + getNodeID(node)); + localConnect->frontReference = NULL; + } + if (localConnect->backReference) { + destroyConnection(localConnect-> + backReference, + -getNodeID(node)); + localConnect->backReference = NULL; + } + unmarkNode(oppositeNode, localConnect); + } + if (getNodeStatus(getTwinNode(oppositeNode))) { + localConnect = + &localScaffold[-getNodeID(oppositeNode) + + nodeCount(graph)]; + if (localConnect->frontReference) { + destroyConnection(localConnect-> + frontReference, + getNodeID(node)); + localConnect->frontReference = NULL; + } + if (localConnect->backReference) { + destroyConnection(localConnect-> + backReference, + -getNodeID(node)); + localConnect->backReference = NULL; + } + unmarkNode(getTwinNode(oppositeNode), + localConnect); + } + // Original + printf("Pebble Scaffolded Node %d -- Node %d\n", + getNodeID(node), getNodeID(oppositeNode)); + // Original + + destroyNode(oppositeNode, graph); + } + + return false; +} + +static void unmarkInterestingNodes() +{ + Node *node; + MiniConnection *localConnect; + + while ((node = popNodeRecord())) { + setSingleNodeStatus(node, false); + localConnect = + &localScaffold[getNodeID(node) + nodeCount(graph)]; + localConnect->frontReference = NULL; + localConnect->backReference = NULL; + localConnect->nodeList = NULL; + } +} + +static void findOppositeNode(Node * node, Node ** oppositeNode, + Coordinate * distance) +{ + NodeList *nodeList; + MiniConnection *localConnect; + Node *node2; + IDnum node2ID; + + *oppositeNode = NULL; + *distance = 0; + + for (nodeList = markedNodes; nodeList != NULL; + nodeList = nodeList->next) { + node2 = nodeList->node; + node2ID = getNodeID(node2); + localConnect = &localScaffold[node2ID + nodeCount(graph)]; + + if (node2 == node) + continue; + + if (!getUniqueness(node2)) + continue; + + if (localConnect->distance < 0) + continue; + + if (*oppositeNode == NULL + || *distance > localConnect->distance) { + *oppositeNode = node2; + *distance = localConnect->distance; + } + } +} + +static boolean expandLongNode(Node * node, boolean force_jumps) +{ + boolean hit = true; + boolean modified = false; + Node *oppositeNode; + Coordinate distance = 0; + + markInterestingNodes(node); + + while (hit) { + correctGraphLocally(node); + findOppositeNode(node, &oppositeNode, &distance); + hit = pushNeighbours(node, oppositeNode, + distance, force_jumps); + modified = modified || hit; + } + + unmarkInterestingNodes(); + + return modified; +} + +static boolean expandLongNodes(boolean force_jumps) +{ + IDnum nodeID; + Node *node; + boolean modified = false; + + for (nodeID = 1; nodeID <= nodeCount(graph); nodeID++) { + node = getNodeInGraph(graph, nodeID); + + if (node != NULL && getUniqueness(node)) { + modified = expandLongNode(node, force_jumps) + || modified; + modified = + expandLongNode(getTwinNode(node), force_jumps) + || modified; + } + } + + return modified; +} + +static void cleanMemory() +{ + // Original + //puts("Cleaning memory"); + // Original + + cleanScaffoldMemory(); + + destroyRecycleBin(nodeListMemory); + nodeListMemory = NULL; + + free(localScaffold); +} + +void exploitShortReadPairs(Graph * argGraph, ReadSet * reads, + boolean * dubious, boolean force_jumps) +{ + boolean modified = true; + + graph = argGraph; + + if (!readStartsAreActivated(graph)) + return; + + // Original + /* + puts("Starting pebble resolution..."); + */ + // Original + + // Prepare graph + resetNodeStatus(graph); + prepareGraphForLocalCorrections(graph); + + // Prepare scaffold + buildScaffold(graph, reads, dubious); + + // Prepare local scaffold + localScaffold = + callocOrExit(2 * nodeCount(graph) + 1, MiniConnection); + + // Loop until convergence + while (modified) + modified = expandLongNodes(force_jumps); + + // Clean up memory + cleanMemory(); + deactivateLocalCorrectionSettings(); + + sortGapMarkers(graph); + + // Original + /* + puts("Pebble done."); + */ + // Original +} + +// Original +static void adjustShortReadsInterRepeat(Node * target, Node * source, + Graph * argGraph) +{ + ShortReadMarker *targetArray, *marker; + IDnum targetLength, index; + Coordinate position, nodeLength; + Graph * graph = argGraph; + + if (!readStartsAreActivated(graph)) + return; + + targetArray = getNodeReads(getTwinNode(target), graph); + targetLength = getNodeReadCount(getTwinNode(target), graph); + + nodeLength = getNodeLength(source); + + for (index = 0; index < targetLength; index++) { + marker = getShortReadMarkerAtIndex(targetArray, index); + position = getShortReadMarkerPosition(marker); + position += nodeLength; + setShortReadMarkerPosition(marker, position); + } +} + +static void absorbExtensionInterRepeat(Node * node, Node * extension, + Graph * argGraph) +{ + Arc *arc; + Graph * graph = argGraph; + + appendNodeGaps(node, extension, graph); + appendDescriptors(node, extension); + + // Destroy old nodes + while (getArc(node) != NULL) + destroyArc(getArc(node), graph); + + // Create new + for (arc = getArc(extension); arc != NULL; arc = getNextArc(arc)) + createAnalogousArc(node, getDestination(arc), arc, graph); +} + +boolean pushNeighboursInterRepeat(Node * node, Node * nodeInterRepeat, + Node * oppositeNode, Graph * argGraph) +{ + Node *candidate; + Node *lastCandidate = NULL; + Category cat; + NodeList *path, *tmp; + Graph * graph = argGraph; + + // Make path (= NodeList of node and oppositeNode) + path = allocateNodeList(); + path->node = nodeInterRepeat; + path->next = allocateNodeList(); + path->next->node = oppositeNode; + path->next->next = NULL; + + // Original + while (path) { + candidate = path->node; + tmp = path->next; + deallocateNodeList(path); + path = tmp; + + if (getUniqueness(candidate)) { + concatenateReadStarts(node, candidate, graph); + concatenateLongReads(node, candidate, graph); + absorbExtensionInterRepeat(node, candidate, graph); + // Read coverage + for (cat = 0; cat < CATEGORIES; cat++) { + incrementVirtualCoverage(node, cat, + getVirtualCoverage + (candidate, cat)); + incrementOriginalVirtualCoverage + (node, cat, + getOriginalVirtualCoverage + (candidate, cat)); + } + + // Original + printf("\tConcatenated InNode %d -- OutNode %d\n", + getNodeID(node), getNodeID(candidate)); + // Original + + destroyNode(candidate, graph); + return true; + } else { + adjustShortReadsInterRepeat(node, candidate, graph); + adjustLongReads(node, candidate); + absorbExtensionInterRepeat(node, candidate, graph); + lastCandidate = candidate; + } + } + + return false; +} +// Original diff --git a/contrib/MetaVelvet-v0.3.1/src/shortReadPairs.h b/contrib/MetaVelvet-v0.3.1/src/shortReadPairs.h new file mode 100644 index 0000000..b03b9f4 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/shortReadPairs.h @@ -0,0 +1,34 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _SHORTREADPAIRS_H_ +#define _SHORTREADPAIRS_H_ + +// Original +boolean pushNeighboursInterRepeat(Node * node, Node * nodeInterRepeat, + Node * oppositeNode, Graph * argGraph); +// Original +void exploitShortReadPairs(Graph * graph, ReadSet * reads, + boolean * dubious, boolean force_jumps); +void detachImprobablePairs(ReadSet * sequences, Graph * graph); +void handicapNode(Node * node); +NodeList *getMarkedNodeList(); + +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/splay.c b/contrib/MetaVelvet-v0.3.1/src/splay.c new file mode 100644 index 0000000..0d115a4 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/splay.c @@ -0,0 +1,423 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include + +#include "globals.h" +#include "recycleBin.h" +#include "kmer.h" + +#define CHUNKSIZE 10000 + +static RecycleBin *treeMemory = NULL; + +struct splayNode_st { + Kmer kmer; + Coordinate position; + struct splayNode_st *left; + struct splayNode_st *right; + IDnum seqID; +}; + +typedef struct splayNode_st SplayNode; +typedef struct splayNode_st SplayTree; + +static SplayNode *allocateSplayNode() +{ + if (treeMemory == NULL) + treeMemory = newRecycleBin(sizeof(SplayNode), CHUNKSIZE); + + return (SplayNode *) allocatePointer(treeMemory); +} + +static void deallocateSplayNode(SplayNode * node) +{ + deallocatePointer(treeMemory, node); +} + +SplayTree *newSplayTree() +{ + return NULL; +} + +void destroySplayTree(SplayTree * T) +{ + if (T == NULL) + return; + + destroySplayTree(T->left); + destroySplayTree(T->right); + deallocateSplayNode(T); +} + +void destroyAllSplayTrees() +{ + destroyRecycleBin(treeMemory); + treeMemory = NULL; +} + +/* This function can be called only if K2 has a left child */ +/* Perform a rotate between a node (K2) and its left child */ +/* Update heights, then return new root */ + +static SplayNode *SingleRotateWithLeft(SplayNode * K2) +{ + SplayNode *K1; + + K1 = K2->left; + K2->left = K1->right; + K1->right = K2; + + return K1; /* New root */ +} + +/* This function can be called only if K1 has a right child */ +/* Perform a rotate between a node (K1) and its right child */ +/* Update heights, then return new root */ + +static SplayNode *SingleRotateWithRight(SplayNode * K1) +{ + SplayNode *K2; + + K2 = K1->right; + K1->right = K2->left; + K2->left = K1; + + return K2; /* New root */ +} + +/* Top-down splay procedure, */ +/* not requiring kmer to be in tree */ + +static SplayTree *Splay(Kmer * kmer, SplayTree * T) +{ + SplayNode Header; + SplayNode *LeftTreeMax, *RightTreeMin; + + if (T == NULL) + return NULL; + + Header.left = Header.right = NULL; + LeftTreeMax = RightTreeMin = &Header; + + while (compareKmers(kmer, &(T->kmer))) { + if (compareKmers(kmer, &(T->kmer)) < 0) { + if (T->left == NULL) + break; + if (compareKmers(kmer, &(T->left->kmer)) < 0) + T = SingleRotateWithLeft(T); + if (T->left == NULL) + break; + /* Link right */ + RightTreeMin->left = T; + RightTreeMin = T; + T = T->left; + } else { + if (T->right == NULL) + break; + if (compareKmers(kmer, &(T->right->kmer)) > 0) + T = SingleRotateWithRight(T); + if (T->right == NULL) + break; + /* Link left */ + LeftTreeMax->right = T; + LeftTreeMax = T; + T = T->right; + } + } /* while kmer != T->kmer */ + + /* Reassemble */ + LeftTreeMax->right = T->left; + RightTreeMin->left = T->right; + T->left = Header.right; + T->right = Header.left; + + return T; +} + +Kmer * findInTree(Kmer * X, SplayTree ** T) +{ + *T = Splay(X, *T); + return &((*T)->kmer); +} + +void insertIntoTree(Kmer * kmer, SplayTree ** T) +{ + SplayNode *newNode; + + if (*T == NULL) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->left = newNode->right = NULL; + *T = newNode; + return; + } + + *T = Splay(kmer, *T); + if (compareKmers(kmer, &((*T)->kmer)) < 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->left = (*T)->left; + newNode->right = *T; + (*T)->left = NULL; + *T = newNode; + } else if (compareKmers(&((*T)->kmer), kmer) < 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->right = (*T)->right; + newNode->left = *T; + (*T)->right = NULL; + *T = newNode; + } +} + +boolean +findOrInsertOccurenceInSplayTree(Kmer * kmer, IDnum * seqID, + Coordinate * position, SplayTree ** T) +{ + SplayNode *newNode; + + if (*T == NULL) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->seqID = *seqID; + newNode->position = *position; + + newNode->left = newNode->right = NULL; + + *T = newNode; + + return false; + } + + *T = Splay(kmer, *T); + if (compareKmers(kmer, &((*T)->kmer)) < 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->seqID = *seqID; + newNode->position = *position; + + newNode->left = (*T)->left; + newNode->right = *T; + (*T)->left = NULL; + + *T = newNode; + + return false; + } else if (compareKmers(kmer, &((*T)->kmer)) > 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->seqID = *seqID; + newNode->position = *position; + + newNode->right = (*T)->right; + newNode->left = *T; + (*T)->right = NULL; + + *T = newNode; + + return false; + } else { + *seqID = (*T)->seqID; + *position = (*T)->position; + + return true; + } +} + +boolean +placeOccurenceInSplayTree(Kmer * kmer, IDnum * seqID, + Coordinate * position, SplayTree ** T) +{ + SplayNode *newNode; + IDnum newID = *seqID; + Coordinate newCoord = *position; + + if (*T == NULL) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->seqID = *seqID; + newNode->position = *position; + + newNode->left = newNode->right = NULL; + + *T = newNode; + + return false; + } + + *T = Splay(kmer, *T); + if (compareKmers(kmer, &((*T)->kmer)) < 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->seqID = *seqID; + newNode->position = *position; + + newNode->left = (*T)->left; + newNode->right = *T; + (*T)->left = NULL; + + *T = newNode; + + return false; + } else if (compareKmers(&((*T)->kmer), kmer) < 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->seqID = *seqID; + newNode->position = *position; + + newNode->right = (*T)->right; + newNode->left = *T; + (*T)->right = NULL; + + *T = newNode; + + return false; + } else { + *seqID = (*T)->seqID; + *position = (*T)->position; + + (*T)->seqID = newID; + (*T)->position = newCoord; + + return true; + } +} + +void printTree(SplayTree * T) +{ + if (T == NULL) + return; + + printTree(T->left); + printKmer(&(T->kmer)); + printTree(T->right); +} + + + +int test(int argc, char **argv) +{ + SplayTree *T = newSplayTree(); + puts("Hello, world"); + Kmer k; + + puts("---TREE---"); + printTree(T); + clearKmer(&k); + pushNucleotide(&k, 1); + insertIntoTree(&k, &T); + puts("---TREE---"); + printTree(T); + clearKmer(&k); + pushNucleotide(&k, 3); + insertIntoTree(&k, &T); + puts("---TREE---"); + printTree(T); + clearKmer(&k); + pushNucleotide(&k, 13); + insertIntoTree(&k, &T); + puts("---TREE---"); + printTree(T); + clearKmer(&k); + pushNucleotide(&k, 5); + insertIntoTree(&k, &T); + puts("---TREE---"); + printTree(T); + clearKmer(&k); + pushNucleotide(&k, 7); + insertIntoTree(&k, &T); + puts("---TREE---"); + printTree(T); + clearKmer(&k); + pushNucleotide(&k, 2); + insertIntoTree(&k, &T); + puts("---TREE---"); + printTree(T); + + destroySplayTree(T); + + return 1; +} + + +void countOccurenceInSplayTree(Kmer * kmer, SplayTree ** T, int increment) +{ + SplayNode *newNode; + + if (*T == NULL) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->position = increment; + + newNode->left = newNode->right = NULL; + + *T = newNode; + + return; + } + + *T = Splay(kmer, *T); + if (compareKmers(kmer, &((*T)->kmer)) < 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->position = increment; + + newNode->left = (*T)->left; + newNode->right = *T; + (*T)->left = NULL; + + *T = newNode; + } else if (compareKmers(kmer, &((*T)->kmer)) > 0) { + newNode = allocateSplayNode(); + copyKmers(&(newNode->kmer), kmer); + newNode->position = increment; + + newNode->right = (*T)->right; + newNode->left = *T; + (*T)->right = NULL; + + *T = newNode; + } else { + ((*T)->position) += increment; + } + +} + +void filterAndExportSplayTree(FILE * file, SplayTree * T, int minCov, + int maxCov) +{ + if (T == NULL) + return; + + filterAndExportSplayTree(file, T->left, minCov, maxCov); + filterAndExportSplayTree(file, T->right, minCov, maxCov); + if ((minCov == -1 || T->position >= minCov) + && (maxCov == -1 || T->position <= maxCov)) + printKmer(&(T->kmer)); +} + +void displaySplayTreeMemory() +{ + printf("TREE MEMORY %lli allocated %lli free\n", + (long long) RecycleBin_memory_usage(treeMemory), + (long long) recycleBinFreeSpace(treeMemory)); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/splay.h b/contrib/MetaVelvet-v0.3.1/src/splay.h new file mode 100644 index 0000000..1fcbee3 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/splay.h @@ -0,0 +1,51 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _Splay_H +#define _Splay_H + +#include + +typedef struct splayNode_st SplayTree; + +// Creates an empty tree (=NULL in fact) +SplayTree *newSplayTree(); + +// Deallocates tree memory +void destroySplayTree(SplayTree * T); +void destroyAllSplayTrees(); + +// Finds occurrence of kmer in the tree +// If found, returns TRUE, and seqID and coordinate are accordingly modified +// If not, a new leaf is added to the tree, with the seqID and position data +boolean findOrInsertOccurenceInSplayTree(Kmer * kmer, IDnum * seqID, + Coordinate * position, + SplayTree ** T); + +boolean placeOccurenceInSplayTree(Kmer * kmer, IDnum * seqID, + Coordinate * position, SplayTree ** T); + +void countOccurenceInSplayTree(Kmer * kmer, SplayTree ** T, int increment); + +void filterAndExportSplayTree(FILE * file, SplayTree * T, int minCov, + int maxCov); + +void displaySplayTreeMemory(); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/splayTable.c b/contrib/MetaVelvet-v0.3.1/src/splayTable.c new file mode 100644 index 0000000..3a8b79e --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/splayTable.c @@ -0,0 +1,354 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include + +#include "globals.h" +#include "readSet.h" +#include "splay.h" +#include "tightString.h" +#include "crc.h" +#include "utility.h" +#include "kmer.h" + +struct splayTable_st { + SplayTree **table; + IDnum lastIndex; + int WORDLENGTH; +}; + +SplayTable *newSplayTable(int WORDLENGTH) +{ + SplayTable *splayTable = mallocOrExit(1, SplayTable); + splayTable->WORDLENGTH = WORDLENGTH; + splayTable->table = callocOrExit(CRC_HASH_BUCKETS, SplayTree *); + splayTable->lastIndex = 0; + return splayTable; +} + +void destroySplayTable(SplayTable * splayTable) +{ + puts("Destroying splay table"); + + destroyAllSplayTrees(); + free(splayTable->table); + free(splayTable); + + puts("Splay table destroyed"); +} + +static int hash_kmer(Kmer * kmer) +{ + return crc32_v((char *) kmer, KMER_BYTE_SIZE); +} + +static boolean findOrInsertOccurenceInSplayTable(Kmer * kmer, IDnum * seqID, + Coordinate * position, + SplayTable * table) +{ + if (table == NULL) { + puts("NULL table!"); + exit(1); + } + + return findOrInsertOccurenceInSplayTree(kmer, seqID, position, + &table-> + table[hash_kmer(kmer)]); +} + +void inputSequenceIntoSplayTable(TightString * tString, + SplayTable * table, FILE * file, boolean double_strand, boolean second_in_pair) +{ + IDnum currentIndex; + Coordinate readNucleotideIndex = 0; + Coordinate writeNucleotideIndex = 0; + Kmer word; + Kmer antiWord; + boolean annotationClosed = true; + IDnum sequenceID; + Coordinate coord; + boolean found; + Coordinate position = 0; + Coordinate start = 0; + Coordinate finish = 0; + IDnum referenceSequenceID = 0; + Nucleotide nucleotide; + + clearKmer(&word); + clearKmer(&antiWord); + + table->lastIndex++; + + currentIndex = table->lastIndex; + fprintf(file, "ROADMAP %d\n", currentIndex); + + // Neglect any string shorter than WORDLENGTH : + if (getLength(tString) < table->WORDLENGTH) { + destroyTightString(tString); + return; + } + // Fill in the initial word : + for (readNucleotideIndex = 0; + readNucleotideIndex < table->WORDLENGTH - 1; + readNucleotideIndex++) { + nucleotide = getNucleotide(readNucleotideIndex, tString); + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + } + + while (readNucleotideIndex < getLength(tString)) { + // Shift word: + nucleotide = getNucleotide(readNucleotideIndex++, tString); + pushNucleotide(&word, nucleotide); + if (double_strand) { +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + + sequenceID = currentIndex; + coord = writeNucleotideIndex; + + if (double_strand) { + if (compareKmers(&word, &antiWord) <= 0) { + found = + findOrInsertOccurenceInSplayTable(&word, + &sequenceID, + &coord, + table); + } else { + sequenceID = -sequenceID; + found = + findOrInsertOccurenceInSplayTable(&antiWord, + &sequenceID, + &coord, + table); + sequenceID = -sequenceID; + } + } else { + if (!second_in_pair) { + found = + findOrInsertOccurenceInSplayTable(&word, + &sequenceID, + &coord, + table); + } else { + sequenceID = -sequenceID; + found = + findOrInsertOccurenceInSplayTable(&antiWord, + &sequenceID, + &coord, + table); + sequenceID = -sequenceID; + } + } + + if (!found) { + writeNucleotideIndex++; + if (!annotationClosed) + fprintf(file, "%ld\t%lld\t%lld\t%lld\n", + (long) referenceSequenceID, (long long) position, + (long long) start, (long long) finish); + annotationClosed = true; + } + // Other wise create/complete annotation: + else { + // Forbidden k-mer + if (sequenceID == 0) { + break; + } + // Closed/inexistant annotation + else if (annotationClosed) { + referenceSequenceID = sequenceID; + position = writeNucleotideIndex; + start = finish = coord; + + if (referenceSequenceID > 0) + finish++; + else + finish--; + + annotationClosed = false; + } + // Open annotation + else if (sequenceID == referenceSequenceID + && coord == finish) { + if (referenceSequenceID > 0) + finish++; + else + finish--; + } + // Previous non corresponding annotation + else { + fprintf(file, "%ld\t%lld\t%lld\t%lld\n", + (long) referenceSequenceID, (long long) position, + (long long) start, (long long) finish); + + referenceSequenceID = sequenceID; + position = writeNucleotideIndex; + start = finish = coord; + + if (referenceSequenceID > 0) + finish++; + else + finish--; + } + } + } + + if (!annotationClosed) + fprintf(file, "%ld\t%lld\t%lld\t%lld\n", + (long) referenceSequenceID, (long long) position, + (long long) start, (long long) finish); + + destroyTightString(tString); + return; +} + +void inputSequenceArrayIntoSplayTableAndArchive(ReadSet * reads, + SplayTable * table, + char *filename, + boolean double_strand) +{ + IDnum index; + IDnum sequenceCount = reads->readCount; + TightString **array; + FILE *outfile = fopen(filename, "w"); + boolean second_in_pair = false; + + if (outfile == NULL) + exitErrorf(EXIT_FAILURE, true, "Couldn't write to file %s", filename); + else + printf("Writing into roadmap file %s...\n", filename); + + fprintf(outfile, "%ld\t%i\t%hi\n", (long) sequenceCount, table->WORDLENGTH, (short) double_strand); + + if (reads->tSequences == NULL) + convertSequences(reads); + + array = reads->tSequences; + + puts("Inputting sequences..."); + for (index = 0; index < sequenceCount; index++) { + if (index % 100000 == 0) { + printf("Inputting sequence %d / %d\n", index, + sequenceCount); + fflush(stdout); + } + inputSequenceIntoSplayTable(array[index], table, outfile, double_strand, second_in_pair); + + if (reads->categories[index] % 2) + second_in_pair = (second_in_pair? false : true); + else + second_in_pair = false; + } + + fclose(outfile); + + free(reads->tSequences); + reads->tSequences = NULL; + destroyReadSet(reads); + puts("Done inputting sequences"); +} + +void inputMaskIntoSplayTable(TightString * tString, SplayTable * table) +{ + Coordinate readNucleotideIndex = 0; + Kmer word; + Kmer antiWord; + IDnum sequenceID = 0; + Coordinate coord = 0; + Nucleotide nucleotide; + + clearKmer(&word); + clearKmer(&antiWord); + + // Neglect any string shorter than WORDLENGTH : + if (getLength(tString) < table->WORDLENGTH) { + destroyTightString(tString); + return; + } + // Fill in the initial word : + for (readNucleotideIndex = 0; + readNucleotideIndex < table->WORDLENGTH - 1; + readNucleotideIndex++) { + nucleotide = getNucleotide(readNucleotideIndex, tString); + pushNucleotide(&word, nucleotide); +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + } + + while (readNucleotideIndex < getLength(tString)) { + // Shift word: + nucleotide = getNucleotide(readNucleotideIndex++, tString); + pushNucleotide(&word, nucleotide); +#ifdef COLOR + reversePushNucleotide(&antiWord, nucleotide); +#else + reversePushNucleotide(&antiWord, 3 - nucleotide); +#endif + + if (compareKmers(&word, &antiWord) <= 0) + findOrInsertOccurenceInSplayTable(&word, + &sequenceID, + &coord, table); + else + findOrInsertOccurenceInSplayTable(&antiWord, + &sequenceID, + &coord, table); + } + + destroyTightString(tString); + return; +} + +void inputMaskArrayIntoSplayTable(ReadSet * reads, SplayTable * table) +{ + IDnum index; + IDnum sequenceCount = reads->readCount; + TightString **array; + + if (reads->tSequences == NULL) + convertSequences(reads); + + array = reads->tSequences; + + puts("Loading masks..."); + for (index = 0; index < sequenceCount; index++) + inputMaskIntoSplayTable(array[index], table); + + free(reads->tSequences); + reads->tSequences = NULL; + destroyReadSet(reads); + puts("Done loading masks"); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/splayTable.h b/contrib/MetaVelvet-v0.3.1/src/splayTable.h new file mode 100644 index 0000000..3224277 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/splayTable.h @@ -0,0 +1,32 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _SPLAYTABLE_H_ +#define _SPLAYTABLE_H_ + +SplayTable *newSplayTable(int WORDLENGTH); + +void destroySplayTable(SplayTable * table); + +void inputSequenceArrayIntoSplayTableAndArchive(ReadSet * reads, + SplayTable * table, + char *filename, + boolean double_strand); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/tightString.c b/contrib/MetaVelvet-v0.3.1/src/tightString.c new file mode 100644 index 0000000..25f8e35 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/tightString.c @@ -0,0 +1,539 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include + +#include "globals.h" +#include "utility.h" + +typedef unsigned char Codon; + +struct tString_st { + Codon *sequence; + Coordinate length; + Coordinate arrayLength; +}; + +static const Nucleotide Adenine = 0; +static const Nucleotide Cytosine = 1; +static const Nucleotide Guanine = 2; +static const Nucleotide Thymine = 3; + +// Binary 11111100 +static const Codon FILTER0 = ~((Codon) 3); +// Binary 11110011 +static const Codon FILTER1 = ~(((Codon) 3) << 2); +// Binary 11001111 +static const Codon FILTER2 = ~(((Codon) 3) << 4); +// Binary 00111111 +static const Codon FILTER3 = (Codon) ~ ((Codon) 3 << 6); + +// +// Adds a number into the Codon pointed to by codonPtr at the desired +// position (0, 1, 2, or 3); +// +void +writeNucleotideNumber(Nucleotide nucleotide, Codon * codonPtr, + Coordinate position) +{ + if (position == 3) { + *codonPtr &= FILTER3; + *codonPtr += nucleotide << 6; + } else if (position == 2) { + *codonPtr &= FILTER2; + *codonPtr += nucleotide << 4; + } else if (position == 1) { + *codonPtr &= FILTER1; + *codonPtr += nucleotide << 2; + } else if (position == 0) { + *codonPtr &= FILTER0; + *codonPtr += nucleotide; + } +} + +// +// Adds a nucleotide into the Codon pointed to by codonPtr at the desired +// position (0, 1, 2, or 3); +// +Nucleotide charToNucleotide(char c) +{ + switch (c) { + case 'A': + return Adenine; + case 'C': + return Cytosine; + case 'G': + return Guanine; + case 'T': + return Thymine; + case 'a': + return Adenine; + case 'c': + return Cytosine; + case 'g': + return Guanine; + case 't': + return Thymine; + case '\n': + return '\n'; + default: + return Adenine; + } + +} + +// +// Adds a nucleotide into the Codon pointed to by codonPtr at the desired +// position (0, 1, 2, or 3); +// +void writeNucleotide(Nucleotide nucleotide, Codon * codonPtr, int position) +{ + int nucleotideNum; + + switch (nucleotide) { + case 'A': + nucleotideNum = Adenine; + break; + case 'C': + nucleotideNum = Cytosine; + break; + case 'G': + nucleotideNum = Guanine; + break; + case 'T': + nucleotideNum = Thymine; + break; + case 'a': + nucleotideNum = Adenine; + break; + case 'c': + nucleotideNum = Cytosine; + break; + case 'g': + nucleotideNum = Guanine; + break; + case 't': + nucleotideNum = Thymine; + break; + default: + nucleotideNum = Adenine; + } + + writeNucleotideNumber(nucleotideNum, codonPtr, position); +} + +// +// Creates a tightString from a tradionnal string of A,T,G, and C of length size +// +TightString *newTightStringFromString(char *sequence) +{ + TightString *newTString = mallocOrExit(1, TightString); + + int size = (int) strlen(sequence); + int arrayLength = size / 4; + int index; + + if (size % 4 > 0) + arrayLength++; + + newTString->length = size; + newTString->arrayLength = arrayLength; + newTString->sequence = callocOrExit(arrayLength, Codon); + + for (index = 0; index < arrayLength; index++) + newTString->sequence[index] = 0; + + for (index = 0; index < size; index++) + writeNucleotide(sequence[index], + &(newTString->sequence[index / 4]), + index % 4); + + free(sequence); + return newTString; +} + +// +// Creates a tightString from an array of normal strings +// +TightString **newTightStringArrayFromStringArray(char **sequences, + IDnum sequenceCount) +{ + IDnum sequenceIndex; + TightString **tStringArray = + mallocOrExit(sequenceCount, TightString *); + + for (sequenceIndex = 0; sequenceIndex < sequenceCount; + sequenceIndex++) + tStringArray[sequenceIndex] = + newTightStringFromString(sequences[sequenceIndex]); + + free(sequences); + return tStringArray; +} + +char readNucleotide(Nucleotide nucleotide) +{ + switch (nucleotide) { + case 0: + return 'A'; + case 1: + return 'C'; + case 2: + return 'G'; + case 3: + return 'T'; + } + + return '?'; +} + +char *readTightString(TightString * tString) +{ + Coordinate index, index4; + char *string; + Codon codon; + + if (tString == NULL || tString->length == 0) { + string = callocOrExit(5, char); + strcpy(string, "VOID"); + return string; + } + + string = callocOrExit(tString->length + 1, char); + + for (index = 0; index < tString->length / 4; index++) { + index4 = index << 2; + codon = tString->sequence[index]; + string[index4] = readNucleotide(codon & 3); + string[index4 + 1] = readNucleotide((codon & 12) >> 2); + string[index4 + 2] = readNucleotide((codon & 48) >> 4); + string[index4 + 3] = readNucleotide((codon & 192) >> 6); + } + + index4 = index << 2; + codon = tString->sequence[index]; + + switch (tString->length % 4) { + case 3: + string[index4 + 3] = readNucleotide((codon & 192) >> 6); + case 2: + string[index4 + 2] = readNucleotide((codon & 48) >> 4); + case 1: + string[index4 + 1] = readNucleotide((codon & 12) >> 2); + case 0: + string[index4] = readNucleotide(codon & 3); + } + + string[tString->length] = '\0'; + + return string; +} + +Nucleotide getNucleotide(Coordinate nucleotideIndex, TightString * tString) +{ + Codon codon = tString->sequence[nucleotideIndex / 4]; + + switch (nucleotideIndex % 4) { + case 3: + return (codon & 192) >> 6; + case 2: + return (codon & 48) >> 4; + case 1: + return (codon & 12) >> 2; + case 0: + return (codon & 3); + } + + return '?'; +} + +void readTightStringFragment(TightString * tString, Coordinate start, + Coordinate finish, char *string) +{ + Coordinate index; + Coordinate inFinish = finish; + + if (start >= tString->length) { + string[0] = '\0'; + return; + } + + if (inFinish > tString->length) + inFinish = tString->length; + + for (index = start; index < inFinish; index++) { + string[index - start] = + readNucleotide(getNucleotide(index, tString)); + } + + string[inFinish - start] = '\0'; +} + +char getNucleotideChar(Coordinate nucleotideIndex, TightString * tString) +{ + Codon codon; + + codon = tString->sequence[nucleotideIndex / 4]; + + switch (nucleotideIndex % 4) { + case 3: + return readNucleotide((codon & 192) >> 6); + case 2: + return readNucleotide((codon & 48) >> 4); + case 1: + return readNucleotide((codon & 12) >> 2); + case 0: + return readNucleotide((codon & 3)); + } + + return '?'; +} + +char getInverseNucleotideChar(Coordinate nucleotideIndex, + TightString * tString) +{ + Codon codon = tString->sequence[nucleotideIndex / 4]; + + switch (nucleotideIndex % 4) { +#ifndef COLOR + case 3: + return readNucleotide(3 - ((codon & 192) >> 6)); + case 2: + return readNucleotide(3 - ((codon & 48) >> 4)); + case 1: + return readNucleotide(3 - ((codon & 12) >> 2)); + case 0: + return readNucleotide(3 - ((codon & 3))); +#else + case 3: + return readNucleotide(((codon & 192) >> 6)); + case 2: + return readNucleotide(((codon & 48) >> 4)); + case 1: + return readNucleotide(((codon & 12) >> 2)); + case 0: + return readNucleotide(((codon & 3))); +#endif + } + + return '?'; +} + +TightString *newTightString(Coordinate length) +{ + Coordinate arrayLength = length / 4; + Coordinate index; + TightString *newTString = mallocOrExit(1, TightString); + if (length % 4 > 0) + arrayLength++; + + newTString->length = length; + newTString->arrayLength = arrayLength; + newTString->sequence = callocOrExit(arrayLength, Codon); + + for (index = 0; index < arrayLength; index++) + newTString->sequence[index] = 0; + + return newTString; +} + +void +writeNucleotideAtPosition(Nucleotide nucleotide, Coordinate position, + TightString * tString) +{ + if (position >= tString->length) + return; + + writeNucleotideNumber(nucleotide, + &tString->sequence[position / 4], + position % 4); +} + +void trimTightString(TightString * tString, Coordinate length) +{ + Coordinate newArrayLength = length / 4; + if (length % 4 == 0) + newArrayLength++; + + tString->length = length; + tString->arrayLength = newArrayLength; + tString->sequence = + reallocOrExit(tString->sequence, newArrayLength, Codon); +} + +Coordinate getLength(TightString * tString) +{ + return tString->length; +} + +TightString **concatenateTightStringArrays(TightString ** array1, + TightString ** array2, + IDnum size1, IDnum size2) +{ + TightString **unionArray; + IDnum index; + + if (array1 == NULL) + return array2; + + if (array2 == NULL) + return array1; + + unionArray = + reallocOrExit(array1, size1 + size2, TightString *); + + for (index = 0; index < size2; index++) + unionArray[size1 + index] = array2[index]; + + free(array2); + + return unionArray; +} + +void destroyTightString(TightString * tString) +{ + free(tString->sequence); + free(tString); +} + +void destroyTightStringArray(TightString ** array, IDnum sequenceCount) +{ + IDnum index; + for (index = 0; index < sequenceCount; index++) + destroyTightString(array[index]); + free(array); +} + +void setTightStringLength(TightString * tString, Coordinate length) +{ + Coordinate newArrayLength = length / 4; + if (length % 4 > 0) + newArrayLength++; + + if (newArrayLength > tString->arrayLength) { + tString->sequence = + reallocOrExit(tString->sequence, + newArrayLength, Codon); + tString->arrayLength = newArrayLength; + } + + tString->length = length; +} + +// Shortens reads to a fixed size (good for Solexa where errors are markedly towards the end) +void trimTightStringArray(TightString ** array, IDnum sequenceCount, + Coordinate length) +{ + IDnum index; + + for (index = 0; index < sequenceCount; index++) + trimTightString(array[index], length); +} + +void trimTightStringArraySanger(TightString ** array, IDnum sequenceCount, + Coordinate min, Coordinate max) +{ + IDnum index; + + for (index = 0; index < sequenceCount; index++) { + if (getLength(array[index]) > max) + trimTightString(array[index], max); + else if (getLength(array[index]) < min) + trimTightString(array[index], 0); + } +} + +void clipTightString(TightString * tString, Coordinate start, + Coordinate finish) +{ + Coordinate position; + Coordinate newLength = finish - start; + + for (position = 0; position < newLength; position++) + writeNucleotideAtPosition(getNucleotide + (position + start, tString), + position, tString); + + trimTightString(tString, newLength); +} + +Nucleotide getNucleotideFromString(Coordinate nucleotideIndex, + char *string) +{ + char letter = string[nucleotideIndex]; + + switch (letter) { + case 'A': + return Adenine; + case 'C': + return Cytosine; + case 'G': + return Guanine; + case 'T': + return Thymine; + default: + return Adenine; + } +} + +void exportTightString(FILE * outfile, TightString * sequence, IDnum index) +{ + Coordinate start, finish; + char str[100]; + + if (sequence == NULL) + return; + + fprintf(outfile, ">SEQUENCE_%ld_length_%lld\n", (long) index, + (long long) getLength(sequence)); + + start = 0; + while (start <= getLength(sequence)) { + finish = start + 60; + readTightStringFragment(sequence, start, finish, str); + fprintf(outfile, "%s\n", str); + start = finish; + } + + fflush(outfile); +} + +void exportSequenceArray(char *filename, TightString ** array, + IDnum sequenceCount) +{ + IDnum index; + FILE *outfile = fopen(filename, "w+"); + + if (outfile == NULL) { + puts("Couldn't open file, sorry"); + return; + } else + printf("Writing into file: %s\n", filename); + + for (index = 0; index < sequenceCount; index++) { + exportTightString(outfile, array[index], index); + } + + fclose(outfile); + + puts("Done"); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/tightString.h b/contrib/MetaVelvet-v0.3.1/src/tightString.h new file mode 100644 index 0000000..9110884 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/tightString.h @@ -0,0 +1,102 @@ +/* +Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef _TIGHTSTRING_H_ +#define _TIGHTSTRING_H_ + +#include + +////////////////////////////////////////////////////////////// +// Creators and destructors +///////////////////////////////////////////////////////////// + +TightString *newTightString(Coordinate size); + +TightString *newTightStringFromString(char *sequence); + +void destroyTightString(TightString * tString); + +/////////////////////////////////////////////////////////////// +// Reading +////////////////////////////////////////////////////////////// + +Coordinate getLength(TightString * tString); + +char *readTightString(TightString * tString); + +Nucleotide getNucleotide(Coordinate nucleotideIndex, + TightString * tString); + +char getNucleotideChar(Coordinate nucleotideIndex, TightString * tString); + +Nucleotide getNucleotideFromString(Coordinate nucleotideIndex, + char *string); + +char getInverseNucleotideChar(Coordinate nucleotideIndex, + TightString * tString); + +void readTightStringFragment(TightString * tString, Coordinate start, + Coordinate finish, char *string); + +Nucleotide charToNucleotide(char c); + +/////////////////////////////////////////////////////////////// +// Writing +/////////////////////////////////////////////////////////////// + +void setTightStringLength(TightString * tString, Coordinate length); + +void writeNucleotideAtPosition(Nucleotide nucleotide, Coordinate position, + TightString * tString); + +/////////////////////////////////////////////////////////////// +// Array wide operations +/////////////////////////////////////////////////////////////// + +TightString **newTightStringArrayFromStringArray(char **sequences, + IDnum sequenceCount); + +TightString **concatenateTightStringArrays(TightString ** array1, + TightString ** array2, + IDnum size1, IDnum size2); + +void destroyTightStringArray(TightString ** array, IDnum arrayLength); + +/////////////////////////////////////////////////////////////// +// Misc +/////////////////////////////////////////////////////////////// + +void trimTightString(TightString * tString, Coordinate length); + +void trimTightStringArray(TightString ** tStringArray, IDnum arrayLength, + Coordinate maxLength); + +void trimTightStringArraySanger(TightString ** tStringArray, + IDnum arrayLength, Coordinate minLength, + Coordinate maxLength); + +void clipTightString(TightString * sequence, Coordinate start, + Coordinate finish); + +// Exports an array of sequences under FastA format +void exportSequenceArray(char *filename, TightString ** array, + IDnum sequenceCount); +void exportTightString(FILE * outfile, TightString * sequence, IDnum index); +#endif diff --git a/contrib/MetaVelvet-v0.3.1/src/tmp.ps b/contrib/MetaVelvet-v0.3.1/src/tmp.ps new file mode 100644 index 0000000..16bc837 Binary files /dev/null and b/contrib/MetaVelvet-v0.3.1/src/tmp.ps differ diff --git a/contrib/MetaVelvet-v0.3.1/src/utility.c b/contrib/MetaVelvet-v0.3.1/src/utility.c new file mode 100644 index 0000000..10bea21 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/utility.c @@ -0,0 +1,92 @@ +/* +Copyright 2009 John Marshall (jm18@sanger.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#include +#include +#include +#include +#include + +#include "globals.h" +#include "utility.h" + +static void allocExitError(const char *function, unsigned long long count, + unsigned long long size, const char *name) +{ + if (size == 1) + exitErrorf(EXIT_FAILURE, true, + "Can't %s %llu %ss", + function, count, name); + else + exitErrorf(EXIT_FAILURE, true, + "Can't %s %llu %ss totalling %llu bytes", + function, count, name, count * size); +} + +void *mallocOrExit3(size_t count, size_t size, const char *name) +{ + void *p = malloc(count * size); + if (p == NULL && count > 0) + allocExitError("malloc", count, size, name); + + return p; +} + +void *callocOrExit3(size_t count, size_t size, const char *name) +{ + void *p = calloc(count, size); + if (p == NULL && count > 0) + allocExitError("calloc", count, size, name); + + return p; +} + +void *reallocOrExit4(void *ptr, size_t count, size_t size, const char *name) +{ + void *p = realloc(ptr, count * size); + if (p == NULL && count > 0) + allocExitError("realloc", count, size, name); + + return p; +} + + +static const char *programName = NULL; + +void setProgramName(const char *name) +{ + programName = name; +} + +void exitErrorf(int exitStatus, boolean showErrno, const char *format, ...) +{ + int savedErrno = errno; + va_list args; + va_start(args, format); + if (programName) + fprintf(stderr, "%s: ", programName); + vfprintf(stderr, format, args); + if (showErrno) + fprintf(stderr, ": %s", strerror(savedErrno)); + fprintf(stderr, "\n"); + va_end(args); + + exit(exitStatus); +} diff --git a/contrib/MetaVelvet-v0.3.1/src/utility.h b/contrib/MetaVelvet-v0.3.1/src/utility.h new file mode 100644 index 0000000..fd7a3d5 --- /dev/null +++ b/contrib/MetaVelvet-v0.3.1/src/utility.h @@ -0,0 +1,54 @@ +/* +Copyright 2009 John Marshall (jm18@sanger.ac.uk) + + This file is part of Velvet. + + Velvet is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + Velvet is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with Velvet; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + +*/ +#ifndef UTILITY_H_ +#define UTILITY_H_ + +#ifdef __GNUC__ +#define ATTRIBUTE(list) __attribute__ (list) +#else +#define ATTRIBUTE(list) +#endif + +// Wrappers for malloc(), calloc(), and realloc() that always succeed. +// These functions print an error message and exit on failure, rather than +// requiring the calling function to check for NULL. The arguments contain +// the type itself -- mallocOrExit(n, Foo) rather than malloc(n * sizeof Foo) +// -- mainly so that it can be shown in error messages. +#define mallocOrExit(count, type) (mallocOrExit3((count), sizeof(type), #type)) +#define callocOrExit(count, type) (callocOrExit3((count), sizeof(type), #type)) +#define reallocOrExit(ptr, count, type) \ + (reallocOrExit4((ptr), (count), sizeof(type), #type)) + +// (Implementation functions -- use the macro wrappers above.) +void *mallocOrExit3(size_t count, size_t size, const char *name); +void *callocOrExit3(size_t count, size_t size, const char *name); +void *reallocOrExit4(void *ptr, size_t count, size_t size, const char *name); + +// Sets the program name to be prepended to error messages. +void setProgramName(const char *name); + +// Prints an error message to standard error (with printf-style formatting +// and optionally appending a perror-style description of errno), and calls +// exit() with the specified exit status. +void exitErrorf(int exitStatus, boolean showErrno, const char *format, ...) + ATTRIBUTE((format(printf, 3, 4), noreturn)); + +#endif diff --git a/data/test_reference.fa b/data/test_reference.fa index a22240c..e83e8c8 100644 --- a/data/test_reference.fa +++ b/data/test_reference.fa @@ -6,7 +6,7 @@ TAAAGGAGCATAATTCGCCACTGGAGATGCTGGTGAGGGTGCATCTATACTCACTATTTG CGATGGTACCTTTTGAGGTGTTATATTATTGACGTTAGTCTTCGGTGCGGTCCTCCTACG TTGCCCAGTATGTACCATGGCGTGCAAACCGACACGCGAAGATCAAGATCGCTGTAGTCT CTAGACACTCGCCCGCCCGAGACCGCCCAGACGATGGATCCCTGGCTTGCAAGCCAGTCT -CGATGACAACGCGCTCGGACAACGGTGAGATTGATAAACACTCAAACGGCTGGCNNNTAG +CGATGACAACGCGCTCGGACAACGGTGAGATTGATAAACACTCAAACGGCTGGCAAATAG TACCAACACGAGCCGCCTATCTGTAGTACTGATACTATTCCTATAGACGACTATGACCCT ATTATCTAATCCGCATACAGCCCCATGGATCAGAAGCAGCTTTCCTACACTGGATTTCCT AGTTGACAATCTGAATACAAGTGATGTGTACAGCTCCACCTGACTACTCCTTCGAAATTT