Skip to content

Commit

Permalink
python version first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
ginolhac committed Nov 12, 2012
1 parent 93f8a2d commit eb753a3
Show file tree
Hide file tree
Showing 26 changed files with 1,050 additions and 1,574 deletions.
Binary file removed FragMisincorporation_plot.pdf
Binary file not shown.
18 changes: 18 additions & 0 deletions LICENSE.txt
@@ -0,0 +1,18 @@
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

13 changes: 13 additions & 0 deletions MANIFEST
@@ -0,0 +1,13 @@
# file GENERATED by distutils, do NOT edit
README.txt
setup.py
bin/mapDamage
mapdamage/__init__.py
mapdamage/align.py
mapdamage/composition.py
mapdamage/parseoptions.py
mapdamage/rscript.py
mapdamage/seq.py
mapdamage/tables.py
mapdamage/version.py
mapdamage/Rscripts/mapDamage.R
1 change: 1 addition & 0 deletions MANIFEST.in
@@ -0,0 +1 @@
include mapdamage/Rscripts/mapDamage.R
90 changes: 90 additions & 0 deletions README.txt
@@ -0,0 +1,90 @@
Introduction
------------

mapDamage is a Perl script that tracks DNA damage patterns among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms.
In addition to the mapDamage Perl script, SAMtools, BEDtools and R are mandatory and must be present in your $PATH.

http://sourceforge.net/projects/samtools/files/samtools/

http://code.google.com/p/bedtools/downloads/list

http://www.r-project.org/

The original page with examples, datasets and result files is there:

http://geogenetics.ku.dk/publications/mapdamage/

News in version 0.3.6
---------------------

* exchange colors for adenine/cytosine in fragmentation patterns, could be confusing with G>A in blue
* bug fix with BAM files obtained from tmap and bowtie
* hard clipping support

Usage
-----

Three main commands can be used: map, merge and plot. The -c option in map allows to chain all the three steps. A detailed description is presented below:

* 1. map

./mapDamage.pl map -i input_SAM -d directory -r reference_fasta -c -t my_title

REQUIRED
-i : input SAM-file without header. SAM format, version 1.4 is described in this pdf file
-r : input reference fasta file, with headers identical to the ones in the input_SAM file

OPTIONAL
-l : maximal read length, in nucleotides to consider [default: 70]
-a : size, in nucleotides, of the genomic region to be retrieved before and after reads [default: 10]
-d : folder for writing output-files, if not already present, this folder will be automatically created [default name: results_input_SAM.filtered]
-j : second input SAM-file without header; reads present in both SAM files will not be processed
-u : removes all non-unique reads based on the X1 and XT tags from the BWA mapper
-k : keep bed file and fasta file containing all the genomic regions [default: delete these files]
-f : outputs all read aligned against the reference genome, in a fasta file format
-c : complete analysis, map will be automatically followed by both merge and plot steps [default: not active]

when -c is active, the following are available to control the plot step:
-b : the number of reference nucleotides to consider for ploting base composition in the region located upstream and downstream of every read [default: 10]
-y : graphical y-axis limit for nucleotide misincorporation frequencies [default: 0.30]
-m : read length, in nucleotides, considered for plotting nucleotide misincorporations [default: 25]

* 2. merge

./mapDamage.pl merge -d directory

REQUIRED
-d : folder with output-files generated by the map command

OPTIONAL
-c : complete analysis, merge will be automatically followed by the plot step [default: not active]

* 3. plot

./mapDamage.pl plot -d directory

REQUIRED
-d : folder with output-files generated by the merge command

OPTIONAL
-l : read length, in nucleotides, considered for plotting nucleotide misincorporations [default: 25]
-m : graphical y-axis limit for nucleotide misincorporation frequencies [default: 0.30]
-a : the number of reference nucleotides to consider for ploting base composition in the region located upstream and downstream of every read [default: 10]
-t : title used for both graph and filename [default: folder name without extension]

An example of FragMisincorporation_plot.pdf can be downloaded [here](https://github.com/ginolhac/mapDamage/blob/master/FragMisincorporation_plot.pdf?raw=true)


Citation
--------
If you use this script, please cite the following publication: Ginolhac A, Rasmussen M, Gilbert MT, Willerslev E, Orlando L.
mapDamage: testing for damage patterns in ancient DNA sequences. Bioinformatics. 2011 27(15):2153-5
http://bioinformatics.oxfordjournals.org/content/27/15/2153

Limitations
-----------
The positions used in fragmentation files are not compatible with read length > 99999 nucleotides. This is compatible with all sequencing technologies currently available.
Contact
-------
Please report bugs and suggest possible improvements to Aurelien Ginolhac by email.

212 changes: 212 additions & 0 deletions bin/mapDamage
@@ -0,0 +1,212 @@
#!/usr/bin/env python
# -*- coding: ASCII -*-
from __future__ import print_function

import sys
import os

# check if pysam if available
module = "pysam"
url = "http://code.google.com/p/pysam/"
try:
__import__(module)

except ImportError, e:
sys.stderr.write("Error: Could not import required module '%s':\n\t- %s\n" % (module,e))
sys.stderr.write(" If module is not installed, please download from '%s'.\n" % (url,))
sys.stderr.write(" A local install may be performed using the following command:\n")
sys.stderr.write(" $ python setup.py install --user\n\n")
sys.exit(1)

import pysam

import mapdamage
from mapdamage.version import __version__


"""
plot damage patterns from a SAM/BAM file
:Author: Aurelien Ginolhac
:Contact: aginolhac@snm.ku.dk
:Date: November 2012
:Type: tool
:Input: SAM/BAM
:Output: tabulated tables, pdf
"""


def getCoordinates(read):

if read.is_reverse:
fivep = read.aend
threep = read.pos
else:
fivep = read.pos
threep = read.aend
return(fivep, threep)


def getAround(coord, chrom, reflengths, lg, ref):

i = j = 0
before = after = ""
if min(coord) - lg > 0:
i = min(coord) - lg
if max(coord) + lg > reflengths[chrom]:
j = reflengths[chrom]
else:
j = max(coord) + lg

before = ref.fetch(chrom, i, min(coord))
after = ref.fetch(chrom, max(coord), j)

return(before, after)


def writeFasta(read,ref, seq, refseq, start, end, before, after, fout):
if read.is_reverse:
std = '-'
else:
std = '+'
# output coordinate in 1-based offset
fout.write(">%s:%d-%d\n%s\n" % (ref, start-len(before), start, before))
fout.write(">%s:%d-%d\n%s\n>%s_%s\n%s\n" % (ref, start+1, end+1, refseq, read.qname, std, seq))
fout.write(">%s:%d-%d\n%s\n" % (ref, end+2, end+2+len(after), after))



""" main """

def main(argv):

options = mapdamage.parseoptions.options(argv)
if not options:
return 1

# plot using R if results folder already done
if options.plotonly:
mapdamage.rscript.plot(options)
return 0

# open mapped reads and corresponding reference
in_bam = pysam.Samfile(options.filename)
ref = pysam.Fastafile(options.ref)

# for misincorporation patterns, to record mismaches, from 5'-ends
misincorp = {}
# for fragmentation patterns, record base compositions in the reference around
dnacomp = {}

# fetch all references and associated lengths in nucleotides
reflengths = dict(zip(in_bam.references, in_bam.lengths))
# initialize table of misincorporations with zeros
misincorp = mapdamage.tables.initializeMut(misincorp, reflengths, options.length)
# initialise base composition tables with zeros
dnacomp = mapdamage.tables.initializeComp(dnacomp, reflengths, options.around, options.length)


if not options.quiet:
print("\tReading from '%s'" % options.filename)
if options.verbose:
print("\t%d references are assumed in SAM/BAM file, for a total of %d nucleotides" % (len(reflengths), sum(reflengths.values())))
print("\tWriting results to '%s/'" % options.folder)

# reference in BAM must be in the fasta reference file
#if not referenceCheck():
#sys.stderr.write("Error: %s reference was not found in the reference file provided\n" % (chrom))
# return 1


# open file handler to write alignements in fasta format
if options.fasta:
# use name of the sam/bam filename without extension
ffasta = os.path.splitext(os.path.basename(options.filename))[0]+'.fasta'
if not options.quiet:
print("\tWriting alignments in '%s'" % ffasta)
ff = open(options.folder+"/"+ffasta,"w")

counter = filtered = 0

# main loop
for read in in_bam:
counter += 1

if read.is_unmapped:
filtered +=1
continue

# external coordinates 5' and 3' , 0-based offset
coordinate = getCoordinates(read)
# fetch reference name, chromosome or contig names
chrom = in_bam.getrname(read.tid)

(before, after) = getAround(coordinate, chrom, reflengths, options.around, ref)
refseq = ref.fetch(chrom, min(coordinate), max(coordinate))
# read.query contains aligned sequences while read.seq is the read itself
seq = read.query

# add gaps according to the cigar string
(seq, refseq) = mapdamage.align.align(read.cigar, seq, refseq)

# reverse complement read and reference when mapped reverse strand
if read.is_reverse:
refseq = refseq.translate(mapdamage.seq.table)[::-1]
seq = seq.translate(mapdamage.seq.table)[::-1]
beforerev = after.translate(mapdamage.seq.table)[::-1]
after = before.translate(mapdamage.seq.table)[::-1]
before = beforerev

# record soft clippinp when present
if len(mapdamage.align.parseCigar(read.cigar, 4)) > 0:
mapdamage.align.recordSoftClipping(mapdamage.align.parseCigar(read.cigar, 4), read, chrom, misincorp, options.length)

# count misincorparations by comparing read and reference base by base
mapdamage.align.getMis(read, seq, refseq, chrom, options.length, misincorp, '5p')
# do the same with sequences align to 3'-ends
mapdamage.align.getMis(read, seq[::-1], refseq[::-1], chrom, options.length, misincorp, '3p')

# compute base composition for genomic regions
mapdamage.composition.countRefComp(read, chrom, before, after, dnacomp)
# compute base composition for reads
mapdamage.composition.countReadComp(read, chrom, options.length, dnacomp)


if options.fasta:
writeFasta(read, chrom, seq, refseq, min(coordinate), max(coordinate), before, after, ff)

if options.verbose:
if counter % 50000 == 0:
print("\t%10d reads processed (%d unmapped)" % (counter, filtered))

if not options.quiet:
print("\tDone. %d reads read (%d unmapped)" % (counter, filtered))

# close file handlers
in_bam.close()
ref.close()
if options.fasta:
ff.close()

# open file handlers to output results
fmut = open(options.folder+"/"+"misincorporation.txt", 'w')
fcomp = open(options.folder+"/"+"dnacomp.txt", 'w')

# write summary tables to disk
mapdamage.tables.printMut(misincorp, options, fmut)
mapdamage.tables.printComp(dnacomp, options, fcomp)

# close file handlers
fmut.close()
fcomp.close()

# plot using R
if not options.nor:
mapdamage.rscript.plot(options)

return 0

if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
Binary file added dist/mapdamage-0.3.7.tar.gz
Binary file not shown.

0 comments on commit eb753a3

Please sign in to comment.