version 0.07

Wang-lab-UCSD · Apr 5, 2018 · 70ef1ff · 70ef1ff
commit 70ef1ff
Show file tree

Hide file tree

Showing 60 changed files with 1,850,970 additions and 0 deletions.
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,19 @@
+This software is Copyright © 2017 The Regents of the University of California. All Rights Reserved.
+
+Permission to copy, modify, and distribute this software and its documentation for educational, research and non-profit purposes, without fee, and without a written agreement is hereby granted, provided that the above copyright notice, this paragraph and the following three paragraphs appear in all copies.
+
+Permission to make commercial use of this software may be obtained by contacting:
+
+Technology Transfer Office
+9500 Gilman Drive, Mail Code 0910
+University of California
+La Jolla, CA 92093-0910
+(858) 534-5815
+invent@ucsd.edu
+
+This software program and documentation are copyrighted by The Regents of the University of California. 
+The software program and documentation are supplied "as is", without any accompanying services from The Regents. 
+The Regents does not warrant that the operation of the program will be uninterrupted or error-free. 
+The end-user understands that the program was developed for research purposes and is advised not to rely exclusively on the program for any reason.
+
+IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
diff --git a/README.md b/README.md
@@ -0,0 +1,98 @@
+# mEpigram  
+
+This tool allows users to find methylated motifs in CpG context and also motifs that contain modified bases. 
+
+Version 0.05
+
+Contact: vqngo@ucsd.edu
+
+## Installation
+
+- The programs were written in Python2.7 and Julia (tested with Julia v0.6.2 )
+
+- mEpigram requires a graph of possible k-mer interactions to function. Download the graphs here: 
+	* CpG mode: http://wanglab.ucsd.edu/star/mepigram/graphE-8mer.tar.gz
+	* non-CpG mode: http://wanglab.ucsd.edu/star/mepigram/graphEF-7mer.tar.gz
+
+- Although the Julia scripts are not required for motif discovery, it's recommended that you installed Julia to use motif scanning and enrichment calculation.
+
+- If you want to generate motif logos, please install WebLOGO on your computer https://pypi.python.org/pypi/weblogo  
+
+
+
+## Usage
+
+To run mEpigram: It's recommended to run mEpigram using the mepigram_wrapper.py script. However, you can run each modules separately as well.
+
+#### mEpigram Pipeline: 
+You can use the included pipeline to run mepigram, it will perform dinucleotide-shulffing, mepigram_typeE, and enrichment calculation (which uses the motif scanning module).
+
+To very quickly test the pipeline: 
+
+TypeE:	
+
+`python mepigram_wrapper.py -f testfiles/test_data_typeE/ENCFF002CQR.narrowPeak.1000.faa -m typeE -b testfiles/test_data_typeE/background_typeE-5.tsv -g testfiles/test_data_typeE/graphE-5mer/`
+
+TypeEF:
+
+`python mepigram_wrapper.py -f testfiles/test_data_typeEF/ENCFF002CQR.narrowPeak.1000.faa -m typeEF -b testfiles/test_data_typeEF/background_typeEF-5.tsv -g testfiles/test_data_typeEF/graphEF-5mer/`
+
+
+If you use k=8 by inputting background_typeE-8.tsv, graphE-8mer (download it from our website) instead, you should be able to find several highly enriched m-motifs. 
+
+*Note: This pipeline must be executed in the mepigram main directory. It also requires Julia installed.
+
+
+#### Running each module separately:
+
+1. Insert methylation information into the genome, the input is assumed to be in BED format by default. WIG format can be used with --wig. In BED format, each line contains chromosome name, start location (0-based index), start location +1. An output directory will be created to contain the new genome with methylation information. The reference genome should be in a directory format, with each chromosomal sequence contained in a separate file, labeled by its chromosome name. 
+
+	`python modifyReference.py -f input.bed -r reference_genome_directory -o methyl_ref_genomeA`
+
+	OR
+
+	`python modifyReference.py --typeEF -f input.bed -r reference_genome_directory -o methyl_ref_genomeA`
+
+2. Make methylated sequences from bed files and the genome above:
+
+	`python bedToFasta.py -f input.bed -r methyl_ref_genomeA -o output.faa`
+
+3. Make background model: Count the number of k-mers in the genome. This might take a while (a few hours on human whole genome) but you only need to do this once per reference genome. Example: Count the number of 5-mers in the sample genome:
+
+	`python bgModel.py -gd testfiles/samplegenome/ -k 5 -m typeE`
+
+	OR 
+
+	`python bgModel.py -gd testfiles/samplegenome/ -k 5 -m typeEF`
+
+4. Shuffle the sequences: this step will di-nucleotide shuffle your FASTA input sequences to be used. 
+
+	`python fasta-dinucleotide-shuffle_typeE.py -f sequences.faa > sequences.DS.faa`
+
+	OR
+
+	`python fasta-dinucleotide-shuffle_typeEF.py -f sequences.faa > sequences.DS.faa`
+
+
+5. Run mEpigram (all parameters must be provided):
+
+	`python mepigram_typeE.py testInput.faa testInputCTCF.DS.faa your_specific_genome/background_met-8.tsv metgraph-8mer/ resultfile.meme max_No_motifs`
+
+	OR
+
+	`python mepigram_typeEF.py testInput.faa testInputCTCF.DS.faa your_specific_genome/background_met-8.tsv metgraph-8mer/ resultfile.meme max_No_motifs`
+
+* max_No_motifs: integer, the maximum number of motifs to be produced.
+
+## Other tools
+
+#### Motif scanning: 
+To identify locations of matches using your motifs, you can use the motif scanning tool. The motif-scanning program is written in julia so it's necessary to have the Julia language installed. The program takes a FASTA file, a motif PWM file, and a background file that states background base composition. Although the background is optional, it is recommended that you use the appropriate background as the program will assume equal nucleotide distribution if the background is not provided.
+
+`python motifscannerA.py [options] -f fastafile -m motiffile -o output_file -b backgroundBaseComposition`
+
+*Note: The motifscannerA.py file should be executed in the main mEpigram directory
+
+#### Making motif LOGOs: 
+You can generate both types of logo using the `makeLOGO.py` script. Use the flag --typeEF to generate typeEF motifs.
+
diff --git a/baseComposition.py b/baseComposition.py
@@ -0,0 +1,49 @@
+
+#!/usr/bin/env python
+
+"""
+Calculate the base composition from the background file
+
+"""
+
+from sys import argv
+
+def main():
+	backgroundfile=argv[1]
+	alphabetfile = argv[2] # alphabet, a squence of alphabets that are used, separated by commas. Example: A,G,C,T
+	output = argv[3]
+
+
+	seqs=open(backgroundfile).read().split('\n')
+	total=float(seqs[0].strip().split('\t')[1])
+	seqs=seqs[1:]
+	#seqs[:10]
+	totalalphacounts={}
+	kmerlen=len(seqs[0].strip().split('\t')[0])
+	for line in seqs:
+		tmp=line.strip().split("\t")
+		if len(tmp)<2:
+			#print tmp
+			continue
+		kmercount=int(tmp[1])
+		alphacounts={}
+		for char in tmp[0]:
+			if char not in alphacounts:
+				alphacounts[char]=1
+			else:
+				alphacounts[char]+=1
+		for char in alphacounts:
+			if char not in totalalphacounts:
+				totalalphacounts[char]=alphacounts[char]*kmercount
+			else:
+				totalalphacounts[char]+=alphacounts[char]*kmercount
+
+	outfile=open(output,'w')
+	for char in totalalphacounts:
+		line=char+'\t'+str(float(totalalphacounts[char])/total/kmerlen)
+		outfile.write(line+'\n')
+	return
+
+
+if __name__=="__main__":
+	main()
diff --git a/basecomposition.H1.met.txt b/basecomposition.H1.met.txt
@@ -0,0 +1,5 @@
+A	0.272169925
+C	0.219596187
+G	0.227665455
+T	0.272197674
+E	0.008370759
diff --git a/bedToFasta.py b/bedToFasta.py
@@ -0,0 +1,147 @@
+#!/usr/bin/env python
+import time
+import sys
+from sys import argv
+import os
+
+'''Usage:
+    python BedtoFasta.py genome bedfile output
+
+    The genome is a single FASTA file
+    '''
+def main():
+    """metgenome=argv[1]
+    bedfile=argv[2]
+    output=argv[3]"""
+
+    metgenome = None
+    bedfile = None
+    output = None
+    ref_is_file = True #Default input reference genome is a FASTA file
+
+    # get command line arguments
+    #
+    usage = """USAGE:
+    %s [options]
+
+        -f <filename>       input file, in BED format
+        -r <ref_file>       the reference genome (default to be in the form of a single FASTA file)
+        -d <ref_directory>  indicates that the reference genome in the form of a directory containing files for chromosome in FASTA format (optional)
+        -o <output file>    name of the output FASTA file to be created
+        -h                  print this usage message
+    """ % (sys.argv[0])
+
+    # parse command line
+    i = 1
+    while i < len(sys.argv):
+        arg = sys.argv[i]
+        if (arg == "-f"):
+            i += 1
+            try: bedfile = sys.argv[i]
+            except:
+                print " Error in -f"
+                sys.exit(1)
+        elif (arg == "-d"):
+            ref_is_file = False
+        elif (arg == "-r"):
+            i += 1
+            try: metgenome = sys.argv[i]
+            except:
+                print "Error in -r, please check the command", sys.argv
+                sys.exit(1)
+        elif (arg == "-o"):
+            i += 1
+            try: output = sys.argv[i]
+            except:
+                print "Error in -o"
+                sys.exit(1)
+        elif (arg == "-h"):
+            print >> sys.stderr, usage; sys.exit(1)
+        else:
+            print >> sys.stderr, "Unknown command line argument: " + arg + " \nPlease check the usage with -h"
+            sys.exit(1)
+        i += 1
+    # check that required arguments given
+    if (bedfile == None):
+        print "No input, exit."
+        #sys.exit(1)
+        print >> sys.stderr, usage; sys.exit(1)
+    elif (output == None):
+        print "No output, exit."
+        print >> sys.stderr, usage; sys.exit(1)
+    elif (metgenome== None):
+        print "No reference genome, exit. See usage with -h"
+        sys.exit(1)
+
+    print "Working with",bedfile, metgenome, output
+
+    #load the peaks bed file
+    #example:
+    print "Reading BED file..."
+    file=open(bedfile)
+    met_reads=[]
+    for line in file:
+        tmp=line.strip().split("\t")
+        chrom=tmp[0]
+        start=int(tmp[1])
+        end=int(tmp[2])
+        met_reads+=[[chrom,start,end]]
+    print "Number of regions",len(met_reads)
+    print "First region in BED file",met_reads[0]
+    #sys.exit(main(0))
+
+    #load metgenome
+    print "Reading the reference genome..."
+    #check if the ref is in the file or directory format
+    genome={}
+    if ref_is_file:
+    	#example: file=open("/Users/vungo/Downloads/Work/genomes/mm10.fa")
+    	file = open(metgenome)
+    	seq = file.read()
+    	seq = seq.split(">")
+    	seq = seq[1:]
+    	#print len(seq)
+    	for chro in range(len(seq)):
+        	temp = seq[chro].strip().split('\n')
+        	chromosome = temp[0] #name of the chromosome
+        	print chromosome
+        	dna = "".join(temp[1:])
+        	genome[chromosome] = dna
+    	print genome.keys()
+    else:
+    	print "Reference is a directory. Reading the files..."
+    	chromfiles=os.listdir(metgenome)
+    	#check if the names of the chr files are chrXX or chrXX.faa, convert the latter to the former
+    	chromnames=[]
+    	for filename in chromfiles:
+    		print filename
+    		chro = filename.strip().split(".")[0]
+    		chromnames += [chro]
+    		seq = open(metgenome+"/"+filename).read().strip().split("\n")
+    		seq = seq[1:] #the first element is the chromosome name, discard it
+    		genome[chro] = "".join(seq)
+    	print genome.keys()
+
+    print "Outputing the FASTA file"
+    target=open(output,"w")
+    #DONT CARE ABOUT DIRECTION! because of the methylation ... its complicated
+    num=0
+    time_start=time.time()
+    for row in range(len(met_reads)):
+        if num%1000==0:
+            print "number of regions converted",num
+        num+=1
+        chrom=met_reads[row][0]
+        start=met_reads[row][1]
+        end=met_reads[row][2]
+        #direction=klf4.loc[row,0]
+        name=">%s%s%s\n" %(chrom,"_",str(start))
+        #name=">"+chrom+str(start)+"\n"
+        seq=genome[chrom][start:end]
+        target.write(name)
+        target.write(seq+"\n")
+    print "Running time is",time.time()-time_start
+    target.close()
+
+if __name__=="__main__":
+    main()