findMotifs

findMotifs is an easy-to-implement R script for finding and scoring motifs in small genomes or metagenomic contigs using position-frequency tables, such as those generated by HT-SELEX experiments. For large genomes, you're better off using motifmatchr or similar software that uses the MOODS C++ library. For the discovery of novel motifs from sequencing data, consider HOMER or rmotifx.

Inputs

fasta file (-f,--fasta)
position-frequency table (-t,--freq_table)
minimum score for matches (-c,--cutoff; default is 90)
output directory (-o,--out_dir; default is getwd())

The position-frequency table must be a concatenated list of comma-delimited position-frequency matrices, with the name of each motif preceding its corresponding matrix. For an example, see the structure of PFMs.txt:

motif_A
A,89,0,41,0,0,28,34,0,23,34,0,90
C,9,94,9,11,95,0,0,92,28,51,100,0
G,2,0,24,0,5,72,47,8,17,15,0,0
T,0,6,26,89,0,0,19,0,32,0,0,10
motif_B
A,0,48,8,0,0,100,37,28,23,0,29,1,0
C,100,0,35,24,0,0,0,9,34,100,71,0,0
G,0,52,41,16,100,0,63,36,35,0,0,0,100
T,0,0,16,60,0,0,0,27,8,0,0,99,0
motif_C
A,0,0,93,0,31,25,0,0,0,29
C,0,71,0,0,0,52,0,0,100,59
G,7,0,0,47,69,23,90,0,0,0
T,93,29,7,53,0,0,10,100,0,12

Trailing commas are OK, as all empty fields get converted to NA and trimmed during import.

Outputs

This script returns a single tab-delimited file named with the following fields:

Field	Description
`motif_name`	the name of the motif as it appears in the position-frequency table
`motif_width`	the number of nucleotides in the motif
`fasta`	the name of the input fasta
`fasta_length`	the number of nucleotides in the input fasta
`strand`	the strand the match is found on (`+` or `-`)
`match`	the nucleotide sequence of the match
`score`	the score of the match as a proportion of the max score (0 to 1)
`concat_start`	the position of the first nucleotide of the match in the `+` concatenated fasta file
`concat_end`	the position of the last nucleotide of the match in the `+` concatenated fasta file. For matches on the `+` strand, `concat_start` < `concat_end`. For matches on the `-` strand, `concat_start` > `concat_end`.
`contig_index`	the contig number on which the match is located
`contig_length`	the number of nucleotides in the contig containing the match
`contig_start`	the position of the first nucleotide in the match relative to the `+` start of the contig
`contig_end`	the position of the last nucleotide in the match relative to the `+` start of the contig. For matches on the `+` strand, `contig_start` < `contig_end`. For matches on the `-` strand, `contig_start` > `contig_end`.
`contig_header`	the defline of the contig

Dependencies

R (version >= 4.0.0) with the following packages:

Biostrings (tested version: 2.56.0)
optparse (tested version: 1.6.6)
stringr (tested version: 1.4.0)

Example

To run findMotifs.R from the command line, ensure R is in your PATH and call the script with Rscript.

export PATH=/programs/R-4.0.0/bin:$PATH  
Rscript findMotifs.R -f EcoliC.fa -t PFMs.txt -c 90

Progress will print to the command line, and results will be written to <fasta>_<cutoff>_findmotifs.txt in the directory specified by --out_dir. To check for correct behavior, run the script using the example files and compare the output to EcoliC.fa_90_findmotifs.txt using md5sum or an equivalent checksum method. Because progress is monitored with txtProgressBar(), jobs submitted to a cluster will produce a jumbled outfile log.

Use -h or --help to print the help message.

Rscript findmotifs.R -h
Options:
    -f CHARACTER, --fasta=CHARACTER
            Required: input directory and filename of fasta sequence

    -t CHARACTER, --freq_table=CHARACTER
            Required: input directory and filename of comma-delimited position frequency table

    -c CHARACTER, --cutoff=CHARACTER
            Input cutoff as integer <= 100 (default: 90)

    -o CHARACTER, --out_dir=CHARACTER
            Output directory without trailing "/" (default: getwd())

    -h, --help
            Show this help message and exit

Details

This script is a wrapper for the PWM() and matchPWM() functions from the Biostrings package. (see RDocumentation)

PWM() is used to convert the input position-frequency matrix into a position-weight matrix using type = "log2probratio". The parameters of the Dirichlet conjugate prior are adjusted to account for the GC content of the input fasta. To save time, the number of matchPWM() calls are reduced to one per motif by first concatenating the input fasta into a single sequence with contigs demarcated by non-IUPAC characters.

There are three parameters that affect the runtime of the script.

The length of the input fasta
The number of motifs represented in the position-frequency table
The information content of each motif. Motifs with lower information content will have lower maximum scores, and therefore more matches above the cutoff threshold.

Future implementations should further increase speed by parallelizing matchPWM() calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EcoliC.fa

EcoliC.fa

EcoliC.fa_90_findmotifs.txt

EcoliC.fa_90_findmotifs.txt

PFMs.txt

PFMs.txt

README.md

README.md

findMotifs.R

findMotifs.R

Repository files navigation

findMotifs

Inputs

Outputs

Dependencies

Example

Details

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
EcoliC.fa		EcoliC.fa
EcoliC.fa_90_findmotifs.txt		EcoliC.fa_90_findmotifs.txt
PFMs.txt		PFMs.txt
README.md		README.md
findMotifs.R		findMotifs.R

acvill/findMotifs

Folders and files

Latest commit

History

Repository files navigation

findMotifs

Inputs

Outputs

Dependencies

Example

Details

About

Topics

Resources

Stars

Watchers

Forks

Languages