By Matthew Croop and Albert Sun. Filters spam, but not as well as other filters. Mostly intended as an experiment in the use of Markov chain-like objects for natural language analysis.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


++Quick Start++

The spam and ham corpus we used for training the included ham.brain and spam.brain files is a subset of the 2005 TREC Public Spam Corpus (

A few quick directions on how to train the filter on spam and ham, and then use it to check a large number of messages, or one message at a time. In these examples, we expect the index file to conform to the syntax of the one included in the corpus we used. That syntax is a file of references, one per line, to text files containing a singe email. The index marks each one as either spam or ham, and then has the relative path to the file, separated by a space.

For example,
ham ../013/032
spam ../014/001

 Load ham into filter
 python -i /path/to/index -t ham -n -l 100
 Load spam into filter
 python -i /path/to/index -t spam -s -l 100
 Test one email. The following commands are equivalent.
 python -f /path/to/email
 cat /path/to/email | python
 Test a large amount of ham
 python -i /path/to/index -t ham -l 1000
 Test spam
 python -i /path/to/index -t spam -l 1000

Upon initial training of the filter, a spam.brain and ham.brain file are created in the directory, and are necessary for program execution. If these files are deleted, it will be necessary to generate new ones by loading new ham and spam into the filter. For best results, the filter should be trained on roughly similar quantities of ham and spam.

++Setting up a .forward file++

To add this spam filter to your .forward file on a unix system, add this line to your .forward file.
 |"/path/to/ -o outputfile"

Then make the file executable by running
 chmod a+x

Output file should point wherever you want your mail to go. Both mail flagged as spam an ham will be written to this file, but with a new header MarkovBrainSpamStatus as either "Spam", "Not Spam", or "Unknown".

WARNING: This has only been tested on the ENIAC system at the University of Pennsylvania's School of Engineering and Applied Sciences.

ADDITIONAL WARNING: This spam filter does not actually work very well compared to most available spam filters. If you rely solely on it, you will likely get spam, and suffer from false positives!

++Code Overview++

The two primary files are and executes the program, and constructs the model. The language model which the Brian class constructs is a dictinoary hashing tuples of n+1 words onto the frequency at which it occurs. The additional Python modules which are included are libraries of helper functions.

You don't need any additional Python packages to run the filter.

++Directory Structure++



    Semi-Markov Spam Filter of Doom
    Filters spam, but not as well as other filters. Mostly intended as an experiment in the use of Markov chain-like objects for natural language analysis.
    Copyright (c) 2009 Matthew Croop
    Copyright (c) 2009 Albert Sun

    This program is free software: you can redistribute it and/or
    modify it under the terms of the GNU General Public License version 3 as
    published by the Free Software Foundation.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    General Public License for more details.

    A copy the GNU General Public Licence version 3 is provide along
    with this program in the file name LICENCE. If not, see