Skip to content

Promising spam filtering library making use of combined machine learning algorithms, written in C++

License

Notifications You must be signed in to change notification settings

freiz/terminator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Terminator is a library written in C++ for spam filtering, like the famous SpamBayes and OSBF-Lua. It can be embedded into other spam filtering software or service as a machine learning module. The advantages are

  • Very high precision and recall, best results on all public spam filtering corpus.
  • It is fast and can only consume several MB of memory.
  • Do not need to tune hyper-parameters

Terminator can be used in any other binary text classification problems, especially those that need an adaptive model for online learning.

Terminator is not a complete E2E spam filtering solution. Instead, it focuses on the machine learning part without blocklist/allowlist or DKIM. My paper, "An Adaptive Fusion Algorithm for Spam Detection](http://csse.szu.edu.cn/staff/panwk/publications/Journal-IEEE-IS-14-AFSD.pdf)" described the implementation in detail.

(Update on Jan 2023. The work of this library dates back to around 2010. It consistently got SOTA results on most online learning email filtering corpus, TREC, CEAS, and a private dataset from NetEase. I have not followed this area for a long time, so I may miss some latest research. For batch learning context, I think the newest Transformer based LLMs have great potential.)

Implementation

Terminator used a fusion model, which includes eight machine learning algorithms to boost spam filtering performance. The algorithms are listed below according to papers

We used a novel adaptive model fusion technique. The weight of every single model is learned during the online learning process.

Installation & Usage

Step 1, Install Dependencies

The only dependency is kyotocabinet](http://fallabs.com/kyotocabinet/) for persistence, which must be installed first.

Step 2, Install Terminator and Compile

clone https://github.com/freiz/terminator.git
cd terminator
make

You can change the compiler suite in Makefile; the output is a static linkable lib.

Step 3, Write an Example

#include "terminator.h"

// The first parameter is the path of database file
// The second parameter is the main memory used as cache, the unit is Byte, so 5 << 20 is around 5MB as cache
Terminator* classifier = new Terminator("terminator.kch", 5 << 20);

// Now you can write the main logic
// There are two public api, Train and Predict

// [Predict] pass in the email content and return a score ranging from 0 (100% ham) to 1 (100% spam)
// You can change the threshold to make the decision on your own 
double score = classifier->Predict(std::string email_content);

// [Train] pass in the email content and a flag
// If spam train, the flag set to true or false
classifier->Train(std::string email_content, boolean is_spam)

Step 4, Play with Demo (Optional)

make run-demo

It will run a demo application to simulate spam filtering using the SpamAssassin corpus; you can also put another dataset (such as ceas08) under demo/corpus to check the experiment result.

Step 5, Compile and Link Your bits

Do not forget to link against the library kyotocabinet.

Experiment Result

Here, I only quote samples of results on public corpus Trec05-p1

Competitor(1-ROCA)%, the smaller the better
bogofilter0.048
spamprobe0.059
spamasassin0.059
terminator0.0055

The paper "An Adaptive Fusion Algorithm for Spam Detection" contains a complete set of experiment results.

About

Promising spam filtering library making use of combined machine learning algorithms, written in C++

Topics

Resources

License

Stars

Watchers

Forks