forked from rspamd/rspamd
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6f1c73d
commit c004fac
Showing
1 changed file
with
2 additions
and
76 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,77 +1,3 @@ | ||
# Requirements | ||
# Corpus testing and Symbol score generation module | ||
|
||
Python Dependencies: | ||
* Python version: 2.7+ | ||
* requests | ||
|
||
# CORPUS TESTING | ||
|
||
Run corpus-test.py to test emails and generate log. | ||
|
||
Example: | ||
|
||
./corpus-test.py --ham test-ham --spam test-spam -o test.log | ||
|
||
Use ./corpus-test.py -h to get more info about usage. | ||
|
||
Log file consists of one line per email. It contains filename, actual email type, score, action, symbols in that order. | ||
|
||
|
||
### TESTING SETTINGS | ||
|
||
"test.conf" file is used as default config file corpus-testing. If you would like to use a custom config file, use -c option. | ||
|
||
NOTE: Enclose the value for settings in triple quotes | ||
|
||
Example 'test.conf' for disabled symbol group "encryption": | ||
``` | ||
{ | ||
"Settings" : '''{groups_disabled=["encryption"]}''' | ||
} | ||
``` | ||
|
||
|
||
NOTE: You might encounter shutdown exception at times. This is a known python 2.7 error. However, it doesnt affect the results. | ||
(http://bugs.python.org/issue14623) | ||
|
||
|
||
# STATISTICS | ||
|
||
Use statistics.py to infer useful information from the log file generated in previous step. For generating statistics specify spam threshold score using -t. Feed in log file using input redirection. | ||
|
||
### Example: | ||
|
||
./statistics.py -t 10 < test.log > stats.log | ||
|
||
Use ./statistics.py -h to get more info about usage | ||
|
||
Statistics contains two different information - File stats and symbol stats. | ||
|
||
### File stats: | ||
|
||
**Number of emails**: Number of emails read from log | ||
**Number of spam**: Number of spam emails read from log | ||
**Number of ham**: Number of ham emails read from log | ||
**Spam percentage**: Percentage of spam emails read from log | ||
**Ham percentage**: Percentage of ham emails read from log | ||
**False positive rate**: Percentage of ham emails that were falsely classified as spam | ||
**False negative rate**: Percentage of spam emails that were falsely classified as ham | ||
|
||
### Symbol stats: | ||
|
||
Each line presents statistics about a symbol read from the log. | ||
|
||
**Overall**: % of emails hit by a symbol | ||
**Spam**: % of spam emails hit by a symbol | ||
**Ham**: % of ham emails hit by a symbol | ||
**S/O**: % spam emails hit over all its hits | ||
(i.e What is the probability that it hits a spam message when it is fired) | ||
|
||
|
||
# Rescoring | ||
|
||
Use rescore.py on logs generated from corpus-test to find optimal symbol scores using perceptron. | ||
|
||
### Example: | ||
|
||
./rescore.py -l logs/ -r 0.001 -e 500 -o scores.txt | ||
Use these scripts for generating best symbol scores from you spam/ham corpus. Lua version is written using Lua + torch, python version is written using Python + Scikit. You can also use these scripts of generating statistics of corpus, symbols. Find readme on how to use it inside python/lua folders |