Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cpragadeesh committed Aug 26, 2017
1 parent 6f1c73d commit c004fac
Showing 1 changed file with 2 additions and 76 deletions.
78 changes: 2 additions & 76 deletions src/rescore/README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,3 @@
# Requirements
# Corpus testing and Symbol score generation module

Python Dependencies:
* Python version: 2.7+
* requests

# CORPUS TESTING

Run corpus-test.py to test emails and generate log.

Example:

./corpus-test.py --ham test-ham --spam test-spam -o test.log

Use ./corpus-test.py -h to get more info about usage.

Log file consists of one line per email. It contains filename, actual email type, score, action, symbols in that order.


### TESTING SETTINGS

"test.conf" file is used as default config file corpus-testing. If you would like to use a custom config file, use -c option.

NOTE: Enclose the value for settings in triple quotes

Example 'test.conf' for disabled symbol group "encryption":
```
{
"Settings" : '''{groups_disabled=["encryption"]}'''
}
```


NOTE: You might encounter shutdown exception at times. This is a known python 2.7 error. However, it doesnt affect the results.
(http://bugs.python.org/issue14623)


# STATISTICS

Use statistics.py to infer useful information from the log file generated in previous step. For generating statistics specify spam threshold score using -t. Feed in log file using input redirection.

### Example:

./statistics.py -t 10 < test.log > stats.log

Use ./statistics.py -h to get more info about usage

Statistics contains two different information - File stats and symbol stats.

### File stats:

**Number of emails**: Number of emails read from log
**Number of spam**: Number of spam emails read from log
**Number of ham**: Number of ham emails read from log
**Spam percentage**: Percentage of spam emails read from log
**Ham percentage**: Percentage of ham emails read from log
**False positive rate**: Percentage of ham emails that were falsely classified as spam
**False negative rate**: Percentage of spam emails that were falsely classified as ham

### Symbol stats:

Each line presents statistics about a symbol read from the log.

**Overall**: % of emails hit by a symbol
**Spam**: % of spam emails hit by a symbol
**Ham**: % of ham emails hit by a symbol
**S/O**: % spam emails hit over all its hits
(i.e What is the probability that it hits a spam message when it is fired)


# Rescoring

Use rescore.py on logs generated from corpus-test to find optimal symbol scores using perceptron.

### Example:

./rescore.py -l logs/ -r 0.001 -e 500 -o scores.txt
Use these scripts for generating best symbol scores from you spam/ham corpus. Lua version is written using Lua + torch, python version is written using Python + Scikit. You can also use these scripts of generating statistics of corpus, symbols. Find readme on how to use it inside python/lua folders

0 comments on commit c004fac

Please sign in to comment.