Skip to content

Commit

Permalink
moved root readme inside python
Browse files Browse the repository at this point in the history
  • Loading branch information
cpragadeesh committed Aug 24, 2017
1 parent 87e1af2 commit a2bf75c
Showing 1 changed file with 77 additions and 0 deletions.
77 changes: 77 additions & 0 deletions src/rescore/python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Requirements

Python Dependencies:
* Python version: 2.7+
* requests

# CORPUS TESTING

Run corpus-test.py to test emails and generate log.

Example:

./corpus-test.py --ham test-ham --spam test-spam -o test.log

Use ./corpus-test.py -h to get more info about usage.

Log file consists of one line per email. It contains filename, actual email type, score, action, symbols in that order.


### TESTING SETTINGS

"test.conf" file is used as default config file corpus-testing. If you would like to use a custom config file, use -c option.

NOTE: Enclose the value for settings in triple quotes

Example 'test.conf' for disabled symbol group "encryption":
```
{
"Settings" : '''{groups_disabled=["encryption"]}'''
}
```


NOTE: You might encounter shutdown exception at times. This is a known python 2.7 error. However, it doesnt affect the results.
(http://bugs.python.org/issue14623)


# STATISTICS

Use statistics.py to infer useful information from the log file generated in previous step. For generating statistics specify spam threshold score using -t. Feed in log file using input redirection.

### Example:

./statistics.py -t 10 < test.log > stats.log

Use ./statistics.py -h to get more info about usage

Statistics contains two different information - File stats and symbol stats.

### File stats:

**Number of emails**: Number of emails read from log
**Number of spam**: Number of spam emails read from log
**Number of ham**: Number of ham emails read from log
**Spam percentage**: Percentage of spam emails read from log
**Ham percentage**: Percentage of ham emails read from log
**False positive rate**: Percentage of ham emails that were falsely classified as spam
**False negative rate**: Percentage of spam emails that were falsely classified as ham

### Symbol stats:

Each line presents statistics about a symbol read from the log.

**Overall**: % of emails hit by a symbol
**Spam**: % of spam emails hit by a symbol
**Ham**: % of ham emails hit by a symbol
**S/O**: % spam emails hit over all its hits
(i.e What is the probability that it hits a spam message when it is fired)


# Rescoring

Use rescore.py on logs generated from corpus-test to find optimal symbol scores using perceptron.

### Example:

./rescore.py -l logs/ -r 0.001 -e 500 -o scores.txt

0 comments on commit a2bf75c

Please sign in to comment.