moved root readme inside python

cpragadeesh · Aug 24, 2017 · a2bf75c · a2bf75c
1 parent 87e1af2
commit a2bf75c
Showing 1 changed file with 77 additions and 0 deletions.
diff --git a/src/rescore/python/README.md b/src/rescore/python/README.md
@@ -0,0 +1,77 @@
+# Requirements
+
+Python Dependencies:  
+       * Python version: 2.7+  
+       * requests  
+
+# CORPUS TESTING
+
+Run corpus-test.py to test emails and generate log.
+
+Example:
+
+	./corpus-test.py --ham test-ham --spam test-spam -o test.log
+
+Use ./corpus-test.py -h to get more info about usage.
+
+Log file consists of one line per email. It contains filename, actual email type, score, action, symbols in that order.
+
+
+### TESTING SETTINGS
+
+"test.conf" file is used as default config file corpus-testing. If you would like to use a custom config file, use -c option.
+
+NOTE: Enclose the value for settings in triple quotes
+
+Example 'test.conf' for disabled symbol group "encryption":
+```
+	{
+		"Settings" : '''{groups_disabled=["encryption"]}'''
+	}
+```
+
+
+NOTE: You might encounter shutdown exception at times. This is a known python 2.7 error. However, it doesnt affect the results.
+(http://bugs.python.org/issue14623)
+
+
+# STATISTICS
+
+Use statistics.py to infer useful information from the log file generated in previous step. For generating statistics specify spam threshold score using -t. Feed in log file using input redirection.
+
+### Example:
+
+	./statistics.py -t 10 < test.log > stats.log
+
+Use ./statistics.py -h to get more info about usage
+
+Statistics contains two different information - File stats and symbol stats.
+
+### File stats:
+
+**Number of emails**: Number of emails read from log  
+**Number of spam**: Number of spam emails read from log  
+**Number of ham**: Number of ham emails read from log  
+**Spam percentage**: Percentage of spam emails read from log  
+**Ham percentage**: Percentage of ham emails read from log  
+**False positive rate**: Percentage of ham emails that were falsely classified as spam  
+**False negative rate**: Percentage of spam emails that were falsely classified as ham  
+
+### Symbol stats:
+
+Each line presents statistics about a symbol read from the log.  
+
+**Overall**: % of emails hit by a symbol  
+**Spam**: % of spam emails hit by a symbol  
+**Ham**: % of ham emails hit by a symbol  
+**S/O**: % spam emails hit over all its hits  
+	   (i.e What is the probability that it hits a spam message when it is fired)  
+
+
+# Rescoring
+
+Use rescore.py on logs generated from corpus-test to find optimal symbol scores using perceptron.
+
+### Example:
+
+	./rescore.py -l logs/ -r 0.001 -e 500 -o scores.txt