<a href="https://colab.research.google.com/github/diwakaryalpi/Coursework/blob/main/Homework9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Filtering

Ch 6 from *Programming Collective Intelligence*, based on code from
* https://github.com/arthur-e/Programming-Collective-Intelligence/tree/master/chapter6
* https://go.oreilly.com/old-dominion-university/library/view/programming-collective-intelligence/9780596529321/

**Goal:** Classify email as spam or not spam.

**Implemented Example:** Classify a given document as "bad" or "good".

## General Functions

In [1]:
import sqlite3 as sqlite   # replaces import stmt from book
import re
import math

`getwords(doc)` - returns a list of unique words found in the given document

* breaks up the text into words, by dividing on any character that isn’t a letter
* leaves only actual words, converted to lowercase
* returns only unique words (so doesn't calculate the count if a word is used multiple times in a document)

Note that this reduces the number of features because text is now case insensitive. However, this will completely miss ALL CAPS as potential feature for spam.


In [2]:
def getwords(doc):
  splitter=re.compile('\W+')  # different than book
  #print (doc)
  # Split the words by non-alpha characters
  words=[s.lower() for s in splitter.split(doc) 
          if len(s)>2 and len(s)<20]
  
  # Return the unique set of words only
  uniq_words = dict([(w,1) for w in words])

  return uniq_words

## Basic Classifier

`class basic_classifer` - holds what the classifier has learned so far
* implemented in pgs. 119-127, no SQL DB involved (this is in the `class classifier` below)

Instance variables:
* `fc` - stores counts for different features in the different classifications \\
example: `{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}`
* `cc` - dictionary of how many times every classification has been used, will be used in later probability calculations
* `getfeatures()` - extracts the features from the items being classified, we use `getwords()`

Helper functions - increment and access the counts (so that we can later store the training data in a file or db)
* `incf()` - increase the count of a feature/category pair
* `incc()` - increase the count of a category
* `fcount()` - num times a feature has appeared in a category
* `catcount()` - number of items in a category
* `totalcount()` - total number of items
* `categories()` - list of all categories

Other functions:
* `train()` - processes the training data, extracts words, and updates counts
* `fprob()` - returns Pr(w|c), probability that a word appears in a category, implements the Multiple Bernoulli method
* `weightedprob()` - returns the weighted probability of Pr(w|c), using assumed probabilities

In [3]:
class basic_classifier:

  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  # Increase the count of a feature/category pair  
  def incf(self,f,cat):
    self.fc.setdefault(f, {})
    self.fc[f].setdefault(cat, 0)
    self.fc[f][cat]+=1
  
  # Increase the count of a category  
  def incc(self,cat):
    self.cc.setdefault(cat, 0)
    self.cc[cat]+=1  

  # The number of times a feature has appeared in a category
  def fcount(self,f,cat):
    if f in self.fc and cat in self.fc[f]:
      return float(self.fc[f][cat])
    return 0.0

  # The number of items in a category
  def catcount(self,cat):
    if cat in self.cc:
        return float(self.cc[cat])
    return 0

  # The total number of items
  def totalcount(self):
    return sum(self.cc.values())

  # The list of all categories
  def categories(self):
    return self.cc.keys()

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp

In [3]:
class basic_classifier:

  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  # Increase the count of a feature/category pair  
  def incf(self,f,cat):
    self.fc.setdefault(f, {})
    self.fc[f].setdefault(cat, 0)
    self.fc[f][cat]+=1
  
  # Increase the count of a category  
  def incc(self,cat):
    self.cc.setdefault(cat, 0)
    self.cc[cat]+=1  

  # The number of times a feature has appeared in a category
  def fcount(self,f,cat):
    if f in self.fc and cat in self.fc[f]:
      return float(self.fc[f][cat])
    return 0.0

  # The number of items in a category
  def catcount(self,cat):
    if cat in self.cc:
        return float(self.cc[cat])
    return 0

  # The total number of items
  def totalcount(self):
    return sum(self.cc.values())

  # The list of all categories
  def categories(self):
    return self.cc.keys()

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=0.75,ap=0.5):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp

## Training Examples

In [None]:
def sampletrain(cl):
  cl.train('Nobody owns the water.','good')
  cl.train('the quick rabbit jumps fences','good')
  cl.train('buy pharmaceuticals now','bad')
  cl.train('make quick money at the online casino','bad')
  cl.train('the quick brown fox jumps','good')

In [4]:
def sampletrain(cl):
  cl.train(open("ontopic1.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic2.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic3.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic4.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic5.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic6.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic7.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic8.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic9.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic10.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic11.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic12.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic13.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic14.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic15.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic16.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic17.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic18.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic19.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("ontopic20.txt","r",encoding="utf8").read().replace("\n", " "),'ontopic')
  cl.train(open("offtopic1.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic2.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic3.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic4.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic5.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic6.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic7.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic8.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic9.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic10.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic11.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic12.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic13.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic14.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic15.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic16.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic17.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic18.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic19.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')
  cl.train(open("offtopic20.txt","r",encoding="utf8").read().replace("\n", " "),'offtopic')

### Example 1 - simple counts

First, instantiate the basic classifier with `getwords()` as the getfeatures function.

In [5]:
cl = basic_classifier(getwords)

Load sample training data and print out data from the classifier

In [6]:
sampletrain(cl)
print("")
print("Total items:", cl.totalcount())
print("Categories:", cl.categories())
for cat in cl.categories():
  print(cat, cl.catcount(cat))


Total items: 40
Categories: dict_keys(['ontopic', 'offtopic'])
ontopic 20.0
offtopic 20.0


In [7]:
cl.fcount('loft', 'ontopic')

2.0

In [8]:
cl.fcount('shoedazzle', 'ontopic')

1.0

In [9]:
cl.fcount('justfab', 'ontopic')

2.0

In [10]:
cl.fcount('amazon', 'ontopic')

2.0

In [11]:
cl.fcount('target', 'ontopic')

3.0

In [12]:
cl.fcount('kaggle', 'offtopic')

1.0

In [13]:
cl.fcount('google', 'offtopic')

3.0

In [14]:
cl.fcount('quora', 'offtopic')

1.0

In [15]:
cl.fcount('odu', 'offtopic')

2.0

In [16]:
cl.fcount('coursera', 'offtopic')

1.0

### Example 2 (pg. 122) - simple prob

First, reset the classifier by re-instantiating

In [17]:
cl = basic_classifier(getwords)

In [18]:
sampletrain(cl)
cl.fprob('shop', 'ontopic')

0.6

In [19]:
sampletrain(cl)
cl.fprob('kaggle', 'offtopic')

0.05

In [20]:
sampletrain(cl)
cl.fprob('shop', 'offtopic')

0.0

In [21]:
sampletrain(cl)
cl.fprob('buy', 'ontopic')

0.25

In [22]:
sampletrain(cl)
cl.fprob('buy', 'offtopic')

0.05

### Example 3 (pg. 122) - simple weightedprob

In [23]:
cl = basic_classifier(getwords)
cl.weightedprob('shop', 'ontopic', cl.fprob)

0.5

In [24]:
cl = basic_classifier(getwords)
cl.weightedprob('kaggle', 'offtopic', cl.fprob)

0.5

In [25]:
cl = basic_classifier(getwords)
cl.weightedprob('shop', 'offtopic', cl.fprob)

0.5

In [26]:
cl.train("This money is bad.", "bad")
cl.weightedprob('money', 'bad', cl.fprob)

0.7857142857142857

### Example 4 - fprob vs. weightedprob

In [27]:
cl = basic_classifier(getwords)
sampletrain(cl)

In [28]:
cl.fprob('shop', 'ontopic')

0.6

In [29]:
cl.weightedprob('shop', 'ontopic', cl.fprob)

0.5941176470588235

### Example 5 (pg. 123) - adding more training data

In [30]:
cl = basic_classifier(getwords)
sampletrain(cl)

In [31]:
cl.weightedprob('shop', 'ontopic', cl.fprob)

0.5941176470588235

In [32]:
cl.weightedprob('kaggle', 'offtopic', cl.fprob)

0.24285714285714285

In [33]:
sampletrain(cl)
cl.weightedprob('shop', 'ontopic', cl.fprob)

0.5969696969696969

In [34]:
sampletrain(cl)
cl.weightedprob('kaggle', 'offtopic', cl.fprob)

0.14

## Naive Bayes Classifier

*To use this with the basic classifier (and to change it back later), make the following changes:*
* `class naivebayes(classifier)` -> `class naivebayes(basic_classifier)`
* `classifier.__init__(self,getfeatures)` -> `basic_classifier.__init__(self,getfeatures)`

In [35]:
class naivebayes(basic_classifier):   # change for basic_classifier

  def __init__(self,getfeatures):   
    basic_classifier.__init__(self,getfeatures)  # change for basic_classifier
    self.thresholds={}
  
  def docprob(self,item,cat):
    features=self.getfeatures(item)   

    # Multiply the probabilities of all the features together
    p=1
    for f in features: p*=self.weightedprob(f,cat,self.fprob)
    return p

  def prob(self,item,cat):
    catprob=self.catcount(cat)/self.totalcount()
    docprob=self.docprob(item,cat)
    return docprob*catprob
  
  def setthreshold(self,cat,t):
    self.thresholds[cat]=t
    
  def getthreshold(self,cat):
    if cat not in self.thresholds: return 1.0
    return self.thresholds[cat]
  
  def classify(self,item,default=None):
    probs={}
    # Find the category with the highest probability
    max=0.0
    for cat in self.categories():
      probs[cat]=self.prob(item,cat)
      if probs[cat]>max: 
        max=probs[cat]
        best=cat

    # Make sure the probability exceeds threshold*next best
    for cat in probs:
      if cat==best: continue
      if probs[cat]*self.getthreshold(best)>probs[best]: return default
    return best

## Bayesian Examples

### Example 1 (pg. 125) - prob

Training dataset: 
```
('Nobody owns the water.','good')
('the quick rabbit jumps fences','good')
('buy pharmaceuticals now','bad')
('make quick money at the online casino','bad')
('the quick brown fox jumps','good')
```

In [36]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.prob('shop', 'ontopic')

0.29705882352941176

In [37]:
cl.prob('shop', 'offtopic')

0.014705882352941176

### Example 2 (pg. 127) - using thresholds

In [38]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("ontopic21.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'ontopic'

In [39]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("ontopic22.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'ontopic'

In [40]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("ontopic23.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'ontopic'

In [41]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("ontopic24.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'ontopic'

In [42]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("ontopic25.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'ontopic'

In [43]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("offtopic21.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'offtopic'

In [44]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("offtopic22.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'offtopic'

In [45]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("offtopic23.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'offtopic'

In [46]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("offtopic24.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'offtopic'

In [48]:
cl = naivebayes(getwords)
sampletrain(cl)
cl.classify(open("offtopic25.txt","r",encoding="utf8").read().replace("\n", " "), default='unknown')

'offtopic'

## Classifier w/SQL

Uses a SQL database

In [None]:
class classifier:
  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  def setdb(self,dbfile):
    self.con=sqlite.connect(dbfile)    
    self.con.execute('create table if not exists fc(feature,category,count)')
    self.con.execute('create table if not exists cc(category,count)')

  def incf(self,f,cat):
    count=self.fcount(f,cat)
    if count==0:
      self.con.execute("insert into fc values ('%s','%s',1)" 
                       % (f,cat))
    else:
      self.con.execute(
        "update fc set count=%d where feature='%s' and category='%s'" 
        % (count+1,f,cat)) 
  
  def fcount(self,f,cat):
    res=self.con.execute(
      'select count from fc where feature="%s" and category="%s"'
      %(f,cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def incc(self,cat):
    count=self.catcount(cat)
    if count==0:
      self.con.execute("insert into cc values ('%s',1)" % (cat))
    else:
      self.con.execute("update cc set count=%d where category='%s'" 
                       % (count+1,cat))    

  def catcount(self,cat):
    res=self.con.execute('select count from cc where category="%s"'
                         %(cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def categories(self):
    cur=self.con.execute('select category from cc');
    return [d[0] for d in cur]

  def totalcount(self):
    res=self.con.execute('select sum(count) from cc').fetchone();
    if res==None: return 0
    return res[0]

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)
    self.con.commit()

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp


## Examples - Full Bayesian Classifier w/SQL


In [None]:
def spamTrain(cl):
  cl.train('the the', 'not spam')
  cl.train('cheap cheap cheap banking the', 'spam')
  cl.train('the', 'not spam')
  cl.train('cheap cheap banking banking banking the the', 'spam')
  cl.train('cheap cheap cheap cheap cheap buy buy the', 'spam')
  cl.train('banking the', 'not spam')
  cl.train('buy banking the', 'not spam')
  cl.train('the', 'not spam')
  cl.train('the', 'not spam')
  cl.train('cheap buy dinner the the', 'not spam')

*Don't forget to adjust `class naivebayes` to use `classifier`*

In [None]:
cl = naivebayes(getwords)
cl.setdb('test1.db')
spamTrain(cl)
cl.setthreshold('spam', 3.0)
cl.classify('the banking dinner', default='unknown')

In [None]:
cl2 = naivebayes(getwords)
cl2.setdb('test2.db')
sampletrain(cl2)
cl2.setthreshold('bad', 3.0)
cl2.classify('quick money', default='unknown')

In [None]:
cl = naivebayes(getwords)
cl.setdb('test1.db')
cl.classify('cheap money', default='unknown')

In [None]:
cl2.classify('online casino now', default='unknown')