### An improved classifier: Naive Bayes 

Welcome to the new company, the first task was for you to make a spam classifier to help with the amount of advertisement emails that the company had received lately. Remember that we are actually paying you for this (unlike your previous employer that fired you), so put extra effort into it.

Before you start, take some time to look at this explanation of what a Naive Bayes classifier is:
https://www.geeksforgeeks.org/naive-bayes-classifiers/

To explain it in a few lines, Naive Bayes classifier classify a certain "thing", given some prior probability:
- Thing A (that is made up of A1, A2, A3) is classified as B instead of C iff P(A1|B) \* P(A2|B) \* P(A3|B) > P(A1|C) \* P(A2|C) \* P(A3|C).
- You have noticed that the classifier would make strong assumptions, but we would not discuss that now, and leave that to the other course \*cough\* totally not IS.

Your classifier should take a txt file as an input, and return a boolean, True if the content of the file is spam, and False if the content of the file is ham.

The txt files would be in the format of:

"Train-Spam \n
 Content" if it is to be used for training the classifier.
 
 and 
 
 "Classify \n
 Content" if it needs classification.
 
 The classifier should have a memory that is in the form of a nested dictionary: 
 {"Spam": {"word": frequency}, "Ham": {"word": frequency}}
 
It is relatively easy to implement the training part of the classifier: simply enumerate the words in the file, if the file is classified as Spam, then put the words and their occurences into the spam sub-dictionary; else put them in the Ham sub-dictionary.
 
Classification should work as follow:
- For each word in the file:
-   Calculate their probability of being spam P(word|spam) and their probability of being ham P(word|ham).
- Take the product of the probabilities and compare them: ΠP(word|spam) and ΠP(word|ham)

#### It might be the case that sometimes the probability of a word you get is 0 (the word doesn't exisits in the memory), taking the product with the value would collapse the classifier (making the whole term 0). A simple way to solve this issue is just to ignore the word (making both P(word|ham) and P(word|spam) 1); although this solve the problem, it is equivalent to assigning the probability 1 to the word. A better (but lets ignore it for now) approach would be to use Laplace smoothing (https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf)

If you are looking at the answer, it might be helpful to look at this description of a static method; it is simply a method that you defined in a class, but you can call it even for non-class instances: https://www.digitalocean.com/community/tutorials/python-static-method

In [15]:
class naiveBayes():
    def __init__(self, memory = None):
        if memory is None:
            self.memory = {"Spam": {}, "Ham": {}}
        else:
            self.memory = memory
        
        self.indicator = {True: "Spam", False: "Ham"}
            
    def train(self, flag, data):
        for word in data:
            occurance = self.memory[flag].get(word, 0)
            occurance += 1
            self.memory[flag][word] = occurance
        
        print(f"trained as {flag}")
    
    def classify(self, data):
        prob_spam = 1
        prob_ham = 1
        for word in data:
            occur_ham = self.memory["Ham"].get(word, None)
            occur_spam = self.memory["Spam"].get(word, None)
            
            if (occur_ham is None) or (occur_spam is None):
                occur_ham, occur_spam = 1, 1
                freq = 1
            else:
                freq = occur_ham + occur_spam
                
            prob_spam *= (occur_spam/freq)
            prob_ham *= (occur_ham/freq)
        
        if prob_spam > prob_ham:
            return True
        return False
    
    def run(self, raw_data):
        train_or_classify, flag, data = naiveBayes.parse_and_clean(raw_data)
        if train_or_classify == "Train":
            self.train(flag, data)
            return None
        else:
            result = self.classify(data)
            print(f"Classified as {self.indicator[result]}")
            return result
        

    @staticmethod
    def parse_and_clean(raw):
        head, body = raw.split("\n")
        body = body.split(" ")
        for word, index in enumerate(body):
            if type(word) == int:
                continue
            body[index] = "".join([letter for letter in word if letter.isalnum()])
            body[index] = body[index].lower()
        head = head.split("-")
        if len(head) == 1:
            to_do = head[0]
            return to_do, _, body
        else:
            to_do, flag = head
            return to_do, flag, body
            

In [18]:
import glob
classifier = naiveBayes()

files = glob.glob('./*.txt')

for file in files:
    with open(file, 'r') as f:
        content = f.read()
        classifier.run(content)

trained as Ham
trained as Spam
trained as Ham
trained as Spam
Classified as Spam
Classified as Spam
