### Machine Learning
In this notebook I will implement the machine learning algorithms described in the course videos. That is, Naive Bayes, Naive Bayes with Laplace smoothing, one dimensional linear regression, logistic regression, regularized regression by gradient descent, perceptron, K-nearest neighbours, Gaussian mixture models, and possibly some eigenvector methods for dimension reduction.

#### Supervised Learning
##### Naive Bayes - Classification
Here we have a data set $X$ where each $x_i \in X$ has a class $y_i \in Y$. For simplicity we assume $|Y| = 2$, i.e. binary classification. Further each $x_i$ is a sequence $x_i = \{w_{i,j}\}_{j =1} ^{m_i}$, and we make the naive independence assumption that

$$ P(x_i | y_i) = \Pi_{j=1} ^{m_i} P(w_{i,j} |y_i). $$

Then given a new data point $x = \{w_j\}_{j=1} ^m$ we can compute the probability of assigning class $y$ to $x$ by Bayes rule

$$ P(y|x) = \frac{P(x|y)P(y)}{P(x)} = \frac{P(y) \Pi_{j=1} ^m P(w_j |y)}{\sum_{\hat{y} \in Y} P(\hat{y})\Pi_{j=1} ^m P(w_j | \hat{y})} $$

We can approximate the priors $P(y)$ for each $y \in Y$ by simply the fraction of $x_i$ with class $y$. The conditionals $P(w|y)$ can be computed as the frequency $w$ appears in all $x_i$ in class $y$. These are the emperical/frequentist estimates. 

The standard example of this is spam detection of emails, so I will name things with that in mind, however the method should work in general.

In [8]:
def strip_punctuation(text):
    import string
    punc = set(string.punctuation)
    return (''.join(ch for ch in text if ch not in punc))

def process_text(text):
    return strip_punctuation(text).lower()

class Email:
        
    def __init__(self, text, spam):
        self.text = text
        self.spam = spam
        
def email_to_words(email):
    return process_text(email.text).split()
        
def get_vocabulary(emails):
    vocab = set()
    for email in emails:
        words = email_to_words(email)
        for word in words:
            vocab.add(word)
    return vocab

def get_priors(emails):
    priors = {"spam":0, "ham":0}
    for email in emails:
        if email.spam: priors['spam']+=1
        else : priors['ham']+=1
    for key in priors:
        priors[key] /= len(emails)
    return priors

def get_word_conditionals(emails):
    vocab = get_vocabulary(emails)
    spam_conditionals = {}
    num_spam = 0
    ham_conditionals = {}
    num_ham = 0
    for email in emails:
        words = email_to_words(email)
        for word in words:
            if email.spam:
                num_spam+=1
                if word in spam_conditionals: spam_conditionals[word]+=1
                else : spam_conditionals[word]=1
            else:
                num_ham+=1
                if word in ham_conditionals : ham_conditionals[word]+=1
                else : ham_conditionals[word] = 1
    conditionals = [spam_conditionals, ham_conditionals]
    nums = [num_spam, num_ham]
    for conditional, num in zip(conditionals,nums):
        for word in vocab:
            if word not in conditional: conditional[word]=0
            else : conditional[word] = conditional[word]/num
    return spam_conditionals, ham_conditionals

def naive_bayes_classification(new_email, emails):
    spam_conditionals, ham_conditionals = get_word_conditionals(emails)
    priors = get_priors(emails)
    numerator = 1
    denominator = 1
    words = email_to_words(new_email)
    for word in words:
        numerator*=spam_conditionals[word]
        denominator*=ham_conditionals[word]
    numerator*=priors['spam']
    denominator = denominator*priors['ham']+numerator
    return numerator/denominator

In [9]:
# example given in videos
email_text = ["offer is secret", "click secret link", "secret sports link", 
         "play sports today", "went play sports", "secret sports event", 
         "sports is today", "sports costs money"]
spam_labels = [True,True,True, False,False,False,False,False]
emails = []
for text,label in zip(email_text,spam_labels):
    emails.append(Email(text,label))

priors = get_priors(emails)
print(priors)

{'ham': 0.625, 'spam': 0.375}


In [10]:
spam_conditionals, _ = get_word_conditionals(emails)
print(spam_conditionals)

{'secret': 0.3333333333333333, 'event': 0, 'went': 0, 'link': 0.2222222222222222, 'offer': 0.1111111111111111, 'money': 0, 'today': 0, 'play': 0, 'is': 0.1111111111111111, 'costs': 0, 'click': 0.1111111111111111, 'sports': 0.1111111111111111}


In [12]:
# examples from videos, with added punctuation and case noise
new_email_text = ["spORts", "sec'ret i@s secr~et", "ToDA&y Is .secret"]
for text in new_email_text:
    new_email = Email(text, None)
    print("Probability of \"{}\" being spam: {}".format(text, naive_bayes_classification(new_email, emails)))

Probability of "spORts" being spam: 0.16666666666666669
Probability of "sec'ret i@s secr~et" being spam: 0.9615384615384616
Probability of "ToDA&y Is .secret" being spam: 0.0


The final example shows why we might want to use a regularisation method such as Laplace smoothing. The word "today" never appears in the spam data set, so $P(today | spam) = 0$ according to the frequentist approximation. 

Laplace smoothing is parameterised by a positive integer $k$. Given a random variable $X$ that takes values $x_1, \ldots, x_m$, suppose $X_1, \ldots, X_n$ are samples of $X$. Then to estimate $P(X=x_i)$ we use the Laplace smoothing

$$ P(X=x_i) \approx \frac{k + \sum_{j=1} ^n \chi(X_j = x_i)}{km + n} $$

In the case of emails, $x_i$ denote all the possible words, and $X_j$ are the words we find in a dataset. The smoothing can also be applied to the priors.

In [13]:
def get_laplace_priors(emails, k):
    priors = {"spam":0, "ham":0}
    for email in emails:
        if email.spam: priors['spam']+=1
        else : priors['ham']+=1
    for key in priors:
        priors[key]+=k
        priors[key] /= (len(emails)+2*k)
    return priors

def get_laplace_word_conditionals(emails, k):
    vocab = get_vocabulary(emails)
    spam_conditionals = {}
    num_spam = 0
    ham_conditionals = {}
    num_ham = 0
    for email in emails:
        words = email_to_words(email)
        for word in words:
            if email.spam:
                num_spam+=1
                if word in spam_conditionals: spam_conditionals[word]+=1
                else : spam_conditionals[word]=1
            else:
                num_ham+=1
                if word in ham_conditionals : ham_conditionals[word]+=1
                else : ham_conditionals[word] = 1
    conditionals = [spam_conditionals, ham_conditionals]
    nums = [num_spam, num_ham]
    for conditional, num in zip(conditionals,nums):
        for word in vocab:
            if word not in conditional: conditional[word]=k
            else : conditional[word] = (conditional[word]+k)/(num+k*len(vocab))
    return spam_conditionals, ham_conditionals

def laplace_naive_bayes_classification(new_email, emails,k):
    spam_conditionals, ham_conditionals = get_laplace_word_conditionals(emails,k)
    priors = get_laplace_priors(emails,k)
    numerator = 1
    denominator = 1
    words = email_to_words(new_email)
    for word in words:
        numerator*=spam_conditionals[word]
        denominator*=ham_conditionals[word]
    numerator*=priors['spam']
    denominator = denominator*priors['ham']+numerator
    return numerator/denominator

In [18]:
# examples from videos, with added punctuation and case noise
new_email_text = ["spORts", "sec'ret i@s secr~et", "ToDA&y Is .secret"]
for text in new_email_text:
    new_email = Email(text, None)
    print("Probability of \"{}\" being spam: {}".format(text, laplace_naive_bayes_classification(new_email, emails,1)))

Probability of "spORts" being spam: 0.22222222222222224
Probability of "sec'ret i@s secr~et" being spam: 0.85002186270223
Probability of "ToDA&y Is .secret" being spam: 0.9520078354554358


#### One Dimensional Linear Regression