## Chapter 3 -  Classification

In [2]:
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import (precision_score, 
                             recall_score, 
                             classification_report, 
                             confusion_matrix, f1_score, 
                             precision_recall_curve, roc_curve, roc_auc_score)

def load(fname):
    import pickle
    mnist = None
    try:
        with open(fname, 'rb') as f:
            mnist = pickle.load(f)
            return mnist
    except FileNotFoundError:
        from sklearn.datasets import fetch_openml
        mnist = fetch_openml('mnist_784', version=1, cache=True)
        with open(fname, 'wb') as f:
            mnist = pickle.dump(mnist, f)
        return mnist

### Classifying with k-Nearest Neighbours (kNN)

Given a training set of $n$ observations with $p$ features belonging to one of $c$ classes, we imagine plotting them in $p$-dimensional space. 

To classify a new example, we first determine the number of nearest neighbours, $k$. We then find the $k$ observations based on a distance measure (e.g. Euclidean distance). Then, we find the "majority vote" of classes amongst these neighbours.

The psuedo code is as follows:

```comments
with a new point q
for every observation X in 1...n:
    calculate the distance between X and q
sort the distances in increasing order
take k items with lowest distances to q
find the majority class among these k items
return the majority class as our prediction for the class of q
```

Note that for kNN, since we use Euclidean distance, if one feature has a much larger range than another, then it will dominate the distance calculation (influence the distance calculation much more). Hence, ideally normalise the values before performing the distance calculation.

In [17]:
# Ingest
mnist_data = load('mnist.data.pkl')
X, y = mnist_data['data'], mnist_data['target']
y = y.astype(int)

# Limit classification to only 1000 samples
df = pd.DataFrame(X)
df['y'] = y
df_samples = []
for i, j in df.groupby('y'):
    df_samples.append(j.sample(1200).copy())
df_new = pd.concat(df_samples)
X_new = df_new.iloc[:,:784].copy()
y_new = df_new['y'].copy()
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.15, random_state=0)

In this case, predict the correct cleaned image from the noisy image.

In [19]:
# Train a kNN classifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [22]:
# Predict
y_predict = knn_clf.predict(X_test[:50])
print(y_test[:25].tolist())
print(y_predict[:25].tolist())

[4, 9, 9, 6, 9, 4, 7, 5, 9, 8, 6, 7, 0, 3, 8, 7, 8, 3, 0, 8, 3, 0, 5, 1, 6]
[4, 9, 9, 6, 9, 4, 7, 5, 9, 8, 6, 7, 0, 3, 8, 7, 8, 3, 0, 8, 3, 0, 5, 1, 6]


In [23]:
print(cross_val_score(knn_clf, X_train, y_train, scoring='accuracy', cv=3))

[0.93647059 0.94088235 0.93852941]


### Classifying with Probability Theory: Naïve Bayes

<b>Conditional Probability</b> - Recall that for 2 events $X$ and $C$ with probabilities $P(X)$ and $P(C)$, the joint probability of both events occuring is $P(X,C)$. Take note that $P(X,C) = P(C,X)$.

The conditional probability of $C$ occuring given $X$, $P(C|X)$ and is evaluated as $P(C|X) = \frac{P(C,X)}{P(X)}$. Since we also know $P(X|C) = \frac{P(C,X)}{P(C)} \rightarrow P(C,X) = P(X|C)\cdot {P(C)}$, we arrive at the conditional probability statement:

$$P(C|X) = \frac{P(X|C)\cdot P(C)}{P(X)}$$

#### Naïve Bayes Model
Bayes' rule extends the conditional probability statement. For a new observation $\mathbf x = (x_1, \cdots, x_p)^T$, what is the probability that it belongs to one of the $K$ classes? Mathematically, we extend the above statment now to form $P(c_k|\mathbf x)$ and the statement is now:

$$P(c_k|\mathbf x) = \frac{P(\mathbf x | c_k)\cdot P(c_k)}{P(\mathbf x)}$$

<b>Assumptions </b> - Naïve Bayes assumes independence of every features in $\mathbf x$ for $j \in \{1,\cdots,p\}$. In the document classification problem, if the vocabulary of the corpus is $p$, then statistical independence means that the presence of each word is independent of the presence of another word. This is unlikely to be true but this assumption holds for the model. The model also assumes that every feature is equally important.

<b>Training</b> - In the document classification problem, there are $N$ documents in the training set, belonging to one of $K$ classes and the vocabulary of the corpus is $P$. The number of $\mathbf x_i$ is a vector representing the document, with each element in the vector representing the count of words in the vocabulary. So ${x}_i = (x_{i1}, \cdots ,x_{ij},\cdots, x_{iP})^T$, If the third word in the vocabulary is `bacon`, and it occurs twice in the first document, then $x_{13}=2$.

$P(c_k)$ is simply the number of documents in the class $k$, divided by $N$. $$P(c_k) = \frac{\sum_{i=1}^N I(c_i=k)}{N}$$

If there are 100 documents and 15 of them belong to $k=1$, then $P(c_1) = \frac{15}{100}=0.15$

Consider the term $P(\mathbf x_i | c_k)$. It can be expanded to $P(x_{11},x_{12},\cdots,x_{1p} | c_k)$. Which is the joint probability of all the words in the document for that class. Since the occurence of every word is independent of each other, we can rewrite the above expression to:
$$P(x_{i1},x_{i2},\cdots,x_{ip} | c_k) = P(x_{i1}|c_k)\cdot P(x_{i2}|c_k)\cdot \cdots P(x_{ip}|c_k)$$

and this helps with our calculations. $P(x_{i1}|c_k)$ is simply the count of the word $w_1$ amongst all words in class $c_k$.

To classify a new document, we calclulate $P(c_k|\mathbf x)$ for every class in $K$ and select the class with the largest posterior probability.

In [7]:
# Ingest
ng = fetch_20newsgroups()
Xng, yng = ng['data'], ng['target']
X_train_ng, X_test_ng, y_train_ng, y_test_ng = train_test_split(Xng, yng, test_size=0.20, random_state=0)

In [8]:
# Transform to vector, and train
vec = TfidfVectorizer()
Xng_matrix = vec.fit_transform(X_train_ng)

nb_clf = MultinomialNB()
nb_clf.fit(Xng_matrix, y_train_ng)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [9]:
# Transform to vector, and predict
Xng_test_matrix = vec.transform(X_test_ng)
y_predict_ng = nb_clf.predict(Xng_test_matrix)

In [10]:
# for testing
print(y_test_ng[:20])
print(y_predict_ng[:20])

[ 1 12 13 14  9  9 11  8 14 11 10  9  4  9  0  9 13 14  7  5]
[ 1  1 13 14  9  9 11  8 14 11  9  9  4  9  0  9 13 14 17  5]


In [16]:
print(cross_val_score(nb_clf, Xng_matrix, y_train_ng, scoring='accuracy', cv=4))

[0.83031374 0.82501105 0.84047724 0.83377542]
