# Machine Learning with Python

Speaker: Yingzhi Gou

Decision Systems Lab,

University of Wollongong

**NOTE** this jupyter notebook is available on github https://github.com/YingzhiGou/AI-Meetup-Decision-Systems-Lab-UOW


# Acknowledgement

source code in this tutorial is based on the book **Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems** by *Aurélien Géron*

# Software Required

* python 3.5
* numpy
* jupyter
* matplotlib
* scikit-learn

# Classification

to identify to which of the categories a new observation (data entry) belongs, based on the training data.

# Data Set

## MNIST data set

A set of 70,000 small impages of digits handwritten by high school students and employees of the US Census Bureau. 

Each image is labelled with the digit it represents (so we can do supervised learning with this data set).

In [None]:
# fetch data set
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home="data")  # download data here, it may take some time depends on the network
mnist  # this will show the dataset object

In [None]:
# now let's take a look at the data
X, y = mnist["data"], mnist["target"]
X.shape

In [None]:
y.shape

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# define a function that display the digit image
def show_digit(pixels):
    digit_image = pixels.reshape(28,28)
    plt.imshow(digit_image, cmap = matplotlib.cm.binary, interpolation="nearest")
    plt.axis("off")
    plt.show()

In [None]:
some_digit = X[36000] # select some digit data
show_digit(some_digit)

In [None]:
y[36000]

# Prepare Data for Learning

## Training data
normally 60%~70% of the labeled data set

## Testing data
the rest of the data set

**NOTE** their are may way to simple the training data, e.g. partation, random simpling, simpling based on the class distribuction of the data set etc. Beware of possible bais introduced in simpling.

In [None]:
# create training and testing sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

# shuffling the data set
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

# Binary Classification

For one class, build one classifier.

In this case, let's build a classifier for digit '5'.

In [None]:
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits
y_test_5 = (y_test == 5)

## Stochastic Gradient Descent (SGD)

Now let's pick our fist binary classifier.

SGD is capable of handling very large dataset efficiently, because it deals with training instances independently, one at a time (suited for *online training*)

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
# GSD relies on randomness during training, so if you want reproducible results, random_state should be set
sgd_clf.fit(X_train, y_train_5)

In [None]:
# let's test the model
sgd_clf.predict([some_digit])

In [None]:
show_digit(some_digit)

In [None]:
sgd_clf.predict([X[23423]])

In [None]:
show_digit(X[23423])

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
forest_clf.fit(X_train, y_train_5)

In [None]:
forest_clf.predict([some_digit])

In [None]:
sgd_clf.predict([X[23423]])

# Performance Measures

Evaluate a classifier is often tricky. There are many performance measures with their own advantages and disadvantages. We will talk about the related concepts more later.

## Cross Validation

We will use $cross\_val\_score()$ function provided by $sklearn.model\_selection$ module, which using K-fold cross-validation. 

K-fold Cross Validation means to split the training set into K-folds then making prediction and evaluating them on each fold using a model trained on the remaining folds.

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=5, scoring='accuracy')

about 95% accuracy!!!

However ...

In [None]:
from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [None]:
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=5, scoring='accuracy')

WHY?

because 90% of the data would be "not 5", our data set is biased against "not 5". by simply guessing "not 5", the classifier would have achieve 90% accuracy.

## Confusion Matrix

A much better way to evaluate the performance of a classifier.

<img src="http://3.bp.blogspot.com/_txFWHHNYMJQ/THyADzbutYI/AAAAAAAAAf8/TAXL7lySrko/s1600/Picture+8.png"/>


In [None]:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=5)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)

Where each row represents a actual class and each column represents a predicted class.

In [None]:
confusion_matrix(y_train_5, y_train_5)  # if the prediction is 100% accurate

## Precision and Recall

$$ precision = \frac{TP}{TP+FP} $$
$$ recall = \frac{TP}{TP+FN} $$

<img src="http://3.bp.blogspot.com/_txFWHHNYMJQ/THyADzbutYI/AAAAAAAAAf8/TAXL7lySrko/s1600/Picture+8.png"/>

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_5, y_train_pred)

In [None]:
recall_score(y_train_5, y_train_pred)

## $F_1$ Score

$F_1$ score is the *harmonic mean* of precision and recall.

$$ F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} = 2\times \frac{precision \times recall}{precision + recall} = \frac{TP}{TP+\frac{FN+FP}{2}} $$

In [None]:
from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

### Precision/Recall Tradeoff

Increasing precision reduces recall, and vice versa.

## ROC Curve

The **receiver operating characteristic (ROC)** curve plots the *true positive rate* (another name for recall) against the *false positive rate*.

In [None]:
from sklearn.metrics import roc_curve

y_scores = sgd_clf.decision_function(X_train)
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

print(fpr)
print(tpr)
print(thresholds)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=3, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    
plt.figure(figsize=(8,6))
plot_roc_curve(fpr, tpr)
plt.show()

In [None]:
# compute ROC AUC (ROC Area under the curve)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

In [None]:
y_probas_forest = forest_clf.predict_proba(X_train)
y_probas_forest

In [None]:
y_score_forest = y_probas_forest[:, 1]  # only keep the proba of the possitive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_score_forest)

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="best")
plt.show()

In [None]:
roc_auc_score(y_train_5, y_score_forest)

In [None]:
y_pred_forest = forest_clf.predict(X_train)

# precision
precision_score(y_train_5, y_pred_forest)

In [None]:
# recall
recall_score(y_train_5, y_pred_forest)

In [None]:
# F1
f1_score(y_train_5, y_pred_forest)