# Introduction to Machine Learning

Presented by [Charles Tapley Hoyt](https://github.com/cthoyt) on Tuesday, April 23rd, 2019 for Life Sciences Knowledge Discovery.

![XKCD on Machine Learning](https://imgs.xkcd.com/comics/machine_learning.png)

Reference: https://xkcd.com/1838/

## What is Machine Learning?

- Capture the underlying patterns observed in data in a model
- Generate predictions about new data


### Types

- Supervised
  - Data is labeled
- Unsupervised
  - Data is unlabeled
- Semi-Supervised
  - Some data is labeled
- Reinforcement
  - Supervised + feedback

## Tasks in Machine Learning

1. Density Estimation
  - Identify structure and distribution underlying data
2. Classification
  - Assign labels
3. Anomaly/Novelty Detection
  - Find outliers
4. Dimensionality Reduction
  - Identify structure in data and build simpler representation
5. Clustering
  - Find groups in data
6. Regression
  - Predict values

## Notebook Setup

### Imports

Organize imports by system imports, third party imports, then local imports.

In [None]:
import sys
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
import sklearn
import sklearn.datasets
import sklearn.linear_model
import sklearn.metrics
import sklearn.model_selection
import sklearn.neighbors
import sklearn.neural_network
import sklearn.svm
import sklearn.tree
import sklearn.utils.multiclass
from IPython.display import HTML, Markdown

### Jupyter Notebook Configuration

Tell `matplotlib` to make figures inside the notebook

Tell `matplotlib` to make figures with vector graphics

Set the theme for `matplotlib` and `seaborn`

### Environmental Provenance

What version of Python was used?

When was this notebook last run?

## Classification

There are several tasks in machine learning. We'll start with one of the most simple - classification.

Load the [breast cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) data from `sklearn`.

Investigate the struture of the object returned

The goal of this data set was to use the shape of the nuclei of cells to predict whether the cells were malignnant or benign. We can learn more from `print(breast_cancer_data.DESCR)`.

<img src="nuclei.png" width="350">

Construct a `pd.DataFrame` from the data and its feature names.

Augument the dataframe with the labels for later use in visualization with `matplotlib` and `seaborn`.

Peek at the first few rows of the dataframe.

## Understanding the Data

List the types of labels.

Count the frequency of each label.

Plot the frequency of each label.

Show statistics for all other measurements

### Univariate Density Estimation

Investigate the univariate distribution of each feature.

### Multivariate Density Estimation

The bivariate investigation of small datasets is already difficult to visualize succinctly. Here, we're going to look at a few specific examples.


Unfortunately, `seaborn` doesn't offer coloring by label in the `sns.jointplot` function. We'll fall back to `sns.scatterplot` to investigate two things:

1. Are features correlated?
2. How good is a feature (or pair of features) at separating the data by label?

## Classification

Prepare the data for classification by splitting randomly into test and training data.

Build a logistic regression classifier.

Train the classifier.

Congratulations, you've done machine learning!

Use the classifier to make predictions on the test data.

Calculate a confusion matrix - how good were our predictions?

In [None]:
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = sklearn.metrics.confusion_matrix(y_true, y_pred)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True Label',
           xlabel='Predicted Label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

Accuracy measures what percentage are we making the correct prediction. It isn't robust when the data isn't balanced.

Precision measures how good positive predictions are

Recall (i.e., sensitivity) measures how able was the algorithm can identify positives

$F_1$ balances the precision and recall and is less suceptible to imbalanced data sets, but does not take into account the number of true negatives and is still lacking.

Matthew's correlation coefficient successfully balances all four measurements. It is also the most strict.

Print a quick summary

### Application of Other Classifiers

- K-Nearest Neighbors (`sklearn.neighbors.KNeighborsClassifier`)
- Epsilon-Nearest Neighbors (`sklearn.neighbors.RadiusNeighborsClassifier`)
- Linear Support Vector Machine Classifier (`sklearn.svm.LinearSVC`)
- General Support Vector Machine Classifier (`sklearn.svm.SVC`)
- Decision Tree (`sklearn.tree.DecisionTreeClassifier`)
- Single Later Perceptron (`sklearn.linear_model.Perceptron`)
- Multi-layer Perceptron (`sklearn.neural_network.MLPClassifier`)

Consider that each model has its own specific purpose and not all models are useful for all problems

### Tuning Hyperparameters


Try adjusting the number of neighbors for K-Nearest Neighbors

Try for generalized Support Vector Machine classifier (`sklearn.svm.SVC`).

Tomorrow:

- Generalization to multi-class classification
- Feature Engineering
  - Dimensionality Reduction
  - Feature Scaling