# Classification algorithms

The objectives of this notebook are:

* Learn and compare the basic model families in shallow machine learning classication problems.Specifically we'll look at: 

    * Logistic regression,
    
    * Nearest neighbours models, 
    
    * Support vector machines,
    
    * Decision Trees, and Ensemble models.
    
    * Neural Networks (very briefly)

* Understand the key hyperparameters that control how these models learn.

* Use decision boundaries to visualize how models make predictions.

* Discuss the properties of an appropriate predictive model.


## The Dataset

First we'll import some data. I'm using an extract from the Rock Property Catalog, https://subsurfwiki.org/wiki/Rock_Property_Catalog

In [None]:
import pandas as pd

df = pd.read_csv('https://geocomp.s3.amazonaws.com/data/RPC_4_lithologies_original.csv')
df.describe()

In [None]:
df.dropna(inplace=True)
df.head(3)

# Exercise

Remove the units from the column names so we can refer to them just using their abbreviation

In [None]:
new_names = {'Vp [m/s]': 'Vp', 'Vs [m/s]': 'Vs', 'Rho [g/cm³]': 'Rho'}
df = df.rename(new_names, axis='columns')

We'll start our classification journey by using the logistic regression algorithm one a variable and two classes.

## Logistic Regression

Logistic regression is similar to linear regression, but instead of predicting a continuous variable, it predicts whether something is true or false. It is a classification algorithm. 

Instead of fitting a line to the data, Logistic regression fits a logistic function (a.k.a sigmoid) to the data. The model then is a probability function used to classify new data.

$$f(x) = \frac{1}{1+e^{-\textbf{x}}}$$

$$f(x) = \frac{1}{1+e^{-(\textbf{wx}+b)}}$$

It has many uses in data analysis and machine learning, especially in data transformations. The curve goes from zero to one. It tells you the probability that a sample is a class of interest or not. Instead of using a least-squares type loss function, it uses a maximum likelihood function.

# EXERCISE: 

- Write a function called `logistic` that takes x, w, and b as arguments and returns the value of the logistic.

- Make a plot of the logistic function from x = -10 to 10.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x, w=1, b=0):
    """Logistic function.
    Args:
        x (array or int): input
        w (float or array): the weights of the logistic
        b (float): the intercept (or bias)
    """
    # goes here your_equation 
    return # your equation goes here

In [None]:
x = # your code goes here. 
plt.plot(x, sigmoid(x), 'o-')
plt.title('The logistic function where w=1, b=0')

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x, w=1, b=0):
    """Logistic function.
    Args:
        x (array or int): input
        w (float or array): the weights of the logistic
        b (float): the intercept (or bias)
    """
    term = np.exp(-(w * x + b))
    return 1 / (1 + term)

In [None]:
x = np.arange(-10, 10)
plt.plot(x, sigmoid(x), 'o-')
plt.title('The logistic function where w=1, b=0')

# Also note we can use the expit function in scipy for this (but we don't have to!)
from scipy.special import expit
y = expit(x)
plt.plot(x, y)
plt.plot(x, sigmoid(x), 'o-')

## Select the data for this problem

In [None]:
features = ['Vp']  # a single feature
classes = ['shale', 'dolomite']  # two classes
df_LR = df.loc[df['Lithology'].isin(classes)]

X = df_LR[features].values
y = df_LR['Lithology'].values

## How are these two classes distributed along the `Vp` dimension?

In [None]:
# Set a custom color palette for seaborn
import seaborn as sns
colors = ['goldenrod', 'darkseagreen', 'cornflowerblue', 'blueviolet']
palette = sns.color_palette(colors)
hue_order = df['Lithology'].unique()
_ = sns.histplot(data=df_LR, x='Vp', kde=True, hue='Lithology', palette=palette, hue_order=hue_order)

In [None]:
import sklearn 

print(sklearn.__version__)

# Make sure we can see all of the model details.
sklearn.set_config(print_changed_only=False)

In [None]:
from sklearn.model_selection import train_test_split

X_train_lr, X_val_lr, y_train_lr, y_val_lr = train_test_split(X, y, test_size=0.2, random_state=32)

X_train_lr.shape, X_val_lr.shape

In [None]:
from sklearn.linear_model import LogisticRegressionCV

model = LogisticRegressionCV()

model.fit(X_train_lr, y_train_lr)

y_pred_lr = model.predict(X_val_lr)

We can now use `model.predict()` to perform our classifications, but if we choose we can also take it's learned coefficients to studied the logistic curve that we fitted to the data. The sigmoid alone, however does not provide the classification directly, but it's illustrative to inspect it relative to the data points

In [None]:
# Create our own logistic function from the learned model parameters
y_test_lr = sigmoid(X_val_lr * model.coef_ + model.intercept_).ravel()
plt.scatter(X_val_lr, y_test_lr)

In [None]:
from ipywidgets import interact
from ml_utils import logistic_progression

@interact(cutoff=np.arange(0, 1.0, 0.05))
def logistic_regression_plot(cutoff=0.5):
    logistic_progression(model, X_val_lr, y_val_lr, y_test_lr, cutoff)

# Add more features and more classes

## Make a new X and y

In [None]:
features = ['Vp', 'Rho']  # two features
X = df[features].values  # four classes
y = df['Lithology']

## Split the data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Standardize the data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

In [None]:
# make a scatter plot
scatter = sns.relplot(data=df, x='Vp', y='Rho', hue='Lithology', s=80, alpha=0.5, height=6, 
                      palette=palette, hue_order=hue_order)

## Logistic Regression for 4 classes and 2 features

In [None]:
from sklearn.linear_model import LogisticRegressionCV

model = LogisticRegressionCV()
model.fit(X_train, y_train)
y_pred = model.predict(X_val)

In [None]:
from clf_utils import val_vs_pred_scatter

val_vs_pred_scatter(X_val, y_val, y_pred, palette, hue_order)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

## Decision regions

We can consider the cutoffs in higher dimensions as a sort of decision regions. We'll use this 

In [None]:
from clf_utils import show_decision_regions

show_decision_regions(model, X_train, y_train, X_val, y_val, palette, hue_order)

## k-NN 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=5)  # the default it 5

clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)

accuracy_score(y_val, y_pred)

## What about some different values of k?

In [None]:
from ipywidgets import interact

@interact(n=[1, 3, 5, 10, 20, 30, 50, 100, 150])
def decision_boundaries(n):
    clf = KNeighborsClassifier(n_neighbors=n)
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

# EXERCISE:

    - Do high or low values if `n` create a smoothing effect?
    - What value of `n_neighbors` gives the highest accuracy?
    - How many times is shale being "confused" as sandstone?

# Support-vector machine (SVM)

## Linear SVM

In [None]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')

svc.fit(X_train, y_train)

y_pred = svc.predict(X_val)

accuracy_score(y_val, y_pred)

In [None]:
@interact(C=[0.01, 0.1, 1.0, 10, 100])
def decision_boundaries(C=1):
    clf = SVC(kernel='linear', C=C)
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

## Non-linear SVM 

If we employ the **kernel trick** we can fit a nonlinear model. Scikit-learn's `SVC` actually uses this by default:

In [None]:
from sklearn.svm import SVC

svc = SVC(C=1)  # Default is kernel='rbf'

svc.fit(X_val, y_val)

y_pred = svc.predict(X_val)

accuracy_score(y_val, y_pred)

In [None]:
@interact(C=[0.001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000, 1e4])
def decision_boundaries(C=1):
    clf = SVC(kernel='rbf', C=C)  # default
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

# EXERCISE:

    - What values of C give the highest accuracy?
    - What value of C *looks* to be yield the most appropriate model?

## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)

accuracy_score(y_val, y_pred)

In [None]:
from clf_utils import lithology_tree

lithology_tree(clf, features)

In [None]:
@interact(max_depth=[2, 3, 4, 5, 6, 7, 8, 9, 10])
def decision_boundaries(max_depth=4):
    clf = DecisionTreeClassifier(max_depth=max_depth)
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

## Random Forests and Ensemble methods

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=3).fit(X_train, y_train)

y_pred = clf.predict(X_val)

accuracy_score(y_val, y_pred)

In [None]:
@interact(max_depth=np.arange(2, 10), min_samps_leaf=np.arange(1,6))
def decision_boundaries(max_depth=3, min_samps_leaf=3):
    clf = RandomForestClassifier(max_depth=max_depth, min_samples_leaf=min_samps_leaf)
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

## Boosted trees

In [None]:
from sklearn.ensemble import GradientBoostingClassifier, ExtraTreesClassifier

clf = GradientBoostingClassifier(random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)

accuracy_score(y_val, y_pred)

In [None]:
@interact(max_depth=np.arange(2, 10), min_samps_leaf=np.arange(1,6))
def decision_boundaries(max_depth=3, min_samps_leaf=3):
    clf = GradientBoostingClassifier(max_depth=max_depth, min_samples_leaf=min_samps_leaf)
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

## Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(hidden_layer_sizes=[10, 10],
                    learning_rate='constant',
                    alpha=0.001,
                    max_iter=5000,
                    solver='adam',
                    random_state=42,
                   )

clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)

accuracy_score(y_val, y_pred)

In [None]:
@interact(hl1=np.arange(2, 15, 1))
def decision_boundaries(hl1=3):
    clf = MLPClassifier(hidden_layer_sizes=[hl1],
                    learning_rate='constant',
                    alpha=0.001,
                    max_iter=5000,
                    solver='adam',
                    random_state=42,
                   )
    show_decision_regions(clf, X_train, y_train, X_val, y_val, palette, hue_order)

---

## Choosing the right estimator

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.

This is a good place to start ([here](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) is a clickable version):

<img src="https://scikit-learn.org/stable/_static/ml_map.png"></img>

---

Different estimators are better suited for different types of data and different problems. For a classifier comparison (below) check the source code [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png"></img>

###  [Check out this paper with a comparison of many classifiers](https://arxiv.org/abs/1708.05070)

# Summary

- Choosing a model mean you are making an interpretation
- You need to know the key hyperparameters that effect how models learn
- KNN and SVM models have few hyperparameters
- Decision Trees and Neural Networks have more hyperparameters so they can be harder to "tune"
- All models can be underfit and overfit to your data

# Next steps

- Classification Reports and Confusion Matricies. Accuracy is usually not enough. We need more detailed performance metrics.
- Chosen hyperparameters visually is not good enough, we need to get systematic.