# Basics of ML

## Supervised learning

In S/L, we are given outcomes (or labels) $y$ and features (or design matrix) $$. The task is to predict $y_i$ given the feature vector $x_i$. If $y$ is categorical, this is known as **classification**. If $y$ is continuous, this is knwon as **regression**. The features of $X$ can be all continous, all categorical, or a mixture of continuous and categorical.

## Classificaiton example

### What the data in S/L looks like

In [None]:
from sklearn import datasets

In [None]:
iris = datasets.load_iris(as_frame=True)

In [None]:
iris.keys()

#### Features `X`

In [None]:
iris['data'].shape

In [None]:
iris['data'][:3]

#### Outcomes `y`

In [None]:
iris['target'].shape

In [None]:
iris['target'][:3]

#### Target names (y)

In [None]:
iris['target_names']

#### Feature names (X)

In [None]:
iris['feature_names']

### Basic steps in S/L

1. Split into training and test data sets
2. Scale continuous features to have zero mean and unit standard deviaiton
3. Choose an appropriate ML algorithm
4. Train the ML algorithm on the train data set
5. Assemble into pipeline
6. Evaluate the trained ML algorithm on the test data set

#### Split into training and test data sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = iris['data']
y = iris['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
X.shape, X_train.shape, X_test.shape

In [None]:
y.shape, y_train.shape, y_test.shape

#### Choose an appropriate ML algorithm

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([
    ('scaler', StandardScaler()), 
    ('knn', KNeighborsClassifier())
])

In [None]:
pipe.fit(X_train, y_train)

## Machine learning visualization

In [None]:
import warnings
warnings.simplefilter('ignore', UserWarning)

In [None]:
from yellowbrick.features import parallel_coordinates

viz = parallel_coordinates(X, y, classes=iris['target_names'])

In [None]:
from yellowbrick.features.manifold import manifold_embedding

viz = manifold_embedding(X, y, manifold="tsne", n_neighbors=10)

In [None]:
from yellowbrick.features import rank2d

viz = rank2d(X, y, algorithm="pearson")

In [None]:
from yellowbrick.features import joint_plot

viz = joint_plot(X, y, columns=['petal length (cm)', 'petal width (cm)'])

In [None]:
from yellowbrick.classifier import classification_report

viz = classification_report(pipe, X, y, classes=iris['target_names'])

In [None]:
from yellowbrick.classifier import confusion_matrix

viz = confusion_matrix(
    pipe,
    X_train, y_train, 
    X_test, y_test,
    classes=iris['target_names'])

In [None]:
from yellowbrick.classifier import precision_recall_curve

viz = precision_recall_curve(pipe, X, y)

In [None]:
from yellowbrick.classifier import roc_auc

viz = roc_auc(pipe, X, y)

In [None]:
from yellowbrick.classifier import class_prediction_error

viz = class_prediction_error(
    pipe,
    X_train, y_train, 
    X_test, y_test,
    classes=iris['target_names'])

## Exercise

Repeat the visualizations for the mushroom dataset,.

In [None]:
from yellowbrick.datasets.loaders import load_mushroom

In [None]:
X, y = load_mushroom()