# Basics of ML

## Supervised learning

In S/L, we are given outcomes (or labels) $y$ and features (or design matrix) $$. The task is to predict $y_i$ given the feature vector $x_i$. If $y$ is categorical, this is known as **classification**. If $y$ is continuous, this is knwon as **regression**. The features of $X$ can be all continous, all categorical, or a mixture of continuous and categorical.

## Classificaiton example

### What the data in S/L looks like

In [None]:
from sklearn import datasets

In [None]:
iris = datasets.load_iris(as_frame=True)

In [None]:
iris.keys()

#### Features `X`

In [None]:
iris['data'].shape

In [None]:
iris['data'][:3]

#### Outcomes `y`

In [None]:
iris['target'].shape

In [None]:
iris['target'][:3]

#### Target names (y)

In [None]:
iris['target_names']

#### Feature names (X)

In [None]:
iris['feature_names']

### Basic steps in S/L

1. Split into training and test data sets
2. Scale continuous features to have zero mean and unit standard deviaiton
3. Choose an appropriate ML algorithm
4. Train the ML algorithm on the train data set
5. Assemble into pipeline
6. Evaluate the trained ML algorithm on the test data set

#### Split into training and test data sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = iris['data']
y = iris['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
X.shape, X_train.shape, X_test.shape

In [None]:
y.shape, y_train.shape, y_test.shape

#### Scale continuous features to have zero mean and unit standard deviaiton

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
X_train.mean(axis=0)

In [None]:
X_train.std(axis=0)

In [None]:
X_train_scaled.mean(axis=0)

In [None]:
X_train_scaled.std(axis=0)

#### Choose an appropriate ML algorithm

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier()

#### Train the ML algorithm on the train data set

In [None]:
clf.fit(X_train_scaled, y_train)

#### Assemble into pipeline

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([
    ('scaler', StandardScaler()), 
    ('knn', KNeighborsClassifier())
])

In [None]:
pipe.fit(X_train, y_train)

#### Evaluate the trained ML algorithm on the test data set

##### Evaluate default score (accuracy)

In [None]:
pipe.score(X_test, y_test)

##### See all predictions

In [None]:
pred = pipe.predict(X_test)

In [None]:
import pandas as pd

In [None]:
pd.DataFrame(dict(pred=pred, y=y_test))[:5]

##### See confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm = confusion_matrix(y_test, pred)
cm

## Exercise

Use a classifier to predict if a mushroom is edible or poisonous.

- If you like, try using a differnet classifier from `sklearn`

In [None]:
from yellowbrick.datasets.loaders import load_mushroom

In [None]:
X, y = load_mushroom()