# Video: Classifying Data with Scikit-Learn

In this video, we will use scikit-learn to build models classifying data into different classes.

## Classifying Data with Scikit-Learn

![Palmer Penguins illustration showing chinstrap, gentoo, and adelie penguins.](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)
Artwork by @allison_horst



## Our Classification Problem - Predict Penguin Sex

Original Palmer Penguins paper looked at size differences by sex...
* Penguin sex was determined by genetic tests on blood samples.
  * Insufficient blood samples were reason for most of the missing data.
* Will try to predict based on size differences.
  * Just as an example here.
  * Would not make sense to look at size differences by sex predicted from size...
  * Just Gentoo penguins to avoid differences in species sizes.

## Classifying vs Regressing

A scikit-learn classifier shares
* `fit` method
* `predict` method

And adds
* `predict_proba` method returning class probabilities
* `class_` attribute storing class identifiers

## Regression vs Classification fit()

What are the inputs to `model.fit(X, y)`?
* Regression: X=features, y=numeric targets
* Classification: X=features, y=class (category) targets


## Regression vs Classification predict()

What are the inputs to `model.predict(X)`?
* Both: X=features

What are the outputs from `model.predict(X)`?
* Regression: predicted numeric value
* Classification: prediction class identifier


## Classification predict() vs predict_proba()

What are the inputs to `model.predict(X)` and `model.predict_proba(X)`?
* Both: X=features

What are the outputs from `model.predict(X)` and `model.predict_proba(X)`?
* `predict`: One predicted class per input row.
* `predict_proba`: One column per class with a predicted probability of that class.

## Classification class identifiers

* What is in the `class_` attribute?
  * A sequence of all the known classes.
  * Roughly `sorted(set(y))` with `y` from `fit(X,y)`

## Classification Algorithm - Nearest Neighbors

* Rough idea: pick $k$ rows which are numerically closest to the input row.
* Predict: most common class in the $k$ rows.
* Probability for class $c$: number w/class $c$ in the $k$ rows, divided by $k$.

## Default Distance Metric for Nearest Neighbors

Euclidean distance
* $d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + (z_1 - z_2)^2 + \cdots}$
* We use this distance function up to 3 dimensions in everyday life.
* Tends to be biased towards dimensions with bigger scales.


In [None]:
import numpy as np
import pandas as pd

In [None]:
penguins = pd.read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381", index_col="Sample Number")
penguins_keep_columns = [c for c in penguins.columns if c == "Sex" or "(mm)" in c or "(g)" in c]
penguins = penguins[penguins_keep_columns]

penguins = penguins.query("Sex in ('FEMALE', 'MALE')")
penguins = penguins.dropna()
penguins


Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,46.1,13.2,211.0,4500.0,FEMALE
2,50.0,16.3,230.0,5700.0,MALE
3,48.7,14.1,210.0,4450.0,FEMALE
4,50.0,15.2,218.0,5700.0,MALE
5,47.6,14.5,215.0,5400.0,MALE
...,...,...,...,...,...
119,47.2,13.7,214.0,4925.0,FEMALE
121,46.8,14.3,215.0,4850.0,FEMALE
122,50.4,15.7,222.0,5750.0,MALE
123,45.2,14.8,212.0,5200.0,FEMALE


In [None]:
penguins_features = penguins.drop("Sex", axis=1)
penguins_features

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g)
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,46.1,13.2,211.0,4500.0
2,50.0,16.3,230.0,5700.0
3,48.7,14.1,210.0,4450.0
4,50.0,15.2,218.0,5700.0
5,47.6,14.5,215.0,5400.0
...,...,...,...,...
119,47.2,13.7,214.0,4925.0
121,46.8,14.3,215.0,4850.0
122,50.4,15.7,222.0,5750.0
123,45.2,14.8,212.0,5200.0


In [None]:
penguins_target = penguins["Sex"]
penguins_target

Sample Number
1      FEMALE
2        MALE
3      FEMALE
4        MALE
5        MALE
        ...  
119    FEMALE
121    FEMALE
122      MALE
123    FEMALE
124      MALE
Name: Sex, Length: 119, dtype: object

In [None]:
from sklearn.neighbors import KNeighborsClassifier
nearest_neighbors_model = KNeighborsClassifier(n_neighbors=5).fit(penguins_features, penguins_target)
nearest_neighbors_model

In [None]:
penguins_check = penguins.copy()
penguins_check["prediction"] = nearest_neighbors_model.predict(penguins_features)
penguins_check["prediction_check"] = penguins_check["prediction"] == penguins_target
penguins_check.groupby("prediction_check").size()

prediction_check
False      9
True     110
dtype: int64

In [None]:
penguins_check.query("prediction_check == False")

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,prediction,prediction_check
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8,46.7,15.3,219.0,5200.0,MALE,FEMALE,False
10,46.8,15.4,215.0,5150.0,MALE,FEMALE,False
35,49.1,14.8,220.0,5150.0,FEMALE,MALE,False
44,49.6,15.0,216.0,4750.0,MALE,FEMALE,False
49,44.9,13.3,213.0,5100.0,FEMALE,MALE,False
58,45.5,15.0,220.0,5000.0,MALE,FEMALE,False
74,46.5,14.8,217.0,5200.0,FEMALE,MALE,False
97,49.4,15.8,216.0,4925.0,MALE,FEMALE,False
123,45.2,14.8,212.0,5200.0,FEMALE,MALE,False


In [None]:
nearest_neighbors_model.predict_proba(penguins_features.head())

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [None]:
nearest_neighbors_model.classes_

array(['FEMALE', 'MALE'], dtype=object)

In [None]:
penguins_check["proba_female"] = nearest_neighbors_model.predict_proba(penguins_features)[:,0]
penguins_check["proba_male"] = nearest_neighbors_model.predict_proba(penguins_features)[:,1]
penguins_check.query("prediction_check == False")

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,prediction,prediction_check,proba_female,proba_male
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8,46.7,15.3,219.0,5200.0,MALE,FEMALE,False,0.6,0.4
10,46.8,15.4,215.0,5150.0,MALE,FEMALE,False,0.6,0.4
35,49.1,14.8,220.0,5150.0,FEMALE,MALE,False,0.4,0.6
44,49.6,15.0,216.0,4750.0,MALE,FEMALE,False,0.8,0.2
49,44.9,13.3,213.0,5100.0,FEMALE,MALE,False,0.2,0.8
58,45.5,15.0,220.0,5000.0,MALE,FEMALE,False,0.6,0.4
74,46.5,14.8,217.0,5200.0,FEMALE,MALE,False,0.4,0.6
97,49.4,15.8,216.0,4925.0,MALE,FEMALE,False,0.8,0.2
123,45.2,14.8,212.0,5200.0,FEMALE,MALE,False,0.4,0.6


## Classifying Data with Scikit-Learn Recap

* Same fit/predict interface as regression models,
* Plus class probability predictions.