# Video: Classifying Data with Scikit-Learn

In this video, we will use scikit-learn to build models classifying data into different classes.

## Classifying Data with Scikit-Learn

![Palmer Penguins illustration showing chinstrap, gentoo, and adelie penguins.](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)
Artwork by @allison_horst



Script:
* In this video, I will show off scikit-learn's classification functionality using the Palmer Penguins data set.


## Our Classification Problem - Predict Penguin Sex

Original Palmer Penguins paper looked at size differences by sex...
* Penguin sex was determined by genetic tests on blood samples.
  * Insufficient blood samples were reason for most of the missing data.
* Will try to predict based on size differences.
  * Just as an example here.
  * Would not make sense to look at size differences by sex predicted from size...
  * Just Gentoo penguins to avoid differences in species sizes.

Script:
* The original studies behind this penguin data aimed to validate a hypothesis that differences penguin sizes based on sex would vary based on availability of food in the surrounding environment.
* If food was abundant, competition for food would be less, and size differences from sex-specific behavior would be less.
* They collected data by measuring the penguins  bodies with calipers, weighing them, and taking blood samples.
* They used the blood samples to run genetic tests to determine the penguin sexes, and to infer some of their feeding patterns from isotope ratios.
* Missing or insufficient blood samples were responsible for most of the missing data entries in this data set.
* So the classification problem we will look at now will be predicting penguin sex from the size measurements.
* Note that this would not be appropriate for the original analysis.
* It would be circular reasoning to use the size measurements to predict sex and then ask questions about size differences.

## Classifying vs Regressing

A scikit-learn classifier shares
* `fit` method
* `predict` method

And adds
* `predict_proba` method returning class probabilities
* `class_` attribute storing class identifiers

Script:
* In scikit-learn, classification models have a very similar interface to regression models.
* The same fit and predict methods are present, but the numeric targets and outputs are replaced with class targets and outputs.
* I will walk through each of these.

## Regression vs Classification fit()

What are the inputs to `model.fit(X, y)`?
* Regression: X=features, y=numeric targets
* Classification: X=features, y=class (category) targets


Script:
* The fit method of both regression and classification models take in the same kinds of features, but classifcation targets are classes instead of numbers.
* In this context, the word class has multiple meanings.
* There are the programming classes that define how these models work, and the category classes that we are trying to distinguish with those models.
* The programming classes are named after the category classes.
* And the targets of the classifer are the category classes.
* For the penguins sex prediction problem, the category classes are female and male.

## Regression vs Classification predict()

What are the inputs to `model.predict(X)`?
* Both: X=features

What are the outputs from `model.predict(X)`?
* Regression: predicted numeric value
* Classification: prediction class identifier


Script:
* Like the fit method, the predict method of both regression and classification models take in the same kinds of features.
* And the same target change of fit methods happens to the predict method output.
* The predict method of classification models returns classes instead of numeric predictions.


## Classification predict() vs predict_proba()

What are the inputs to `model.predict(X)` and `model.predict_proba(X)`?
* Both: X=features

What are the outputs from `model.predict(X)` and `model.predict_proba(X)`?
* `predict`: One predicted class per input row.
* `predict_proba`: One column per class with a predicted probability of that class.

Script:
* What is the difference between the predict method, and the new predict_proba method of classification models?
* The predict method picks a single class for each input row.
* Essentially an all or nothing choice.
* The predict_proba method gives a probability for each known class.
* The probabilities returned give a more nuanced answer reflecting uncertainty in the choice.
* The model still could predict 100% for a class, but splitting the probabilities allows more flexibility in the predictions.
* It also makes it clear when the prediction is not confident, or another class is almost as likely.
* All of these probabilities should add up to one in each row.

## Classification class identifiers

* What is in the `class_` attribute?
  * A sequence of all the known classes.
  * Roughly `sorted(set(y))` with `y` from `fit(X,y)`

Script:
* The class underscore attribute lists all the known classes.
* These classes may be strings like in our penguin example.
* Another common type is integers, if there exists an outside mapping of integers to classes.
* Whatever the type, the classes are collected from the y targets in the call to fit, and the distinct values are sorted and saved in this attribute.
* The probabilities in `predict_proba` will match the order of the class underscore attribute.


## Classification Algorithm - Nearest Neighbors

* Rough idea: pick $k$ rows which are numerically closest to the input row.
* Predict: most common class in the $k$ rows.
* Probability for class $c$: number w/class $c$ in the $k$ rows, divided by $k$.

Script:
* The nearest neighbor algorithm is one of the simplest machine learning algorithms.
* The basic idea is simple.
* Given an input row, find the closest row in the training data, and return the corresponding answer.
* If you pick k nearest neighbors, find the k closest training rows and pick by a majority vote of their classes.
* What does closest mean here?


## Default Distance Metric for Nearest Neighbors

Euclidean distance
* $d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + (z_1 - z_2)^2 + \cdots}$
* We use this distance function up to 3 dimensions in everyday life.
* Tends to be biased towards dimensions with bigger scales.


Script:
* The default distance metric used for nearest neighbors is just the everyday Euclidean distance.
* If I did not say Euclidean, but gave you a ruler and asked you to measure something, that would be what you were measuring.
* Euclidean distance is easy to interpret, but can be dominated by dimensions with a wider range of variation.
* In the penguin data set that we are about to model, the culmen depth has a standard deviation of about one, while the body mass has a standard deviation of about five hundred.
* Managing such bias will be an important part of learning how to configure models over the course of this program.
* For now, let's take a look at the penguin data and start classifying.

In [None]:
import numpy as np
import pandas as pd

Script:
* We will start with our usual imports.
* Then I'll load the data for the Gentoo penguins.

In [None]:
penguins = pd.read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381", index_col="Sample Number")
penguins_keep_columns = [c for c in penguins.columns if c == "Sex" or "(mm)" in c or "(g)" in c]
penguins = penguins[penguins_keep_columns]

penguins = penguins.query("Sex in ('FEMALE', 'MALE')")
penguins = penguins.dropna()
penguins


Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,46.1,13.2,211.0,4500.0,FEMALE
2,50.0,16.3,230.0,5700.0,MALE
3,48.7,14.1,210.0,4450.0,FEMALE
4,50.0,15.2,218.0,5700.0,MALE
5,47.6,14.5,215.0,5400.0,MALE
...,...,...,...,...,...
119,47.2,13.7,214.0,4925.0,FEMALE
121,46.8,14.3,215.0,4850.0,FEMALE
122,50.4,15.7,222.0,5750.0,MALE
123,45.2,14.8,212.0,5200.0,FEMALE


Script:
* I also filtered the penguin data to numeric columns which will be the features and the sex column which will be the target.
* Then I filtered out all the missing rows with missing data.
* Now I will finish preparing the input features.

In [None]:
penguins_features = penguins.drop("Sex", axis=1)
penguins_features

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g)
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,46.1,13.2,211.0,4500.0
2,50.0,16.3,230.0,5700.0
3,48.7,14.1,210.0,4450.0
4,50.0,15.2,218.0,5700.0
5,47.6,14.5,215.0,5400.0
...,...,...,...,...
119,47.2,13.7,214.0,4925.0
121,46.8,14.3,215.0,4850.0
122,50.4,15.7,222.0,5750.0
123,45.2,14.8,212.0,5200.0


Script:
* And prepare the target.

In [None]:
penguins_target = penguins["Sex"]
penguins_target

Sample Number
1      FEMALE
2        MALE
3      FEMALE
4        MALE
5        MALE
        ...  
119    FEMALE
121    FEMALE
122      MALE
123    FEMALE
124      MALE
Name: Sex, Length: 119, dtype: object

Script:
* Now I will create the model.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
nearest_neighbors_model = KNeighborsClassifier(n_neighbors=5).fit(penguins_features, penguins_target)
nearest_neighbors_model

Script:
* I picked 5 neighbors to check in predictions.
* I chose five to get a few votes to get a few sample points weighing in, but not a lot so they hopefully are all close.
* Picking these parameters can be an art on its own.
* One bit which is less of an art is that checking an odd number of neighbors makes ties a little less likely, since the votes cannot be exactly 50/50.
* Next, I will make the predictions and collect them in a new data frame to check how good they are.

In [None]:
penguins_check = penguins.copy()
penguins_check["prediction"] = nearest_neighbors_model.predict(penguins_features)
penguins_check["prediction_check"] = penguins_check["prediction"] == penguins_target
penguins_check.groupby("prediction_check").size()

prediction_check
False      9
True     110
dtype: int64

Script:
* Take a moment to read that.
* The prediction column is the output of the model.
* The prediction check column is true if the prediction matches the training target.
* Then those checks are grouped to count the number of right and wrong predictions.
* One hundred and ten correct predictions out of one hundred and nineteen rows.
* Not bad, but I should point out that this is the data that we trained the model on.
* So a high accuracy is expected.
* If you test on your training set, the nearest row will always be the same row from training, so one of the nearest neighbors is guaranteed to be right!
* At least two of the other four agreed, but this is not a good way to assess generalization.
* We will talk about better ways next week.
* Let's look at the mistakes briefly.

In [None]:
penguins_check.query("prediction_check == False")

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,prediction,prediction_check
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8,46.7,15.3,219.0,5200.0,MALE,FEMALE,False
10,46.8,15.4,215.0,5150.0,MALE,FEMALE,False
35,49.1,14.8,220.0,5150.0,FEMALE,MALE,False
44,49.6,15.0,216.0,4750.0,MALE,FEMALE,False
49,44.9,13.3,213.0,5100.0,FEMALE,MALE,False
58,45.5,15.0,220.0,5000.0,MALE,FEMALE,False
74,46.5,14.8,217.0,5200.0,FEMALE,MALE,False
97,49.4,15.8,216.0,4925.0,MALE,FEMALE,False
123,45.2,14.8,212.0,5200.0,FEMALE,MALE,False


Script:
* Nothing huge is jumping out at me here, but I do notice that there are a few tied body masses within these rows, so perhaps there are a lot of penguins with similar features.
* Let's take a look at some of the predicted probabilities.

In [None]:
nearest_neighbors_model.predict_proba(penguins_features.head())

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.]])

Script:
* The probabilities are returned as a two column array.

In [None]:
nearest_neighbors_model.classes_

array(['FEMALE', 'MALE'], dtype=object)

Script:
* Let's add on predicted probabilities for female and male penguins to the data frame.

In [None]:
penguins_check["proba_female"] = nearest_neighbors_model.predict_proba(penguins_features)[:,0]
penguins_check["proba_male"] = nearest_neighbors_model.predict_proba(penguins_features)[:,1]
penguins_check.query("prediction_check == False")

Unnamed: 0_level_0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,prediction,prediction_check,proba_female,proba_male
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8,46.7,15.3,219.0,5200.0,MALE,FEMALE,False,0.6,0.4
10,46.8,15.4,215.0,5150.0,MALE,FEMALE,False,0.6,0.4
35,49.1,14.8,220.0,5150.0,FEMALE,MALE,False,0.4,0.6
44,49.6,15.0,216.0,4750.0,MALE,FEMALE,False,0.8,0.2
49,44.9,13.3,213.0,5100.0,FEMALE,MALE,False,0.2,0.8
58,45.5,15.0,220.0,5000.0,MALE,FEMALE,False,0.6,0.4
74,46.5,14.8,217.0,5200.0,FEMALE,MALE,False,0.4,0.6
97,49.4,15.8,216.0,4925.0,MALE,FEMALE,False,0.8,0.2
123,45.2,14.8,212.0,5200.0,FEMALE,MALE,False,0.4,0.6


Script:
* Most of these prediction mistakes had 60/40 or 40/60 probabilities, so it looks like these penguins are just in an area of the feature space where female and male penguins are mixed up.
* Overall, the model seems reasonable, and we will look into better model evaluations next week.

## Classifying Data with Scikit-Learn Recap

* Same fit/predict interface as regression models,
* Plus class probability predictions.

Script:
* We saw that classifying data with scikit-learn classification models can be very easy, similar to regression models.
* The same init, fit, predict sequence works here too.
* Plus you can get more information about the probabilities of each class.