# Machine learning for biology, part three

How do we score our k-nearest neighbors classifier?

We will use a training set (80%) and a testing set (20%).

In [7]:
import pandas as pd
import seaborn as sns
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
df = df.dropna()

We modify our earlier function to take k as a parameter, so we can test different values of k. We also take in the dataset to be used as training.

In [8]:
def classify_penguin(bill_length, flipper_length, k, data):
    bill_length_difference = data['bill_length_mm'] - bill_length
    flipper_length_difference = data['flipper_length_mm'] - flipper_length
    overall_distance = np.sqrt(flipper_length_difference ** 2 + bill_length_difference ** 2)
    data['distance to new point'] = overall_distance

    most_common_species_nearby = data.sort_values('distance to new point').head(k)['species'].mode()[0]

    return most_common_species_nearby


We randomly assign 80% of our data to be 'training' and 20% to be 'testing'.

In [9]:
df['assignment'] = np.random.choice(['training', 'testing'], len(df), p=[0.8, 0.2])
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,assignment
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,testing
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,training
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,training
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,training
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007,training
...,...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009,training
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009,training
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009,testing
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009,training


In [10]:
training_data = df[df['assignment'] == 'training']
testing_data = df[df['assignment'] == 'testing']

In [13]:
predictions = testing_data.apply(lambda p: classify_penguin(p.bill_length_mm, p.flipper_length_mm, 10, training_data), axis=1)
(predictions == testing_data['species']).value_counts(normalize=True)[True]


0.9358974358974359

What if we change the value of k? Test all values of k from 1 to 10.

In [16]:
k_scores = []
for k in range(1, 11):
    predictions = testing_data.apply(lambda p: classify_penguin(p.bill_length_mm, p.flipper_length_mm, k, training_data), axis=1)
    score = (predictions == testing_data['species']).value_counts(normalize=True)[True]
    k_scores.append([k, score])
pd.DataFrame(data = k_scores, columns=['k', 'score'])

Unnamed: 0,k,score
0,1,0.948718
1,2,0.948718
2,3,0.948718
3,4,0.948718
4,5,0.948718
5,6,0.948718
6,7,0.948718
7,8,0.935897
8,9,0.935897
9,10,0.935897


Note that these scores only represent one possible split of the data into testing and training sets. If we use a different split each time, we will see different results.

Is there a better, deterministic way of evaluating the performance of our function?