# K-Nearest Neighbors

We are now going to use a different machine learning algorithm to classify our brains, this time using a k-nearest neighbors approach. This is similar in concept to the function you wrote in the paleo-neuro unit to classify your 3-D printed brain.

## Import libraries

First, we need to import some libraries. Import `pandas`, `numpy`, and `matplotlib.pyplot`. From `sklearn`, import `neighbors` and `datasets`.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn import datasets

## Import data

Now we need to import our bird and dinosaur brain data and calculate the brain to body ratio and cerebrum to whole brain ratio. We also need to convert the bird and dinosaur classifications to 0s and 1s like you did in the decision tree notebook. Finally, we need to convert the data to a numpy array.

In [2]:
df = pd.read_csv("../../data/bird_dino_data.csv")
df.head()
df["Total Endocranium (cm3)"] = df["Olfactory bulbs (cm3)"] + df["Cerebrum (cm3)"] + df["Optic Lobes (cm3)"] + df["Cerebellum (cm3)"]
df["Brain Body Ratio"] = df["Total Endocranium (cm3)"]/(df["Body Mass (kg)"]*1000)
df["Cerebrum Ratio"] = df["Cerebrum (cm3)"]/df["Total Endocranium (cm3)"]
df["Bird vs Dino"].loc[df["Bird vs Dino"] == "Bird"] = 0
df["Bird vs Dino"].loc[df["Bird vs Dino"] == "Dino"] = 1
data = df.to_numpy()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


## Set x and y

Again, just as in the decision tree, we need to set `x` and `y` variables corresponding to our features (brain to body mass ratio and cerebrum to whole brain ratio) and classes (birds and dinosaurs), respectively.

In [3]:
x = data[:,9:]
y = data[:,1]

## Train / test split

We are going to split up our dataset into training data, that we will use to train the algorithm, and test data, that we will use to see how well the algorithm performs. From `sklearn.model_selection`, import `train_test_split`. Next, look at the documentation for `train_test_split` to divide the data into 80% training and 20% test data:

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

## Run k-nearest neighbors

Now it is time to actually run the k-nearest neighbors! Look at the documentation for `KNeighborsClassifiers`: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html. Let's start with `n_neighbors` as 5, which is the default.

In [6]:
neigh = neighbors.KNeighborsClassifier(n_neighbors=5)

Now, just as you did for the decision tree, you need to run `fit` based on what you called your k-nearest neighbors result. Use the training data you created (split up into x and y).

In [8]:
fit = neigh.fit(x_train, y_train)

## Testing

Now, we can evaluate our model on the test data you created by running `predict`  on the test data. We can see whether the algorithm classifies this test data as either a bird or dinosaur based on the values of the two ratios.

In [11]:
y_pred = neigh.predict(x_test)

We can calculate the accuracy by importing `metrics` from `sklearn`:

In [10]:
from sklearn import metrics

We can use `metrics.accuracy_score` to compare the predicted classifications with the actual identities of the brains in our test data.

In [12]:
metrics.accuracy_score(y_test, y_pred)

1.0