# Lazy learning with k-Nearest Neighbors {#knn}

<iframe width="560" height="315" src="https://www.youtube.com/embed/MDniRwXizWo" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

In [10]:
import pandas as pd
from pandas.api.types import CategoricalDtype
from IPython.display import display, Markdown
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

## Business Case: Diagnosing Breast Cancer

Breast cancer is the top cancer in women both in the developed and the developing world. In the Netherlands it is the most pervasive form of cancer [@noauthor_who_nodate]. In order to improve breast cancer outcome and survival early detection remains the most important instrument for breast cancer control. If machine learning could automate the identification of cancer, it would improve efficiency of the detection process and might also increase its effectiveness by providing greater detection accuracy.

## Data Understanding
The data we will be using comes from the University of Wisconsin and is available online as an open source dataset [@noauthor_uci_cancer_nodate]. It includes measurements from digitized images from from fine-needle aspirates of breast mass. The values represent cell nuclei features.

For convenience the data in csv format is stored on Github. We can access it directly using a function for reading csv from the `pandas` library

In [5]:
url = "https://raw.githubusercontent.com/businessdatasolutions/courses/main/data%20mining/gitbook/datasets/breastcancer.csv"
rawDF = pd.read_csv(url)

Using the `info()` function we can have some basic information about the dataset.

In [6]:
rawDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 569 non-null    int64  
 1   diagnosis          569 non-null    object 
 2   radius_mean        569 non-null    float64
 3   texture_mean       569 non-null    float64
 4   perimeter_mean     569 non-null    float64
 5   area_mean          569 non-null    float64
 6   smoothness_mean    569 non-null    float64
 7   compactness_mean   569 non-null    float64
 8   concavity_mean     569 non-null    float64
 9   points_mean        569 non-null    float64
 10  symmetry_mean      569 non-null    float64
 11  dimension_mean     569 non-null    float64
 12  radius_se          569 non-null    float64
 13  texture_se         569 non-null    float64
 14  perimeter_se       569 non-null    float64
 15  area_se            569 non-null    float64
 16  smoothness_se      569 non

## Preparation
The first variable, `id`, contains unique patient IDs. The IDs do not possess any relevant information for making predictions, so we will delete it from the dataset.

In [7]:
cleanDF = rawDF.drop(['id'], axis=1)
cleanDF.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
0,B,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,0.1959,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771
1,B,10.6,18.95,69.28,346.4,0.09688,0.1147,0.06387,0.02642,0.1922,...,11.88,22.94,78.28,424.8,0.1213,0.2515,0.1916,0.07926,0.294,0.07587
2,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,0.1714,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
3,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,0.1771,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
4,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,0.1721,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766


The variable named `diagnosis` contains the outcomes we would like to predict - 'B' for 'Benign' and 'M' for 'Malignant'. The variable we would like to predict is called the 'label'. We can look at the counts for both outcomes, using the `value_counts()` function. When we set the normalize setting to `True` we get the the proportions.

In [8]:
cntDiag = cleanDF['diagnosis'].value_counts()
propDiag = cleanDF['diagnosis'].value_counts(normalize=True)
cntDiag
propDiag

diagnosis
B    0.627417
M    0.372583
Name: proportion, dtype: float64

Looking again at the results from the `info()` function you'll notice that the variable `diagnosis` is coded as text (`object`). Many models require that the label is of type `category`. The `pandas` library has a function that can transform a `object` type to `category`.

In [None]:
catType = CategoricalDtype(categories=["B", "M"], ordered=False)
cleanDF['diagnosis'] = cleanDF['diagnosis'].astype(catType)
cleanDF['diagnosis']

The features consist of three different measurements of ten characteristics. We will take three characteristics and have a closer look.

In [None]:
cleanDF[['radius_mean', 'area_mean', 'smoothness_mean']].describe()

You'll notice that the three variables have very different ranges and as a consequence `area_mean` will have a larger impact on the distance calculation than the `smootness_mean`. This could potentially cause problems for modeling. To solve this we'll apply normalization to rescale all features to a standard range of values.

We will write our own normalization function,

In [None]:
def normalize(x):
  return((x - min(x)) / (max(x) - min(x))) # distance of item value - minimum vector value divided by the range of all vector values

testSet1 = np.arange(1,6)
testSet2 = np.arange(1,6) * 10



print(f'testSet1: {testSet1}\n')
print(f'testSet2: {testSet2}\n')
print(f'Normalized testSet1: {normalize(testSet1)}\n')
print(f'Normalized testSet2: {normalize(testSet2)}\n')

and apply it to all the numerical variables in the dataframe.

In [None]:
excluded = ['diagnosis'] # list of columns to exclude
X = cleanDF.loc[:, ~cleanDF.columns.isin(excluded)]
X = X.apply(normalize, axis=0)
X[['radius_mean', 'area_mean', 'smoothness_mean']].describe()

When we take the variables we've selected earlier and look at the summary parameters again, we'll see that the normalization was successful.

We can now split our data into training and test sets.


In [None]:
y = cleanDF['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify=y)

Here, X_train and y_train are the features and labels of the training data, respectively, and X_test and y_test are the features and labels of the test data.

Now we can train and evaluate our kNN model.

## Modeling and Evaluation
KNN is a instance-based learning algorithm. It stores all of the training data and makes predictions based on the similarity between the input instance and the stored instances. The prediction is based on the majority class among the K nearest neighbors of the input instance.

The distance between instances is typically measured using the Euclidean distance. However, other distance measures such as the Manhattan distance or the Minkowski distance can also be used.

The pseudocode for the KNN algorithm is as follows:

<div class='p-2' style='background-color:#f0f3f4;'>
<pre><code class=''>
<span class="hljs-keyword">for</span> <span class="hljs-keyword">each</span> instance <span class="hljs-keyword">in</span> the test <span class="hljs-keyword">set</span>:
    <span class="hljs-keyword">for</span> <span class="hljs-keyword">each</span> instance <span class="hljs-keyword">in</span> the training <span class="hljs-keyword">set</span>:
        calculate the distance between the two instances
    sort the distances <span class="hljs-keyword">in</span> ascending <span class="hljs-keyword">order</span>
    find the K nearest neighbors
    predict the <span class="hljs-keyword">class</span> based <span class="hljs-keyword">on</span> the majority <span class="hljs-keyword">class</span> among the K nearest neighbors
</code></pre>
</div>

To train the knn model we only need one single function from the `sklearn` library. The `fit()` function trains the model on the training data. The trained model is applied to the set with test features and the `predict()` function gives back a set of predicted values for y. 

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# make predictions on the test set
y_pred = knn.predict(X_test)

Now that we have a set of predicted labels we can compare these with the actual labels. A diffusion table shows how well the model performed.

```{r difftable-fig, echo=FALSE, fig.align='center', fig.asp=.75, fig.cap='Standard diffusion table. Taken from: https://emj.bmj.com/content/emermed/36/7/431/F1.large.jpg', message=TRUE, warning=TRUE, out.width='80%'}
knitr::include_graphics(rep('images/diffusion.png'))
```

Here is our own table:

In [None]:
cm = confusion_matrix(y_test, y_pred, labels=knn.classes_)
cm

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn.classes_)
disp.plot()
plt.show()

**Questions:** 

1. *How would you assess the overall performance of the model?*
2. *What would you consider as more costly: high false negatives or high false positives levels? Why?*
3. *Try to improve the model by changing some parameters of the `KNeighborsClassifier()` function*
