# **CSI 382 - Data Mining and Knowledge Discovery**

# **Lab 5 - k-Nearest Neighbor Algorithm**

k-nearest neighbor algorithm, which is most often used for classification, al-
though it can also be used for estimation and prediction. k-Nearest neighbor is
an example of instance-based learning, in which the training data set is stored,
so that a classification for a new unclassified record may be found simply by
comparing it to the most similar records in the training set.

# **Dataset for Lab 5**

Since as a beginner in data mining it would be a great opportunity to try some techniques to predict the outcome of the drugs that might be accurate for the patient.

The target feature is: **Drug type**

The feature sets are:
* Age
* Sex
* Blood Pressure Levels (BP)
* Cholesterol Levels
* Na to Potassium Ration

The dataset can be found here in this [URL](https://drive.google.com/file/d/1BWeCLtgyt4B1poaSRwQ-nMLYJfbykYPm/view?usp=sharing)

**For today we need the upgraded matplotlib package. So we need to run the following code. This might not be needed if you are running this in a local environment. We need atleaset matplotlib 3.4**

In [None]:
!pip install matplotlib --upgrade

## **Loading the dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/drug200.csv')

#Check number of rows and columns in the dataset
print("The dataset has %d rows and %d columns." % df.shape)

## **Dataset Preprocessing**

We need to transform all categorical data to numerical ones. That's why we are applying some Lambda functions to our dataset columns.

In [None]:
# 0 = Female, 1 = Male
df["Sex"] = df["Sex"].apply(lambda x: 0 if x=="F" else 1)

In [None]:
# 0 = LOW, 1 = NORMAL and 2 = HIGH
df["BP"] = df["BP"].apply(lambda x: 0 if x=="LOW" else (1 if x=="NORMAL" else 2))

In [None]:
# 0 = LOW, 1 = NORMAL and 2 = HIGH
df["Cholesterol"] = df["Cholesterol"].apply(lambda x: 0 if x=="LOW" else (1 if x=="NORMAL" else 2))

In [None]:
# 0 = drugA, 1 = drugB, 2 = drugC, 3 = drugX, 4 = DrugY
df["Drug"] = df["Drug"].apply(lambda x: 0 if x=="drugA" else (1 if x=="drugB" else (2 if x=="drugC" else (3 if x=="drugX" else 4))))

In [None]:
# checking the columns now

df.columns

In [None]:
df.head(10)

# **k-nearest neighbor**

k-nearest neighbor algorithm, which is most often used for classification, al-
though it can also be used for estimation and prediction. k-Nearest neighbor is
an example of instance-based learning, in which the training data set is stored,
so that a classification for a new unclassified record may be found simply by
comparing it to the most similar records in the training set.
Let’s consider an example.

For example, in the medical field, suppose that we are interested in classifying
the type of drug a patient should be prescribed, based on certain patient charac-
teristics, such as the age of the patient and the patient’s sodium/potassium ratio.
Figure below is a scatter plot of patients’ sodium/potassium ratio against patients’
ages for a sample of 200 patients. The particular drug prescribed is symbolized
by the shade of the points.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
fig.set_size_inches(18.5, 10.5)

ax.set_facecolor('xkcd:grey')
scatter = plt.scatter( df['Age'],df['Na_to_K'], c=df['Drug'])
plt.xlabel('Age')
plt.ylabel('Na/K Ratio')
plt.grid(True)
plt.show()

In [None]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
            },
          });
        </script>
        '''))

In [None]:
#plot libaries
import numpy as np
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook



layout = dict(
    yaxis=dict(
        title='Na/K Ratio',
        automargin=True,
    ),
    xaxis=dict(
        title='Age',
        automargin=True,
    ),
)
fig = go.Figure(layout=layout)
# Add traces

fig.add_trace(go.Scatter(x=df['Age'], y=df['Na_to_K'],text= df['Drug'],
                    mode='markers',
                    name='Drug',
                    hovertemplate="%{text}",
                    marker=dict(
                        size=8,
                        color = df['Drug'],
                        colorscale='rainbow', # one of plotly colorscales
                        showscale=True
                    ))
            )
fig.update_layout(
    autosize=False,
    width=800,
    height=800
    )
fig.show()

Now suppose that we have a new patient record, without a drug classification,
and would like to classify which drug should be prescribed for the patient based
on which drug was prescribed for other patients with similar attributes. Identified
as “new patient 1,” this patient is 40 years old and has a Na/K ratio of 29, placing
her at the center of the circle indicated for new patient 1 in Figure 5.6. Which
drug classification should be made for new patient 1? Since her patient profile
places her deep into a section of the scatter plot where all patients are prescribed
drug Y, we would thereby classify new patient 1 as drug Y. All of the points
nearest to this point, that is, all of the patients with a similar profile (with respect
to age and Na/K ratio) have been prescribed the same drug, making this an easy
classification.

In [None]:
df2 = pd.DataFrame([[40, 1,1,2,29,6],[17, 0,1,0,12.5,6],[47, 1,0,1,13.5,6]], columns=['Age','Sex','BP','Cholesterol','Na_to_K','Drug'])

In [None]:
new_df = df.copy()
new_df=new_df.append(df2)

Next, we move to new patient 2, who is 17 years old with a Na/K ratio of 12.5.
Suppose we let k = 1 for our k-nearest neighbor algorithm, so that new patient 2
would be classified according to whichever single (one) observation it was closest to. In this case, new patient 2 would be classified for drugs B and C, since that is the classification of the point closest to the point on the scatter plot for new patient 2.

However, suppose that we now let k = 2 for our k-nearest neighbor algorithm,
so that new patient 2 would be classified according to the classification of the
k = 2 points closest to it. One of these points is dark gray, and one is medium
gray, so that our classifier would be faced with a decision between classifying
new patient 2 for drugs B and C or drugs A and X. How would the classifier
decide between these two classifications? Voting would not help, since there is
one vote for each of two classifications.

Voting would help, however, if we let k = 3 for the algorithm, so that new patient 2 would be classified based on the three points closest to it. Since two of the three closest points are same, a classification based on voting would therefore choose drugs A and X as the classification for new patient 2. Note that the classification assigned for new patient 2 differed based on which value we chose for k.

Finally, consider new patient 3, who is 47 years old and has a Na/K ratio of
13.5. Figure 11 presents a close-up of the three nearest neighbors to new patient 3. For k = 1, the k-nearest neighbor algorithm would choose the dark gray (drugs B and C) classification for new patient 3, based on a distance measure. For k = 2, however, voting would not help. But voting would not help for k = 3 in this case either, since the three nearest neighbors to new patient 3 are of three different classifications.

In [None]:
#plot libaries
import numpy as np
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook



layout = dict(
    yaxis=dict(
        title='Na/K Ratio',
        automargin=True,
    ),
    xaxis=dict(
        title='Age',
        automargin=True,
    ),
)
fig = go.Figure(layout=layout)
# Add traces

fig.add_trace(go.Scatter(x=new_df['Age'], y=new_df['Na_to_K'],text= new_df['Drug'],
                    mode='markers',
                    name='Drug',
                    hovertemplate="%{text}",
                    marker=dict(
                        size=8,
                        color = new_df['Drug'],
                        colorscale='rainbow', # one of plotly colorscales
                        showscale=True
                    ))
            )
fig.update_layout(
    autosize=False,
    width=800,
    height=800
    )
fig.show()

This example has shown us some of the issues involved in building a classifier
using the k-nearest neighbor algorithm. These issues include:
* How many neighbors should we consider? That is, what is k?
* How do we measure distance?
* How do we combine the information from more than one observation?

Later we consider other questions, such as:

* Should all points be weighted equally, or should some points have more
influence than others?

## **Preparing dataset to be fed into Model**

The target/response variable in our dataset is **Drug**. So we are putting the drug labels in our target varible $y$.

The other varaibles/predictors are the columns **[Age, Sex, BP, Cholesterol, Na_to_K]** and should be put in our training variable $X$.

In [None]:
y = df['Drug']
X = df.drop(columns=['Drug'])

print("Data shape: ", X.shape)
print("Labels shape: ", y.shape)

## **Supervised Learning**

Most data mining methods are supervised methods, however, meaning that (1)
there is a particular pre-specified target variable, and (2) the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn which values of the target variable are associated with which values of the predictor variables.

### **Training Set**

First, the algorithm is provided with a training set of data, which includes the
pre-classified values of the target variable in addition to the predictor variables.

For example, if we are interested in classifying income bracket, based on age,
gender, and occupation, our classification algorithm would need a large pool of
records, containing complete (as complete as possible) information about every
field, including the target field, income bracket. In other words, the records in the training set need to be pre-classified. A provisional data mining model is then constructed using the training samples provided in the training data set.

#### **Training Set - Necessarily Incomplete**

However, the training set is necessarily incomplete; that is, it does not include the “new” or future data that the data modelers are really interested in classifying. Therefore, the algorithm needs to guard against “memorizing” the training set and blindly applying all patterns found in the training set to the future data.

For example, it may happen that all customers named “David” in a training set
may be in the high income bracket. We would presumably not want our final
model, to be applied to new data, to include the pattern “If the customer’s first name is David, the customer has a high income.” Such a pattern is a spurious artifact of the training set and needs to be verified before deployment.

### **Testing Set**

Therefore, the next step in supervised data mining methodology is to examine
how the provisional data mining model performs on a test set of data. In the test set, a holdout data set, the values of the target variable are hidden temporarily from the provisional model, which then performs classification according to the patterns and structure it learned from the training set. The efficacy of the classifications are then evaluated by comparing them against the true values of the target variable. The provisional data mining model is then adjusted to minimize the error rate on the test set.

In [None]:
# Splittng train:test in 90:10 ratio

import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## **COMBINATION FUNCTION**

### **Simple Unweighted Voting**
1. Before running the algorithm, decide on the value of k, that is, how many
records will have a voice in classifying the new record.
2. Then, compare the new record to the k nearest neighbors, that is, to the k
records that are of minimum distance from the new record in terms of the
Euclidean distance or whichever metric the user prefers.
3. Once the k records have been chosen, then for simple unweighted voting,
their distance from the new record no longer matters. It is simple one
record, one vote.

### **Weighted Voting**

One may feel that neighbors that are closer or more similar to the new record
should be weighted more heavily than more distant neighbors.

In weighted voting, the influence of a particular record is inversely proportional to the distance of the record from the new record to be classified.

**In our model today, we will calculate for both simple unweigted and weighted voting techniques.**

## **CHOOSING k**

How should one go about choosing the value of k? In fact, there may not be
an obvious best solution. Consider choosing a small value for k. Then it is
possible that the classification or estimation may be unduly affected by outliers or unusual observations (“noise”). With small k (e.g., k = 1), the algorithm will simply return the target value of the nearest observation, a process that may lead the algorithm toward overfitting, tending to memorize the training data set at the expense of generalizability.

**In our model today, we will experiment with k= 2 to 20.**

## **Running the model**

We will run our model in different scenarios. The scenarios are as follows:

**Scenario 1** - We will train our model with **all available data** and calculate it for different $k$'s. We will also calculate the accuracy for each configuration and plot it in Matplotlib with **all available data**.

**Scenario 2** - We will train our model with **training dataset** and calculate it for different $k$'s. We will also calculate the accuracy for each configuration and plot it in Matplotlib with our **testing dataset**.

### **Scenario 1 - Training and testing with all available data**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn.neighbors import KNeighborsClassifier

score = []

for n_neighbors in range(2,21):
    for weights in ['uniform', 'distance']:
        # we create an instance of Neighbours Classifier and fit the data.
        clf = KNeighborsClassifier(n_neighbors, weights=weights)
        clf.fit(X, y)
        score.append(clf.score(X,y))

score = np.reshape(score,(-1,2))

In [None]:
score

In [None]:
import matplotlib.pyplot as plt
import numpy as np


labels = np.arange(2, 21, 1)
uniform = score[:, 0::2].reshape(-1).tolist()
distance = score[:, 1::2].reshape(-1).tolist()
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars



fig, ax = plt.subplots(figsize=(18,8))
rects1 = ax.bar(x - width/2, uniform, width, label='Simple unweighted voting')
rects2 = ax.bar(x + width/2, distance, width, label='Weighted Voting')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('N_neighbors')
ax.set_ylabel('Scores')
ax.set_title('Scores by weights and accuracy')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)

fig.tight_layout()

plt.show()

### **Scenario 2 - Training with training data($90\%$) and testing with testing data($10\%$)**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import neighbors

score = []

for n_neighbors in range(2,21):
    for weights in ['uniform', 'distance']:
        # we create an instance of Neighbours Classifier and fit the data.
        clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
        clf.fit(X_train, y_train)
        score.append(clf.score(X_test,y_test))

score = np.reshape(score,(-1,2))

In [None]:
score

In [None]:
import matplotlib.pyplot as plt
import numpy as np


labels = np.arange(2, 21, 1)
uniform = score[:, 0::2].reshape(-1).tolist()
distance = score[:, 1::2].reshape(-1).tolist()
x = np.arange(len(labels))  # the label locations
width = 0.45  # the width of the bars



fig, ax = plt.subplots(figsize=(18,8))
rects1 = ax.bar(x - width/2, uniform, width, label='Simple unweighted voting')
rects2 = ax.bar(x + width/2, distance, width, label='Weighted Voting')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('N_neighbors')
ax.set_ylabel('Scores')
ax.set_title('Scores by weights and accuracy')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()


ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)

fig.tight_layout()

plt.show()

### **Checking for individual predictions that our model has made**

Let's check which drug our model recommends to the 3 new patients in our dataset.

In [None]:
clf = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance')
clf.fit(X_train, y_train)

In [None]:
predcitor = df2.drop(columns=["Drug"])

pred_labels = clf.predict(predcitor)

In [None]:
pred_labels

## **Visualizing the Decision boundaries (Not in syllabus)**

Let's take the first two features in our dataset and plot a decision boundary for 'Age' and 'Sex'.

In [None]:
import plotly.graph_objects as go
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

configure_plotly_browser_state()

init_notebook_mode(connected=False)# to show plots in notebook


mesh_size = .02
margin = 0.25



# Load and split data
X_m, y_m = X.iloc[:,[0,1]],y

# Create a mesh grid on which we will run our model
x_min, x_max = X_m.iloc[:, 0].min() - margin, X_m.iloc[:, 0].max() + margin
y_min, y_max = X_m.iloc[:, 1].min() - margin, X_m.iloc[:, 1].max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)

# Create classifier, run predictions on grid
clf = KNeighborsClassifier(15, weights='distance')
clf.fit(X_m, y_m)
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

layout = dict(
    yaxis=dict(
        title='BP',
        automargin=True,
    ),
    xaxis=dict(
        title='Age',
        automargin=True,
    ),
)

# Plot the figure
fig = go.Figure(data=[
    go.Contour(
        x=xrange,
        y=yrange,
        z=Z,
        colorscale='RdBu'
    )
],
    layout = layout)
fig.show()


# **That's all for today!**