# kNN Classifier

Import out libraries first.

In [4]:
library(class)
library(gmodels)

Read in our Illness data.

In [5]:
illness <- read.csv("../illness-mapped.csv")

Now we have to normalize our data. Each column is normalized by applying this function over each data point in the column. It takes in the maximum and minimum value of the column, gets the difference and divides each data point by the difference to get a value between 0 and 1.

In [8]:
normalize <- function(x) {
    num <- x - min(x)
    denom <- max(x) - min(x)
    return (num/denom)
}

illness_norm <- as.data.frame(lapply(illness[1:8], normalize))

# Print out a summary of the data
summary(illness_norm)

 plasma_glucose         bp              age         skin_thickness 
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.3028   1st Qu.:0.4419   1st Qu.:0.2500   1st Qu.:0.100  
 Median :0.4437   Median :0.5349   Median :0.3929   Median :0.200  
 Mean   :0.4664   Mean   :0.5386   Mean   :0.3934   Mean   :0.291  
 3rd Qu.:0.6074   3rd Qu.:0.6279   3rd Qu.:0.5179   3rd Qu.:0.400  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
 num_pregnancies      insulin            bmi             pedigree      
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.07452   1st Qu.:0.2086   1st Qu.:0.07912   1st Qu.:0.03333  
 Median :0.13341   Median :0.3047   Median :0.15418   Median :0.08333  
 Mean   :0.17099   Mean   :0.3030   Mean   :0.18597   Mean   :0.15301  
 3rd Qu.:0.21184   3rd Qu.:0.3829   3rd Qu.:0.25396   3rd Qu.:0.21667  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  

Once the data is normalized, we split our data into training and testing data in a ratio of 2:1. To do this, we create a list the same length as our data and assign it a 1 or 2 value where the probaility of getting a 1 is 67% and a 2 is 33%.

In [9]:
sampler <- sample(2, nrow(illness_norm), replace=TRUE, prob=c(0.67, 0.33))

We create our training data, only selecting the instance if the corresponding row in our sampler is 1. We do the same for the test, except only select the row if it is a 2. We also only select our numerical data columns, we leave the classes out of the training and testing data to allow the classifier to predict them.

In [10]:
train <- illness_norm[sampler==1, -1]
test <- illness_norm[sampler==2, -1]

After splitting out data, we create the class labels for training (and the test labels for testing).

In [12]:
train_labels <- illness[sampler==1, 9]
test_labels <- illness[sampler==2, 9]

Now we can run our knn classifier function from the `class` library. Our K is set to 3 nearest neighbours. Once the classifier has run, we output results for inspection.

In [14]:
illness_predicted <- knn(train, test, train_labels, 3)

CrossTable(x = test_labels, y = illness_predicted, prop.chisq=FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  131 

 
             | illness_predicted 
 test_labels |  negative |  positive | Row Total | 
-------------|-----------|-----------|-----------|
    negative |        77 |        16 |        93 | 
             |     0.828 |     0.172 |     0.710 | 
             |     0.748 |     0.571 |           | 
             |     0.588 |     0.122 |           | 
-------------|-----------|-----------|-----------|
    positive |        26 |        12 |        38 | 
             |     0.684 |     0.316 |     0.290 | 
             |     0.252 |     0.429 |           | 
             |     0.198 |     0.092 |           | 
-------------|-----------|-----------|-----------|
Column Total |       103 |        28 |       131 | 
             |     0.786 |     0.214 |           | 
------------