# kNN Algorithm
### I am using the "Breast Cancer Wisconsin Diagnostic" dataset from the *UCI Machine Learning Repository* 
##### I added the appropriate headers to my dataset

### Step One: Collect Data 

In [1]:
wdbc <- read.csv("wdbc.csv", stringsAsFactors = FALSE)

### Step two: Explore and Prepare the Data 

In [None]:
str(wdbc)

Let's remove the $id from the dataset since it's not needed and can distort the distance in the kNN algorithm.

In [8]:
wdbc <- wdbc[-1]

In [None]:
table(wdbc$diagnosis)

We need to change $diagnosis to a factor so R's M.L. classifiers can use this as the target feature.

In [2]:
wdbc$diagnosis <- factor(wdbc$diagnosis, levels = c("B", "M"), 
                        labels = c("Benign", "Malignant"))

In [None]:
round(prop.table(table(wdbc$diagnosis)) * 100, digits = 1)

The above code is just to show that the new labels are in use

#### Transformation - normalizing numeric data 

Write a normalize function

In [None]:
normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

Now let's test our new function

In [None]:
normalize(c(1,2,3,4,5))
normalize(c(10,20,30,40,50))

Notice how we will exclude the target variable $diagnosis below...

In [None]:
wdbc_normalize <- as.data.frame(lapply(wdbc[2:31], normalize))
wdbc_n <- wdbc_normalize

In [None]:
summary(wdbc_n$area_mean)

Now our measurements are between 0 and 1

#### Data Preparation - training and test datasets 

Going to use 469 records for training dataset and 100 for the test dataset. There are 569 records in total. 

In [None]:
wdbc_train <- wdbc_n[1:469, ]
wdbc_test <- wdbc_n[470:569, ]

*keep in mind that if your dataset is not in random order, you will need to use a random sampling method*

Earlier, we excluded the target variable $diagnosis from our wdbc dataset. We will now store those class labels in factor vectors.

In [None]:
wdbc_train_labels <- wdbc[1:469, 1]
wdbc_test_labels <- wdbc[470:569, 1]

### Step 3: training the model

In [3]:
install.packages("class", repo = ('https://cran.r-project.org/'))

package 'class' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Bradley Bailey\AppData\Local\Temp\Rtmpg5XQI8\downloaded_packages


In [4]:
library(class)

: package 'class' was built under R version 3.2.5

We now have access to the knn() function. Let's use it... knn(train, test, class, k) *k will be the sqare root of 469 (the # of instances in our train dataset) which is around 21*

In [None]:
wdbc_test_predictions <- knn(train = wdbc_train, test = wdbc_test, cl = wdbc_train_labels, k = 21)

### Step Four: Evaluate Model Performance

In [5]:
install.packages("gmodels", repo = ('https://cran.r-project.org/'))

package 'gmodels' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Bradley Bailey\AppData\Local\Temp\Rtmpg5XQI8\downloaded_packages


In [6]:
library(gmodels)

: package 'gmodels' was built under R version 3.2.5

gmodels is needed to we have access to the CrossTable() function

In [None]:
CrossTable(x = wdbc_test_labels, y = wdbc_test_predictions, prop.chisq = FALSE)

This cross table shows what matched between the actual test labels of "Benign" and "Malignant" with our test predictions. We see that benign = benign 77 times, malignant = maglignant 22 times and a false negative twice. 98% accuracy

### Step Five: Improving Model Performance

#### Transformation - z-score standardization (alternative to normalizing the numeric data)

In [9]:
wdbc_z = as.data.frame(scale(wdbc[-1]))

In [10]:
summary(wdbc_z$area_mean)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.4530 -0.6666 -0.2949  0.0000  0.3632  5.2460 

Notice that the Mean is 0. This should always be the case.

In [11]:
wdbc_train <- wdbc_z[1:469, ]

In [12]:
wdbc_test <- wdbc_z[470:569, ]

In [13]:
wdbc_train_labels <- wdbc[1:469, 1]

In [14]:
wdbc_test_labels <- wdbc[470:569, 1]

In [29]:
 wdbc_test_pred <- knn(train = wdbc_train, test = wdbc_test, cl = wdbc_train_labels, k=75)

In [30]:
CrossTable(x = wdbc_test_labels, y = wdbc_test_pred, prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                 | wdbc_test_pred 
wdbc_test_labels |    Benign | Malignant | Row Total | 
-----------------|-----------|-----------|-----------|
          Benign |        76 |         1 |        77 | 
                 |     0.987 |     0.013 |     0.770 | 
                 |     0.962 |     0.048 |           | 
                 |     0.760 |     0.010 |           | 
-----------------|-----------|-----------|-----------|
       Malignant |         3 |        20 |        23 | 
                 |     0.130 |     0.870 |     0.230 | 
                 |     0.038 |     0.952 |           | 
                 |     0.030 |     0.200 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |        79 |        21 |       100 | 
           

### k = ?... 
k = 1  : 94% accuracy
k = 5  : 96%
k = 10 : 97%
k = 15 : 98%
k = 21 : 98%
k = 25 : 98%
k = 30 : 98%
k = 75 : 96%