# Part 1: Wisconsin Breast Cancer Diagnoses
## Step 1
##### "Collect" the Data

In [1]:
wbcd <- read.csv("wisc_bc_data.csv", header=TRUE)

## Step 2
##### Explore & Prepare the Data

In [2]:
wbcd <- wbcd[-1] # Removing the ID of the patients.
table(wbcd$diagnosis)


  B   M 
357 212 

In [3]:
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),labels = c("Benign", "Malignant"))
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)


   Benign Malignant 
     62.7      37.3 

In [4]:
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

  radius_mean       area_mean      smoothness_mean  
 Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
 1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
 Median :13.370   Median : 551.1   Median :0.09587  
 Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
 3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
 Max.   :28.110   Max.   :2501.0   Max.   :0.16340  

In [5]:
normalize <- function(x)
{
  return ((x - min(x)) / (max(x) - min(x)))
}

In [6]:
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

## Step 3
##### Training a Model on the Data

In [7]:
library(class)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 15)

## Step 4
##### Evaluating Model Performance

In [8]:
library(gmodels)

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                 | wbcd_test_pred 
wbcd_test_labels |    Benign | Malignant | Row Total | 
-----------------|-----------|-----------|-----------|
          Benign |        61 |         0 |        61 | 
                 |     1.000 |     0.000 |     0.610 | 
                 |     0.953 |     0.000 |           | 
                 |     0.610 |     0.000 |           | 
-----------------|-----------|-----------|-----------|
       Malignant |         3 |        36 |        39 | 
                 |     0.077 |     0.923 |     0.390 | 
                 |     0.047 |     1.000 |           | 
                 |     0.030 |     0.360 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |        64 |        36 |       100 | 
           

## Step 5
##### Improving Model Performance

In [9]:
wbcd_z <- as.data.frame(scale(wbcd[-1]))
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 15)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                 | wbcd_test_pred 
wbcd_test_labels |    Benign | Malignant | Row Total | 
-----------------|-----------|-----------|-----------|
          Benign |        61 |         0 |        61 | 
                 |     1.000 |     0.000 |     0.610 | 
                 |     0.953 |     0.000 |           | 
                 |     0.610 |     0.000 |           | 
-----------------|-----------|-----------|-----------|
       Malignant |         3 |        36 |        39 | 
                 |     0.077 |     0.923 |     0.390 | 
                 |     0.047 |     1.000 |           | 
                 |     0.030 |     0.360 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |        64 |        36 |       100 | 
           

#### Comment: 
The z-score transformation does not significantly improve the prediction of the knn. Thus we will return to using a min-max normalization and try to find the optimal k value.

In [10]:
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]

In [11]:
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=1)
# CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=5)
# CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=11)
# CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=15)
# CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=21)
# CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=27)
# CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

#### Conclusion: 
From multiple knn trials, the optimal value for k is 15 with min-max normalized data. When k=15, we get 97% accuracy in classifying benign vs. malignant growths in breast tissue.

# Part 2: UCI ML Repository, Zoo Data
## Step 1

#### Comment: 
The data here is gathered from a zoo, where various species were classified into one of seven categories based on several characteristics, mostly Boolean.

In [12]:
library(reshape)
library(plyr)

zoo <- read.csv("zoo.csv", header=TRUE)
zoo <- zoo[-1] # Removing individual species name, but keeping the type distinction.


Attaching package: ‘reshape’

The following object is masked from ‘package:class’:

    condense


Attaching package: ‘plyr’

The following objects are masked from ‘package:reshape’:

    rename, round_any



## Step 2

In [13]:
# Renaming the animal types.
labels <- c("Mammal", "Bird", "Reptile", "Marine", "Amphibian", "Insect", "Invertibrate")
zoo$Type <- factor(zoo$Type, levels=c(1:7), labels=labels) 
# Creating the testing and training data that has been normalized and randomized.
zoo_n <- as.data.frame(lapply(zoo[1:16], normalize))
select <- sample(1:101, 70, replace=F)
zoo_train_n <- zoo_n[select,]
zoo_test_n <- zoo_n[-select,]
zoo_train_labels <- zoo[select, 17]
zoo_test_labels <- zoo[-select, 17]

## Steps 3 & 4

In [14]:
# Creating a knn prediction vector and generating a cross table.
zoo_test_pred_n <- knn(train = zoo_train_n, test = zoo_test_n, cl = zoo_train_labels, k = 11)
invisible(capture.output(ct <- CrossTable(x = zoo_test_labels, y = zoo_test_pred_n, prop.chisq = FALSE, 
                 missing.include=FALSE)))

In [15]:
# Converting the information from the cross table into a more readable format, a dataframe.
df_n <- as.data.frame(ct)[, 1:3]
colnames(df_n) <- c('Actual', 'Predicted', 'number')
df2_n <- cast(df_n, Actual ~ Predicted)
df2_n <- rename(df2_n, c("Mammal"="Mammal_knn","Bird"="Bird_knn", "Reptile"="Reptile_knn", 
              "Marine"="Marine_knn", "Amphibian"="Amphibian_knn", 
              "Insect"="Insect_knn", "Invertibrate"="Invertibrate_knn"))
df2_n

Using number as value column.  Use the value argument to cast to override this choice
The following `from` values were not present in `x`: Reptile, Amphibian


Unnamed: 0,Actual,Mammal_knn,Bird_knn,Marine_knn,Insect_knn,Invertibrate_knn
1,Mammal,14,0,0,0,0
2,Bird,0,4,0,0,0
3,Reptile,0,0,2,0,0
4,Marine,0,0,5,0,0
5,Amphibian,0,1,1,0,0
6,Insect,0,0,0,1,0
7,Invertibrate,0,0,0,1,2


#### Comment: 
This model with regular normalization does fairly well in predicting type of animal based on the characteristics in the data. The predictions primarily have difficulty with misclassifying reptiles. However, most other animal types are classified correctly.

## Step 5

In [16]:
# Testing z-score normalization.
zoo_z <- as.data.frame(scale(zoo[-17]))
select2 <- sample(1:101, 70, replace=F)
zoo_train_z <- zoo_z[select2,]
zoo_test_z <- zoo_z[-select2,]
zoo_train_labels_z <- zoo[select2, 17]
zoo_test_labels_z <- zoo[-select2, 17]

In [17]:
zoo_test_pred_z <- knn(train = zoo_train_z, test = zoo_test_z, cl = zoo_train_labels_z, k = 11)
invisible(capture.output(
    zt <- CrossTable(x = zoo_test_labels_z, y = zoo_test_pred_z, prop.chisq = FALSE)))

In [18]:
df_z <- as.data.frame(zt)[, 1:3]
colnames(df_z) <- c('Actual', 'Predicted', 'number')
df2_z <- cast(df_z, Actual ~ Predicted)
df2_z <- rename(df2_z, c("Mammal"="Mammal_knn","Bird"="Bird_knn", "Reptile"="Reptile_knn", 
                         "Marine"="Marine_knn", "Amphibian"="Amphibian_knn", 
                         "Insect"="Insect_knn", "Invertibrate"="Invertibrate_knn"))
df2_z

Using number as value column.  Use the value argument to cast to override this choice
The following `from` values were not present in `x`: Reptile, Amphibian


Unnamed: 0,Actual,Mammal_knn,Bird_knn,Marine_knn,Insect_knn,Invertibrate_knn
1,Mammal,11,0,0,0,0
2,Bird,0,9,0,0,0
3,Reptile,0,0,2,0,0
4,Marine,0,0,5,0,0
5,Insect,0,0,0,2,0
6,Invertibrate,0,0,0,0,2


#### Comment: 
The z-score normalization performs worse than the previous normalization function. In this case several types of animals were misclassified.

#### Conclusion: 
After analyzing several possible k values for the min-max normalized knn function, k=11 has yielded the best results. From this, the knn algorithm is predicting which of seven categories an animal falls into based on several characteristics, such as whether the animal has hair, teeth, wings, etc. The knn does well at making these predictions, but falls short for some distinctions where a human would not. 