Report on fertility kNN analysis
===
<b>STEP 1: Collecting the data</b>

The data for the analysis was taken from the UCI Machine Learning Repository. The dataset (.txt file) includes 100 observations, each with 10 attributes. The diagnosis is coded as "N" to indicate normal or "O" to indicate altered.

<b>STEP 2: Exploring and preparing the data</b>

In [1]:
# Loading the data
mydata = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00244/fertility_Diagnosis.txt", sep = ",", stringsAsFactors = FALSE)

<b>STEP 2: Exploring and preparing the data</b>

The data was read into the R, and the data structure was examined. There were no missing values (NAs). The Season attribute (V1) was dropped, since it does not contribute into prediction. The diagnosis labels then were recoded as “Normal” and “Altered”. 

In [2]:
# Looking at the data
str(mydata)
# Checking if there are any missing values
sum(is.na(mydata))

'data.frame':	100 obs. of  10 variables:
 $ V1 : num  -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 1 1 ...
 $ V2 : num  0.69 0.94 0.5 0.75 0.67 0.67 0.67 1 0.64 0.61 ...
 $ V3 : int  0 1 1 0 1 1 0 1 0 1 ...
 $ V4 : int  1 0 0 1 1 0 0 1 0 0 ...
 $ V5 : int  1 1 0 1 0 1 0 1 1 0 ...
 $ V6 : int  0 0 0 0 0 0 -1 0 0 0 ...
 $ V7 : num  0.8 0.8 1 1 0.8 0.8 0.8 0.6 0.8 1 ...
 $ V8 : int  0 1 -1 -1 -1 0 -1 -1 -1 -1 ...
 $ V9 : num  0.88 0.31 0.5 0.38 0.5 0.5 0.44 0.38 0.25 0.25 ...
 $ V10: chr  "N" "O" "N" "N" ...


In [3]:
mydata <- mydata[,-1]
mydata$V10 <- factor(mydata$V10, levels = c("N", "O"),
                         labels = c("Normal", "Altered"))

In [4]:
summary(mydata$V10)

Some of the columns were already normalized, however, others (V2 – Age, V6 – High fevers, V8 – Smoking habits) needed normalization. Min-Max normalization was then applied to the dataset.

In [5]:
# Creating normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
# Normalizing the data
mydata_n <- as.data.frame(lapply(mydata[1:8], normalize))
summary(mydata_n$V2)
summary(mydata_n$V6)
summary(mydata_n$V8)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.120   0.340   0.338   0.500   1.000 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.500   0.500   0.595   1.000   1.000 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.325   0.500   1.000 

The next sub step is splitting the data into the two parts – the training set and the test set. The training set contains 80 observations, while the test set contains 20 observations. The diagnosis labels were stored in the new variables <i>mydata_train_labels</i> and <i>mydata_test_labels</i>.

In [6]:
# Creating training and test data
mydata_train <- mydata_n[1:80, ]
mydata_test <- mydata_n[81:100, ]

# Creating labels for training and test data
mydata_train_labels <- mydata[1:80, 9]
mydata_test_labels <- mydata[81:100, 9]

<b>STEP 3: Training a model on a data</b>

The <i>class</i> package was installed and loaded into R. We then performed the kNN analysis using the <i>knn</i> function from the class package. K-value of 2 was chosen based on the size of the dataset. 

In [7]:
library(class)
mydata_test_pred <- knn(train = mydata_train, test = mydata_test,
                      cl = mydata_train_labels, k = 2)

head(mydata_test)
head(mydata_test_pred)

Unnamed: 0,V2,V3,V4,V5,V6,V7,V8,V9
81,0.84,1,1,0,1.0,1.0,0.0,0.606383
82,0.62,1,1,1,1.0,0.75,0.5,0.1382979
83,0.84,1,0,0,1.0,0.5,0.0,0.1382979
84,0.72,1,1,1,1.0,1.0,0.0,0.2021277
85,0.56,1,0,0,1.0,1.0,1.0,0.0
86,0.78,1,1,0,0.5,0.5,1.0,0.2659574


<b>STEP 4: Evaluating model performance</b>

Using the <i>CrossTable</i> function from the <i>gmodels</i> package, we printed out the confusion matrix as the part of the model evaluation. The Accuracy rate of the model is 85%, while the Error rate is 15%.

In [10]:
library(gmodels)
# Creating the cross tabulation of predicted vs. actual
CrossTable(x = mydata_test_labels, y = mydata_test_pred,
           prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  20 

 
                   | mydata_test_pred 
mydata_test_labels |    Normal |   Altered | Row Total | 
-------------------|-----------|-----------|-----------|
            Normal |        17 |         1 |        18 | 
                   |     0.944 |     0.056 |     0.900 | 
                   |     0.944 |     0.500 |           | 
                   |     0.850 |     0.050 |           | 
-------------------|-----------|-----------|-----------|
           Altered |         1 |         1 |         2 | 
                   |     0.500 |     0.500 |     0.100 | 
                   |     0.056 |     0.500 |           | 
                   |     0.050 |     0.050 |           | 
-------------------|-----------|-----------|-----------|
      Column Total |        18 |        

<b>STEP 5: Improving model performance</b>

After trying different K-values for the training parameter, we found out that with K = 3, the function produces the Accuracy rate of 90%, but classifies one observation with “Altered” label as it is “Normal”, which is very undesirable in medical diagnosis.

In [11]:
mydata_test_pred <- knn(train = mydata_train, test = mydata_test, cl = mydata_train_labels, k=3)
CrossTable(x = mydata_test_labels, y = mydata_test_pred, prop.chisq=FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  20 

 
                   | mydata_test_pred 
mydata_test_labels |    Normal |   Altered | Row Total | 
-------------------|-----------|-----------|-----------|
            Normal |        16 |         2 |        18 | 
                   |     0.889 |     0.111 |     0.900 | 
                   |     0.941 |     0.667 |           | 
                   |     0.800 |     0.100 |           | 
-------------------|-----------|-----------|-----------|
           Altered |         1 |         1 |         2 | 
                   |     0.500 |     0.500 |     0.100 | 
                   |     0.059 |     0.333 |           | 
                   |     0.050 |     0.050 |           | 
-------------------|-----------|-----------|-----------|
      Column Total |        17 |        