# Indoor Positioning Systems
## MSDS 7333 - Section 401
## Case Study Week 6
[Data Science @ Southern Methodist University](https://datascience.smu.edu/)

### Due:
18 June 2018


### Table of Contents
* [Team Members](#Team-Members)
* [Abstract](#Abstract)
* [Introduction](#Introduction)
* [Methods](#Methods)
* [Results](#Results)
* [Conclusion](#Conclusion)
* [References](#References)

### <a name="Team-Members"></a>Team Members
* Kevin Cannon
* Austin Hancock

### <a name="Abstract"></a>Abstract

Analyze real-time location system data to determine locations of routers within a network. Use k-nearest neighbor function to predict observation location using weighted signal strengths.

### <a name="Introduction"></a>Introduction

Wireless networking has allowed us to bypass Global Positioning Systems (GPS) in spaces where the signals do not reliably work. Indoor Position Systems (IPS), which use wireless local area networks (LAN), use WiFi signals from network access points to determine near real-time information about connected devices. Initial IPS construction requires reference data that contains spatial and signal strength information around a fixed set of access points, known as routers. Once the training data is analyzed, a model can be created to predict the location of a new, unknown device that is connected to the IPS as a function of the detected signal strength.

The data for this case study was taken from the companion website for the 'Data Science in R' book written by Nolan and Lang, which is referenced below in the [References](#References) section. To summarize, the raw data was taken from a Community Resource for Archiving Wireless Data At Dartmouth (CRAWDAD) site. The reference data set is termed the "offline" data set and is used to train our prediction model. The data "contains signal strengths measured using a hand-held device on a grid of 166 points spaced 1 meter apart in the hallways of one floor of a building at the University of Mannheim. The floor plan, which measures about 15 meters by 36 meters, is displayed in Figure 1.1" (Nolan & Lang, 4). The grid can be seen below.

<img src='http://www.rdatasciencecases.org/GeoLoc/images/building.png'>

Additionally, x, y, and orientation information about the hand-held device is included in the data set. In total, 110 signal strength measurements were taken at 8 circular orientations (45 degrees separated) to each of the 6 access points on the figure above. The media access control (MAC) lists the MAC address of the hardware, which gives a unique identifier to a piece of hardware on a network.

The format of the data is as follows:
* t="Timestamp";
* id="MACofScanDevice";
* pos="RealPosition";
* degree="orientation";
* MACofResponse1="SignalStrengthValue,Frequency,Mode"; ...
* MACofResponseN="SignalStrengthValue,Frequency,Mode"

A real-time location system (RTLS) is a system which uses wireless technologies to determine the location of a target in real-time. With the proliferation of IoT (internet of things) enabled devices in recent years, the ability to gather RTLS data indoors has increased significantly. To correctly gather and use this RTLS data, it is important to both understand the monitoring technology being used and how the location of a target is determined.

In this case study, we will be analyzing data gathered through an IPS comprised of a number of routers within a single floor of a building. We will be creating a model to predict an observation's location based on the signal strength, in dB, and angle of the observation in relation to the 6 access points within the IPS.

### <a name="Methods"></a>Methods

To begin, we create the functions to pull in the RTLS data, clean it, and reshape it so that we can perform our analysis.

In [1]:
# Set rounding to 2 digits
options(digits = 2)

# Now that we know how to separate the data, we create a function so we can apply to all rows in the txt file
# Second iteration of function to correct for errors
# This function will so we can repeat the operation for each row of the input file
processLine = function(x)
{
  tokens = strsplit(x, "[;=,]")[[1]] #splits the lines at semicolons, equals signs, & commas
  
  if (length(tokens) == 10) 
    return(NULL)
 
  tmp = matrix(tokens[ - (1:10) ], , 4, byrow = TRUE)
  #bind columns with the values from the first ten entries
  cbind(matrix(tokens[c(2, 4, 6:8, 10)], nrow(tmp), 6, 
               byrow = TRUE), tmp)
}

# Create function to convert orientation values to proper orientation bin
roundOrientation = function(angles) {
  refs = seq(0, by = 45, length  = 9)
  q = sapply(angles, function(o) which.min(abs(o - refs)))
  c(refs[1:8], 0)[q]
}

# Function to pull in data of MACs we need to address, clean and reshape data
# Returns dataset we will analyze as 'offline'
readData = 
  function(filename = 'http://rdatasciencecases.org/Data/offline.final.trace.txt', 
           subMacs = c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd", "00:14:bf:b1:97:8a",
                       "00:14:bf:3b:c7:c6", "00:14:bf:b1:97:90", "00:14:bf:b1:97:8d",
                       "00:14:bf:b1:97:81"))
  {
    txt = readLines(filename) #search the file for lines that begin with # character
    lines = txt[ substr(txt, 1, 1) != "#" ] #locate lines that begin with # and tally them
    tmp = lapply(lines, processLine) # process through all lines of data
    #stack matrices together
    offline = as.data.frame(do.call("rbind", tmp), 
                            stringsAsFactors= FALSE) 
    
    #Add names to variables
    names(offline) = c("time", "scanMac", 
                       "posX", "posY", "posZ", "orientation", 
                       "mac", "signal", "channel", "type")
    
     # keep only signals from access points
    offline = offline[ offline$type == "3", ]
    
    # drop scanMac, posZ, channel, and type - no info in them
    dropVars = c("scanMac", "posZ", "channel", "type")
    offline = offline[ , !( names(offline) %in% dropVars ) ]
    
    # drop more unwanted access points - some access points are not near the testing area
    #   or were only active for a short period of time
    offline = offline[ offline$mac %in% subMacs, ]
    
    # convert numeric values
    numVars = c("time", "posX", "posY", "orientation", "signal")
    offline[ numVars ] = lapply(offline[ numVars ], as.numeric)

    # convert time to POSIX
    offline$rawTime = offline$time
    offline$time = offline$time/1000
    class(offline$time) = c("POSIXt", "POSIXct")
    
    # round orientations to nearest 45
    offline$angle = roundOrientation(offline$orientation)
      
    return(offline)
  }

Assign data to variable 'offline'.

In [12]:
# Read in offline data 
offline = readData()

Next we collect summary statistics for all locations, angles, and access points.

In [13]:
# It would take too many graphs to look at all of the locations, so we will use summary statistics
## mean and standard deviation, median and IQR

# Create a special factor that contains all unique (x,y) pairs for the 166 locations
offline$posXY = paste(offline$posX, offline$posY, sep = "-")

# Create a list of data frames for each combination of (x,y), angle, and access point
byLocAngleAP = with(offline, by(offline, list(posXY, angle, mac), function(x) x))

# Calculate summary statistics on each of the dataframes
signalSummary = 
  lapply(byLocAngleAP,            
         function(oneLoc) {
           ans = oneLoc[1, ]
           ans$medSignal = median(oneLoc$signal)
           ans$avgSignal = mean(oneLoc$signal)
           ans$num = length(oneLoc$signal)
           ans$sdSignal = sd(oneLoc$signal)
           ans$iqrSignal = IQR(oneLoc$signal)
           ans
           })
offlineSummary = do.call("rbind", signalSummary)

Create a function to build contour maps and parameterize MAC address, angle, and others, if we desire.

In [14]:
surfaceSS = function(data, mac, angle = 45) {
  require(fields)
  oneAPAngle = data[ data$mac == mac & data$angle == angle, ]
  smoothSS = Tps(oneAPAngle[, c("posX","posY")], 
                 oneAPAngle$avgSignal)
  vizSmooth = predictSurface(smoothSS)
  plot.surface(vizSmooth, type = "C", main=mac,
               xlab = "", ylab = "", xaxt = "n", yaxt = "n")
  points(oneAPAngle$posX, oneAPAngle$posY, pch=19, cex = 0.5) 
}

Keep only MAC addresses that we want to investigate.

In [15]:
subMacs = c("00:0f:a3:39:e1:c0", "00:0f:a3:39:dd:cd", "00:14:bf:b1:97:8a",
            "00:14:bf:3b:c7:c6", "00:14:bf:b1:97:90", "00:14:bf:b1:97:8d",
            "00:14:bf:b1:97:81")

We then look at contour plots of the signal strengths recieved by different access points. We modify the plotting parameters so we can place four contour plots on one canvas.

In [45]:
# Parameters for plot layout, saved in a new variable
parCur = par(mfrow = c(2, 2), mar = rep(1, 4))

# Create plots by calling the surfaceSS function four times
mapply(surfaceSS, mac = subMacs[rep(c(2, 1), each = 2)], 
      angle = rep(c(0, 135), 2),
      data = list(data = offlineSummary))

# Reset plot parameters
par(parCur)

> From these plots, we can identify the location of the access points as the dark red regions in the contour maps. We know the general locations of the access points based on the floor plan of the building, but we do not have a mapping between the MAC addresses and access points. The two top maps are for the access point "00:0f:a3:39:dd:cd" at angle 0 on the left, and angle 135 on the right. Similarly, the bottom two contour heat maps represent the signal strength for the "00:0f:a3:39:e1:c0" at the same two angles, respectively.

> Two MAC adresses have similar heat maps which both correspond to the access point centrally located in the left half of the building, based to the building diagram in Figure 1. Additionally, a corridor effect can be found, whereas the signal is stronger north and south where the signals are less blocked by walls. We will need to investigate these two MACs to determine which one belongs to this access point at this location.

When determining which MAC address to use, we will be looking at both the contour plots of signal strength and prediction errors from a k-nearest neighbor model. 

From the contour plots above, the "00:0f:a3:39:e1:c0" MAC address records much better signal strength than the "00:0f:a3:39:dd:cd" address, with a signal strength of -48 and -50 dBs compared to -60 and -60 dBs, respectively. This is a strong indicator that we should use the "c0" address for our IPS, but we will also look at how each MAC address effects the predictive capabilities of our kNN model.

Below, we look at the effects of each MAC address separately on our kNN model by first removing the "00:0f:a3:39:e1:c0" MAC address and then removing the "00:0f:a3:39:dd:cd" MAC address and comparing errors in predictions.

In [17]:
# Since these two appear to correspond to the same access point we will need to remove one
## Note: Test effect of removing one MAC or the other, or both MACs
#offlineSummary = subset(offlineSummary, mac != subMacs[1]) # remove "00:0f:a3:39:e1:c0"
## Errors in predictions: 250
offlineSummary = subset(offlineSummary, mac != subMacs[2]) # remove "00:0f:a3:39:dd:cd"
## Errors in predictions: 276

Now the we have aquired the 6 MAC addresses for the 6 access points we can create a matrix with the positions for the 6 access points.

In [18]:
#Create a small matrix with the relevant positions for the 6 access oints on the floor plan
AP = matrix(c(7.5, 6.3,
              2.5, -0.8, 
              12.8, -2.8,
              1.0, 14.0,
              33.5, 9.3,
              33.5, 2.8),
            ncol = 2, byrow = TRUE,
            dimnames = list(subMacs[-2], c("x", "y"))) 
    # subMacs[-1] to ignore 00:0f:a3:39:e1:c0
    # subMacs[-2] to ignore 00:0f:a3:39:dd:cd

Process raw data for 'online' variable.

In [19]:
macs = unique(offlineSummary$mac)
online = readData("http://rdatasciencecases.org/Data/online.final.trace.txt", subMacs = macs)

Create unique location identifier for online data.

In [20]:
online$posXY = paste(online$posX, online$posY, sep = "-")

Format data into 6 columns of signal strengths (one for each access point).

In [21]:
keepVars = c("posXY", "posX", "posY", "orientation", "angle")
byLoc = with(online,
             by(online, list(posXY),
                function(x){
                    ans = x[1, keepVars]
                    avgSS = tapply(x$signal, x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 6, 
                               dimnames = list(ans$posXY, names(avgSS)))
                    cbind(ans, y)
                }))
onlineSummary = do.call("rbind", byLoc)

Adjust orientation.

In [22]:
m = 3; angleNewObs = 230
refs = seq(0, by = 45, length  = 8)
nearestAngle = roundOrientation(angleNewObs)
  
if (m %% 2 == 1) {
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
} else {
  m = m + 1
  angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  if (sign(angleNewObs - nearestAngle) > -1) 
    angles = angles[ -1 ]
  else 
    angles = angles[ -m ]
}

Map angles to values in refs (e.g. -45 maps to 335, 405 maps to 45).

In [23]:
# Need to adjust angles
angles = angles + nearestAngle
angles[angles < 0] = angles[angles < 0 ] + 360
angles[angles > 360] = angles[angles > 360] - 360

We then select the observations from offlineSummary that we will analyze.

In [24]:
offlineSubset = offlineSummary[offlineSummary$angle %in% angles,]

Below we create a function to aggregate the signal strengths from above angles and create data structure. 

In [25]:
reshapeSS = function(data, varSignal = "signal", keepVars = c("posXY", "posX", "posY")){
    byLocation = with(data, by(data, list(posXY),
                           function(x){
                               ans = x[1, keepVars]
                               avgSS = tapply(x[, varSignal], x$mac, mean)
                               y = matrix(avgSS, nrow = 1, ncol = 6, 
                                          dimnames = list(ans$posXY,
                                                          names(avgSS)))
                               cbind(ans, y)
                           }))
     newDataSS = do.call("rbind", byLocation)
     return(newDataSS)
    }

Summarize and reshape offlineSubset.

In [26]:
trainSS = reshapeSS(offlineSubset, varSignal = "avgSignal")

Create function to select angles and call to reshapeSS.

In [27]:
selectTrain = function(angleNewObs, signals = NULL, m = 1){
  # angleNewObs is the angle of the new observation
  # signals is the training data (data in the format of offlineSummary)
  # m is the number of angles to keep between 1 and 5 (the angles to include from signals)
  refs = seq(0, by = 45, length  = 8)
  nearestAngle = roundOrientation(angleNewObs)
  
  if (m %% 2 == 1) 
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
  else {
    m = m + 1
    angles = seq(-45 * (m - 1) /2, 45 * (m - 1) /2, length = m)
    if (sign(angleNewObs - nearestAngle) > -1) 
      angles = angles[ -1 ]
    else 
      angles = angles[ -m ]
  }
  angles = angles + nearestAngle
  angles[angles < 0] = angles[ angles < 0 ] + 360
  angles[angles > 360] = angles[ angles > 360 ] - 360
  angles = sort(angles) 
  
  offlineSubset = signals[ signals$angle %in% angles, ]
  reshapeSS(offlineSubset, varSignal = "avgSignal")
}

Create functions for our kNN model and error calculator.

In [28]:
# Finding the Nearest Neighbors
findNN = function(newSignal, trainSubset) {
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) x - newSignal)
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(trainSubset[closest, 1:3 ])
}
                                                         
# Faster function
predXY = function(newSignals, newAngles, trainData, 
                  numAngles = 1, k = 3){
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = findNN(newSignal = as.numeric(newSignals[i, ]),
                           trainSS)
  }

  estXY = lapply(closeXY, function(x)
                            sapply(x[ , 2:3], 
                                    function(x) mean(x[1:k])))
  estXY = do.call("rbind", estXY)
  return(estXY)
}
                                   
# Calculate error
calcError = 
function(estXY, actualXY) 
   sum( rowSums( (estXY - actualXY)^2) )
actualXY = onlineSummary[ , c("posX", "posY")]

In [29]:
# Cross-Validation and Choice of k

v = 11
permuteLocs = sample(unique(offlineSummary$posXY))
permuteLocs = matrix(permuteLocs, ncol = v, 
                     nrow = floor(length(permuteLocs)/v))

"data length [166] is not a sub-multiple or multiple of the number of rows [15]"

In [30]:
onlineFold = subset(offlineSummary, posXY %in% permuteLocs[,1])

In [31]:
reshapeSS = function(data, varSignal = "signal", 
                     keepVars = c("posXY", "posX","posY"),
                     sampleAngle = FALSE, 
                     refs = seq(0, 315, by = 45)) {
  byLocation =
    with(data, by(data, list(posXY), 
                  function(x) {
                    if (sampleAngle) {
                      x = x[x$angle == sample(refs, size = 1), ]}
                    ans = x[1, keepVars]
                    avgSS = tapply(x[ , varSignal ], x$mac, mean)
                    y = matrix(avgSS, nrow = 1, ncol = 6, 
                               dimnames = list(ans$posXY,
                                               names(avgSS)))
                    cbind(ans, y)
                  }))

  newDataSS = do.call("rbind", byLocation)
  return(newDataSS)
}

Remove extra MAC address and apply reshapeSS function to collect summary data.

In [32]:
offline = offline[ offline$mac != "00:0f:a3:39:dd:cd", ] # Change when testing MACs

keepVars = c("posXY", "posX","posY", "orientation", "angle")

onlineCVSummary = reshapeSS(offline, keepVars = keepVars, 
                            sampleAngle = TRUE)

In [33]:
onlineFold = subset(onlineCVSummary, 
                    posXY %in% permuteLocs[ , 1])

offlineFold = subset(offlineSummary,
                     posXY %in% permuteLocs[ , -1])

estFold = predXY(newSignals = onlineFold[ , 6:11], 
                 newAngles = onlineFold[ , 4], 
                 offlineFold, numAngles = 3, k = 3)

actualFold = onlineFold[ , c("posX", "posY")]

In [34]:
K = 20
err = rep(0, K)

for (j in 1:v) {
  onlineFold = subset(onlineCVSummary, 
                      posXY %in% permuteLocs[ , j])
  offlineFold = subset(offlineSummary,
                       posXY %in% permuteLocs[ , -j])
  actualFold = onlineFold[ , c("posX", "posY")]
  
  for (k in 1:K) {
    estFold = predXY(newSignals = onlineFold[ , 6:11], 
                     newAngles = onlineFold[ , 4], 
                     offlineFold, numAngles = 3, k = k)
    err[k] = err[k] + calcError(estFold, actualFold)
  }
}

In [37]:
oldPar = par(mar = c(4, 3, 1, 1))
plot(y = err, x = (1:K),  type = "l", lwd= 2,
     ylim = c(1100, 2100),
     xlab = "Number of Neighbors",
     ylab = "Sum of Square Errors")

rmseMin = min(err)
kMin = which(err == rmseMin)[1]
segments(x0 = 0, x1 = kMin, y0 = rmseMin, col = gray(0.4), 
         lty = 2, lwd = 2)
segments(x0 = kMin, x1 = kMin, y0 = 1100,  y1 = rmseMin, 
         col = grey(0.4), lty = 2, lwd = 2)

text(x = kMin - 2, y = rmseMin + 40, 
     label = as.character(round(rmseMin)), col = grey(0.4))
par(oldPar)
dev.off()

In [36]:
estXYk5 = predXY(newSignals = onlineSummary[ , 6:11],  
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 5)

In [38]:
# Count errors in our predictions
calcError(estXYk5, actualXY)

After running both MAC addresses through our model, we see that the "00:0f:a3:39:dd:cd" address performed better, but only slightly (250 errors against 276). Since the difference in predictive capability between the use of either MAC address is marginal, we will use the "00:0f:a3:39:e1:c0" address due to its greater signal strength and larger number of observations.

Now that we have made our selection of the MAC addresses for each access point, we will next analyze the effect on predictions of weighting our model using signal strength. By adding weight, we are giving more power to closer access points in terms of the location calculation.

To determine the effect that weighting this variable has, we first need to modify our prediction function. Below, we update the assignment of estXY from returning a simple average to returning a weighted average that is inversely proportional to the distance in signal strength from the test observation. This assigns more weight to stronger signals by saying that as signal strength increases, it's distance decreases in terms of our location calculation.

In [39]:
# Edit functions to include weights for k=3
findNN_inverse = function(newSignal, trainSubset) {
  # For each MAC, subtract new signal strength from MAC signal strength
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) 
                    (
                        (1/(sort(x,partial=length(x))[length(x)]))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2]))
                        ) +
                        ((1/(sort(x,partial=length(x)-1)[length(x)-1])))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2]))
                        ) +
                        ((1/(sort(x,partial=length(x)-2)[length(x)-2])))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2]))
                        )
                    )
                    *(x - newSignal)) 
                
  # For each new strength (diffs), calculate the distance
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(trainSubset[closest, 1:3])
}

         
predXY_inverse = function(newSignals, newAngles, trainData, 
                  numAngles = 1, k = 3){
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = findNN_inverse(newSignal = as.numeric(newSignals[i, ]),
                           trainSS)
  }

  estXY = lapply(closeXY, function(x)
                            sapply(x[ , 2:3], 
                                   function(x) mean(x[1:k]))) 

  estXY = do.call("rbind", estXY)
  return(estXY)
}

In [40]:
# This applies function
estXYk3 = predXY_inverse(newSignals = onlineSummary[ , 6:11],  
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 3)

In [41]:
# Count errors in our predictions
calcError(estXYk3, actualXY)

> When using the weight of the 3 nearest neighbors our errors in predictions is 307

To test to see if a greater number of neighbors will increase our predictive capabilities, we will apply the same method with 5 nearest neighbors.

In [42]:
# Edit functions to include weights for k=5
findNN_inverse5 = function(newSignal, trainSubset) {
  # For each MAC, subtract new signal strength from MAC signal strength
  diffs = apply(trainSubset[ , 4:9], 1, 
                function(x) 
                    (
                        (1/(sort(x,partial=length(x))[length(x)]))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2])) +
                             (1/(sort(x,partial=length(x)-1)[length(x)-3])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-4])) 
                             
                        ) +
                        ((1/(sort(x,partial=length(x)-1)[length(x)-1])))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2])) +
                             (1/(sort(x,partial=length(x)-1)[length(x)-3])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-4])) 
                        ) +
                        ((1/(sort(x,partial=length(x)-2)[length(x)-2])))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2])) +
                             (1/(sort(x,partial=length(x)-1)[length(x)-3])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-4])) 
                        ) +
                        ((1/(sort(x,partial=length(x)-3)[length(x)-3])))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2])) +
                             (1/(sort(x,partial=length(x)-1)[length(x)-3])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-4])) 
                        ) +
                        ((1/(sort(x,partial=length(x)-4)[length(x)-4])))/
                         (
                             (1/(sort(x,partial=length(x))[length(x)]))+
                             (1/(sort(x,partial=length(x)-1)[length(x)-1])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-2])) +
                             (1/(sort(x,partial=length(x)-1)[length(x)-3])) +
                             (1/(sort(x,partial=length(x)-2)[length(x)-4])) 
                        )
                    )
                    *(x - newSignal)) 
    # For each new strength (diffs), calculate the distance
  dists = apply(diffs, 2, function(x) sqrt(sum(x^2)) )
  closest = order(dists)
  return(trainSubset[closest, 1:5])
}

         
predXY_inverse5 = function(newSignals, newAngles, trainData, 
                  numAngles = 1, k = 5){
  
  closeXY = list(length = nrow(newSignals))
  
  for (i in 1:nrow(newSignals)) {
    trainSS = selectTrain(newAngles[i], trainData, m = numAngles)
    closeXY[[i]] = findNN_inverse5(newSignal = as.numeric(newSignals[i, ]),
                           trainSS)
  }

  estXY = lapply(closeXY, function(x)
                            sapply(x[ , 2:3], 
                                   function(x) mean(x[1:k]))) 
                                   
  estXY = do.call("rbind", estXY)
  return(estXY)
}

In [43]:
# This applies function
estXYk5 = predXY_inverse5(newSignals = onlineSummary[ , 6:11],  
                 newAngles = onlineSummary[ , 4], 
                 offlineSummary, numAngles = 3, k = 5)

In [44]:
# Count errors in our predictions
calcError(estXYk5, actualXY)

> When using the weights of the 5 nearest neighbors our errors in predictions is reduced to 291. This confirms what we learned from our sum of squared errors chart which showed a lower SSE for 5-6 neighbors than for 3 neighbors.

### <a name="Results"></a>Results

When deciding which MAC address to keep, we looked at both the contour plot and the errors in prediction. The contour plots showed greater signal strength in the "00:0f:a3:39:e1:c0" MAC address than in the "00:0f:a3:39:dd:cd" address for the angles we tested with. The knn model we used to calculate prediction errors revealed that the "cd" MAC address produced less errors, but was only marginally better than the "c0" address. Given this slight difference in predictive capabilities using our first prediction model, we decided to keep the "c0" address due to its greater signal strength and continue the analysis with "00:0f:a3:39:e1:c0" as our access point.

After we had our 6 access points we added weights based on signal strength to our knn model to try and improve its predictive capabilities. To compare the differences in accuracy of models based on the number of nearest neighbors it uses in the location calculation, we created two models; knn=3 and knn=5. The knn=5 model produced better predictions which confirmed our k-fold cross-validation check which showed that using 5 or 6 nearest neighbors would be more accurate than 3 nearest neighbors. While using 5 to 6 neighbors gives us better results than using 3 neighbors in a weighted model, the unweighted model still produced the best results.

### <a name="Conclusion"></a>Conclusion

Our analysis into the two MAC addresses which both had signal strength positions that corresponded to the same access point resulted in the elimination of the same address proposed in the Nolan and Lang text. The additional look into the resulting k-nearest neighbor predictions, as well as analysis of signal strengths through the use of contour plots, allowed us to more confidently state which MAC address was the one mapped to the contested access point.

Next, the use of k-fold cross-validation enabled us to determine the number of nearest neighbors to use for our prediction model. While we were able to successfully determine how to improve the predictions of our knn model by tuning the number of neighbors we applied weights to, we were unable to use this model to produce better predictions than the unweighted model. To improve our predictions, it might be helpful to use different distance calculation methods or a different weighting function.

### <a name="References"></a>References

Nolan, D, and D.T. Lang. Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press, 2015, books.google.com.sa/books?id=r_0YCwAAQBAJ.