# Avocado Classifier
This Jupyter Notebook contains the code that takes in a table with information about avocados (average price, total volumne, total bags) and classifies them as either `conventional` or `organic`. This classifier is a $k$-nearest neighbors classifier using the cartesian distance between the point in question and the points in the training set. The data set is from Kaggle (https://www.kaggle.com/neuromusic/avocado-prices).

In [2]:
library(tidyverse)

avocado <- read.csv('avocado.csv')
head(avocado)

Date,AveragePrice,Total.Volume,X4046,X4225,X4770,Total.Bags,Small.Bags,Large.Bags,XLarge.Bags,type,year,region
2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0,conventional,2015,Albany
2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0,conventional,2015,Albany
2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0,conventional,2015,Albany
2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0,conventional,2015,Albany
2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0,conventional,2015,Albany
2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0,conventional,2015,Albany


## 2. Divide the Kaggle data set into the training and test sets
This cell selects the 4 columns we will use from the original table (three data point columns and the type column) and shuffles the rows of the csv file and separates them into a training set, to which the avocado to be classified will be compared, and a test set, to test the accuracy of the classifer once it is built. The test set will retain its `type` column so that we know what proportion of avocados the classifier gets correct. The training set has 18,000 rows and the test set has 249.

In [3]:
av <- avocado %>%
    select(AveragePrice, Total.Volume, Total.Bags, type) %>%
    sample_frac(1)
av_train <- av[1:18000,]
av_test <- av[-(1:18000),]

# ensuring all rows are capture in av_test and av_train
dim(av)[1] == dim(av_train)[1] + dim(av_test)[1]

## 3. Define a function to find the cartesian distances
In this section, I will define a function that finds the 3-dimensional cartesian distant between two points. This is an application of the Pythagorean Theorem. The distance between two points $(x_1, y_1, z_1)$ and $(x_2, y_2, z_2)$ is

$$d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}$$

The function defined takes as arguments a table whose first 3 columns are data points and an array containing the corresponding values for the point that is being compared. It returns the table with a new column that has the distance between each row in the table and the point in the array.

In [4]:
dist <- function (df, vec) {
    new_df <- data.frame(df)
    dists <- c()
    for (i in 1:dim(df)[1]) {
        dist <- sqrt((df[i, 1] - vec[1])^2 + (df[i, 2] - vec[2])^2 + (df[i, 3] - vec[3])^2)
        dists <- c(dists, dist)
    }
    new_df$distances = unlist(dists)
    new_df
}

In [5]:
head(dist(av_train, av_test[1, 1:3]))

AveragePrice,Total.Volume,Total.Bags,type,distances
0.83,1104682.52,345923.79,conventional,1148933.02
1.3,302596.77,107071.19,conventional,312280.829
0.78,1062071.65,152918.86,conventional,1064769.46
1.91,4366.74,2661.17,organic,3716.762
1.27,8471.24,8052.74,organic,3889.732
1.91,12382.57,6631.11,organic,5229.984


## 4. Define a function to find the majority classification
$k$-NN classifiers work by determining what classification a majority of the $k$ points closest to a point in question have. The function `find_majority` defined below runs the `dist` function on a table and returns that output sorted by increasing distance. The function `knn` below that selects the top $k$ rows and returns the majority classification.

In [6]:
find_majority <- function (df, df2, row_index) {
    test <- df2[row_index, 1:3]
    d <- df %>%
        dist(test) %>%
        arrange(distances)
    d
}

In [None]:
head(find_majority(av_train, av_test, 1))

In [None]:
knn <- function (df, df2, row, k) {
    sort <- find_majority(df, df2, row)
    new_df <- sort[1:k,] %>%
        count(type) %>%
        arrange(desc(n))
    new_df[1, 1]
}

In [None]:
knn(av_train, av_test, 1, 7)

## 5. Test the accuracy of the 7-NN classifier
For an example, I will text how accurate the 7-nearest neighbors classifer is. The `test_accuracy` function defined below runs the classifier on all rows of the `av_test` table (the entire test set), and then returns the proportion of rows that were correctly classified.

In [None]:
test_accuracy <- function (train, test, k) {
    classed <- c()
    for (i in 1:dim(test)[1]) {
        cl = knn(train, test, i, k)
        classed <- c(classed, cl)
    }
    classed_test <- data.frame(test)
    classed_test$kNN.class = classed
    sum(classed_test$kNN.class == classed_test$type) / dim(classed_test)[1] 
}

In [None]:
test_accuracy(av_train, av_test, 7)

## 6. Determining the optimal value of $k$
In order to determine how many nearest neigbors would be best to run on a random avocado, this second determines the optimal value of $k$ based on the training set. It will run through the classifier for odd integer values 1 through 99, and return a table with the accuracy of each value.

In [None]:
results <- c()
for (i in seq(1, 100, by=2)) {
    result = test_accuracy(av_train, av_test, i)
    results <- c(results, result)
}

optimal_k = data.frame(
    k = seq(1, 100, by=2),
    Accuracy = results
)