# Avocado Classifier
This Jupyter Notebook contains the code that takes in a table with information about avocados (average price, total volumne, total bags) and classifies them as either `conventional` or `organic`. This classifier is a $k$-nearest neighbors classifier using the cartesian distance between the point in question and the points in the training set. The data set is from Kaggle (https://www.kaggle.com/neuromusic/avocado-prices).

## 1. Import datascience, numpy, and the table
The cell below imports the `datascience` and `numpy` libraries of Python, as well as opens the csv file as a `datascience` Table object.

In [1]:
from datascience import *
import numpy as np

avocado = Table.read_table('avocado.csv')
avocado

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.6,1036.74,54454.8,48.16,8696.87,8603.62,93.25,0,conventional,2015,Albany
1,2015-12-20,1.35,54877.0,674.28,44638.8,58.33,9505.56,9408.07,97.49,0,conventional,2015,Albany
2,2015-12-13,0.93,118220.0,794.7,109150.0,130.5,8145.35,8042.21,103.14,0,conventional,2015,Albany
3,2015-12-06,1.08,78992.1,1132.0,71976.4,72.58,5811.16,5677.4,133.76,0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.4,75.78,6183.95,5986.26,197.69,0,conventional,2015,Albany
5,2015-11-22,1.26,55979.8,1184.27,48068.0,43.61,6683.91,6556.47,127.44,0,conventional,2015,Albany
6,2015-11-15,0.99,83453.8,1368.92,73672.7,93.26,8318.86,8196.81,122.05,0,conventional,2015,Albany
7,2015-11-08,0.98,109428.0,703.75,101815.0,80.0,6829.22,6266.85,562.37,0,conventional,2015,Albany
8,2015-11-01,1.02,99811.4,1022.15,87315.6,85.34,11388.4,11104.5,283.83,0,conventional,2015,Albany
9,2015-10-25,1.07,74338.8,842.4,64757.4,113.0,8625.92,8061.47,564.45,0,conventional,2015,Albany


## 2. Divide the Kaggle data set into the training and test sets
This cell selects the 4 columns we will use from the original table (three data point columns and the type column) and shuffles the rows of the csv file and separates them into a training set, to which the avocado to be classified will be compared, and a test set, to test the accuracy of the classifer once it is built. The test set will retain its `type` column so that we know what proportion of avocados the classifier gets correct. The training set has 18,000 rows and the test set has 249.

In [2]:
av = avocado.select('AveragePrice', 'Total Volume', 'Total Bags', 'type')
av = av.sample(with_replacement=False)
av_train = av.take(np.arange(18000))
av_test = av.take(np.arange(18000, 18249))
av

AveragePrice,Total Volume,Total Bags,type
1.06,15669.8,11247.1,organic
0.92,844690.0,409948.0,conventional
1.11,128079.0,80410.8,organic
1.11,140981.0,36725.7,conventional
1.01,3033330.0,1156380.0,conventional
0.96,128078.0,20393.6,conventional
1.55,28095.0,6.18,organic
1.01,171880.0,34792.9,conventional
1.37,178200.0,48896.9,conventional
1.23,413224.0,127142.0,conventional


## 3. Define a function to find the cartesian distances
In this section, I will define a function that finds the 3-dimensional cartesian distant between two points. This is an application of the Pythagorean Theorem. The distance between two points $(x_1, y_1, z_1)$ and $(x_2, y_2, z_2)$ is

$$d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}$$

The function defined takes as arguments a table whose first 3 columns are data points and an array containing the corresponding values for the point that is being compared. It returns the table with a new column that has the distance between each row in the table and the point in the array.

In [3]:
def dist(t, arr):
    '''Takes in a table where the 1st 3 columns are the numerical data
    and returns the cartesian distance from an array with coincident values'''
    dists = make_array()
    for i in np.arange(t.num_rows):
        dist = np.sqrt((t.column(0).item(i) - arr.item(0)) ** 2 + (t.column(1).item(i) - arr.item(1)) ** 2 + (t.column(2).item(i) - arr.item(2)) ** 2)
        dists = np.append(dists, dist)
    return t.with_column('distances', dists)

In [4]:
dist(av_train, np.array(av_test.drop('type').row(0)))

AveragePrice,Total Volume,Total Bags,type,distances
1.06,15669.8,11247.1,organic,21378.2
0.92,844690.0,409948.0,conventional,898691.0
1.11,128079.0,80410.8,organic,111238.0
1.11,140981.0,36725.7,conventional,106705.0
1.01,3033330.0,1156380.0,conventional,3206270.0
0.96,128078.0,20393.6,conventional,92197.2
1.55,28095.0,6.18,organic,19739.1
1.01,171880.0,34792.9,conventional,136988.0
1.37,178200.0,48896.9,conventional,145579.0
1.23,413224.0,127142.0,conventional,392747.0


## 4. Define a function to find the majority classification
$k$-NN classifiers work by determining what classification a majority of the $k$ points closest to a point in question have. The function `find_majority` defined below runs the `dist` function on a table and returns that output sorted by increasing distance. The function `knn` below that selects the top $k$ rows and returns the majority classification.

In [5]:
def find_majority(t, t2, row_index):
    '''Takes in training table (t), test table (t2), and row index of test
    table value (row_index) and computes the cartesian distance then
    returns the training table sorted by incrasing distance'''
    test = np.array(t2.drop('type').row(row_index))
    d = dist(t, test)
    return d.sort('distances')

find_majority(av_train, av_test, 0)

AveragePrice,Total Volume,Total Bags,type,distances
1.93,35248.5,18306.9,organic,682.81
1.67,36469.4,17619.4,organic,760.185
1.75,36529.1,17569.9,organic,837.672
1.43,36082.4,17277.1,organic,873.174
2.41,35543.1,18989.7,organic,931.654
1.44,34888.7,18444.9,organic,1066.79
1.51,36396.9,17144.6,organic,1102.24
1.3,35447.5,16815.3,organic,1395.94
0.98,34903.1,19193.0,organic,1461.26
2.03,36228.4,16701.5,organic,1466.61


In [6]:
def knn(t, t2, row, k):
    test = np.array(t2.drop('type').row(row))
    sort = find_majority(t, t2, row)
    tbl = sort.take(np.arange(k)).group('type').sort(1, descending=True)
    return tbl.column(0).item(0)

In [7]:
knn(av_train, av_test, 0, 7)

'organic'

## 5. Test the accuracy of the 7-NN classifier
For an example, I will text how accurate the 7-nearest neighbors classifer is. The `test_accuracy` function defined below runs the classifier on all rows of the `av_test` table (the entire test set), and then returns the proportion of rows that were correctly classified.

In [8]:
def test_accuracy(train, test, k):
    '''Returns proportion of correct classifications from avocado classifier'''
    classed = make_array()
    for i in np.arange(test.num_rows):
        cl = knn(train, test, i, k)
        classed = np.append(classed, cl)
    
    classed_test = test.with_column('k-NN Class', classed)
    return np.count_nonzero(classed_test.column('k-NN Class') == classed_test.column('type')) / classed_test.num_rows

In [9]:
test_accuracy(av_train, av_test, 7)

0.9477911646586346

## 6. Determining the optimal value of $k$
In order to determine how many nearest neigbors would be best to run on a random avocado, this second determines the optimal value of $k$ based on the training set. It will run through the classifier for odd integer values 1 through 99, and return a table with the accuracy of each value.

In [11]:
results = make_array()
for i in np.arange(1, 100, 2):
    result = test_accuracy(av_train, av_test, i)
    results = np.append(results, result)
    
optimal_k = Table().with_columns(
    'k', np.arange(1, 100, 2),
    'Accuracy', results
)
optimal_k.sort('Accuracy', descending=True)

k,Accuracy
49,0.947791
13,0.947791
7,0.947791
93,0.943775
89,0.943775
87,0.943775
69,0.943775
67,0.943775
65,0.943775
63,0.943775


Based on the table above, it seems that using 7, 13, or 49 for $k$ are all equally as accurate (with minor, neglible differences, presumably). 