# Avocado Classifier
This Jupyter Notebook contains the code that takes in a table with information about avocados (average price, total volumne, total bags) and classifies them as either `conventional` or `organic`. This classifier is a $k$-nearest neighbors classifier using the cartesian distance between the point in question and the points in the training set. The data set is from Kaggle (https://www.kaggle.com/neuromusic/avocado-prices).

## 1. Import datascience, numpy, and the table
The cell below imports the `datascience` and `numpy` libraries of Python, as well as opens the csv file as a `datascience` Table object.

In [47]:
import pandas as pd
import numpy as np

avocado = pd.read_csv('avocado.csv')
avocado.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


## 2. Divide the Kaggle data set into the training and test sets
This cell selects the 4 columns we will use from the original table (three data point columns and the type column) and shuffles the rows of the csv file and separates them into a training set, to which the avocado to be classified will be compared, and a test set, to test the accuracy of the classifer once it is built. The test set will retain its `type` column so that we know what proportion of avocados the classifier gets correct. The training set has 18,000 rows and the test set has 249.

In [48]:
av = avocado[['AveragePrice', 'Total Volume', 'Total Bags', 'type']]
av = av.sample(n=len(av))
av_train = av.iloc[0:18000,]
av_test = av.iloc[18001:,]
av_train.shape, av_test.shape

((18000, 4), (248, 4))

## 3. Define a function to find the cartesian distances
In this section, I will define a function that finds the 3-dimensional cartesian distant between two points. This is an application of the Pythagorean Theorem. The distance between two points $(x_1, y_1, z_1)$ and $(x_2, y_2, z_2)$ is

$$d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}$$

The function defined takes as arguments a table whose first 3 columns are data points and an array containing the corresponding values for the point that is being compared. It returns the table with a new column that has the distance between each row in the table and the point in the array.

In [53]:
def dist(df, arr):
    '''Takes in a table where the 1st 3 columns are the numerical data
    and returns the cartesian distance from an array with coincident values'''
    
    # turn 1st 3 columns of df into arrays
    arr_1 = np.array(df.iloc[:,0])
    arr_2 = np.array(df.iloc[:,1])
    arr_3 = np.array(df.iloc[:,2])
    
    # iterate through 3-tuples of training set data and create array of distances
    dists = make_array()
    for i in zip(arr_1, arr_2, arr_3):
        dist = np.sqrt((arr[0] - i[0])**2 + (arr[1] - i[1])**2 + (arr[2] - i[2])**2)
        dists = np.append(dists, dist)
        
    new_df = df.copy()
    new_df['distances'] = pd.Series(dists)
        
    return df

# demonstrate dist function
dist(av_train, av_test.iloc[0,0:3].values).head()

Unnamed: 0,AveragePrice,Total Volume,Total Bags,type,distances
1009,1.21,151883.21,40114.79,conventional,6943036.0
6240,0.94,1153699.28,314531.08,conventional,269370.0
12918,1.31,11306.09,4808.45,organic,101596.8
12686,1.8,1680.08,366.67,organic,249581.2
12368,2.09,27655.2,1381.67,organic,78592.18


## 4. Define a function to find the majority classification
$k$-NN classifiers work by determining what classification a majority of the $k$ points closest to a point in question have. The function `find_majority` defined below runs the `dist` function on a table and returns that output sorted by increasing distance. The function `knn` below that selects the top $k$ rows and returns the majority classification.

In [55]:
def find_majority(df_train, df_test, row_index):
    '''Takes in training table (t), test table (t2), and row index of test
    table value (row_index) and computes the cartesian distance then
    returns the training table sorted by incrasing distance'''
    test = df_test.iloc[row_index,:3]
    d = dist(df_train, test)
    return d.sort_values('distances')

find_majority(av_train, av_test, 0).head()

Unnamed: 0,AveragePrice,Total Volume,Total Bags,type,distances
7882,1.79,617981.91,88193.35,conventional,1222.150551
9031,1.07,5278752.32,1869766.58,conventional,1721.650699
4903,1.34,432472.11,58195.58,conventional,1803.170747
10941,1.77,1427.11,592.46,organic,2379.394366
14417,1.08,4726.65,2881.15,organic,2381.602972


In [72]:
def knn(df_train, df_test, row_index, k):
    sort = find_majority(df_train, df_test, row_index)
    df = sort.iloc[:k,].groupby('type').count().sort_values('distances', ascending=False)
    return df.index[0]

knn(av_train, av_test, 0, 7)

'conventional'

## 5. Test the accuracy of the 7-NN classifier
For an example, I will text how accurate the 7-nearest neighbors classifer is. The `test_accuracy` function defined below runs the classifier on all rows of the `av_test` table (the entire test set), and then returns the proportion of rows that were correctly classified.

In [8]:
def test_accuracy(train, test, k):
    '''Returns proportion of correct classifications from avocado classifier'''
    classed = make_array()
    for i in np.arange(test.num_rows):
        cl = knn(train, test, i, k)
        classed = np.append(classed, cl)
    
    classed_test = test.with_column('k-NN Class', classed)
    return np.count_nonzero(classed_test.column('k-NN Class') == classed_test.column('type')) / classed_test.num_rows

In [9]:
test_accuracy(av_train, av_test, 7)

0.9477911646586346

## 6. Determining the optimal value of $k$
In order to determine how many nearest neigbors would be best to run on a random avocado, this second determines the optimal value of $k$ based on the training set. It will run through the classifier for odd integer values 1 through 99, and return a table with the accuracy of each value.

In [11]:
results = make_array()
for i in np.arange(1, 100, 2):
    result = test_accuracy(av_train, av_test, i)
    results = np.append(results, result)
    
optimal_k = Table().with_columns(
    'k', np.arange(1, 100, 2),
    'Accuracy', results
)
optimal_k.sort('Accuracy', descending=True)

k,Accuracy
49,0.947791
13,0.947791
7,0.947791
93,0.943775
89,0.943775
87,0.943775
69,0.943775
67,0.943775
65,0.943775
63,0.943775


Based on the table above, it seems that using 7, 13, or 49 for $k$ are all equally as accurate (with minor, neglible differences, presumably). 