# Classifying Fruits with kNN

The objective is to train a kNN classifier to identify fruits based on their mass, width, height and color.

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as sps
from sklearn.model_selection import train_test_split
%matplotlib notebook
sns.set(style='whitegrid')

## Loading the Data Set

In [15]:
fruits = pd.read_table('fruit_data_with_colors.txt')
print('Rows,Cols:',fruits.shape)
fruits.head()

Rows,Cols: (59, 7)


Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79


In [3]:
features = fruits.columns[3:7].tolist()
class_name = dict(zip(fruits.fruit_label.unique(),fruits.fruit_name.unique()))
features,class_name

(['mass', 'width', 'height', 'color_score'],
 {1: 'apple', 2: 'mandarin', 3: 'orange', 4: 'lemon'})

## Examining the data

In [16]:
def corrfunc(x, y, **kws):
    (r, p) = sps.pearsonr(x, y)
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.1, .9), xycoords=ax.transAxes)

fig = sns.pairplot(fruits[features + ['fruit_name']],hue='fruit_name',diag_kind='kde');
fig = fig.map(corrfunc)

<IPython.core.display.Javascript object>

<b>Comments</b> Mass <u>correlates</u> highly with width.  Mass and height are good for <u>segmentation</u>.

<h2> Splitting the Data </h2>

75% of the data will be used for <b> Training </b> and 25% for <b>testing</b> the model.

In [17]:
X = fruits[features]
y = fruits['fruit_label']
# default is 75% / 25% train-test split
# If we want to keep the same split the value of 'random_state' should be kept the same
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
df_train = pd.concat([X_train, y_train],axis=1)


# Training the Classifier

In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [19]:
# Try various options of n
for n in [1,3,5,11]:
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train,y_train) #train the classifier
    print('For n={} -> Train Score: {:.2f} Test Score: {:.2f}'.format(n,knn.score(X_train, y_train),knn.score(X_test, y_test)))


For n=1 -> Train Score: 1.00 Test Score: 0.60
For n=3 -> Train Score: 0.82 Test Score: 0.53
For n=5 -> Train Score: 0.80 Test Score: 0.53
For n=11 -> Train Score: 0.59 Test Score: 0.33


<h2> Classifying a new fruit </h2>

With a kNN classifier (n=5) we will classify a fruit with the following characteristics:

mass: 20g

width: 4.3 cm

height 5.5 cm 

color scale: 0.92 


In [22]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train) #train the classifier

attributes = [[20,4.3,5.5,0.92]]

prediction = knn.predict(attributes) 

'The fruit is a ' + class_name[prediction[0]]

'The fruit is a mandarin'