## Classifying plants of _Iris_ genus using KNN algorithm
* **Gabriel Nascimento Silva**
* **github: gabrielnsil**

#### Importing Libraries

In [9]:
from sklearn import datasets
from sklearn.neighbors import  KNeighborsClassifier
import pandas as pd
import numpy as np

####  Importing Iris dataset

In [10]:
iris = datasets.load_iris() ## The iris dataset is an object

In [11]:
type(iris)

sklearn.utils.Bunch

In [12]:
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


#### As the KNN Algorithm (K-Nearest Neighbor) is an algorithm that need to have a number as target, we have to supply this.

In [13]:
iris_df["target"] = iris.target
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


#### We can also add a 'target name' to the dataframe

In [14]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [15]:
iris_df["target_name"] = iris.target_names[iris_df["target"]]
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,virginica
146,6.3,2.5,5.0,1.9,2,virginica
147,6.5,3.0,5.2,2.0,2,virginica
148,6.2,3.4,5.4,2.3,2,virginica


#### We also have to specify the characteristics that will be used in this analysis.

In [16]:
iris_features=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

#### The variable 'y' will be our target. It is import to note that a classification using KNN algorithm needs to have a numeric target.

In [17]:
y = iris_df.target

#### The variable 'X' will represent the features group.

In [18]:
X = iris_df[iris_features]

In [19]:
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [20]:
y

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int32

##  How to use Scikit-learn

1. Define = Choose the model (and the parameters)
2. Fit = Training
3. Predict = Make a prediction
4. Evaluate = Scoring the performance of the model

In [21]:
knn_model = KNeighborsClassifier(3) ## Choosing the model

In [22]:
knn_model.fit(X, y) ## Training

KNeighborsClassifier(n_neighbors=3)

In [23]:
knn_model.predict(X) ## Predicting

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

###  Let's observe the accuracy, remembering that we are validating the same data that was trained

In [24]:
knn_model.score(X,y)

0.96

### Here we obtained a model with 96% accuracy! It is good but we can't use all data to train our model and test the same data, this imply a problem called *overfitting*. To solve it we have to split our dataset in two groups: training (75%) and testing (25%)

In [25]:
index = np.random.permutation(iris_df.shape[0]) # vector with 150 iris_df randomized indexes 

In [26]:
div = int(0.75 * len(index))

In [27]:
training_id = index[:div]
test_id = index[div:]

In [28]:
training_set = iris_df.loc[training_id,:]
test_set = iris_df.loc[test_id,:]

In [29]:
training_set.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
56,6.3,3.3,4.7,1.6,1,versicolor
109,7.2,3.6,6.1,2.5,2,virginica
16,5.4,3.9,1.3,0.4,0,setosa
91,6.1,3.0,4.6,1.4,1,versicolor
73,6.1,2.8,4.7,1.2,1,versicolor


In [30]:
test_set.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
85,6.0,3.4,4.5,1.6,1,versicolor
30,4.8,3.1,1.6,0.2,0,setosa
28,5.2,3.4,1.4,0.2,0,setosa
75,6.6,3.0,4.4,1.4,1,versicolor
86,6.7,3.1,4.7,1.5,1,versicolor


In [31]:
X_training = training_set[iris_features]
X_test = test_set[iris_features]

In [32]:
y_training = training_set.target
y_test = test_set.target

In [33]:
knn_model.fit(X_training, y_training)

KNeighborsClassifier(n_neighbors=3)

In [34]:
knn_model.predict(X_test) ## Predicting

array([1, 0, 0, 1, 1, 2, 2, 2, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2, 2, 0, 0, 1,
       1, 0, 2, 1, 2, 2, 0, 0, 1, 2, 1, 0, 1, 2, 2, 0])

In [35]:
knn_model.score(X_test,y_test)

0.9736842105263158

###  Therefore, we obtained a new model, following best practices, which has higher accuracy (~ 97.37%) compared to the previous one (96%).  It is important to note that sci-kit learn library KNN object (KNeighborsClassifier) uses the *Minkowski distance* which is a variation/generalization of Euclidian and Manhattan distances.