<a href="https://colab.research.google.com/github/fay421/ML_Projects/blob/main/Iris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## KNN Classification Project-Iris


## About KNN Model:



   K-Nearest Neighbors (KNN) is a simple and versatile machine learning algorithm used for both classification and regression tasks.
*   Classification with KNN:
In classification, KNN assigns a data point to the majority class among its k-nearest neighbors. The algorithm works by calculating the distance between the data point and all other points in the training set and then selecting the k-nearest neighbors based on this distance metric (commonly Euclidean distance). The class label of the majority of these neighbors is assigned to the data point.


*   Regression with KNN:
In regression, KNN predicts the target value for a data point by averaging the target values of its k-nearest neighbors. The process is similar to classification, but instead of assigning a class label, KNN calculates the average of the target values.

*   Pros of KNN:

1.   Simple and Intuitive: KNN is easy to understand and implement, making it a good choice for beginners.
2.   Non-parametric: It doesn't make assumptions about the underlying data distribution, making it suitable for various types of datasets.

3.   Adaptability: KNN can adapt to changes in the data as new instances become available.

*   Cons of KNN:


1.   Computational Cost: As the dataset grows, the computational cost of finding the nearest neighbors increases, making KNN less efficient for large datasets.

2.   Sensitive to Noise and Irrelevant Features: KNN is sensitive to outliers, noise, and irrelevant features, which can impact the accuracy of its predictions.
3.   Need for Feature Scaling: KNN is sensitive to the scale of features, so it's often necessary to scale or normalize the data before applying the algorithm.
4.   Curse of Dimensionality: In high-dimensional spaces, the concept of proximity becomes less meaningful, leading to decreased performance.

In summary, KNN is a simple and effective algorithm that can be used for both classification and regression tasks. However, its performance may be influenced by factors such as dataset size, noise, and feature scaling, and it might not be the best choice for high-dimensional datasets.








## About Iris Dataset:
The Iris dataset is a popular dataset in the field of machine learning and statistics. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936 as an example of discriminant analysis. The dataset consists of measurements of various features of three species of iris flowers:

*Sepal Length: The length of the iris flower's sepal (the outermost whorl of a flower).

*Sepal Width: The width of the iris flower's sepal.

*Petal Length: The length of the iris flower's petal (the inner whorl of a flower).

*Petal Width: The width of the iris flower's petal.

The three species of iris flowers included in the dataset are:

-Setosa

-Versicolor

-Virginica

The Iris dataset is commonly used for practicing and demonstrating various machine learning algorithms, particularly in the context of classification problems. It's a small and well-understood dataset, making it suitable for educational purposes and benchmarking algorithms. The dataset is often used to demonstrate the effectiveness of algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and decision trees for classification tasks.

## Import Libraries

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Craete Dataframe

In [47]:
df =pd.read_csv('Iris.csv')

## EDA

In [48]:
df=df.drop('Id',axis=1)

In [49]:
df.columns

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [50]:
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


## The dtype of 'Species' in the target column is categorical

In [51]:
y = df['Species'].astype('category')

## Conver Categical to Numerical in Target column

In [52]:
y = y.cat.codes
print(y)

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Length: 150, dtype: int8


## Scale Data

In [53]:
from sklearn.preprocessing import StandardScaler

In [54]:
sc=StandardScaler()

In [55]:
df=df.drop('Species',axis=1)

In [56]:
df_sc=sc.fit_transform(df)

In [57]:
X=pd.DataFrame(df_sc, columns=['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'])

## Train Test Split

In [58]:
from sklearn.model_selection import train_test_split

In [59]:
X_train ,  X_test , y_train , y_test = train_test_split(X, y, test_size=0.3,random_state=42)

In [60]:

from scipy.stats import mode




from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score



## Build KNN Model

In [61]:
from sklearn.neighbors import KNeighborsClassifier

## About kd_tree Algorithm
The k-d tree (short for k-dimensional tree) is a data structure used for efficient multidimensional search operations, such as finding the nearest neighbors of a given point in a space with multiple dimensions.

P: This parameter is relevant when using the Minkowski distance metric. The Minkowski distance with p=1 corresponds to the Manhattan distance (sum of absolute differences). Setting p=2 would correspond to the Euclidean distance.

In [87]:
knn = KNeighborsClassifier(n_neighbors= 25, algorithm='kd_tree' , p= 2)

In [88]:
knn.fit(X_train , y_train)

In [89]:
pred = knn.predict(X_test)

## Evaluation

In [90]:
from sklearn.metrics import classification_report

In [91]:
print(classification_report(y_test , pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.92      0.92      0.92        13
           2       0.92      0.92      0.92        13

    accuracy                           0.96        45
   macro avg       0.95      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45



## About Cross Validation Score
Cross-validation is a statistical technique used in machine learning to assess the performance of a predictive model. The primary goal of cross-validation is to provide a more accurate and reliable estimate of a model's performance by using different subsets of the available data for training and testing.

** cv=5: This parameter specifies the number of folds in the cross-validation. In this case, it's set to 5, meaning that the dataset will be divided into 5 folds, and the model will be trained and evaluated 5 times.


In [93]:
from sklearn.model_selection import cross_val_score

In [92]:
neighbors = np.arange(2,11)
mindowsky = [1, 2, np.inf]
for j in mindowsky:
  print('p is ', j, '\n -------------------------')
  for k in neighbors:
    kkn_val = KNeighborsClassifier(n_neighbors=k, p=j)
    scores = cross_val_score(kkn_val, X_train, y_train,
                            cv=5, scoring='accuracy')
    print('for k = ', k, 'acc is ', scores, 'mean_acc is ', scores.mean())

p is  1 
 -------------------------
for k =  2 acc is  [0.95238095 0.85714286 0.85714286 1.         0.9047619 ] mean_acc is  0.9142857142857143
for k =  3 acc is  [0.95238095 0.85714286 0.85714286 1.         0.9047619 ] mean_acc is  0.9142857142857143
for k =  4 acc is  [0.95238095 0.85714286 0.85714286 0.95238095 0.9047619 ] mean_acc is  0.9047619047619048
for k =  5 acc is  [0.95238095 0.85714286 0.85714286 1.         0.95238095] mean_acc is  0.9238095238095237
for k =  6 acc is  [0.95238095 0.9047619  0.85714286 1.         0.95238095] mean_acc is  0.9333333333333333
for k =  7 acc is  [0.95238095 0.9047619  0.9047619  1.         0.95238095] mean_acc is  0.9428571428571428
for k =  8 acc is  [0.95238095 0.9047619  0.9047619  1.         0.95238095] mean_acc is  0.9428571428571428
for k =  9 acc is  [0.95238095 0.9047619  0.85714286 1.         0.95238095] mean_acc is  0.9333333333333333
for k =  10 acc is  [0.95238095 0.9047619  0.85714286 1.         0.9047619 ] mean_acc is  0.92380952

## Result

For p = 1:

Best mean accuracy is achieved for k=7 and k=8 with a mean accuracy of approximately 0.9429.
The mean accuracy generally increases with increasing k until it starts to stabilize.

For p = 2:

Best mean accuracy is achieved for k=7 and k=8 with a mean accuracy of approximately 0.9333.
The mean accuracy follows a similar pattern as with p=1, increasing with k until it stabilizes.

For p = infinity (p = inf):

Best mean accuracy is achieved for k=3, k=5, and k=9 with a mean accuracy of approximately 0.9333.
The mean accuracy varies for different values of k, but it generally increases with larger values of k.

*General observations:

The choice of the distance metric parameter (p) affects the performance of the KNN classifier.
The optimal number of neighbors (k) varies for different values of p.
The mean accuracy tends to increase with larger values of k, but there's a point where increasing k further does not significantly improve performance.

**In summary,choosing the combination of p and k that gives the highest mean accuracy for your specific dataset and problem.