<a href="https://colab.research.google.com/github/abidshafee/DataScienceYouTubeTutorials/blob/master/KNN_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

From my personal observation, the accuracy of the KNN Model depends on the overlapping of data. The more the data overlap on each other, the less the accurate the KNN model will able to perform. We can find data overlapping from PCA scatter plot or seaborn pair plot. For densely overlapped data point accuracy increased significantly by increasing the value *n_neighbor*.

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from matplotlib import style
style.use('ggplot')

In [None]:
from google.colab import files
file = files.upload()

In [None]:
df = pd.read_csv('datasets_Iris.csv', index_col=0)

In [None]:
df.head()

### Defining Input variables 'X', and targeted output 'y' 

### Converting Categorical Species column to numeric values
####**Not that necessary for this case**

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encodedOutput_Y = LabelEncoder()
df['Species'] = encodedOutput_Y.fit_transform(df.iloc[:,-1].values)

In [None]:
df.tail()

0 -> Iris-setosa
1 -> Iris-versicolor
2 -> Iris-virginica

In [None]:
df.info()

## Distribution of Data in the dataset
How data is distributed in the dataset. Here are more pairplot options.
[sns.pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

**hue** has been set to Species without converting numerical values

As we can see in the distribution that the data in the dataset doesn't have much Overlaping except for few. So we can easily apply KNN for this dataset expection much better output.

In [None]:
sns.pairplot(df, hue='Species',  markers=["o", "s", "D"], height=3.5) # hue=y when y datatype is boolean

### Slicing Input variables and targeted output

In [None]:
numeric_df = df.iloc[:,0:4]

# drop the categorical Species column
# ty=df.drop('Species',axis=1)

# Targeted Output as pandas.Series
# y = df[[df.columns[-1]]]
y = df[df.columns[-1]]

In [None]:
numeric_df

### Convering y to 1D numpy array
because knn expect targeted output in this shape

In [None]:
y = y.values

In [None]:
y

### Scaling data for centering the dataset to origin 

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(numeric_df)

In [None]:
scaled_data = scaler.transform(numeric_df)

In [None]:
scaled_data

### Converting scaled data to a dataframe

In [None]:
scaled_df = pd.DataFrame(scaled_data, columns=df.columns[0:-1])

In [None]:
X = scaled_df
X.head()

## Test Train Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
print('Size of Train Data >> ' + str(len(X_train)))
print('Size of Test  >> ' + str(len(X_test)))
print('Size of y Train >> ' + str(len(y_train)))
print('Size of y Test >> ' + str(len(y_test)))

# **Using KNN Classifier**

In [None]:
knn = KNeighborsClassifier(n_neighbors=13)

In [None]:
knn.fit(X_train, y_train)

In [None]:
prediction = knn.predict(X_test)

In [None]:
prediction

## Accuracy of our Model

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
knn_accuracy = round((accuracy_score(y_test, prediction) * 100), 2)

In [None]:
print(str(knn_accuracy) + '%')

## Testing Accurecy of KNN Model

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

In [None]:
conf_matrix = confusion_matrix(y_test, prediction)
print(conf_matrix)
print(conf_matrix.shape)

`Each row in a confusion matrix represents an actual class, while each column represents a predicted class.`

**From the confusion matrix**
 - For first-col first-row we can say that we have prediced 8 values correctly category-0, and no false prediction.
 - 2nd-col 2nd-row 8 values correctly predicted category-1 comparing to y_test, and 2 falsely predicted caregory-2. 
 - $3^{rd}$-col, $3^{rd}$-row our model has predicted 10 values correctly category-2, and 2 falsely predicted category-1.

### Plotting Confusion Matrix
**Each row in a confusion matrix represents an actual class, while each column represents a predicted class.**

In [None]:
sns.heatmap(conf_matrix, annot=True)

The confusion matrix gives us a lot of information, but sometimes we may need to prefer a more concise metric

Without converting categorical variables to numerical values.

In [None]:
print(classification_report(y_test, prediction))

 - **precision** of the classifier ratio of positive prediction.
 $precision = \frac{TP
 }{TP + FP}$
 - **Recall**, also called sensitivity or true positive rate (TPR): the ratio of positive instances that are correctly detected by the classifier, almost Similar to precsion. $recall = \frac{TP
 }{TP + FN}$
 - classifier will only get a high **F1-score** if both recall and precision are high.

### Selecting best K value or n_neighbors
using cross validation

In [None]:
accuracy =[]
for num in range(1, 50):
  knn = KNeighborsClassifier(n_neighbors=num)
  acc= cross_val_score(knn, X, y, cv=20)
  accuracy.append(acc.mean()*100) #1-acc.mean()

In [None]:
accuracy

In [None]:
plt.figure(figsize=(16, 8))
plt.plot(range(1,50), accuracy, marker='o')
plt.ylabel("Accuracy '%'")
plt.xlabel("n_neighbors")
plt.show()

# **k fold cross validation**

Just like the `cross_val_score()` function, `cross_val_predict()` performs **K-fold** cross-validation, but instead of returning the evaluation scores, *it returns the predictions made on each test fold*. This means that we get a clean prediction for each
instance in the training set.

In [None]:
from sklearn.model_selection import cross_val_predict

In [None]:
kf_accurecy = []
for i in range(1, 50):
  knn = KNeighborsClassifier(n_neighbors=i)
  pred = cross_val_predict(knn, X, y, cv=3)
  kf_accurecy.append(round((pred.mean()*100), 2)) #1-acc.mean()

In [None]:
kf_accurecy

In [None]:
plt.figure(figsize=(17, 8))
plt.plot(range(1,50), kf_accurecy, marker='o')
plt.ylabel("Accuracy '%'")
plt.xlabel("n_neighbors")
plt.show()