<a href="https://colab.research.google.com/github/codegitfirst/ML-workshop/blob/main/KNN_Iris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'flowers-dataset-iris:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F5197119%2F8671682%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240729%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240729T080237Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D233fb04eddee5a63289ce99b15202909a87b07bb4e76c0f3009c40791ef34299e7dc3490bf35a66a05e89c1303ab8bbc37a1a383b975174c59a77f1cb4135b3485ec0e9bfd548abf45b3b87440a765d0b22b78e9bfc5c345219458df0d644c1c2912f871bbe066d5e3aa41f691a6cbc6306bb476f2e1fa2aca4696f83c6d2addf9772218836bf3ad1ccc866d678c71ef749c81b65b474c63b9ece4c3dbfdad060fafe0d9747884dec27adfc416cce7206fb6dd59662de6ada0a3016497917ecb38da992022d7de52f07e19563134e65ea638d177e72c9d12de951c185eb3ec6dce9034b4ca0fb11bfece3fdb4bdd597c8f840c92bbc957aab52e07527ff3e54a'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


# K Nearest Neighbors - Iris

### In this project, we will make predictions using KNN on a real data set using Scikit-Learn.

## 1- Import Libraries & Dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
dataset = pd.read_csv('/kaggle/input/flowers-dataset-iris/iris.csv')

In [None]:
# veri icindeki adet (n -> satir sayısı) ve feature (p -> degisken sayisi)
# shape -> (n, p)
# n = 150
# p = 4

dataset.shape

In [None]:
#ilk 7 satir
dataset.head(7)

In [None]:
# basit istatistikler

dataset.describe()

In [None]:
# Her siniftan kac adet var
dataset.groupby('Species').size()

## 2- Features and Labels

In [None]:
# degisken sütunlari
feature_columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm','PetalWidthCm']

In [None]:
# Feature Matris

X= dataset[feature_columns].values
X

In [None]:
# Labels Vektörü

y=dataset['Species'].values
y

## 3- Label Encoding

Since the label column (y) is categorical, we have to encode this column.

After Label Encode, the new values ​will be as follows:
* Iris Setosa: 0
* Iris Versicolour: 1
* Iris Virginica: 2

In [None]:
# Scikit-Learn -> Labelenccoder
from sklearn.preprocessing import LabelEncoder

In [None]:
#Label Encoder
le = LabelEncoder()

In [None]:
# encode etmeden önce y
print('Iris-setosa:\n', y[0:10])
print('\n')
print('Iris-versicolour:\n', y[50:60])
print('\n')
print('Iris-virginica:\n', y[100:110])

In [None]:
# simdi y yi encode et

y = le.fit_transform(y)

In [None]:
# encode ettikten sonra
print('Iris-setosa:\n', y[0:10])
print('\n')
print('Iris-versicolour:\n', y[50:60])
print('\n')
print('Iris-virginica:\n', y[100:110])

## 4- Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# şimdi veriyi ayıralım
# Train - Test  --->  %60 - %40
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 4)

In [None]:
# Train Data sekli
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)

In [None]:
# Test data sekli

print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

## 5 - Scaling

**IMPORTANT:**

Normally, we do `scaling` before performing any operation on the data. That is, we bring the variables to the same magnitude scale. (Usually between 0-1)

But since all variables in this data set are in 'cm' and on the same scale, we do not need to do any scaling.

## 6 - Data Visualization

Since the number of variables in the data is more than 2, we cannot use a coordinate axis for visualization.

So we will use pair-plot:

#### Pairplot

In [None]:
plt.figure()
sns.pairplot(dataset.drop("Id", axis=1),
            hue="Species",
            size = 3,
            markers = ["o", "s", "D"])
plt.show()

### Box-Plot

In [None]:
plt.figure()
dataset.drop("Id", axis=1).boxplot(by = "Species", figsize=(15,10))
plt.show()

## 7 - Develope Model

In [None]:
# kütüphaneleri Import et
from sklearn.neighbors import KNeighborsClassifier

In [None]:
def sklearn_knn(train_data, label_data, test_data, k):

    ## knn ile classifier olustur
    knn= KNeighborsClassifier(n_neighbors=k)

    # Train -> X_train
    knn.fit(train_data, label_data)

    #Predict -> X_test
    predict_label = knn.predict(test_data)

    #Return
    return predict_label

### Predict

Now we have everything we need to make predictions.

All we need to do is decide on the value of `K` and call the `sklearn_knn` function.

In [None]:
# sklearn_knn fonksiyonunu çağır ve tahmin değerlerini al
# K = 3

y_predict = sklearn_knn(X_train, y_train, X_test, 3)
y_predict

In [None]:
y_predict.shape

## 8- Model Accuracy

In [None]:
# Dogruluk: Gerçek y değerleri (gerçek sınıflar) ile tahmin değerlerini karşılaştırma ile bulunur.
def accuracy(test_labels, pred_labels):

    #Dogru tahminlerin sayisini hesapla
    correct = np.sum(test_labels == pred_labels)

    #Toplam test data adedi
    n = len(test_labels)

    #accuracy -> dogruluk oranı = doğru tahmin / toplam test verisi
    accur = correct / n

    return accur

In [None]:
# şimdi accuracy fonksiyonunu kullanıp model doğruluğumuzu ölçelim

accuracy(y_test, y_predict)

Now let's get the accuracy rate with the standard scikit-learn function:

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_sklearn = accuracy_score(y_test, y_predict)*100

In [None]:
print('Model Doğruluğumuz (Accuracy) ' + str(round(accuracy_sklearn, 2)) + ' %.')

## 9 - K Value

`K` değerinin belirlemek için `Train-Test Data` üzerinde farklı `K` değerleri deneyip en uygun olana karar vermemiz lazım.

### Genel Geçer Kabul:

K değeri eldeki veri adedinin (n) karekökünden büyük olmamalı.


In [None]:
n = len(dataset)
n

In [None]:
import math
k_max = math.sqrt(n)
k_max

Buna göre K değeri maksimum 12 olmalıdır.

Yani 1 ile 12 arasındaki değerleri tek tek kontrol edeceğiz.

En iyi Accuracy'yi vereni alacağız.

In [None]:
#  Accuracy oranları için bir liste yarat
normal_accuracy = []

#Olabilecek k degerleri
k_value = range(1,13)

# döngü ile tektek K degerlerine bak
for k in k_value:
    y_predict = sklearn_knn(X_train, y_train, X_test, k)
    accur = accuracy_score(y_test, y_predict)
    normal_accuracy.append(accur)

In [None]:
# Şimdi bu K değerlerine göre elde ettiğimiz Accuracy'leri çizelim

plt.xlabel("k")
plt.ylabel("accuracy")

#grafik ciz
plt.plot(k_value, normal_accuracy, c='g')

# Izgara
plt.grid(True)

As you can see here, we achieved a certain increase as the k value increased.

After `K=6`, the rest remained with the same accuracy.

Therefore, we can say that the best K value for this data set is 6.

In [None]:
# K yi 12 ile sinirlamayip 1-30 arasi arayalim
# Accuracy oranları için bir liste yarat
normal_accuracy = []

#olabilecek K degeri
k_value = range(1,31)

#döngü ile tek tek K degerlerine bak
for k in  k_value:
    y_predict = sklearn_knn(X_train, y_train, X_test, k)
    accur = accuracy_score(y_test, y_predict)
    normal_accuracy.append(accur)

In [None]:
# simdi K degerlerine göre Accuracy degerleri cizelim

plt.xlabel("k")
plt.ylabel("accuracy")

#grafik ciz
plt.plot(k_value, normal_accuracy, c = 'g')

#izgara
plt.grid(True)

plt.show()

We could not achieve any improvement after K=12.

Even the Accuracy rate dropped.