# Regression dengan KNN

<ul>
    <li>KNN merupakan model machine learning yang dapat digunakan untuk melakukan prediksi berdasarkan  kedekatan karakteristik dengan sejumlah tetangga terdekat.</li>
    <li>Prediksi dapat diterapkan dengan baik pada classification dan regression tasks.</li>
</ul>

### Sample Dataset

In [98]:
import pandas as pd
sensus = {
    'tinggi': [150, 170, 183, 191, 155, 163, 180, 158, 170],
    'gender': ["pria", "pria", "pria", "pria", "wanita", "wanita", "wanita" , "wanita", "wanita"],
    'berat': [64, 86, 84, 80, 49, 59, 67, 54, 67]
    
}

sensus_df = pd.DataFrame(sensus)
sensus_df

Unnamed: 0,tinggi,gender,berat
0,150,pria,64
1,170,pria,86
2,183,pria,84
3,191,pria,80
4,155,wanita,49
5,163,wanita,59
6,180,wanita,67
7,158,wanita,54
8,170,wanita,67


#### Features dan Target

In [99]:
import numpy as np

x_train = np.array(sensus_df[['tinggi','gender']]) # features
y_train = np.array(sensus_df['berat']) # target

print(f"x_train: \n{x_train}\n")
print(f"y_train: {y_train}")

x_train: 
[[150 'pria']
 [170 'pria']
 [183 'pria']
 [191 'pria']
 [155 'wanita']
 [163 'wanita']
 [180 'wanita']
 [158 'wanita']
 [170 'wanita']]

y_train: [64 86 84 80 49 59 67 54 67]


#### Preprocessing Dataset: Konversi label menjadi numeric Biner

In [100]:
# mentranspos dataset features
x_train_transposed = np.transpose(x_train)

print(f"x_train: \n{x_train}\n")
print(f"x_train_transposed: \n {x_train_transposed}")

x_train: 
[[150 'pria']
 [170 'pria']
 [183 'pria']
 [191 'pria']
 [155 'wanita']
 [163 'wanita']
 [180 'wanita']
 [158 'wanita']
 [170 'wanita']]

x_train_transposed: 
 [[150 170 183 191 155 163 180 158 170]
 ['pria' 'pria' 'pria' 'pria' 'wanita' 'wanita' 'wanita' 'wanita'
  'wanita']]


In [101]:
# mengubah label ke numeric
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
gender_binarised = lb.fit_transform(x_train_transposed[1])

print(f"gender: {x_train_transposed[1]}\n")
print(f"gender_binarised:\n {gender_binarised}")

gender: ['pria' 'pria' 'pria' 'pria' 'wanita' 'wanita' 'wanita' 'wanita' 'wanita']

gender_binarised:
 [[0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]]


In [102]:
# mengubah ke bentuk 1 dimensi
gender_binarised = gender_binarised.flatten()
gender_binarised

array([0, 0, 0, 0, 1, 1, 1, 1, 1])

In [122]:
# mengembalikan nilai gender ke dalam variabel x_train
x_train_transposed[1] = gender_binarised
x_train = x_train_transposed.transpose()

print(f"x_train_transposed: {x_train_transposed}\n")
print(f"x_train: \n{x_train}")

x_train_transposed: [[150 170 183 191 155 163 180 158 170]
 [0 0 0 0 1 1 1 1 1]]

x_train: 
[[150 0]
 [170 0]
 [183 0]
 [191 0]
 [155 1]
 [163 1]
 [180 1]
 [158 1]
 [170 1]]


### Training KNN Regression Model

In [104]:
from sklearn.neighbors import KNeighborsRegressor

k = 3 # kelas terdekat di setting sejumlah 3
model = KNeighborsRegressor(n_neighbors=k)
model.fit(x_train, y_train)

KNeighborsRegressor(n_neighbors=3)

### Prediksi BB


In [105]:
# menyiapakan nilai feature yang akan di prediksi
x_new = np.array([[155,1]])
x_new

array([[155,   1]])

In [106]:
y_pred = model.predict(x_new)
y_pred

array([55.66666667])

### Evaluasi KNN Regression Model
Evaluasi adalah tentang membandingkan nilai target dengan nilai hasil prediksi

In [107]:
# menyiapkan data untuk testing set
x_test = np.array([[168,0], [180,0], [160,1],[169,1]]) # features
y_test = np.array([65,96,52,67]) # target

print(f"x_test: \n{x_test}\n")
print(f"y_test: {y_test}")

x_test: 
[[168   0]
 [180   0]
 [160   1]
 [169   1]]

y_test: [65 96 52 67]


In [108]:
# membuat prediksi dari nilai feature
y_pred = model.predict(x_test)
y_pred

array([70.66666667, 79.        , 54.        , 70.66666667])

#### 1. Coefisien of Determination R2

In [109]:
from sklearn.metrics import r2_score

r_squared = r2_score(y_test, y_pred)

print(f"R_squared: {r_squared}")

R_squared: 0.6725768321513002


#### 2. Mean Absolute Error (MAE) atau Mean Absolute Deviation (MAD)
<img src="mae.png">
MAE adalah nilai rata rata dari absolute error dari prediksi. <br><br>
yi = nilai target pada testing set<br>
yi^ = nilai prediksi yang dihasilkan dari model<br>

jika hasil prediksi lebih kecil dari semestinya maka nilai positif dan sebaliknya.

In [110]:
from sklearn.metrics import mean_absolute_error

MAE = mean_absolute_error(y_test, y_pred)

print(f"MAE: {MAE}")

MAE: 7.083333333333336


#### 3. Mean Squared Error (MSE) atau Mean Squared Deviation (MSD)
<img src="mse.png">
MSE adalah nilai rata rata dari R2.<br>
karena yang di hitung nilai error maka semakin kecil nilainya maka semakin bagus modelnya.

In [111]:
from sklearn.metrics import mean_squared_error

MSE = mean_squared_error(y_test, y_pred)

print(f"MSE: {MSE}")

MSE: 84.6388888888889


### Permasalahan Scalling pada Features
membuktikan apakah perbedaan pengukuran akan memengaruhi konsistensi hasil pengukuran euclidean distance.

In [112]:
from scipy.spatial.distance import euclidean

# tinggi dalam milimeter
x_train = np.array([[1700,0], [1600,1]])
x_new = np.array([[1640,0]]) # data point yang akan di prediksi

[euclidean(x_new[0], i) for i in x_train]

[60.0, 40.01249804748511]

In [113]:
# tinggi dalam meter
x_train = np.array([[1.7,0], [1.6,1]])
x_new = np.array([[1.64,0]]) 

[euclidean(x_new[0], i) for i in x_train]

[0.06000000000000005, 1.0007996802557442]

### Penyelesaian Permasalahan pada Scalling Features
#### 1. Menerapkan Standard Scaller (standard score atau z-score)
Z-score adalah angka yang merupakan perbedaan antara nilai data dan rata-rata, dibagi dengan standar deviasi
<img src="z-score.png">
x = nilai features<br>
S = standard deviation

In [114]:
from sklearn.preprocessing import  StandardScaler

# membentuk object dari standard scaller
ss = StandardScaler()

In [115]:
# tinggi dalam milimeter
x_train = np.array([[1700,0], [1600,1]])
x_train_scaled = ss.fit_transform(x_train)
print(f"x_train_scaled: \n{x_train_scaled}\n")


x_new = np.array([[1640, 0]])
x_new_scaled = ss.transform(x_new)
print(f"x_new_scled: {x_new_scaled}\n")

# satuan akan menggunakan standard z-score
jarak = [euclidean(x_new_scaled[0], i) for i in x_train_scaled]
print(f"jarak: {jarak}")

x_train_scaled: 
[[ 1. -1.]
 [-1.  1.]]

x_new_scled: [[-0.2 -1. ]]

jarak: [1.2, 2.1540659228538015]


In [116]:
# tinggi dalam meter
x_train = np.array([[1.7,0], [1.6,1]])
x_train_scaled = ss.fit_transform(x_train)
print(f"x_train_scaled: \n{x_train_scaled}\n")


x_new = np.array([[1.640, 0]])
x_new_scaled = ss.transform(x_new)
print(f"x_new_scled: {x_new_scaled}\n")

# satuan akan menggunakan standard z-score
jarak = [euclidean(x_new_scaled[0], i) for i in x_train_scaled]
print(f"jarak: {jarak}")

x_train_scaled: 
[[ 1. -1.]
 [-1.  1.]]

x_new_scled: [[-0.2 -1. ]]

jarak: [1.2000000000000026, 2.1540659228538006]


### Menerapkan Features Scalling pada KNN
#### Dataset

In [129]:
# training set
print(f"x_train: \n{x_train}\n")
print(f"y_train: {y_train}\n")

# test set
print(f"x_test: \n{x_test}\n")
print(f"y_test: {y_test}")

x_train: 
[[150 0]
 [170 0]
 [183 0]
 [191 0]
 [155 1]
 [163 1]
 [180 1]
 [158 1]
 [170 1]]

y_train: [64 86 84 80 49 59 67 54 67]

x_test: 
[[168   0]
 [180   0]
 [160   1]
 [169   1]]

y_test: [65 96 52 67]


#### Features Scalling (Standard Scaller)

In [130]:
x_train_scaled = ss.fit_transform(x_train)
x_test_scaled = ss.transform(x_test)

print(f"x_train_scaled: \n{x_train_scaled}\n")
print(f"x_test_scaled: \n{x_test_scaled}\n")

x_train_scaled: 
[[-1.45495909 -1.11803399]
 [ 0.08558583 -1.11803399]
 [ 1.08694002 -1.11803399]
 [ 1.70315799 -1.11803399]
 [-1.06982286  0.89442719]
 [-0.45360489  0.89442719]
 [ 0.85585829  0.89442719]
 [-0.83874112  0.89442719]
 [ 0.08558583  0.89442719]]

x_new_scaled: 
[[-0.06846866 -1.11803399]
 [ 0.85585829 -1.11803399]
 [-0.68468663  0.89442719]
 [ 0.00855858  0.89442719]]



### Training dan Evaluasi Model

In [131]:
model.fit(x_train_scaled, y_train)
y_pred = model.predict(x_test_scaled)

MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)

print(f"MAE: {MAE}")
print(f"MSE: {MSE}")

MAE: 7.583333333333336
MSE: 85.13888888888893
