Supervised Learning (Classification / Regression)

- k-Nearest Neighbors (KNN)  
- Linear Models  
- Decision Trees  
- Random Forests  





k-Nearest Neighbors (KNN)  
    - 새로운 data point에 대한 예측 시, dataset 내에서 가장 가까운 data point를 탐색  
    - KNeighborsClassifier, KNeighborsRegressor

* KNN의 Classification, Regression 차이점  
    1. Dataset load, Data preprocessing은 동일  
    2. Leaning : Classification은 KNeighborsClassifier, Regression은 KNeighborsRegressor  
        - clf = KNeighborsClassifier(n_neighbors = 3)  
        - reg = KNeighborsRegressor(n_neighbors=3)  
  
  
Linear Models  
    - 선형 함수로 값을 예측. 기존의 데이터셋을 기준으로 선하나 긋기  
    1. LinearRegression : cost function을 최소화 하는 최적의 선
    2. Ridge : L2 regularization 사용  
    3. Lasso : L1 regularization 사용  
    4. Logistic Regression : Binary classification에서 사용! class를 구분하는 경계를 긋는 것.  
    - linear model은 binary classification에서만 사용가능하나 multi class classification으로 확장할수 있는 알고리즘도 존재하는듯?  

Decision Trees  
    - 땅 자르듯이, 계속 분할 하는 것
    
Random Forests  


### KNN (Classification)

In [10]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Dataset load - 'Forge Dataset'
data = np.load('./data/forge_dataset.npy', allow_pickle=True)
X = data[:,:-1]
y = data[:,-1]

print('shape of X:', X.shape)
print('shape of y', y.shape)
print('y:', y)

# Data Preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Learning
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 3)
clf.fit(X_train_scaled, y_train)

# Inference & Evaluation
y_train_hat = clf.predict(X_train_scaled)
print('ground truth of y_train:', y_train)
print('prediction result of y_train:', y_train_hat)

y_test_hat = clf.predict(X_test_scaled)
print('ground truth of y_test:', y_test)
print('prediction result of y_test:', y_test_hat)

from sklearn.metrics import accuracy_score	# classification accuracy
y_train_accuracy = accuracy_score(y_train, y_train_hat)
print('train_accuracy:', y_train_accuracy)

y_test_accuracy = accuracy_score(y_test, y_test_hat)
print('test_accuracy:', y_test_accuracy)


shape of X: (26, 2)
shape of y (26,)
y: [1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0.
 1. 0.]
ground truth of y_train: [0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0.]
prediction result of y_train: [0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0.]
ground truth of y_test: [1. 0. 1. 0. 1. 1. 0.]
prediction result of y_test: [1. 0. 1. 0. 1. 0. 0.]
train_accuracy: 0.9473684210526315
test_accuracy: 0.8571428571428571


### Hyperparameter  
n_neighbors : 값이 작을수록 overfitting

In [11]:
# Hyperparameter search (number of neighbors)
# 
from sklearn.datasets import load_breast_cancer
breast_cancer_dataset = load_breast_cancer()
X, y = breast_cancer_dataset.data, breast_cancer_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
train_accuracy_list = []
test_accuracy_list = []
k_search_list = range(1, 10, 2)

for k in k_search_list:
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)

    y_train_hat = clf.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_hat)

    y_test_hat = clf.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_hat)

    train_accuracy_list.append(train_accuracy)
    test_accuracy_list.append(test_accuracy)

import pandas as pd

result_df = pd.DataFrame({
    'k': k_search_list,
    'train accuracy': train_accuracy_list,
    'test accuracy': test_accuracy_list
})
display(result_df)  ## k 가 작을수록 overfitting 됨



Unnamed: 0,k,train accuracy,test accuracy
0,1,1.0,0.916084
1,3,0.957746,0.923077
2,5,0.941315,0.937063
3,7,0.938967,0.944056
4,9,0.93662,0.958042


In [12]:
# Hyperparameter Search (Power parameter for Minkowski distance)
train_accuracy_list = []
test_accuracy_list = []

p_search_list = range(1,6)  
# p = 1 : manhatten distance
# p = 2 : euclidean distance
# p >= 2 : minkowski distance

for p in p_search_list:
    clf = KNeighborsClassifier(n_neighbors=5, metric = 'minkowski', p=p)
    clf.fit(X_train, y_train)

    y_train_hat = clf.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_hat)

    y_test_hat = clf.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_hat)

    train_accuracy_list.append(train_accuracy)
    test_accuracy_list.append(test_accuracy)

result_df = pd.DataFrame({
    'p': p_search_list,
    'train accuracy': train_accuracy_list,
    'test accuracy': test_accuracy_list
})

display(result_df)

Unnamed: 0,p,train accuracy,test accuracy
0,1,0.957746,0.958042
1,2,0.941315,0.937063
2,3,0.941315,0.93007
3,4,0.938967,0.93007
4,5,0.938967,0.93007


### KNN (Regression)

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Dataset Load - 'Wave Dataset'
data = np.load('./data/wave_dataset.npy', allow_pickle=True)
X = data[:,:-1]
y = data[:,-1]

print('shape of X:', X.shape)
print('shape of y', y.shape)
print('y:', y)

# Data Preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Learning
from sklearn.neighbors import KNeighborsRegressor   # KNN regression.
reg = KNeighborsRegressor(n_neighbors=3)            # k = 3
reg.fit(X_train_scaled, y_train)                    # fit (training)

# Inference & Evaluation
y_train_hat = reg.predict(X_train_scaled)           # predict train data
print('ground truth of y_train:', y_train)          # GT
print('prediction result of y_train:', y_train_hat) # train result

y_test_hat = reg.predict(X_test_scaled)             # predict test data
print('ground truth of y_test:', y_test)            # GT
print('prediction result of y_test:', y_test_hat)   # test result

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score	# regression accuracy

y_train_mae = mean_absolute_error(y_train, y_train_hat)
y_train_rmse = mean_squared_error(y_train, y_train_hat)**0.5
y_train_r2 = r2_score(y_train, y_train_hat)
print('train_MAE: %.4f'%y_train_mae)
print('train_RMSE: %.4f'%y_train_rmse)
print('train_R_square: %.4f'%y_train_r2)

y_test_mae = mean_absolute_error(y_test, y_test_hat)
y_test_rmse = mean_squared_error(y_test, y_test_hat)**0.5
y_test_r2 = r2_score(y_test, y_test_hat)
print('test_MAE: %.4f'%y_test_mae)
print('test_RMSE: %.4f'%y_test_rmse)
print('test_R_square: %.4f'%y_test_r2)


shape of X: (40, 1)
shape of y (40,)
y: [-0.44822073  0.33122576  0.77932073  0.03497884 -1.38773632 -2.47196233
 -1.52730805  1.49417157  1.00032374  0.22956153 -1.05979555  0.7789638
  0.75418806 -1.51369739 -1.67303415 -0.90496988  0.08448544 -0.52734666
 -0.54114599 -0.3409073   0.21778193 -1.12469096  0.37299129  0.09756349
 -0.98618122  0.96695428 -1.13455014  0.69798591  0.43655826 -0.95652133
  0.03527881 -2.08581717 -0.47411033  1.53708251  0.86893293  1.87664889
  0.0945257  -1.41502356  0.25438895  0.09398858]
ground truth of y_train: [-1.51369739 -2.47196233 -0.52734666 -1.67303415  1.53708251  1.49417157
 -0.47411033  0.33122576 -1.13455014  0.75418806 -2.08581717 -0.98618122
 -1.52730805  0.09756349 -1.12469096 -0.3409073   0.22956153  0.25438895
  0.03497884 -0.44822073]
prediction result of y_train: [-1.44042723 -1.89415682 -0.49284968 -1.63113382  1.12082662  1.26181405
 -1.04203645  1.12082662 -1.44042723  1.26181405 -2.07693788 -0.6539162
 -1.04203645 -0.23052151 -1.

### Linear model

In [14]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Dataset Load 'extended boston dataset'
data = np.load('./data/extended_boston_dataset.npy', allow_pickle=True)
X = data[:,:-1]
y = data[:,-1]

# print('shape of X:', X.shape)
# print('shape of y', y.shape)
# print('y:', y)

# Data preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Learning
reg_linear = LinearRegression()
reg_linear.fit(X_train_scaled, y_train)

# Inference & Evaluation
y_train_hat = reg_linear.predict(X_train_scaled)
print('train MAE: %.5f'%mean_absolute_error(y_train, y_train_hat))
print('train RMSE: %.5f'%mean_squared_error(y_train, y_train_hat)**0.5)
print('train R_square: %.5f'%r2_score(y_train, y_train_hat))

y_test_hat = reg_linear.predict(X_test_scaled)
print('test MAE: %.5f'%mean_absolute_error(y_test, y_test_hat))
print('test RMSE: %.5f'%mean_squared_error(y_test, y_test_hat)**0.5)
print('test R_square: %.5f'%r2_score(y_test, y_test_hat))


train MAE: 1.56741
train RMSE: 2.02246
train R_square: 0.95205
test MAE: 3.22590
test RMSE: 5.66296
test R_square: 0.60747


### Ridge

In [15]:
from sklearn.linear_model import Ridge

# Data load, data preprocessing은 위에서 처리

# Learning (여기만 다름!!)
reg_ridge = Ridge(alpha=1) # Hyperparameter alpha.
reg_ridge.fit(X_train_scaled, y_train)

# Inference & Evaluation (LinearRegression와 동일)
y_train_hat = reg_ridge.predict(X_train_scaled)
print('train MAE: %.5f'%mean_absolute_error(y_train, y_train_hat))
print('train RMSE: %.5f'%mean_squared_error(y_train, y_train_hat)**0.5)
print('train R_square: %.5f'%r2_score(y_train, y_train_hat))

y_test_hat = reg_ridge.predict(X_test_scaled)
print('test MAE: %.5f'%mean_absolute_error(y_test, y_test_hat))
print('test RMSE: %.5f'%mean_squared_error(y_test, y_test_hat)**0.5)
print('test R_square: %.5f'%r2_score(y_test, y_test_hat))

train MAE: 1.70581
train RMSE: 2.30122
train R_square: 0.93792
test MAE: 2.81779
test RMSE: 4.20133
test R_square: 0.78395


### Lasso Regression

In [16]:
from sklearn.linear_model import Lasso
# Data load, data preprocessing은 위에서 처리

# Learning (여기만 다름!!)
reg_lasso = Lasso(alpha=1)
reg_lasso.fit(X_train_scaled, y_train)

# Inference & Evaluation (LinearRegression와 동일)
y_train_hat = reg_ridge.predict(X_train_scaled)
print('train MAE: %.5f'%mean_absolute_error(y_train, y_train_hat))
print('train RMSE: %.5f'%mean_squared_error(y_train, y_train_hat)**0.5)
print('train R_square: %.5f'%r2_score(y_train, y_train_hat))

y_test_hat = reg_ridge.predict(X_test_scaled)
print('test MAE: %.5f'%mean_absolute_error(y_test, y_test_hat))
print('test RMSE: %.5f'%mean_squared_error(y_test, y_test_hat)**0.5)
print('test R_square: %.5f'%r2_score(y_test, y_test_hat))

train MAE: 1.70581
train RMSE: 2.30122
train R_square: 0.93792
test MAE: 2.81779
test RMSE: 4.20133
test R_square: 0.78395


### LogisticRegression
데이터를 바꿔야되는데 왜?

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Data load
breast_cancer_dataset = load_breast_cancer()
X, y = breast_cancer_dataset.data, breast_cancer_dataset.target

# print('shape of X:', X.shape)
# print('shape of y', y.shape)
# print('y:', y)

# Data preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Learning
clf = LogisticRegression(C=1)
clf.fit(X_train_scaled, y_train)

# Inference & Evaluation
y_train_hat = clf.predict(X_train_scaled)
print('ground truth of y_train:', y_train)
print('prediction result of y_train:', y_train_hat)
y_test_hat = clf.predict(X_test_scaled)
print('ground truth of y_test:', y_test)
print('prediction result of y_test:', y_test_hat)

y_train_accuracy = accuracy_score(y_train, y_train_hat)
print('train_accuracy:', y_train_accuracy)
y_test_accuracy = accuracy_score(y_test, y_test_hat)
print('test_accuracy:', y_test_accuracy)

ground truth of y_train: [1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0
 0 1 1 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 0
 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1
 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1
 0 1 0 0 1 0 0 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0
 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1
 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 1 1 1
 1 1 0 0 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 0
 0 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1
 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0
 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1]
prediction result of y_train: [1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0
 0 1 

### Decision Tree

In [20]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Dataset Load
breast_cancer_dataset = load_breast_cancer()
X, y = breast_cancer_dataset.data, breast_cancer_dataset.target

# print('shape of X:', X.shape)
# print('shape of y', y.shape)
# print('y:', y)

# Data preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)

# Learning
clf = DecisionTreeClassifier() # Hyperparameter (min_samples_leaf = m)
clf.fit(X_train, y_train)

# Inference & Evaluation
y_train_hat = clf.predict(X_train)
print('train_accuracy: %.5f'%accuracy_score(y_train, y_train_hat))
y_test_hat = clf.predict(X_test)
print('test_accuracy: %.5f'%accuracy_score(y_test, y_test_hat))

# sklearn 내장함수로 tree 시각화 가능

train_accuracy: 1.00000
test_accuracy: 0.86713


### Random forest

In [21]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Dataset Load
breast_cancer_dataset = load_breast_cancer()
X, y = breast_cancer_dataset.data, breast_cancer_dataset.target

# print('shape of X:', X.shape)
# print('shape of y', y.shape)
# print('y:', y)

# Data preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)

# Learning
clf = RandomForestClassifier() # Hyperparameter (n_estimators=n)
clf.fit(X_train, y_train)

# Inference & Evaluation
y_train_hat = clf.predict(X_train)
print('train_accuracy: %.5f'%accuracy_score(y_train, y_train_hat))
y_test_hat = clf.predict(X_test)
print('test_accuracy: %.5f'%accuracy_score(y_test, y_test_hat))

train_accuracy: 1.00000
test_accuracy: 0.97203
