# 서포트 벡터 머신(Support Vector Machines)

- 회귀, 분류, 이상치 탐지 등에 사용되는 지도학습 방법
- 클래스 사이의 경계에 위치한 데이터 포인트를 서포트 벡터(support vector)라고 한다.
- 각 지지 벡터가 클래스 사이의 결정 경계를 구분하는데 얼마나 중요한지를 학습
- 각 지지 벡터 사이의 마진이 가장 큰 방향으로 학습
- 지지 벡터 까지의 거리와 지지 벡터의 중요도를 기반으로 예측을 수행

![support vector machine](https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Svm_separating_hyperplanes.png/220px-Svm_separating_hyperplanes.png)

- H3은 두 클래스의 점들을 제대로 분류하고 있지 않음
- H1과 H2는 두 클래스의 점들을 분류하는데, H2가 H1보다 더 큰 마진을 갖고 분류하는 것을 확인할 수 있음

In [1]:
import multiprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use(['seaborn-whitegrid'])

In [2]:
from sklearn.svm import SVR, SVC # SVC classification, SVR regression
from sklearn.datasets import load_boston, load_diabetes
from sklearn.datasets import load_breast_cancer, load_iris, load_wine
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_validate
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.manifold import TSNE # 원래 데이터를 2차원으로 축소, 주로 시각화에 활용된다.
import warnings
warnings.filterwarnings('ignore')

## SVM을 이용한 회귀 모델과 분류 모델

### SVM을 사용한 회귀 모델 (SVR)

In [10]:
x, y = load_boston(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)

model = SVR()
model.fit(x_train, y_train)

print(f'Train Data Score: {model.score(x_train, y_train)}')
print(f'Test Data Score: {model.score(x_test, y_test)}')

Train Data Score: 0.2177283706374875
Test Data Score: 0.13544178468518164


### SVM을 사용한 분류 모델 (SVC)

In [11]:
x, y = load_breast_cancer(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)

model = SVC()
model.fit(x_train, y_train)

print(f'Train Data Score: {model.score(x_train, y_train)}')
print(f'Test Data Score: {model.score(x_test, y_test)}')

Train Data Score: 0.9014084507042254
Test Data Score: 0.9230769230769231


## 커널 기법

- 입력 데이터를 고차원 공간에 사상해서 비선형 특징을 학습할 수 있도록 확장하는 방법
- scikit-learn에서는 Linear, Polynomial, RBF(Radial Basis Function)등 다양한 커널 기법을 지원

In [11]:
# load_boston 보스턴 집값

In [12]:
x, y = load_boston(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)

linear_svr = SVR(kernel = 'linear')
linear_svr.fit(x_train, y_train)

print(f'Linear SVR Train Data Score: {linear_svr.score(x_train, y_train)}')
print(f'Linear SVR Test Data Score: {linear_svr.score(x_test, y_test)}')

polynomial_svr = SVR(kernel = 'poly')
polynomial_svr.fit(x_train, y_train)

print(f'Polynomial SVR  Train Data Score: {polynomial_svr.score(x_train, y_train)}')
print(f'Polynomial SVR Test Data Score: {polynomial_svr.score(x_test, y_test)}')

rbf_svr = SVR(kernel = 'rbf')
rbf_svr.fit(x_train, y_train)

print(f'RBF SVR Train Data Score: {rbf_svr.score(x_train, y_train)}')
print(f'RBF SVR Test Data Score: {rbf_svr.score(x_test, y_test)}')

Linear SVR Train Data Score: 0.715506552212211
Linear SVR Test Data Score: 0.6380396318359647
Polynomial SVR  Train Data Score: 0.2024454261446288
Polynomial SVR Test Data Score: 0.13366845036746233
RBF SVR Train Data Score: 0.2177283706374875
RBF SVR Test Data Score: 0.13544178468518164


In [13]:
# 유방암 데이터를 이용한 분류
# - 각 커널별로 유방암 데이터를 분류해보자. 

x, y = load_breast_cancer(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)

linear_svr = SVR(kernel='linear')
linear_svr.fit(x_train, y_train)

print(f'Linear SVR Train Data score : {linear_svr.score(x_train,y_train)}')
print(f'Linear SVR Train Data score : {linear_svr.score(x_test,y_test)}')

polynomial_svr = SVR(kernel = 'poly')
polynomial_svr.fit(x_train, y_train)

print(f'Polynomial SVR  Train Data Score: {polynomial_svr.score(x_train, y_train)}')
print(f'Polynomial SVR Test Data Score: {polynomial_svr.score(x_test, y_test)}')

rbf_svr = SVR(kernel = 'rbf')
rbf_svr.fit(x_train, y_train)

print(f'RBF SVR Train Data Score: {rbf_svr.score(x_train, y_train)}')
print(f'RBF SVR Test Data Score: {rbf_svr.score(x_test, y_test)}')




Linear SVR Train Data score : 0.2587946667423394
Linear SVR Train Data score : 0.05441336442720124
Polynomial SVR  Train Data Score: 0.5492121168675549
Polynomial SVR Test Data Score: 0.3307672139413259
RBF SVR Train Data Score: 0.7141833202421759
RBF SVR Test Data Score: 0.7625209396028007


## 매개변수 튜닝

- SVM은 사용하는 커널에 따라 다양한 매개변수 설정 가능
- 매개변수를 변경하면서 성능변화를 관찰

In [22]:
x, y = load_breast_cancer(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)

In [24]:
polynomial_svc = SVC(kernel = 'poly', degree=2, C=0.1, gamma='auto')
polynomial_svc.fit(x_train, y_train)

print(f'kernel=ploy, degree={2}, C={0.1}, gamma={"auto"}')
print(f'Polynomial SVC Train Data Score: {polynomial_svc.score(x_train, y_train)}')
print(f'Polynomial SVC Test Data Score: {polynomial_svc.score(x_test, y_test)}')

kernel=ploy, degree=2, C=0.1, gamma=auto
Polynomial SVC Train Data Score: 0.9835680751173709
Polynomial SVC Test Data Score: 0.993006993006993


In [28]:
rbf_svc = SVC(kernel = 'rbf', C=2.0, gamma='scale')
rbf_svc.fit(x_train, y_train)

print(f'kernel=ploy, C={2.0}, gamma={"scale"}')
print(f'RBF SVC Train Data Score: {rbf_svc.score(x_train, y_train)}')
print(f'RBF SVC Test Data Score: {rbf_svc.score(x_test, y_test)}')

kernel=ploy, C=2.0, gamma=scale
RBF SVC Train Data Score: 0.9154929577464789
RBF SVC Test Data Score: 0.9370629370629371


## 데이터 전처리

- SVM은 입력 데이터가 정규화 되어야 좋은 성능을 보임
- 주로 모든 특성 값을 [0, 1] 범위로 맞추는 방법을 사용
- scikit-learn의 StandardScaler 또는 MinMaxScaler를 사용해 정규화

In [29]:
# 예 load_breast_cancer 데이터를 StandardScaler 를 이용해 정규화 하고 학습시켜보자
# - Score 는?

# model 생성
# 그냥 학습 vs standardscaler로 데이터 변환 후 학습 vs minmax로 변환 후 학습
x, y = load_breast_cancer(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)


In [30]:
model = SVC()
model.fit(x_train, y_train)

print(f'SVC Train Data Score: {model.score(x_train, y_train)}')
print(f'SVC Test Data Score: {model.score(x_test, y_test)}')

SVC Train Data Score: 0.9014084507042254
SVC Test Data Score: 0.9230769230769231


In [35]:
scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.fit_transform(x_test)

model = SVC()
model.fit(X_train, y_train)

print(f'SVC Train Data Score: {model.score(X_train, y_train)}')
print(f'SVC Test Data Score: {model.score(X_test, y_test)}')


SVC Train Data Score: 0.9835680751173709
SVC Test Data Score: 0.986013986013986


In [36]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.fit_transform(x_test)

model = SVC()
model.fit(X_train, y_train)

print(f'SVC Train Data Score: {model.score(X_train, y_train)}')
print(f'SVC Test Data Score: {model.score(X_test, y_test)}')

SVC Train Data Score: 0.9812206572769953
SVC Test Data Score: 0.9300699300699301
