- 계산 복잡도가 상당히 높아 상대적으로 많은 계산 자원 필요
- 하지만, 대부분의 경우 훨씬 좋은 결과를 도출
- SVM은 어떤 형태의 데이터에도 적용가능
- 상대적으로 매우 높은 차원일 경우에 장점을 가짐
- 학습 주제
    - 단어 벡터와 동일한 차원을 갖는 언어의 텍스트 분류
    - 크로마토그램을 정확히 분석해 DNA 염기 서열의 품질을 관리

## SVM 동작 원리
- 최대 마진 분류기
- 서포트 벡터 분류기
- 서포트 벡터 머신

### 최대 마진 분류기
- 각 영역을 분리한 초평면 사이의 간격이 최대가 되게하는 것

#### 초평면
- 2차원 : $ \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} = 0 $
- 영역
    - $ \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} > 0 $
    - $ \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} < 0 $
    
- 분류기의 성능은 오직 서포트 벡터에만 달려있다.
- 서포트 벡터가 아닌 관측값의 변화는 최대 마진 분류기의 성능에 아무런 영향을 끼치지 않는다

### 서포트 벡터 분류기
- 최대 마진 분류기의 확장된 개념
- 정확한 분류가 불가능한 경우, 허용된 범위 내의 오차를 용인해 최적의 적합화를 시도
- C 값에 따라 오차의 허용도가 조절된다.
    - 높은 C값 : 모델을 더 유연하게 만들어준다
    - 낮은 C값 : 모델을 더 안정적으로 만들어준다
    
### 서포트 벡터 머신
- 결정 경계면이 선형이 아님
- 서포트 벡트 분류기로 분리할 수 없을 경우 사용
- 커널 트릭을 사용

## 커널 함수
#### 다항식 커널
#### 레디얼 베이시스 함수
#### 가우시안 커널

## SVM 다중 레이블 분류기를 사용한 문자 인식 예제

In [1]:
# 데이터 가져오기
import pandas as pd
letterdata = pd.read_csv('./Data/letter_data.csv')

In [2]:
# 데이터 컬럼 및 데이터 타입 확인
letterdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 17 columns):
letter    20000 non-null object
xbox      20000 non-null int64
ybox      20000 non-null int64
width     20000 non-null int64
height    20000 non-null int64
onpix     20000 non-null int64
xbar      20000 non-null int64
ybar      20000 non-null int64
x2bar     20000 non-null int64
y2var     20000 non-null int64
xybar     20000 non-null int64
x2ybar    20000 non-null int64
xy2bar    20000 non-null int64
xedge     20000 non-null int64
xedgey    20000 non-null int64
yedge     20000 non-null int64
yedgex    20000 non-null int64
dtypes: int64(16), object(1)
memory usage: 2.6+ MB


In [3]:
# 데이터 shape 확인
letterdata.shape

(20000, 17)

In [4]:
# 종속변수 : letter
# 독립변수 : 연속형 변수
# 종속변수 : 연속형 또는 명목형 변수
letterdata.head()

Unnamed: 0,letter,xbox,ybox,width,height,onpix,xbar,ybar,x2bar,y2var,xybar,x2ybar,xy2bar,xedge,xedgey,yedge,yedgex
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [5]:
# 데이터 컬럼 분할하기
x_vars = letterdata.drop('letter', axis = 1)
y_vars = letterdata['letter']

In [12]:
# sklearn은 문자 처리를 바로 지원하지 않으므로 모두 숫자로 매핑해야 한다
letters = sorted(letterdata['letter'].unique())
letters_mapping = dict()
for idx, val in enumerate(letters):
    letters_mapping[val] = idx+1

In [14]:
# 딕셔너리 데이터 확인
letters_mapping

{'A': 1,
 'B': 2,
 'C': 3,
 'D': 4,
 'E': 5,
 'F': 6,
 'G': 7,
 'H': 8,
 'I': 9,
 'J': 10,
 'K': 11,
 'L': 12,
 'M': 13,
 'N': 14,
 'O': 15,
 'P': 16,
 'Q': 17,
 'R': 18,
 'S': 19,
 'T': 20,
 'U': 21,
 'V': 22,
 'W': 23,
 'X': 24,
 'Y': 25,
 'Z': 26}

In [15]:
# train, test 데이터 나누기
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_vars, y_vars, random_state = 43, train_size = 0.7)

In [16]:
# 데이터 분할 상태 확인하기
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((14000, 16), (6000, 16), (14000,), (6000,))

### 최대 마진 분류기 - 선형 커널

In [19]:
# 모델 생성
from sklearn.svm import SVC
svm_fit = SVC(
    kernel = 'linear',
    C = 1.0,
    random_state = 43,
)
svm_fit.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=43,
    shrinking=True, tol=0.001, verbose=False)

In [21]:
# confusion matrix - 학습
from sklearn.metrics import accuracy_score, classification_report
pd.crosstab(y_train, svm_fit.predict(x_train), rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,540,0,1,1,0,0,1,0,0,2,...,0,2,0,0,2,0,0,0,4,0
B,1,484,0,2,2,0,8,4,0,1,...,0,19,6,0,0,3,0,3,0,0
C,1,0,488,0,14,1,8,2,0,0,...,0,0,0,0,1,0,0,0,0,0
D,0,20,0,510,0,1,1,6,0,0,...,0,5,0,1,1,0,0,0,0,0
E,0,3,8,0,447,3,25,1,1,0,...,4,4,10,5,0,0,0,5,0,4
F,1,2,0,5,6,503,4,2,0,5,...,0,0,11,9,0,0,0,1,2,0
G,0,3,22,10,5,4,434,3,0,0,...,11,7,10,0,0,9,2,1,0,0
H,0,13,9,30,0,7,4,368,1,4,...,7,24,0,2,5,3,0,4,0,0
I,0,2,1,6,1,12,1,0,471,15,...,0,0,12,0,0,0,0,7,0,4
J,5,0,0,5,0,1,0,3,21,467,...,0,0,6,0,0,0,0,3,0,6


In [25]:
# 정확도 - 학습
round(accuracy_score(y_train, svm_fit.predict(x_train)), 3)

0.876

In [27]:
# classification report
print(classification_report(y_train, svm_fit.predict(x_train)))

              precision    recall  f1-score   support

           A       0.93      0.97      0.95       557
           B       0.82      0.90      0.86       537
           C       0.89      0.91      0.90       535
           D       0.82      0.92      0.87       555
           E       0.81      0.84      0.83       530
           F       0.84      0.89      0.86       564
           G       0.76      0.80      0.78       543
           H       0.75      0.71      0.73       516
           I       0.92      0.88      0.90       534
           J       0.89      0.90      0.90       519
           K       0.84      0.84      0.84       551
           L       0.91      0.89      0.90       530
           M       0.93      0.92      0.93       540
           N       0.94      0.93      0.94       552
           O       0.89      0.78      0.83       535
           P       0.96      0.89      0.92       555
           Q       0.88      0.84      0.86       530
           R       0.81    

In [33]:
# confusion matrix - test
pd.crosstab(y_test, svm_fit.predict(x_test), rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,218,1,1,1,0,0,0,0,0,5,...,0,2,1,0,1,0,0,0,1,0
B,1,198,0,5,1,1,0,6,1,0,...,0,10,0,0,0,3,0,1,0,0
C,0,0,172,0,5,0,7,3,0,0,...,0,0,0,0,3,0,0,0,0,0
D,1,10,0,223,0,0,0,2,0,1,...,0,1,0,2,1,0,0,0,0,0
E,0,1,4,0,209,0,8,0,0,0,...,3,1,5,3,1,0,0,0,0,1
F,0,0,1,0,2,187,1,1,3,2,...,0,0,3,7,0,0,0,0,1,0
G,1,1,11,5,2,1,172,1,0,0,...,13,2,6,0,0,5,1,0,0,0
H,1,3,2,15,0,3,1,138,0,3,...,3,21,0,0,3,0,0,2,0,0
I,0,0,0,5,0,3,1,0,188,10,...,1,0,4,0,0,0,0,4,0,3
J,5,1,0,1,0,4,0,3,10,194,...,0,0,5,0,0,0,0,1,0,2


In [35]:
# 정확도 - test
round(accuracy_score(y_test, svm_fit.predict(x_test)),3)

0.85

In [37]:
# classification report
print(classification_report(y_test, svm_fit.predict(x_test)))

              precision    recall  f1-score   support

           A       0.87      0.94      0.90       232
           B       0.80      0.86      0.83       229
           C       0.86      0.86      0.86       201
           D       0.77      0.89      0.83       250
           E       0.81      0.88      0.84       238
           F       0.81      0.89      0.85       211
           G       0.74      0.75      0.74       230
           H       0.70      0.63      0.67       218
           I       0.89      0.85      0.87       221
           J       0.85      0.85      0.85       228
           K       0.77      0.79      0.78       188
           L       0.93      0.88      0.90       231
           M       0.95      0.93      0.94       252
           N       0.91      0.90      0.91       231
           O       0.86      0.79      0.82       218
           P       0.97      0.83      0.89       248
           Q       0.84      0.77      0.81       253
           R       0.74    

### 다항 커널

In [38]:
# 모델 생성 - 다항커널
svm_poly_fit = SVC(
    kernel = 'poly',
    C = 1,
    degree = 2 # 2차 다항식
)
svm_poly_fit.fit(x_train, y_train)



SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma='auto_deprecated',
    kernel='poly', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [39]:
# confusion matrix - 학습
pd.crosstab(y_train, svm_poly_fit.predict(x_train), rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,557,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B,0,528,0,0,0,0,0,1,0,0,...,0,5,0,0,0,3,0,0,0,0
C,0,0,530,0,1,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
D,0,1,0,550,0,0,0,2,0,0,...,0,0,0,0,1,0,0,0,0,0
E,0,0,0,0,520,0,6,0,0,0,...,0,0,2,0,0,0,0,0,0,0
F,0,0,0,0,0,560,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
G,0,0,2,0,2,0,538,0,0,0,...,0,0,0,0,0,0,0,0,0,0
H,0,1,0,9,0,1,1,488,0,0,...,0,12,0,0,2,0,0,0,0,0
I,0,0,0,1,0,1,0,0,525,7,...,0,0,0,0,0,0,0,0,0,0
J,1,0,0,1,0,0,0,0,7,509,...,0,0,1,0,0,0,0,0,0,0


In [40]:
# 정확도 - 학습
round(accuracy_score(y_train, svm_poly_fit.predict(x_train)),4)

0.9892

In [41]:
# classification report - 학습
print(classification_report(y_train, svm_poly_fit.predict(x_train)))

              precision    recall  f1-score   support

           A       1.00      1.00      1.00       557
           B       0.98      0.98      0.98       537
           C       1.00      0.99      0.99       535
           D       0.97      0.99      0.98       555
           E       0.99      0.98      0.98       530
           F       0.98      0.99      0.98       564
           G       0.98      0.99      0.98       543
           H       0.98      0.95      0.96       516
           I       0.98      0.98      0.98       534
           J       0.99      0.98      0.98       519
           K       0.99      0.98      0.98       551
           L       1.00      0.99      1.00       530
           M       1.00      1.00      1.00       540
           N       1.00      0.99      0.99       552
           O       0.99      1.00      0.99       535
           P       0.99      0.97      0.98       555
           Q       1.00      1.00      1.00       530
           R       0.96    

In [42]:
# confusion matrix - 테스트
pd.crosstab(y_test, svm_poly_fit.predict(x_test), rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,230,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B,0,219,0,1,0,1,0,1,0,0,...,0,3,1,0,0,1,0,1,0,0
C,0,0,190,0,2,0,4,0,0,0,...,0,0,0,0,0,0,0,0,1,0
D,0,3,0,236,0,0,0,4,0,1,...,0,0,0,3,0,0,0,0,0,0
E,0,1,2,0,228,1,3,0,0,0,...,0,0,0,0,0,0,0,0,0,1
F,0,0,0,0,2,205,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
G,1,1,5,2,0,0,215,1,0,0,...,0,1,0,0,0,1,1,0,0,0
H,0,3,1,3,2,0,1,189,0,0,...,1,9,0,0,1,0,0,0,1,0
I,0,0,0,1,0,2,0,0,211,4,...,0,0,1,0,0,0,0,1,0,1
J,0,0,0,0,0,4,0,0,7,214,...,1,1,0,0,1,0,0,0,0,0


In [43]:
# 정확도 - 학습
round(accuracy_score(y_test, svm_poly_fit.predict(x_test)),4)

0.9537

In [44]:
# classification report - 학습
print(classification_report(y_test, svm_poly_fit.predict(x_test)))

              precision    recall  f1-score   support

           A       0.98      0.99      0.99       232
           B       0.92      0.96      0.94       229
           C       0.92      0.95      0.93       201
           D       0.91      0.94      0.93       250
           E       0.93      0.96      0.95       238
           F       0.92      0.97      0.94       211
           G       0.94      0.93      0.94       230
           H       0.92      0.87      0.89       218
           I       0.96      0.95      0.96       221
           J       0.96      0.94      0.95       228
           K       0.90      0.93      0.91       188
           L       0.98      0.95      0.97       231
           M       0.98      0.98      0.98       252
           N       0.96      0.94      0.95       231
           O       0.94      0.95      0.94       218
           P       0.99      0.95      0.97       248
           Q       0.98      0.93      0.95       253
           R       0.91    

### RBF 커널

In [45]:
# 모델 생성 - RBF 커널
svm_rbf_fit = SVC(
    kernel = 'rbf',
    C = 1,
    gamma = 0.1
)
svm_rbf_fit.fit(x_train,y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [46]:
# confusion matrix - 학습
pd.crosstab(y_train, svm_rbf_fit.predict(x_train), rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,557,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B,0,535,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
C,0,0,535,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
D,0,0,0,553,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
E,0,0,0,0,527,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
F,0,0,0,0,0,563,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
G,0,0,0,1,1,0,541,0,0,0,...,0,0,0,0,0,0,0,0,0,0
H,0,0,0,1,0,0,1,512,0,0,...,0,1,0,0,1,0,0,0,0,0
I,0,0,0,0,0,0,0,0,530,4,...,0,0,0,0,0,0,0,0,0,0
J,0,0,0,0,0,0,0,0,2,517,...,0,0,0,0,0,0,0,0,0,0


In [47]:
# 정확도 - 학습
round(accuracy_score(y_train, svm_rbf_fit.predict(x_train)),3)

0.998

In [48]:
# classification report
print(classification_report(y_train, svm_rbf_fit.predict(x_train)))

              precision    recall  f1-score   support

           A       1.00      1.00      1.00       557
           B       1.00      1.00      1.00       537
           C       1.00      1.00      1.00       535
           D       1.00      1.00      1.00       555
           E       1.00      0.99      1.00       530
           F       0.99      1.00      0.99       564
           G       0.99      1.00      0.99       543
           H       0.99      0.99      0.99       516
           I       1.00      0.99      0.99       534
           J       0.99      1.00      0.99       519
           K       1.00      1.00      1.00       551
           L       1.00      1.00      1.00       530
           M       1.00      1.00      1.00       540
           N       1.00      1.00      1.00       552
           O       1.00      1.00      1.00       535
           P       1.00      0.99      1.00       555
           Q       1.00      1.00      1.00       530
           R       0.99    

In [49]:
# confusion matrix - 테스트
pd.crosstab(y_test, svm_rbf_fit.predict(x_test), rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,232,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B,0,223,0,1,0,0,1,1,0,0,...,0,0,0,0,0,3,0,0,0,0
C,0,0,187,0,2,0,2,0,0,0,...,0,0,0,1,0,1,2,0,0,0
D,0,0,0,242,0,0,0,4,0,0,...,0,0,1,0,0,0,0,0,0,0
E,0,0,2,0,233,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
F,0,0,0,0,0,206,0,0,0,0,...,0,0,0,2,0,0,0,1,0,0
G,0,1,0,3,1,0,221,0,0,0,...,0,0,1,0,0,1,0,0,0,0
H,0,5,1,3,0,0,2,190,0,0,...,0,6,0,0,0,0,0,0,0,0
I,0,0,0,0,0,0,0,0,210,10,...,0,0,0,0,0,0,0,1,0,0
J,0,0,0,0,0,1,0,1,4,220,...,0,0,0,0,1,0,0,0,0,0


In [50]:
# 정확도 - 테스트
round(accuracy_score(y_test, svm_rbf_fit.predict(x_test)),3)

0.969

In [51]:
# classification report
print(classification_report(y_test, svm_rbf_fit.predict(x_test)))

              precision    recall  f1-score   support

           A       0.99      1.00      0.99       232
           B       0.93      0.97      0.95       229
           C       0.98      0.93      0.96       201
           D       0.95      0.97      0.96       250
           E       0.98      0.98      0.98       238
           F       0.96      0.98      0.97       211
           G       0.97      0.96      0.96       230
           H       0.94      0.87      0.90       218
           I       0.98      0.95      0.97       221
           J       0.95      0.96      0.96       228
           K       0.93      0.94      0.93       188
           L       0.99      0.98      0.98       231
           M       0.96      1.00      0.98       252
           N       0.99      0.95      0.97       231
           O       0.94      0.97      0.95       218
           P       0.99      0.96      0.97       248
           Q       0.99      0.97      0.98       253
           R       0.89    

In [54]:
# 그리드 검색 - RBF 커널
# 파이프라인, 파라미터 생성
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV

pipeline = Pipeline([
       ('clf', SVC(kernel = 'rbf', C = 1, gamma = 0.1))
])
parameters = {
    'clf__C' : (0.1, 0.3, 1, 3, 10, 30),
    'clf__gamma' : (0.001, 0.01, 0.1, 0.3, 1)
}

In [56]:
# 그리드 검색 모델 생성
grid_search_rbf = GridSearchCV(pipeline, parameters, n_jobs = -1, verbose = 1, scoring = 'accuracy')
grid_search_rbf.fit(x_train, y_train)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  4.4min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        SVC(C=1, cache_size=200,
                                            class_weight=None, coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma=0.1, kernel='rbf',
                                            max_iter=-1, probability=False,
                                            random_state=None, shrinking=True,
                                            tol=0.001, verbose=False))],
                                verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'clf__C': (0.1, 0.3, 1, 3, 10, 30),
                         'clf__gamma': (0.001, 0.01, 0.1, 0.3, 1)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=

In [57]:
# 그리드 검색 - 학습 결과
grid_search_rbf.best_score_

0.9631428571428572

In [59]:
# best_parameters 저장
best_parameteres = grid_search_rbf.best_estimator_.get_params()

In [60]:
# 데이터 확인
best_parameteres

{'memory': None,
 'steps': [('clf', SVC(C=30, cache_size=200, class_weight=None, coef0=0.0,
       decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
       max_iter=-1, probability=False, random_state=None, shrinking=True,
       tol=0.001, verbose=False))],
 'verbose': False,
 'clf': SVC(C=30, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False),
 'clf__C': 30,
 'clf__cache_size': 200,
 'clf__class_weight': None,
 'clf__coef0': 0.0,
 'clf__decision_function_shape': 'ovr',
 'clf__degree': 3,
 'clf__gamma': 0.1,
 'clf__kernel': 'rbf',
 'clf__max_iter': -1,
 'clf__probability': False,
 'clf__random_state': None,
 'clf__shrinking': True,
 'clf__tol': 0.001,
 'clf__verbose': False}

In [67]:
# best parameter 확인
for param_name in sorted(parameters.keys()):
    print(param_name, ':' ,  best_parameteres[param_name])

clf__C : 30
clf__gamma : 0.1


In [68]:
# 테스트
predictions = grid_search_rbf.predict(x_test)

In [69]:
# 정확도 - 테스트
round(accuracy_score(y_test, predictions),3)

0.971

In [70]:
# confusion matrix - 테스트
pd.crosstab(y_test, predictions, rownames = ['Actual'], colnames = ['Predicted'])

Predicted,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,232,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B,0,224,0,1,0,0,1,1,0,0,...,0,0,0,0,0,2,0,0,0,0
C,0,0,191,0,2,0,1,0,0,0,...,0,0,0,1,0,1,2,0,0,0
D,0,0,0,245,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
E,0,0,2,0,231,1,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
F,0,0,0,0,0,205,0,0,0,0,...,0,0,0,2,0,0,0,1,0,0
G,0,1,0,1,0,0,222,2,0,0,...,0,0,1,0,0,1,0,0,0,0
H,0,5,0,1,0,0,1,195,0,0,...,0,5,0,0,0,0,0,0,0,0
I,0,0,0,0,0,1,0,0,209,10,...,0,0,0,0,0,0,0,1,0,0
J,0,0,0,0,0,1,0,1,4,220,...,0,0,0,0,1,0,0,0,0,0
