# 2022년도 도전학기 '빅데이터와 인공지능을 활용한 시스템 강건설계' 머신러닝 챌린지

#### - 머신러닝 챌린지 목표 
- 실습내용 기반 데이터 가공 능력, 인공지능 모델 학습 능력 평가


#### - 가이드에 따라 머신러닝 챌린지를 수행하며, 각 단계의 결과가 저장된 폴더(Result) 내의 모든 파일을 1개 압축파일(.zip)로 제출
- 과제 수행을 위해 최소한의 가이드가 있으나 가이드에 따르지 않고 다른 방법으로 코드 작성해도 결과만 맞으면 무관
- 로봇 스폿 용접으로 수집된 <font color=red>스폿별 센서 데이터</font> (실습용 데이터와 구성이 동일하나 시간열이 없음)


#### - 머신러닝 챌린지 수행 내용
- 수집된 센서데이터의 특징 데이터 추출 및 선택
- K-Fold 교차검증 통한 최적의 모델 선택
- 전체 데이터로 최적 모델 최종 학습

.

.

.



### 필요한 라이브러리 import
- numpy, pandas 등 필요한 라이브러리 불러오기

In [1]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt

import scipy.stats as sp
import pywt

# [1단계] 정상/고장 스폿별 데이터 Time/Frequency domain 특징 추출하기(5점)
> #### SpotWeldingData 폴더 내부 정상/고장 각각 120개씩
> #### 개별 데이터는 전류/전압/가속도 3개 센서, 샘플링 주파수 12800Hz, 약 0.2167초간의 데이터
> #### 개별 데이터 1개당 Time domain 특징 30개, Frequency domain 특징 180개 추출

# 필수!
- 추출된 특징데이터(DataFrame) 변수 이름: <font color=red>FeatureData</font>
- 특징 순서: Max, Min, Mean, RMS, Variance, Skewness, Kurtosis, Crest factor, Impulse factor, Shape factor
- Wavelet option: mother wavelet 'haar', 6 level
- 행 순서: Time domain 특징 30개, Frequency domain 특징 180개
- 열 순서: Normal_1, Normal_2, ..., Abnormal_1, Abnormal_2, ...

### 특징 추출

In [2]:
NoOfData    = 120  # 정상/고장 스폿용접 데이터 각 120개씩 
NoOfSensor  = 3    # 전류(Current), 전압(Voltage), 가속도(Acceleration)
NoOfFeature = 10   # 특징 개수:10개 (순서: Max, Min, Mean, RMS, Variance, Skewness, Kurtosis, Crest factor, Impulse factor, Shape factor)

In [3]:
def rms(x): # RMS 함수 정의
    return np.sqrt(np.mean(x**2))

In [4]:
# Time Domain 특징값 추출

# 특징데이터 크기 지정
TimeFeature_Normal = np.zeros((NoOfSensor*NoOfFeature , NoOfData))
TimeFeature_Abnormal = np.zeros((NoOfSensor*NoOfFeature , NoOfData))

for i in range(NoOfData):
    
    # 데이터 불러오기
    temp_path1 = './SpotWeldingData/Normal_%d'%(i+1)   # Normal 데이터 파일 경로
    temp_path2 = './SpotWeldingData/Abnormal_%d'%(i+1) # Abnormal 데이터 파일 경로
    temp_data1 = pd.read_csv(temp_path1 , sep=',' , header=None).iloc[:,0:] # 임시 Normal 데이터
    temp_data2 = pd.read_csv(temp_path2 , sep=',' , header=None).iloc[:,0:] # 임시 Abnormal 데이터
    
    # Time Domain 특징값 추출
    for j in range(NoOfSensor):
        
        # Normal Time Domain Feature
        TimeFeature_Normal[10*j+0, i] = np.max(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+1, i] = np.min(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+2, i] = np.mean(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+3, i] = rms(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+4, i] = np.var(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+5, i] = sp.skew(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+6, i] = sp.kurtosis(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+7, i] = np.max(temp_data1.iloc[:,j])/rms(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+8, i] = np.max(temp_data1.iloc[:,j])/np.mean(temp_data1.iloc[:,j])
        TimeFeature_Normal[10*j+9, i] = rms(temp_data1.iloc[:,j])/np.mean(temp_data1.iloc[:,j])
            
        # Abnormal Time Domain Feature
        TimeFeature_Abnormal[10*j+0, i] = np.max(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+1, i] = np.min(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+2, i] = np.mean(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+3, i] = rms(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+4, i] = np.var(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+5, i] = sp.skew(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+6, i] = sp.kurtosis(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+7, i] = np.max(temp_data2.iloc[:,j])/rms(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+8, i] = np.max(temp_data2.iloc[:,j])/np.mean(temp_data2.iloc[:,j])
        TimeFeature_Abnormal[10*j+9, i] = rms(temp_data2.iloc[:,j])/np.mean(temp_data2.iloc[:,j])

TimeFeature = np.concatenate([TimeFeature_Normal, TimeFeature_Abnormal] , axis=1)
TimeFeature.shape

(30, 240)

In [5]:
# Frequency Domain 특징값 추출

# Wavelet options
MotherWavelet = pywt.Wavelet('haar')   # Mother wavelet (모함수) 지정
Level   = 6                            # Wavelet 분해 레벨 지정
select  = 6                            # 특징추출 영역 고주파 영역부터 개수 지정 (d1~)

#Frequency Domain 특징값 추출 (Wavelet Transform 기반)
FreqFeature_Normal   = np.zeros(shape=(NoOfSensor*NoOfFeature*select , NoOfData))
FreqFeature_Abnormal = np.zeros(shape=(NoOfSensor*NoOfFeature*select , NoOfData))

for i in range(NoOfData):
    
    # 데이터 불러오기
    temp_path1 = './SpotWeldingData/Normal_%d'%(i+1)   # Normal 데이터 파일 경로
    temp_path2 = './SpotWeldingData/Abnormal_%d'%(i+1) # Abnormal 데이터 파일 경로
    temp_data1 = np.array(pd.read_csv(temp_path1 , sep=',', header=None).iloc[:,0:]) # 임시 Normal 데이터
    temp_data2 = np.array(pd.read_csv(temp_path2 , sep=',', header=None).iloc[:,0:]) # 임시 Abnormal 데이터
    Coef1      = pywt.wavedec(temp_data1, MotherWavelet, level=Level, axis=0)
    Coef2      = pywt.wavedec(temp_data2, MotherWavelet, level=Level, axis=0)
    
    # Frequency Domain 특징값 추출
    for j in range(NoOfSensor):
        
        for k in np.arange(select):
            coef1 = Coef1[Level-k]
            coef2 = Coef2[Level-k]
            
            # Normal Frequency Domain Feature
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+0 , i] = np.max(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+1 , i] = np.min(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+2 , i] = np.mean(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+3 , i] = rms(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+4 , i] = np.var(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+5 , i] = sp.skew(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+6 , i] = sp.kurtosis(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+7 , i] = np.max(coef1[:,j])/rms(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+8 , i] = np.max(coef1[:,j])/np.mean(coef1[:,j])
            FreqFeature_Normal[NoOfFeature*j*select+k*NoOfFeature+9 , i] = rms(coef1[:,j])/np.mean(coef1[:,j])
            
            # Abnormal Frequency Domain Feature
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+0 , i] = np.max(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+1 , i] = np.min(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+2 , i] = np.mean(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+3 , i] = rms(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+4 , i] = np.var(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+5 , i] = sp.skew(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+6 , i] = sp.kurtosis(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+7 , i] = np.max(coef2[:,j])/rms(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+8 , i] = np.max(coef2[:,j])/np.mean(coef2[:,j])
            FreqFeature_Abnormal[NoOfFeature*j*select+k*NoOfFeature+9 , i] = rms(coef2[:,j])/np.mean(coef2[:,j])

FreqFeature = np.concatenate([FreqFeature_Normal, FreqFeature_Abnormal] , axis=1)
FreqFeature.shape

(180, 240)

In [6]:
# 특징 데이터 병합
Features = np.concatenate([TimeFeature,FreqFeature] , axis=0)

FeatureData = pd.DataFrame(Features)
FeatureData

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,230,231,232,233,234,235,236,237,238,239
0,5.291863,5.234436,5.114397,5.078826,5.152517,5.229064,5.253481,5.234687,5.103656,5.363713,...,5.119402,5.148513,5.254651,5.104079,5.196028,5.317705,5.149645,5.256765,5.451178,5.121302
1,-5.532383,-5.337900,-5.152897,-5.122665,-5.190717,-5.266181,-5.279861,-5.267032,-5.135895,-5.525549,...,-5.203542,-5.460530,-5.520525,-5.142374,-5.226539,-5.584296,-5.183268,-5.477434,-5.731463,-5.165231
2,-0.030931,-0.026517,-0.023940,-0.024068,-0.024566,-0.023105,-0.024522,-0.019830,-0.019514,-0.029797,...,-0.024559,-0.023034,-0.027406,-0.022263,-0.021810,-0.026585,-0.022116,-0.025467,-0.028971,-0.021275
3,2.691801,2.694450,2.688186,2.687191,2.677480,2.691660,2.702448,2.683724,2.682554,2.707523,...,2.675390,2.684913,2.683773,2.664949,2.663127,2.679799,2.680464,2.690217,2.688724,2.671162
4,7.244837,7.259358,7.225770,7.220416,7.168297,7.244498,7.302624,7.201982,7.195715,7.329793,...,7.157107,7.208225,7.201886,7.101456,7.091767,7.180615,7.184396,7.236619,7.228395,7.134652
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,-0.737752,3.259734,1.664979,1.684975,1.581038,2.781113,3.178408,2.107269,1.625931,1.581395,...,1.303904,1.344954,1.368371,1.440955,1.350556,1.424882,1.482014,1.460398,1.346535,1.499261
206,4.924400,12.826006,1.432969,1.492820,1.313720,9.261458,12.341555,5.466513,1.218102,1.145249,...,0.442792,0.442015,0.505664,0.645331,0.473570,0.561481,0.667883,0.537265,0.396698,0.687633
207,2.244769,5.167951,2.943150,2.962990,2.778244,4.806569,5.129116,4.187465,2.654294,2.695495,...,2.362096,2.228747,2.322788,2.267641,2.267871,2.265560,2.471394,2.227753,2.177953,2.334014
208,-15.607825,39.438692,96.630826,97.180605,87.095434,36.985333,39.867799,366.618120,184.095179,123.429948,...,-23.560170,-27.650099,-35.462131,-30.818613,-18.241424,-28.954627,-41.407167,-37.599780,-25.476130,-55.585940


### 1단계 결과물 제출용 데이터 파일로 저장 (수강생 번호 외 코드수정X)

- 변수 이름 가이드에 맞게 지정됐는지 재확인 요망

In [7]:
StudentNo = 136   # 수강생 번호 입력

# 아래는 수정 금지
Path = './Result/ST%d_MC1'%StudentNo
FeatureData.to_csv(Path, sep=',' , header=None , index=None)

# [2단계] t-Test 기반 주요 특징 선택(5점)

# 필수!
- 정상/고장의 각 특징 별 t-Test 수행, P-value 오름차순 기준 상위 30개 주요 특징 선택하여 저장
- 선택된 특징데이터(DataFrame) 변수 이름: <font color=red>FeatureSelected</font>

In [8]:
# 특징 데이터 분할
NoOfData = int(FeatureData.shape[1]/2)
Normal_FeatureData   = FeatureData.iloc[:,:NoOfData]
Abnormal_FeatureData = FeatureData.iloc[:,NoOfData:]

In [9]:
NoOfFeature = FeatureData.shape[0] # 추출된 Feature 갯수

P_value = np.zeros((NoOfFeature , 2))

# 특징값 각각 T-검정 수행
for i in np.arange(NoOfFeature):
    T_test       = np.array(sp.ttest_ind(Normal_FeatureData.iloc[i,:] , Abnormal_FeatureData.iloc[i,:]))
    P_value[i,0] = i          # Feature Index
    P_value[i,1] = T_test[1]  # P값 (P-value)
    
P_value      = pd.DataFrame(P_value)
P_value_Rank = P_value.sort_values([1],ascending=True)  # P-value 기준 오름차순 정렬

P_value_Rank

Unnamed: 0,0,1
144,144.0,5.931804e-120
143,143.0,5.791478e-119
14,14.0,1.726412e-106
13,13.0,4.168146e-106
124,124.0,1.265018e-89
...,...,...
38,38.0,9.026807e-01
175,175.0,9.101920e-01
162,162.0,9.512485e-01
40,40.0,9.569913e-01


In [10]:
Rank = 30

Normal   = np.zeros((Rank,NoOfData))
Abnormal = np.zeros((Rank,NoOfData))

for i in range(Rank):
    index         = int(P_value_Rank.iloc[i,0])
    Normal[i,:]   = Normal_FeatureData.iloc[index,:].values
    Abnormal[i,:] = Abnormal_FeatureData.iloc[index,:].values

# 정상, 고장 특징값 합치기
FeatureSelected = pd.DataFrame(np.concatenate([Normal, Abnormal] , axis=1))

### 2단계 결과물 제출용 데이터 파일로 저장 (수강생 번호 외 코드수정X)

- 변수 이름 가이드에 맞게 지정됐는지 재확인 요망

In [11]:
StudentNo = 136   # 수강생 번호 입력

# 아래는 수정 금지
Path = './Result/ST%d_MC2'%StudentNo
FeatureSelected.to_csv(Path, sep=',' , header=None , index=None)

# [3단계] 선택된 특징 데이터 가공하여 3-fold 교차검증 데이터/레이블 만들기(10점)

# 필수!
- Fold 1 검증 데이터: 정상 및 고장 1~40까지의 특징, 나머지 Fold 1 학습 데이터로 사용
- Fold 2 검증 데이터: 정상 및 고장 41~80까지의 특징, 나머지 Fold 2 학습 데이터로 사용
- Fold 3 검증 데이터: 정상 및 고장 81~120까지의 특징, 나머지 Fold 3 학습 데이터로 사용
- 레이블(Label Encoding): 정상 0, 고장 1
- 학습 데이터 변수 이름: <font color=red>Training_Fold1, Training_Fold2, Training_Fold3</font>
- 검증 데이터 변수 이름: <font color=red>Validation_Fold1, Validation_Fold2, Validation_Fold3</font>

In [12]:
NoOfData   = int(FeatureSelected.shape[1]/2)   # 데이터 개수 (정상/고장 각각)
Fold       = 3

FeatNo     = int(FeatureSelected.shape[0])  # 데이터 특징 수 (=데이터 차원)
FoldDataNo = int(NoOfData/Fold)            # 1개 Fold 당 (검증)데이터 개수

# Fold별 데이터 분할 용이하도록 데이터 Reshape
NormalSet   = np.array(FeatureSelected.iloc[: , :NoOfData])
AbnormalSet = np.array(FeatureSelected.iloc[: , NoOfData:])
FeatureSelected_Reshaped = pd.DataFrame(np.concatenate([NormalSet , AbnormalSet] , axis=0))
FeatureSelected_Reshaped.shape

(60, 120)

Fold별 데이터 분할

In [13]:
# Validation Data set
for i in range(Fold):
    
    temp_Valid_Normal   = FeatureSelected_Reshaped.iloc[:FeatNo , FoldDataNo*i : FoldDataNo*(i+1)]
    temp_Valid_Abnormal = FeatureSelected_Reshaped.iloc[FeatNo: , FoldDataNo*i : FoldDataNo*(i+1)]
    temp_Valid = pd.DataFrame(np.transpose(np.concatenate([temp_Valid_Normal, temp_Valid_Abnormal] , axis=1)))
    
    s = 'Validation_Fold%d = temp_Valid'%(i+1)
    exec(s)
    
# Training Data set
for i in range(Fold):
    
    temp_Train_Front = FeatureSelected_Reshaped.iloc[:,:FoldDataNo*i]
    temp_Train_Back  = FeatureSelected_Reshaped.iloc[:,FoldDataNo*(i+1):]
    temp_Train_Total = np.concatenate([temp_Train_Front , temp_Train_Back] , axis=1)
    temp_Train_Final = pd.DataFrame(np.transpose(np.concatenate([temp_Train_Total[:FeatNo,:],temp_Train_Total[FeatNo:,:]] , axis=1)))
    
    s ='Training_Fold%d  = temp_Train_Final'%(i+1)
    exec(s)

In [14]:
# 레이블 만들기
NoOfLabel_Train = int(Training_Fold1.shape[0]/2)
NoOfLabel_Valid = int(Validation_Fold1.shape[0]/2)


## KNN & SVM 레이블 (Label encoding) - 정상: 0 // 고장: 1
TrainingFold_Label   = np.zeros(2*NoOfLabel_Train , dtype=int)
ValidationFold_Label = np.zeros(2*NoOfLabel_Valid , dtype=int)

# 고장데이터(학습용) Label 값 = 1
TrainingFold_Label[NoOfLabel_Train:] = 1

# 고장데이터(검증용) Label 값 = 1
ValidationFold_Label[NoOfLabel_Valid:] = 1

TrainingFold_Label   = pd.Series(TrainingFold_Label)
ValidationFold_Label = pd.Series(ValidationFold_Label)

#TrainingFold_Label
ValidationFold_Label

0     0
1     0
2     0
3     0
4     0
     ..
75    1
76    1
77    1
78    1
79    1
Length: 80, dtype: int32

## k-fold 데이터 및 레이블 저장

k-fold 데이터 (Training & Validation) 저장

In [15]:
for i in range(Fold):
    path1 = './K_FoldData/Training_Fold%d'  %(i+1)
    path2 = './K_FoldData/Validation_Fold%d'%(i+1)
    
    c1 = 'Training_Fold%d.to_csv(  path1, sep = ",", header = None, index = None)'%(i+1)
    c2 = 'Validation_Fold%d.to_csv(path2, sep = ",", header = None, index = None)'%(i+1)
    exec(c1)
    exec(c2)

레이블 (Training & Validation) 저장

In [16]:
TrainingFold_Label.to_csv(  './K_FoldData/TrainingFold_Label', header = None, index = None)
ValidationFold_Label.to_csv('./K_FoldData/ValidationFold_Label', header = None, index = None)

## 최종 AI모델 학습용 전체 데이터 및 레이블 저장

In [17]:
# 전체 데이터 저장 (행렬 전치)
Training_All = np.transpose(FeatureSelected)
Training_All.shape


# 전체 데이터 레이블: SVM 및 KNN 맞춤형 (Label encoding)
Training_All_Label = np.zeros(NoOfData*2)

Training_All_Label[NoOfData:] = 1    # 고장데이터(학습용) Label 값 = 1
Training_All_Label = pd.Series(Training_All_Label)

Training_All_Label.shape


# 전체 데이터 & 레이블 저장
Training_All.to_csv('./K_FoldData/Training_All', sep = ",", header = None, index = None)
Training_All_Label.to_csv('./K_FoldData/Training_All_Label', sep = ",", header = None, index = None)

### 3단계 결과물 제출용 데이터 파일로 저장 (수강생 번호 외 코드수정X)

- 변수 이름 가이드에 맞게 지정됐는지 재확인 요망

In [18]:
StudentNo = 136   # 수강생 번호 입력

# 아래는 수정 금지
Path1 = './Result/ST%d_MC3_1'%StudentNo
Path2 = './Result/ST%d_MC3_2'%StudentNo

Validation_Fold1.to_csv(Path1, sep=',' , header=None , index=None)
Validation_Fold2.to_csv(Path2, sep=',' , header=None , index=None)

.

.

.

# [4단계] 머신러닝(KNN/SVM) 모델 3-fold 교차 검증 및 최적 모델 선정(15점)
> #### KNN/SVM의 하이퍼파라미터를 달리하여 학습 및 교차 검증을 통해 성능 확인
> #### 아래 파라미터의 모델 리스트 중 최적의 파라미터를 활용하여 전체 데이터 학습 후 제출

# 필수!

##### - KNN 모델 리스트 
- KNN Model 1: n_neighbors = 3, metric = euclidean
- KNN Model 2: n_neighbors = 5, metric = euclidean
- KNN Model 3: n_neighbors = 3, metric = manhattan
- KNN Model 4: n_neighbors = 5, metric = manhattan

##### - SVM 모델 리스트 
- SVM Model 1: kernel = rbf, C = 1
- SVM Model 2: kernel = rbf, C = 10
- SVM Model 3: kernel = linear, C = 1
- SVM Model 4: kernel = linear, C = 10


#### - KNeighborsClassifier, SVC 함수의 다른 파라미터는 입력하지 않음
#### - 위 8개 모델 중 3-fold 교차 검증의 <font color=red>평균 검증 정확도가 가장 높은 파라미터를 갖는 최종 모델에 전체 데이터 학습</font>
#### - 최종 모델 변수명: <font color=red>FinalModel</font>

In [19]:
# KNN/SVM 함수 라이브러리 불러오기
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm,metrics
import joblib

In [20]:
# 전체 데이터 불러오기 / 전체 데이터의 레이블 만들기
Fold = 3

# k-fold 학습/검증 데이터
for i in range(Fold):
    
    path1 = './K_FoldData/Training_Fold%d'%(i+1)
    path2 = './K_FoldData/Validation_Fold%d'%(i+1)
    c1 = 'Training_Fold%d   = np.array(pd.read_csv(path1, sep=",", header=None))'%(i+1)
    c2 = 'Validation_Fold%d = np.array(pd.read_csv(path2, sep=",", header=None))'%(i+1)
    exec(c1)
    exec(c2)

# K-fold 학습/검증 레이블
TrainingFold_Label   = np.array(pd.read_csv('./K_FoldData/TrainingFold_Label'  , sep=",", header=None).T.squeeze())
ValidationFold_Label = np.array(pd.read_csv('./K_FoldData/ValidationFold_Label', sep=",", header=None).T.squeeze())
    
    
# 전체 학습용 데이터
Training_All       = np.array(pd.read_csv('./K_FoldData/Training_All', sep = ",", header = None))
Training_All_Label = np.array(pd.read_csv('./K_FoldData/Training_All_Label', sep = ",", header = None).T.squeeze())

print(Training_Fold1.shape)
print(Validation_Fold1.shape)
print(TrainingFold_Label.shape)
print(ValidationFold_Label.shape)
print(Training_All.shape)

(160, 30)
(80, 30)
(160,)
(80,)
(240, 30)


### KNN Model 1 K-fold 교차검증

In [21]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    

    knnModel_1 = KNeighborsClassifier(n_neighbors = 3, metric = 'euclidean').fit(Training_CurrentFold , TrainingFold_Label)
    
    c3 = 'knnscore_Fold%d = knnModel_1.score(Validation_CurrentFold , ValidationFold_Label)'%(i+1)
    exec(c3)
        
    Add += knnModel_1.score(Validation_CurrentFold, ValidationFold_Label)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}% \n'.
        format(knnscore_Fold1*100, knnscore_Fold2*100, knnscore_Fold3*100))
print(' Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 100.00% 
 Fold 2: 100.00% 
 Fold 3: 98.75% 

 Average accuracy: 99.58%


### KNN Model 2 K-fold 교차검증

In [22]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    

    knnModel_2 = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean').fit(Training_CurrentFold , TrainingFold_Label)
    
    c3 = 'knnscore_Fold%d = knnModel_2.score(Validation_CurrentFold , ValidationFold_Label)'%(i+1)
    exec(c3)
        
    Add += knnModel_2.score(Validation_CurrentFold, ValidationFold_Label)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}% \n'.
        format(knnscore_Fold1*100, knnscore_Fold2*100, knnscore_Fold3*100))
print(' Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 98.75% 
 Fold 2: 100.00% 
 Fold 3: 100.00% 

 Average accuracy: 99.58%


### KNN Model 3 K-fold 교차검증

In [23]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    

    knnModel_3 = KNeighborsClassifier(n_neighbors = 3, metric = 'manhattan').fit(Training_CurrentFold , TrainingFold_Label)
    
    c3 = 'knnscore_Fold%d = knnModel_3.score(Validation_CurrentFold , ValidationFold_Label)'%(i+1)
    exec(c3)
        
    Add += knnModel_3.score(Validation_CurrentFold, ValidationFold_Label)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}% \n'.
        format(knnscore_Fold1*100, knnscore_Fold2*100, knnscore_Fold3*100))
print(' Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 100.00% 
 Fold 2: 100.00% 
 Fold 3: 100.00% 

 Average accuracy: 100.00%


### KNN Model 4 K-fold 교차검증

In [24]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    

    knnModel_4= KNeighborsClassifier(n_neighbors = 5, metric = 'manhattan').fit(Training_CurrentFold , TrainingFold_Label)
    
    c3 = 'knnscore_Fold%d = knnModel_4.score(Validation_CurrentFold , ValidationFold_Label)'%(i+1)
    exec(c3)
        
    Add += knnModel_4.score(Validation_CurrentFold, ValidationFold_Label)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}% \n'.
        format(knnscore_Fold1*100, knnscore_Fold2*100, knnscore_Fold3*100))
print(' Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 98.75% 
 Fold 2: 100.00% 
 Fold 3: 100.00% 

 Average accuracy: 99.58%


### SVM Model 1 K-fold 교차검증

In [25]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    
    
    svmModel_1 = svm.SVC(kernel = 'rbf', C = 1)
    svmModel_1.fit(Training_CurrentFold , TrainingFold_Label)
    Predicted = np.array(svmModel_1.predict(Validation_CurrentFold))
    
    c3 = 'svmscore_Fold%d = metrics.accuracy_score(ValidationFold_Label , Predicted)'%(i+1)
    exec(c3)
        
    Add += metrics.accuracy_score(ValidationFold_Label , Predicted)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}%'.
        format(svmscore_Fold1*100, svmscore_Fold2*100, svmscore_Fold3*100))
print('\n Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 98.75% 
 Fold 2: 100.00% 
 Fold 3: 98.75%

 Average accuracy: 99.17%


### SVM Model 2 K-fold 교차검증

In [26]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    
    
    svmModel_2 = svm.SVC(kernel = 'rbf', C = 10)
    svmModel_2.fit(Training_CurrentFold , TrainingFold_Label)
    Predicted = np.array(svmModel_2.predict(Validation_CurrentFold))
    
    c3 = 'svmscore_Fold%d = metrics.accuracy_score(ValidationFold_Label , Predicted)'%(i+1)
    exec(c3)
        
    Add += metrics.accuracy_score(ValidationFold_Label , Predicted)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}%'.
        format(svmscore_Fold1*100, svmscore_Fold2*100, svmscore_Fold3*100))
print('\n Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 98.75% 
 Fold 2: 100.00% 
 Fold 3: 98.75%

 Average accuracy: 99.17%


### SVM Model 3 K-fold 교차검증

In [27]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    
    
    svmModel_3 = svm.SVC(kernel = 'linear', C = 1)
    svmModel_3.fit(Training_CurrentFold , TrainingFold_Label)
    Predicted = np.array(svmModel_3.predict(Validation_CurrentFold))
    
    c3 = 'svmscore_Fold%d = metrics.accuracy_score(ValidationFold_Label , Predicted)'%(i+1)
    exec(c3)
        
    Add += metrics.accuracy_score(ValidationFold_Label , Predicted)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}%'.
        format(svmscore_Fold1*100, svmscore_Fold2*100, svmscore_Fold3*100))
print('\n Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 100.00% 
 Fold 2: 100.00% 
 Fold 3: 98.75%

 Average accuracy: 99.58%


### SVM Model 4 K-fold 교차검증

In [28]:
Add    = 0
Divide = 0

for i in range(Fold):
    c1 = 'Training_CurrentFold = Training_Fold%d'%(i+1)
    exec(c1)
    c2 = 'Validation_CurrentFold = Validation_Fold%d'%(i+1)
    exec(c2)    
    
    svmModel_4 = svm.SVC(kernel = 'linear', C = 10)
    svmModel_4.fit(Training_CurrentFold , TrainingFold_Label)
    Predicted = np.array(svmModel_4.predict(Validation_CurrentFold))
    
    c3 = 'svmscore_Fold%d = metrics.accuracy_score(ValidationFold_Label , Predicted)'%(i+1)
    exec(c3)
        
    Add += metrics.accuracy_score(ValidationFold_Label , Predicted)
    Divide += 1
    
Avg_accuracy = Add/Divide

print('[Result of K-fold Cross Validation] \n')
print(' Fold 1: {:.2f}% \n Fold 2: {:.2f}% \n Fold 3: {:.2f}%'.
        format(svmscore_Fold1*100, svmscore_Fold2*100, svmscore_Fold3*100))
print('\n Average accuracy: {:.2f}%'.format(Avg_accuracy*100))

[Result of K-fold Cross Validation] 

 Fold 1: 100.00% 
 Fold 2: 100.00% 
 Fold 3: 98.75%

 Average accuracy: 99.58%


### 전체 데이터로 학습

In [29]:
FinalModel = KNeighborsClassifier(n_neighbors = 3, metric = 'manhattan').fit(Training_All , Training_All_Label)

### 4단계 결과물 제출용 데이터 파일로 저장 (수강생 번호 외 코드수정X)

- 변수 이름 가이드에 맞게 지정됐는지 재확인 요망

In [30]:
StudentNo = 136   # 수강생 번호 입력

# 아래는 수정 금지
joblib.dump(FinalModel, './Result/ST%d_MC4.plk'%(StudentNo));

.

.

.

# [최종] 해당 코드 파일을 .py 확장자로 변환

### 1. 해당 코드 파일명의 마지막에 본인 수강생 번호로 변경(ex: 수강생 번호 13번일 경우, MachineLearning_Challenge_ST-13)
### 2. 메뉴바에서 File > Download as > Python (.py) 선택하여 .py 확장자로 변환
### 3. Download 폴더에서 저장되어 있는 .py 파일을 Result 폴더에 넣기

# ● 결과가 저장된 폴더(Result) 내의 모든 파일을 하나의 zip파일로 제출
> #### 압축파일 이름 ST(수강생번호)_MC.zip (예시: 한 자리 수 'ST0_MC', 두 자리 수 'ST00_MC', 세 자리 수 'ST000_MC',)