# Sklearn-Self-Training-Solution

- 실습조교: 배진수(wlstn215@korea.ac.kr), 안시후(sihuahn@korea.ac.kr), 김현지(99ktxx@korea.ac.kr)

## 0.모듈 불러오기

In [1]:
''' 기본 모듈 및 시각화 모듈 '''
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

''' 데이터 전처리 모듈 '''
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

''' 모델 학습용 모듈 '''
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.semi_supervised import SelfTrainingClassifier

''' 결과 평가용 모듈 '''
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

''' github+colab 교육생분들 '''
# !git clone https://github.com/bogus215/LG-EDUCATION3.git

## 1. 분석 데이터: Marketing Campaign Dataset (이진 분류 문제)

### Task abstract : 슈퍼마켓 이용자 정보를 기반으로 해당 이용자가 집에 부양 가족(어린이/청소년)이 있는지 여부를 예측

### 설명변수(X) : 개인 정보 및 슈퍼마켓 이용 정보

- Year_Birth (출생년도)
- MntWines (와인 품목에 대한 이용자의 연간 지출)
- MntFruits (과일 품목에 대한 이용자의 연간 지출)
- MntMeatProducts (육류 품목에 대한 이용자의 연간 지출)
- MntSweetProducts (과자 품목에 대한 이용자의 연간 지출)

### 반응변수(Y) : 부양 가족 여부

- Dependents_Flag : 0 (부양 가족 없음), 1 (부양 가족 있음)

In [2]:
data = pd.read_excel('./data/marketing_campaign.xlsx')

## 다음 조건을 만족하는 semi-supervised learning 모델 학습 및 평가하세요.
- 전처리시 변수 'ID'는 제거
- 어린이(Kidhome)와 청소년(Teenhome)을 모두 포함한 부양가족 여부에 대한 칼럼 "Dependents_Flag" 생성
- 학습 데이터 : 테스트 데이터 = 0.75 : 0.25 비율로 설정
- 학습데이터 중 Labeled Data의 비율을 0.01로 설정
- Supervised learning으로 SVM(Support Vector Machine) classifier 학습 후 성능 확인
- SelfTrainingClassifier의 base_estimator를 SVM(Support Vector Machine) classifier로 설정하여 semi-supervised learning 방식으로 학습 후  성능 확인

In [3]:
data = data.drop(['ID'], axis=1)
data.head()

Unnamed: 0,Year_Birth,MntWines,MntFruits,MntMeatProducts,MntSweetProducts,Kidhome,Teenhome
0,1957,635,88,546,88,0,0
1,1954,11,1,6,1,1,1
2,1965,426,49,127,21,0,0
3,1984,11,4,20,3,1,0
4,1981,173,43,118,27,1,0


In [4]:
# Create a flag to denote whether the person has any dependants at home (either kids or teens)
data['Dependents_Flag']=data.apply(lambda x: 1 if x['Kidhome']+x['Teenhome']>0 else 0, axis=1)
data.head()

Unnamed: 0,Year_Birth,MntWines,MntFruits,MntMeatProducts,MntSweetProducts,Kidhome,Teenhome,Dependents_Flag
0,1957,635,88,546,88,0,0,0
1,1954,11,1,6,1,1,1,1
2,1965,426,49,127,21,0,0,0
3,1984,11,4,20,3,1,0,1
4,1981,173,43,118,27,1,0,1


In [5]:
data.isnull().sum()

Year_Birth          0
MntWines            0
MntFruits           0
MntMeatProducts     0
MntSweetProducts    0
Kidhome             0
Teenhome            0
Dependents_Flag     0
dtype: int64

In [6]:
df_train, df_test = train_test_split(data, test_size = 0.25, random_state =0)
print('Size of train dataframe: ', df_train.shape[0])
print('Size of test dataframe: ', df_test.shape[0])

Size of train dataframe:  1680
Size of test dataframe:  560


In [7]:
df_train['Random_Mask'] = True
df_train.loc[df_train.sample(frac=0.01, random_state = 0).index, 'Random_Mask'] = False
df_train['New_Target'] = df_train.apply(lambda x: x['Dependents_Flag'] if x['Random_Mask'] ==False else -1, axis = 1)
df_train['New_Target'].value_counts()

-1    1663
 1      12
 0       5
Name: New_Target, dtype: int64

In [8]:
df_train_labeled = df_train[df_train['New_Target']!=-1]

X_baseline = df_train_labeled.drop(['Dependents_Flag', 'Random_Mask', 'New_Target'], axis = 1)
y_baseline = df_train_labeled['New_Target'].values

X_test = df_test.drop(['Dependents_Flag'], axis = 1)
y_test = df_test['Dependents_Flag'].values

In [9]:
model = SVC(kernel='rbf', 
            probability=True, 
            C=1.0, # default = 1.0
            gamma='scale', # default = 'scale'
            random_state=0
           )

clf = model.fit(X_baseline, y_baseline)

print('---------- SVC Baseline Model - Evaluation on Test Data ----------')
accuracy_score_B = clf.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_B)
print(classification_report(y_test, clf.predict(X_test)))

---------- SVC Baseline Model - Evaluation on Test Data ----------
Accuracy Score:  0.7303571428571428
              precision    recall  f1-score   support

           0       1.00      0.02      0.04       154
           1       0.73      1.00      0.84       406

    accuracy                           0.73       560
   macro avg       0.86      0.51      0.44       560
weighted avg       0.80      0.73      0.62       560



In [10]:
X_train = df_train.drop(['Dependents_Flag', 'Random_Mask', 'New_Target'], axis = 1)
y_train = df_train['New_Target'].values

In [11]:
model_svc = SVC(kernel='rbf', 
            probability=True, 
            C=1.0, # default = 1.0
            gamma='scale', # default = 'scale'
            random_state=0)

self_training_model = SelfTrainingClassifier(base_estimator = model_svc,
                                            threshold = 0.75,
                                            criterion = 'threshold',
                                            max_iter = 10,
                                            verbose = True
                                            )

clf_ST = self_training_model.fit(X_train, y_train)

print('')
print('---------- Self Training Model - Summary ----------')
print('Base Estimator: ', clf_ST.base_estimator_)
print('Dependents_Flag: ', clf_ST.classes_)
print('Transduction Labels: ', clf_ST.transduction_)

#print('Iteration When Sample Was Labeled: ', clf_ST.labeled_iter_)
print('Number of Features: ', clf_ST.n_features_in_)
print('Number of Iterations: ', clf_ST.n_iter_)
print('Termination Condition: ', clf_ST.termination_condition_)
print('')

print('---------- Self Training Model - Evaluation on Test Data ----------')
accuracy_score_ST = clf_ST.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_ST)
print(classification_report(y_test, clf_ST.predict(X_test)))

End of iteration 1, added 1236 new labels.
End of iteration 2, added 320 new labels.
End of iteration 3, added 49 new labels.
End of iteration 4, added 15 new labels.
End of iteration 5, added 8 new labels.
End of iteration 6, added 2 new labels.
End of iteration 7, added 2 new labels.

---------- Self Training Model - Summary ----------
Base Estimator:  SVC(probability=True, random_state=0)
Dependents_Flag:  [0 1]
Transduction Labels:  [0 1 1 ... 1 1 0]
Number of Features:  7
Number of Iterations:  8
Termination Condition:  no_change

---------- Self Training Model - Evaluation on Test Data ----------
Accuracy Score:  0.8107142857142857
              precision    recall  f1-score   support

           0       0.82      0.40      0.54       154
           1       0.81      0.97      0.88       406

    accuracy                           0.81       560
   macro avg       0.82      0.68      0.71       560
weighted avg       0.81      0.81      0.79       560

