### RAI LAB

코드는 Colab 환경에서 작성되었습니다. 

전체적인 Process는 다음과 같습니다. 


---


1.   데이터 전처리
  
  * Class encoding  

  * 결측치 채우기

  * 데이터 Drop

      -   label 0인 데이터들 중에서
      -   상위 5개 IRI에 포함되지 않는 Data 모두 Drop
      -   상위 2개 IRI에 포함되는 Data들 중 일부 Drop

  * SMOOTE를 통한 데이터 불균형 해소 

---


2. 피쳐 구성 (27개)

  * **method, &nbsp; group, &nbsp; type, &nbsp; state &nbsp;** 4개의 피쳐 사용 

  * 위 4개의 피쳐를 통해 600 class clustering 후 피쳐로 추가 (**Cluster1**)

  * Peptide package로 계산된 아래 7개의 피쳐를 통해 600 class clustering 후 피쳐로 추가 
  (**Cluster2**)

    -   hydrophobicity
    -   molecular_weight
    -   aliphatic_index
    -   instability_index
    -   isoelectric_point
    -   boman
    -   charge

  * **AAC** (Amino Axid Composition feature) 20개 사용 (Propy)

  * 각 epitope seq의 **length**를 피쳐로 추가

  * **20 AAC + 4 feature + 2 cluster feature + 1 len feature** 


---


3. 모델링 

  * Random Forest, &nbsp; Gradient Boosting, &nbsp; 
  XGB, &nbsp; LightGBM, &nbsp; CatBoost 5개의 모델 학습

  * 각 classifier의 prob을 평균내어 Soft Voting Ensemble




### Import

In [1]:
! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 95 kB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.6


In [2]:
# 펩타이드 피쳐 추출을 위한 peptides 패키지

! pip install peptides

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting peptides
  Downloading peptides-0.2.1-py3-none-any.whl (102 kB)
[K     |████████████████████████████████| 102 kB 5.2 MB/s 
[?25hInstalling collected packages: peptides
Successfully installed peptides-0.2.1


In [3]:
# AAC 추출을 위한 propy3 패키지

! pip install propy3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting propy3
  Downloading propy3-1.1.1-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 5.2 MB/s 
[?25hInstalling collected packages: propy3
Successfully installed propy3-1.1.1


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import torch
import peptides
import joblib
import propy

from propy import PyPro
from tqdm import tqdm
from google.colab import files
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, precision_score, recall_score, balanced_accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

  "Python 3.6 and Python 3.7 might get deprecated. "


### 데이터 로드

In [5]:
# 드라이브 마운트 
drive.mount("/content/drive")

Mounted at /content/drive


In [6]:
# 학습 데이터 로드 
data = pd.read_csv("/content/drive/MyDrive/첼린지/train.csv")

# 제출 파일
sub = pd.read_csv("/content/drive/MyDrive/첼린지/sample_submission.csv")

# 테스트 데이터
test_data = pd.read_csv("/content/drive/MyDrive/첼린지/test.csv")

### PEP feature 추출

In [None]:
# Train Data feature 추출 

train_pep = np.zeros((len(data), 7))

for k, seq in enumerate(data["epitope_seq"]):
  peptide = peptides.Peptide(seq)
  train_pep[k][0] = peptide.hydrophobicity()
  train_pep[k][1] = peptide.molecular_weight()
  train_pep[k][2] = peptide.aliphatic_index()
  train_pep[k][3] = peptide.instability_index()
  train_pep[k][4] = peptide.isoelectric_point()
  train_pep[k][5] = peptide.boman()
  train_pep[k][6] = peptide.charge()

train_pep

In [None]:
np.save("/content/drive/MyDrive/첼린지/train_pep.npy", train_pep)

In [None]:
# Test Data feature 추출 

test_pep = np.zeros((len(test_data), 7))

for k, seq in enumerate(test_data["epitope_seq"]):
  peptide = peptides.Peptide(seq)
  test_pep[k][0] = peptide.hydrophobicity()
  test_pep[k][1] = peptide.molecular_weight()
  test_pep[k][2] = peptide.aliphatic_index()
  test_pep[k][3] = peptide.instability_index()
  test_pep[k][4] = peptide.isoelectric_point()
  test_pep[k][5] = peptide.boman()
  test_pep[k][6] = peptide.charge()

test_pep

In [None]:
np.save("/content/drive/MyDrive/첼린지/test_pep.npy", test_pep)

### AAC feature 추출

In [None]:
# Pypro 패키지를 통한 AAC 피쳐 추출

new_feature = np.zeros((len(data), 20))
i = 0
for seq in tqdm(data["epitope_seq"]):
  DesObject = PyPro.GetProDes(data["epitope_seq"][i])
  feature = np.array(list(DesObject.GetAAComp().values())).reshape(1,-1)
  new_feature[i] = feature
  i += 1

new_feature.shape

In [None]:
np.save("/content/drive/MyDrive/첼린지/new_feature_epitope.npy")

In [None]:
new_feature_test = np.zeros((len(test_data), 20))
i = 0
for seq in tqdm(test_data["epitope_seq"]):
  DesObject = PyPro.GetProDes(test_data["epitope_seq"][i])
  feature = np.array(list(DesObject.GetAAComp().values())).reshape(1,-1)
  new_feature_test[i] = feature
  i += 1

new_feature_test.shape

In [None]:
np.save("/content/drive/MyDrive/첼린지/new_feature_epitope_test.npy")

### 전처리

In [None]:
# label 비율

data["label"].value_counts()

In [None]:
# 결측치 확인 1

data["assay_method_technique"].isna().value_counts()

In [None]:
# 결측치 확인 2

data["assay_group"].isna().value_counts()

In [None]:
# 결측치 확인 3

data["disease_type"].isna().value_counts()

In [None]:
# 결측치 확인 4

data["disease_state"].isna().value_counts()

In [None]:
# 결측치 확인 5

data["qualitative_label"].isna().value_counts()

In [None]:
# 결측치 확인 6

data["reference_journal"].isna().value_counts()

In [None]:
# 결측치 확인 7

data["reference_title"].isna().value_counts()

In [None]:
# 결측치 확인 8

data["reference_IRI"].isna().value_counts()

In [12]:
# disease_state 결측치 채우기 

"""
disease_state 피쳐의 결측치에 대한 전처리가 필요.
disease_type에 따라 유일하거나 가장 많이 나오는 disease_state를 찾아서 대체
ex) healthy
"""

nan_disease_states = list(data[ data["disease_state"].isnull() ]["disease_type"].value_counts().keys())

data.loc[data["disease_type"] == nan_disease_states[0], "disease_state"] = data[data["disease_type"] == nan_disease_states[0]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[1], "disease_state"] = data[data["disease_type"] == nan_disease_states[1]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[2], "disease_state"] = data[data["disease_type"] == nan_disease_states[2]]["disease_state"].fillna("transplant-related disease and allo-reactivity")
data.loc[data["disease_type"] == nan_disease_states[3], "disease_state"] = data[data["disease_type"] == nan_disease_states[3]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[4], "disease_state"] = data[data["disease_type"] == nan_disease_states[4]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[5], "disease_state"] = data[data["disease_type"] == nan_disease_states[5]]["disease_state"].fillna("Chagas disease")
data.loc[data["disease_type"] == nan_disease_states[6], "disease_state"] = data[data["disease_type"] == nan_disease_states[6]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[7], "disease_state"] = data[data["disease_type"] == nan_disease_states[7]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[8], "disease_state"] = data[data["disease_type"] == nan_disease_states[8]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[8], "disease_state"] = data[data["disease_type"] == nan_disease_states[8]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[9], "disease_state"] = data[data["disease_type"] == nan_disease_states[9]]["disease_state"].fillna("healthy")
data.loc[data["disease_type"] == nan_disease_states[10], "disease_state"] = data[data["disease_type"] == nan_disease_states[10]]["disease_state"].fillna("celiac disease")

data["disease_state"].isna().value_counts()

False    190811
Name: disease_state, dtype: int64

In [21]:
data["label"].value_counts()

0    173959
1     16852
Name: label, dtype: int64

In [None]:
# epitope seq 길이 피쳐 추가

data["len feature"] = data["epitope_seq"].map(len)

In [None]:
# AAC feature load

new_feature = np.load("/content/drive/MyDrive/첼린지/new_feature_epitope.npy")[:,-20:]
new_feature_test = np.load("/content/drive/MyDrive/첼린지/new_feature_epitope_test.npy")[:,-20:]

print(new_feature.shape)
print(new_feature_test.shape)

(190811, 20)
(120944, 20)


In [None]:
# 7 feature load

new_feature2 = np.load("/content/drive/MyDrive/첼린지/train_pep.npy")
new_feature_test2 = np.load("/content/drive/MyDrive/첼린지/test_pep.npy")

print(new_feature2.shape)
print(new_feature_test2.shape)

(190811, 7)
(120944, 7)


In [None]:
"""
label 0 인 데이터 중에서 상위 5개의 IRI만 추출
"""

label0_data = data[ data["label"] == 0 ]  
iris = label0_data["reference_IRI"].value_counts()

notin_iri = []
in_iri = []

for i, iri in enumerate(iris[5:].keys()):
  notin_iri.append( (iri, iris[5:][i]) )

for i, iri in enumerate(iris[:5].keys()):
  in_iri.append( (iri, iris[:5][i]) )

len(notin_iri), len(in_iri)

(687, 5)

In [None]:
np.random.seed(3)

delete_ind = []
label1_data = data[ data["label"] == 1 ]  

# 상위 5개의 IRI에 포함되지 않은 데이터들은 모두 삭제 인덱스 추가
for iri, _ in notin_iri:
  ind = list(label0_data[label0_data["reference_IRI"] == iri].index)
  delete_ind = delete_ind + ind

# 상위 2개의 IRI에 포함된 데이터들 중 일부 삭제 인덱스 추가
for i, (iri, num) in enumerate(in_iri[:2]):
  ind = list(label0_data[label0_data["reference_IRI"] == iri].index)
  if i == 0:
    ind_half = list(np.random.choice(ind, 88500, replace=False))
  else:
    ind_half = list(np.random.choice(ind, int(num/2) - 6000, replace=False))
  delete_ind = delete_ind + ind_half

len(delete_ind)

In [None]:
# 학습용 피쳐
data = data[["assay_method_technique", "assay_group",
             "disease_type", "disease_state", "len feature", "label"]]

data_ = data.copy()

In [None]:
# 결측치 존재 재확인

data.isna().sum()

assay_method_technique    0
assay_group               0
disease_type              0
disease_state             0
len feature               0
label                     0
dtype: int64

In [None]:
# 클래스 인코딩

col_name1 = "assay_method_technique"
col_name2 = "assay_group"
col_name3 = "disease_type"
col_name4 = "disease_state"

classes1 = list(data[col_name1].value_counts().keys())
classes2 = list(data[col_name2].value_counts().keys())
classes3 = list(data[col_name3].value_counts().keys())
classes4 = list(data[col_name4].value_counts().keys())

for i, clas in enumerate(classes1):
  data.loc[ data[col_name1] == clas, col_name1] = i

for i, clas in enumerate(classes2):
  data.loc[ data[col_name2] == clas, col_name2] = i

for i, clas in enumerate(classes3):
  data.loc[ data[col_name3] == clas, col_name3] = i

for i, clas in enumerate(classes4):
  data.loc[ data[col_name4] == clas, col_name4] = i

data = data.copy().astype(float)
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,assay_method_technique,assay_group,disease_type,disease_state,len feature,label
0,4.0,1.0,4.0,3.0,6.0,1.0
1,4.0,1.0,4.0,3.0,6.0,1.0
2,4.0,1.0,4.0,3.0,24.0,1.0
3,4.0,1.0,4.0,3.0,8.0,1.0
4,4.0,1.0,4.0,3.0,6.0,1.0
...,...,...,...,...,...,...
190806,2.0,1.0,4.0,8.0,34.0,0.0
190807,4.0,1.0,4.0,8.0,24.0,1.0
190808,4.0,1.0,4.0,8.0,49.0,1.0
190809,4.0,1.0,4.0,8.0,32.0,1.0


In [None]:
"""
다음 4개의 피쳐를 사용하여 600개 클래스로 clustering
"assay_method_technique", "assay_group", "disease_type", "disease_state"
"""

from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=600, random_state=0)
Kmean.fit(data.drop(["label", "len feature"], axis=1, inplace=False))

KMeans(n_clusters=600, random_state=0)

In [None]:
"""
펩타이드 7 피쳐를 사용하여 600개 클래스로 clustering
"""
from sklearn.cluster import KMeans
Kmean2 = KMeans(n_clusters=600, random_state=0)
Kmean2.fit(new_feature2)

KMeans(n_clusters=600, random_state=0)

In [None]:
"""
펩타이드 7 피쳐를 사용하여 650개 클래스로 clustering
"""
from sklearn.cluster import KMeans
Kmean3 = KMeans(n_clusters=600, random_state=0)
Kmean3.fit(new_feature)

KMeans(n_clusters=600, random_state=0)

In [None]:
data["cluster1"] = Kmean.predict(data.drop(["label", "len feature"], axis=1, inplace=False))

In [None]:
data["cluster2"] = Kmean2.predict(new_feature2)

In [None]:
data["cluster3"] = Kmean3.predict(new_feature)

In [None]:
data

Unnamed: 0,assay_method_technique,assay_group,disease_type,disease_state,len feature,label,cluster1,cluster2
0,4.0,1.0,4.0,3.0,6.0,1.0,191,126
1,4.0,1.0,4.0,3.0,6.0,1.0,191,67
2,4.0,1.0,4.0,3.0,24.0,1.0,191,181
3,4.0,1.0,4.0,3.0,8.0,1.0,191,282
4,4.0,1.0,4.0,3.0,6.0,1.0,191,52
...,...,...,...,...,...,...,...,...
190806,2.0,1.0,4.0,8.0,34.0,0.0,405,527
190807,4.0,1.0,4.0,8.0,24.0,1.0,431,84
190808,4.0,1.0,4.0,8.0,49.0,1.0,431,54
190809,4.0,1.0,4.0,8.0,32.0,1.0,431,151


### 학습 및 검증 데이터 준비

In [None]:
# 학습에 사용할 INDEX

all = set(range(0, len(data)))
delete_ind = set(delete_ind)
go_ind = list(all - delete_ind)

In [None]:
feature_list = ["assay_method_technique", "assay_group", "disease_type", "disease_state", "len feature", "cluster1", "cluster2", "cluster3"]

x_data = np.concatenate([data[feature_list].to_numpy().reshape(-1,len(feature_list)), new_feature[:,:]], axis=1)[go_ind]
y_data = np.array(data["label"])[go_ind]
x_data.shape, y_data.shape

((71490, 7), (71490,))

In [None]:
# SMOTE로 불균형 맞추기

from imblearn.over_sampling import SMOTE
x_data, y_data = SMOTE(random_state=0).fit_resample(x_data, y_data)
x_data.shape, y_data.shape

((109276, 7), (109276,))

In [None]:
list(y_data).count(0), list(y_data).count(1)

(54638, 54638)

In [None]:
# scaling

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
x_data = scaler.fit_transform(x_data)

In [None]:
np.max(x_data[:,0]), np.min(x_data[:,0])

(1.0, 0.0)

In [None]:
# 검증 데이터 분할 

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.1, stratify=y_data, random_state=777)

print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)

(98348, 7) (10928, 7)
(98348,) (10928,)


### 모델링

In [None]:
# hyper parameter search space

# RF
param_rf = {"max_depth": [50, 30],
            "n_estimators": [500, 1000]} 

# RF2
param_rf2 = {"max_depth": 100,
             "n_estimators": 1000} 

# XGB
param_xgb = {"max_depth": [25, 30],
             "min_child_weight" : [6, 10],
             "n_estimators": [500, 300]}    

# XGB2
param_xgb2 = {"max_depth": 30, 
              "min_child_weight" : 6,
              "n_estimators": 500}   

# LGB                        
param_lgb = {"learning_rate" : [0.1, 0.3], 
             "max_depth" : [50, 100],
             "n_estimators" : [200, 1000]}

# LGB2                        
param_lgb2 = {"learning_rate" : 0.2,
              "max_depth" : 30,
              "n_estimators" : 1000}

# GBM             
param_gb = {"max_depth" : [9, 20],
            "learning_rate" : [0.2, 0.3],
            "n_estimators" : [500, 200]}

# GBM2             
param_gb2 = {"max_depth" : 9,
             "learning_rate" : 0.2,
             "n_estimators" : 300}

# CAT
param_cat = {"depth" : [10, 20],
             "iterations" : [1000],
             "learning_rate" : [0.1, 0.3],
             "l2_leaf_reg" : [2, 5],
             "border_count" : [254]}

In [None]:
# Grid search (서치시에만 실행)

rf = RandomForestClassifier()
xgb = XGBClassifier()
lgb = LGBMClassifier()
gb = GradientBoostingClassifier()
cat = CatBoostClassifier()

gscv_rf = GridSearchCV (estimator = rf, param_grid = param_rf, scoring ='accuracy', cv = 3, refit=True, n_jobs=1, verbose=2)
gscv_xgb = GridSearchCV (estimator = xgb, param_grid = param_xgb, scoring ='accuracy', cv = 3, refit=True, n_jobs=1, verbose=2)
gscv_lgb = GridSearchCV (estimator = lgb, param_grid = param_lgb, scoring ='accuracy', cv = 3, refit=True, n_jobs=1, verbose=2)
gscv_gb = GridSearchCV (estimator = gb, param_grid = param_gb, scoring ='accuracy', cv = 3, refit=True, n_jobs=1, verbose=2)
gscv_cat = GridSearchCV (estimator = cat, param_grid = param_cat, scoring ='accuracy', cv = 3, refit=True, n_jobs=1, verbose=2)

gscv_rf.fit(x_train, y_train)
gscv_xgb.fit(x_train, y_train)
gscv_lgb.fit(x_train, y_train)
gscv_gb.fit(x_train, y_train)
gscv_cat.fit(x_train, y_train)

In [None]:
# (서치시에만 실행)

print("="*30)
print('RF 파라미터: ', gscv_rf.best_params_)
print('RF 예측 정확도: {:.4f}'.format(gscv_rf.best_score_))
print("="*30)
print('XGB 파라미터: ', gscv_xgb.best_params_)
print('XGB 예측 정확도: {:.4f}'.format(gscv_xgb.best_score_))
print("="*30)
print('LGB 파라미터: ', gscv_lgb.best_params_)
print('LGB 예측 정확도: {:.4f}'.format(gscv_lgb.best_score_))
print("="*30)
print('GB 파라미터: ', gscv_gb.best_params_)
print('GB 예측 정확도: {:.4f}'.format(gscv_gb.best_score_))
print("="*30)
print('CAT 파라미터: ', gscv_cat.best_params_)
print('CAT 예측 정확도: {:.4f}'.format(gscv_cat.best_score_))
print("="*30)

In [None]:
# Best hyper parameter 1

# RF
best_param_rf = {"max_depth": 50,
                 "n_estimators": 500} 

# RF2
best_param_rf2 = {"max_depth": 100,
                  "n_estimators": 1000} 

# XGB
best_param_xgb = {"max_depth": 25,
                  "min_child_weight" : 6,
                  "n_estimators": 500}    

# XGB2
best_param_xgb2 = {"max_depth": 30,
                   "min_child_weight" : 6,
                   "n_estimators": 500}   

# LGB                        
best_param_lgb = {"learning_rate" : 0.1,
                  "max_depth" : 50,
                  "n_estimators" : 1000}

# LGB2                        
best_param_lgb2 = {"learning_rate" : 0.2,
                   "max_depth" : 30,
                   "n_estimators" : 1000}

# GBM             
best_param_gb = {"max_depth" : 9,
                 "learning_rate" : 0.2,
                 "n_estimators" : 500}

# GBM2             
best_param_gb2 = {"max_depth" : 9,
                  "learning_rate" : 0.2,
                  "n_estimators" : 300}

# CAT
best_param_cat = {"depth" : 10,
                  "iterations" : 1000,
                  "learning_rate" : 0.1, 
                  "l2_leaf_reg" : 2,
                  "border_count" : 254}

In [None]:
def get_score(pred, target):
  correct = list((pred == target)).count(True)
  accuracy = (correct/len(pred))
  val_f1 = f1_score(target, pred, average='macro')
  precision = precision_score(target, pred)
  recall = recall_score(target, pred)
  return accuracy, val_f1, precision, recall

In [None]:
# 랜덤 포레스트 모델

rf_model = RandomForestClassifier(**best_param_rf)
rf_model.fit(x_train, y_train)

# 테스트 정확도
rf_pred = rf_model.predict(x_test)
acc, f1, pre, recall = get_score(rf_pred, y_test)

print(acc, f1, pre, recall)

In [None]:
joblib.dump(rf_model, '/content/drive/MyDrive/첼린지/0.7245/rf_model.pkl')

In [None]:
# 저장된 모델 불러오기

rf_model = joblib.load('/content/drive/MyDrive/첼린지/0.7245/rf_model.pkl')

In [None]:
# XGB 모델

xgb_model = XGBClassifier(**best_param_xgb)
xgb_model.fit(x_train, y_train)

# 테스트 정확도
xgb_pred = xgb_model.predict(x_test)
acc, f1, pre, recall = get_score(xgb_pred, y_test)

print(acc, f1, pre, recall)

In [None]:
joblib.dump(xgb_model, '/content/drive/MyDrive/첼린지/0.7245/xgb_model.pkl')

In [None]:
# 저장된 모델 불러오기

xgb_model = joblib.load('/content/drive/MyDrive/첼린지/0.7245/xgb_model.pkl')

In [None]:
# gb 모델

gb_model = GradientBoostingClassifier(**best_param_gb)
gb_model.fit(x_train, y_train)

# 테스트 정확도
gb_pred = gb_model.predict(x_test)
acc, f1, pre, recall = get_score(gb_pred, y_test)

print(acc, f1, pre, recall)

In [None]:
joblib.dump(gb_model, '/content/drive/MyDrive/첼린지/0.7245/gb_model.pkl')

In [None]:
# 저장된 모델 불러오기

gb_model = joblib.load('/content/drive/MyDrive/첼린지/0.7245/gb_model.pkl')

In [None]:
# lgb 모델

lgbm_model = LGBMClassifier(**best_param_lgb)
lgbm_model.fit(x_train, y_train)

# 테스트 정확도
lgbm_pred = lgbm_model.predict(x_test)
acc, f1, pre, recall = get_score(lgbm_pred, y_test)

print(acc, f1, pre, recall)

0.91398243045388 0.9137153393757593 0.9658154859967051 0.8583455344070278


In [None]:
joblib.dump(lgbm_model, '/content/drive/MyDrive/첼린지/0.7245/lgbm_model.pkl')

In [None]:
# 저장된 모델 불러오기

lgbm_model = joblib.load('/content/drive/MyDrive/첼린지/0.7245/lgbm_model.pkl')

In [None]:
from lightgbm import plot_importance

fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(lgbm_model, ax=ax)

In [None]:
# cat 모델

cat_model = CatBoostClassifier(**best_param_cat)
cat_model.fit(x_train, y_train)

# 테스트 정확도
cat_pred = cat_model.predict(x_test)
acc, f1, pre, recall = get_score(cat_pred, y_test)

print(acc, f1, pre, recall)

In [None]:
joblib.dump(cat_model, '/content/drive/MyDrive/첼린지/0.7245/cat_model.pkl')

In [None]:
# 저장된 모델 불러오기

cat_model = joblib.load('/content/drive/MyDrive/첼린지/0.7245/cat_model.pkl')

### 추론

In [None]:
# 추론 데이터 length feature 추가 

test_data["len feature"] = test_data["epitope_seq"].map(len)

In [None]:
# 테스트 데이터 
test_data = test_data[["number_of_tested", "number_of_responses", 
                       "assay_method_technique", "assay_group",
                       "disease_type", "disease_state", "len feature"]]

test_data

Unnamed: 0,number_of_tested,number_of_responses,assay_method_technique,assay_group,disease_type,disease_state,len feature
0,20.0,1.0,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,16
1,,,High throughput multiplexed assay,antibody binding,Occurrence of infectious disease,Chagas disease,15
2,,,microarray,qualitative binding,Occurrence of infectious disease,severe acute respiratory syndrome,12
3,,,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,12
4,,,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,12
...,...,...,...,...,...,...,...
120939,,,High throughput multiplexed assay,antibody binding,Occurrence of infectious disease,Chagas disease,15
120940,,,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,12
120941,20.0,0.0,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,16
120942,,,High throughput multiplexed assay,antibody binding,Occurrence of infectious disease,Chagas disease,15


In [None]:
test_data["number_of_tested"].isna().value_counts()

In [None]:
test_data["number_of_responses"].isna().value_counts()

In [None]:
test_data["assay_method_technique"].isna().value_counts()

In [None]:
test_data["assay_group"].isna().value_counts()

In [None]:
test_data["disease_type"].isna().value_counts()

In [None]:
test_data["disease_state"].isna().value_counts()

In [None]:
test_data["reference_journal"].isna().value_counts()

In [None]:
test_data["reference_title"].isna().value_counts()

In [None]:
# disease_state 결측치 채우기 

nan_disease_states_test = list(test_data[ test_data["disease_state"].isnull() ]["disease_type"].value_counts().keys())

test_data.loc[test_data["disease_type"] == nan_disease_states_test[0], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[0]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[1], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[1]]["disease_state"].fillna("Chagas disease")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[2], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[2]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[3], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[3]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[4], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[4]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[5], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[5]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[6], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[6]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[7], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[7]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[8], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[8]]["disease_state"].fillna("healthy")
test_data.loc[test_data["disease_type"] == nan_disease_states_test[9], "disease_state"] = test_data[test_data["disease_type"] == nan_disease_states_test[9]]["disease_state"].fillna("healthy")

test_data

Unnamed: 0,number_of_tested,number_of_responses,assay_method_technique,assay_group,disease_type,disease_state,len feature
0,20.0,1.0,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,16
1,,,High throughput multiplexed assay,antibody binding,Occurrence of infectious disease,Chagas disease,15
2,,,microarray,qualitative binding,Occurrence of infectious disease,severe acute respiratory syndrome,12
3,,,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,12
4,,,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,12
...,...,...,...,...,...,...,...
120939,,,High throughput multiplexed assay,antibody binding,Occurrence of infectious disease,Chagas disease,15
120940,,,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,12
120941,20.0,0.0,microarray,qualitative binding,Environmental exposure to endemic/ubiquitous a...,healthy,16
120942,,,High throughput multiplexed assay,antibody binding,Occurrence of infectious disease,Chagas disease,15


In [None]:
# 추론 데이터 

test_data = test_data[["assay_method_technique", "assay_group", "disease_type", "disease_state", "len feature"]].copy()

In [None]:
# 추론 데이터 인코딩

col_name1 = "assay_method_technique"
col_name2 = "assay_group"
col_name3 = "disease_type"
col_name4 = "disease_state"

classes1 = list(data_[col_name1].value_counts().keys())
classes2 = list(data_[col_name2].value_counts().keys())
classes3 = list(data_[col_name3].value_counts().keys())
classes4 = list(data_[col_name4].value_counts().keys())


for i, clas in enumerate(classes1):
  test_data.loc[ test_data[col_name1] == clas, col_name1] = i

for i, clas in enumerate(classes2):
  test_data.loc[ test_data[col_name2] == clas, col_name2] = i

for i, clas in enumerate(classes3):
  test_data.loc[ test_data[col_name3] == clas, col_name3] = i

for i, clas in enumerate(classes4):
  test_data.loc[ test_data[col_name4] == clas, col_name4] = i


# 학습 피쳐에 없는 클라스 0으로 처리
classes1 = list(test_data[col_name1].value_counts().keys())
classes2 = list(test_data[col_name2].value_counts().keys())
classes3 = list(test_data[col_name3].value_counts().keys())
classes4 = list(test_data[col_name4].value_counts().keys())

for i, clas in enumerate(classes1):
  if type(clas) == str:
    test_data.loc[ test_data[col_name1] == clas, col_name1] = 0

for i, clas in enumerate(classes2):
  if type(clas) == str:
    test_data.loc[ test_data[col_name2] == clas, col_name2] = 0

for i, clas in enumerate(classes3):
  if type(clas) == str:
    test_data.loc[ test_data[col_name3] == clas, col_name3] = 0

for i, clas in enumerate(classes4):
  if type(clas) == str:
    test_data.loc[ test_data[col_name4] == clas, col_name4] = 0

test_data = test_data.astype(float)
test_data

Unnamed: 0,assay_method_technique,assay_group,disease_type,disease_state,len feature
0,1.0,1.0,1.0,1.0,16.0
1,0.0,0.0,0.0,0.0,15.0
2,1.0,1.0,0.0,6.0,12.0
3,1.0,1.0,1.0,1.0,12.0
4,1.0,1.0,1.0,1.0,12.0
...,...,...,...,...,...
120939,0.0,0.0,0.0,0.0,15.0
120940,1.0,1.0,1.0,1.0,12.0
120941,1.0,1.0,1.0,1.0,16.0
120942,0.0,0.0,0.0,0.0,15.0


In [None]:
test_data["cluster1"] = Kmean.predict(test_data.drop(["len feature"], axis=1, inplace=False))
test_data["cluster2"] = Kmean2.predict(new_feature_test2)
test_data

Unnamed: 0,assay_method_technique,assay_group,disease_type,disease_state,len feature,cluster1,cluster2
0,1.0,1.0,1.0,1.0,16.0,7,596
1,0.0,0.0,0.0,0.0,15.0,0,388
2,1.0,1.0,0.0,6.0,12.0,51,529
3,1.0,1.0,1.0,1.0,12.0,7,320
4,1.0,1.0,1.0,1.0,12.0,7,340
...,...,...,...,...,...,...,...
120939,0.0,0.0,0.0,0.0,15.0,0,138
120940,1.0,1.0,1.0,1.0,12.0,7,424
120941,1.0,1.0,1.0,1.0,16.0,7,339
120942,0.0,0.0,0.0,0.0,15.0,0,129


In [None]:
test_data = np.concatenate([test_data[["assay_method_technique", "assay_group", "disease_type", "disease_state", "len feature", "cluster1", "cluster2"]].to_numpy().reshape(-1,7), 
                            new_feature_test[:,:]], axis=1)
test_data.shape

(120944, 27)

In [None]:
test_data = scaler.transform(test_data)
np.max(test_data[:,0]), np.min(test_data[:,0])

In [None]:
#Soft Voting Ensemble

rf_pred_ = rf_model.predict_proba(test_data)

xgb_pred_ = xgb_model.predict_proba(test_data)

lgbm_pred_ = lgbm_model.predict_proba(test_data)

gb_pred_ = gb_model.predict_proba(test_data)

cat_pred_ = cat_model.predict_proba(test_data)

pred = (rf_pred_ + xgb_pred_ + lgbm_pred_ + gb_pred_ + cat_pred_)/5.0
pred = np.argmax(pred, axis=1).reshape(-1)

In [None]:
list(np.argmax(rf_pred_, axis=1).reshape(-1)).count(0), list(np.argmax(rf_pred_, axis=1).reshape(-1)).count(1)

In [None]:
list(np.argmax(xgb_pred_, axis=1).reshape(-1)).count(0), list(np.argmax(xgb_pred_, axis=1).reshape(-1)).count(1)

In [None]:
list(np.argmax(lgbm_pred_, axis=1).reshape(-1)).count(0), list(np.argmax(lgbm_pred_, axis=1).reshape(-1)).count(1)

In [None]:
list(np.argmax(gb_pred_, axis=1).reshape(-1)).count(0), list(np.argmax(gb_pred_, axis=1).reshape(-1)).count(1)

In [None]:
list(np.argmax(cat_pred_, axis=1).reshape(-1)).count(0), list(np.argmax(cat_pred_, axis=1).reshape(-1)).count(1)

In [None]:
list(pred).count(0), list(pred).count(1)

In [None]:
sub["label"].iloc[:] = pred
sub

In [None]:
sub["label"].value_counts()

In [None]:
sub.to_csv("/content/제출물60.csv", index=False)

In [None]:
from google.colab import files

files.download("/content/제출물60.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>