# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계3 : Text classification

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * Machine Learning
>> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
> * Deep Learning
>> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
>> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
>> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

## 1. 개발 환경 설정

### 1-1. 라이브러리 설치

In [2]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko
  Downloading python_mecab_ko-1.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (575 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m575.6/575.6 KB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 KB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko-dic
  Downloading python_mecab_ko_dic-2.1.1.post2-py3-none-any.whl (34.5 MB)


### 1-2. 라이브러리 import

In [3]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wget,os
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
import wget,os
import joblib

### 1-3. 한글 글꼴 설정(Windows)

In [None]:
# if not os.path.exists("malgun.ttf"): 
#     wget.download("https://www.wfonts.com/download/data/2016/06/13/malgun-gothic/malgun.ttf")
# if 'malgun' not in fm.fontManager.findfont("Malgun Gothic"):
#     fm.fontManager.addfont("malgun.ttf")
# if plt.rcParams['font.family']!= ["Malgun Gothic"]:
#     plt.rcParams['font.family']= [font for font in fm.fontManager.ttflist if 'malgun.ttf' in font.fname][-1].name
# plt.rcParams['axes.unicode_minus'] = False #한글 폰트 사용시 마이너스 폰트 깨짐 해결
# assert plt.rcParams['font.family'] == ["Malgun Gothic"], "한글 폰트가 설정되지 않았습니다."
# FONT_PATH = "malgun.ttf"

In [None]:
# !sudo apt-get install -y fonts-nanum

### 1-4. 자바 경로 설정(Windows)

In [None]:
# os.environ['JAVA_HOME'] = "C:\Program Files\Java\jdk-19"

### 1-3. 한글 글꼴 설정(Colab)

In [4]:
!sudo apt-get install -y fonts-nanum

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 24 not upgraded.
Need to get 9,599 kB of archives.
After this operation, 29.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 fonts-nanum all 20180306-3 [9,599 kB]
Fetched 9,599 kB in 1s (9,190 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package fonts-nanum.
(Reading database ... 122349 files and di

In [5]:
FONT_PATH = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font_name = fm.FontProperties(fname=FONT_PATH, size=10).get_name()
fm.fontManager.addfont(FONT_PATH)
print(font_name)
plt.rcParams['font.family']=font_name
assert plt.rcParams['font.family'] == [font_name], "한글 폰트가 설정되지 않았습니다."

NanumGothic


### 1-4. 구글드라이브 연결(Colab)

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. 전처리한 데이터 불러오기
* 1, 2일차에 전처리한 데이터를 불러옵니다.
* sparse data에 대해서는 scipy.sparse.load_npz 활용

In [102]:
import scipy.sparse

##sequence
x_train_ds = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_sequence.npz')
x_val_ds = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_sequence_val.npz')
x_train_tk = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_sequence_tk.npz')
x_val_tk = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_sequence_tk_val.npz')
x_train_tk_1 = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_sequence_tk_1.npz')
x_val_tk_1 = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_sequence_tk_val_1.npz')

##새로운sequence
x_train_t = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/t.npz')
x_val_t = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/t_val.npz')
x_train_t1 = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/t1.npz')
x_val_t1 = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/t1_val.npz')


#n_gram
x_train_tf = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_gram_tf.npz')
x_val_tf = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_gram_tf_val.npz')
x_train_n = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_gram.npz')
x_val_n = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/sparse_matrix_gram_val.npz')


x_train_tff = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/tf.npz')
x_val_tff = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/tf_val.npz')
x_train_nn = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/n.npz')
x_val_nn = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/n_val.npz')

##w2v
x_w2v_tr = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/x_w2v_tr.npz')
x_w2v_val = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/x_w2v_val.npz')
x_pr_tr = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/x_pr_tr.npz')
x_pr_val = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/x_pr_val.npz')

y_train = pd.read_csv('/content/drive/MyDrive/4mini/y_train.csv')
y_val = pd.read_csv('/content/drive/MyDrive/4mini/y_val.csv')

In [103]:
test = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/test.npz')
test_seq = scipy.sparse.load_npz('/content/drive/MyDrive/4mini/test_seq.npz')

In [104]:
x_train_n=x_train_n.toarray()
x_val_n=x_val_n.toarray()
x_train_ds=x_train_ds.toarray()
x_val_ds=x_val_ds.toarray()
x_train_tk=x_train_tk.toarray()
x_val_tk=x_val_tk.toarray()
x_train_tf=x_train_tf.toarray()
x_val_tf=x_val_tf.toarray()

In [105]:
x_train_tk_1=x_train_tk_1.toarray()
x_val_tk_1=x_val_tk_1.toarray()

x_train_nn=x_train_nn.toarray()
x_val_nn=x_val_nn.toarray()
x_train_tff=x_train_tff.toarray()
x_val_tff=x_val_tff.toarray()

In [106]:
x_train_t = x_train_t.toarray()
x_val_t = x_val_t.toarray()
x_train_t1 = x_train_t1.toarray()
x_val_t1 = x_val_t1.toarray()

In [107]:
x_w2v_tr = x_w2v_tr.toarray()
x_w2v_val = x_w2v_val.toarray()
x_pr_tr = x_pr_tr.toarray()
x_pr_val = x_pr_val.toarray()

In [108]:
test=test.toarray()
test_seq=test_seq.toarray()

In [14]:
test.shape,test_seq.shape

((929, 10201), (929, 600))

In [None]:
# y_train = pd.get_dummies(y_train["label"])
# y_val = pd.get_dummies(y_val["label"])

In [109]:
y_train

Unnamed: 0,label
0,0
1,4
2,2
3,2
4,1
...,...
2959,2
2960,2
2961,3
2962,0


In [110]:
x_train_tk = pd.DataFrame(x_train_tk,columns=[f"word{i}" for i in range(x_train_tk.shape[1])])
x_train_tf = pd.DataFrame(x_train_tf,columns=[f"word{i}" for i in range(x_train_tf.shape[1])])
x_train_tk_1 = pd.DataFrame(x_train_tk_1,columns=[f"word{i}" for i in range(x_train_tk_1.shape[1])])
x_train_tff = pd.DataFrame(x_train_tff,columns=[f"word{i}" for i in range(x_train_tff.shape[1])])

In [111]:
x_val_tk = pd.DataFrame(x_val_tk,columns=[f"word{i}" for i in range(x_val_tk.shape[1])])
x_val_tf = pd.DataFrame(x_val_tf,columns=[f"word{i}" for i in range(x_val_tf.shape[1])])
x_val_tk_1 = pd.DataFrame(x_val_tk_1,columns=[f"word{i}" for i in range(x_val_tk_1.shape[1])])
x_val_tff = pd.DataFrame(x_val_tff,columns=[f"word{i}" for i in range(x_val_tff.shape[1])])

In [112]:
x_train_n = pd.DataFrame(x_train_n,columns=[f"word{i}" for i in range(x_train_n.shape[1])])
x_val_n = pd.DataFrame(x_val_n,columns=[f"word{i}" for i in range(x_val_n.shape[1])])
x_train_nn = pd.DataFrame(x_train_nn,columns=[f"word{i}" for i in range(x_train_nn.shape[1])])
x_val_nn = pd.DataFrame(x_val_nn,columns=[f"word{i}" for i in range(x_val_nn.shape[1])])

In [113]:
x_train_t = pd.DataFrame(x_train_t,columns=[f"word{i}" for i in range(x_train_t.shape[1])])
x_val_t = pd.DataFrame(x_val_t,columns=[f"word{i}" for i in range(x_val_t.shape[1])])
x_train_t1 = pd.DataFrame(x_train_t1,columns=[f"word{i}" for i in range(x_train_t1.shape[1])])
x_val_t1 = pd.DataFrame(x_val_t1,columns=[f"word{i}" for i in range(x_val_t1.shape[1])])

In [114]:
x_w2v_tr = pd.DataFrame(x_w2v_tr,columns=[f"word{i}" for i in range(x_w2v_tr.shape[1])])
x_w2v_val = pd.DataFrame(x_w2v_val,columns=[f"word{i}" for i in range(x_w2v_val.shape[1])])
x_pr_tr= pd.DataFrame(x_pr_tr,columns=[f"word{i}" for i in range(x_pr_tr.shape[1])])
x_pr_val = pd.DataFrame(x_pr_val,columns=[f"word{i}" for i in range(x_pr_val.shape[1])])

In [115]:
test = pd.DataFrame(test,columns=[f"word{i}" for i in range(test.shape[1])])
test_seq = pd.DataFrame(test_seq,columns=[f"word{i}" for i in range(test_seq.shape[1])])

In [146]:
x_train_t

Unnamed: 0,word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,...,word990,word991,word992,word993,word994,word995,word996,word997,word998,word999
0,0,0,0,0,0,0,0,0,0,0,...,37,6,490,13,1870,9,244,19,1499,15
1,0,0,0,0,0,0,0,0,0,0,...,34,1086,68,204,36,79,12,87,23,49
2,0,0,0,0,0,0,0,0,0,0,...,48,61,199,340,29,99,3500,224,104,22
3,0,0,0,0,0,0,0,0,0,0,...,56,183,19,148,237,20,73,101,120,6
4,0,0,0,0,0,0,0,0,0,0,...,350,76,11,326,9,241,101,3066,23,49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2959,0,0,0,0,0,0,0,0,0,0,...,15,109,194,138,630,17,89,73,101,834
2960,0,0,0,0,0,0,0,0,0,0,...,958,135,32,52,26,9,38,57,23,49
2961,0,0,0,0,0,0,0,0,0,0,...,68,332,43,44,340,358,75,38,57,23
2962,0,0,0,0,0,0,0,0,0,0,...,107,20,36,26,73,311,43,44,86,15


In [121]:
x_train_ = pd.concat([x_train_tff, x_val_tff], axis=0)
x_train_

Unnamed: 0,word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,...,word10191,word10192,word10193,word10194,word10195,word10196,word10197,word10198,word10199,word10200
0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.100502,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.084777,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
737,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
738,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
739,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
740,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [123]:
x_train_=x_train_.reset_index(drop=True)

In [122]:
y_train_ = pd.concat([y_train, y_val], axis=0)
y_train_

Unnamed: 0,label
0,0
1,4
2,2
3,2
4,1
...,...
737,4
738,2
739,0
740,4


In [124]:
y_train_=y_train_.reset_index(drop=True)

In [21]:
from sklearn.model_selection import train_test_split
# x_train,x_val,y_train,y_val = train_test_split(x_,y_train,test_size=0.2,random_state=42)

In [21]:
x_train_tff.shape,x_val_tff.shape

((2964, 10201), (742, 10201))

In [None]:
label_dict = {0: 'a',1: "b",2: "c",3: "d",4: 'e'}

In [None]:
y_train["label"] = y_train["label"].map(label_dict)
y_val['label'] = y_val["label"].map(label_dict)

In [None]:
y_val

Unnamed: 0,label
0,b
1,a
2,c
3,c
4,a
...,...
737,e
738,c
739,a
740,e


In [None]:
# data.to_csv('/content/drive/MyDrive/4mini/datadata.csv', index=False)

In [None]:
# data_val.to_csv('/content/drive/MyDrive/4mini/datadata_val.csv', index=False)

In [None]:
import tensorflow as tf
from tensorflow.data import AUTOTUNE
from tensorflow.data import Dataset

x_train = tf.data.Dataset.from_tensor_slices((x_train,y_train))
x_val = tf.data.Dataset.from_tensor_slices((x_val,y_val))

AUTOTUNE = AUTOTUNE

x_train = x_train.cache().prefetch(buffer_size=AUTOTUNE)
x_val = x_val.cache().prefetch(buffer_size=AUTOTUNE)

TypeError: ignored

In [None]:
x_train

<_PrefetchDataset element_spec=(TensorSpec(shape=(600,), dtype=tf.int32, name=None), TensorSpec(shape=(5,), dtype=tf.uint8, name=None))>

In [None]:
x_val

<_PrefetchDataset element_spec=(TensorSpec(shape=(600,), dtype=tf.int32, name=None), TensorSpec(shape=(5,), dtype=tf.uint8, name=None))>

## 3. Machine Learning(N-grams)
* N-gram으로 전처리한 데이터를 이용하여 3개 이상의 Machine Learning 모델 학습 및 성능 분석
> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

### 3-1. Model 1

In [None]:
joblib.dump(model, '/content/drive/MyDrive/4mini/LinearRegression.pkl') 

In [17]:
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.3/365.3 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.10.3-py3-none-any.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.3/212.3 KB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 KB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.10.3 cmaes-0.9.1 colorlog-6.7.0 optuna-3.1.0


In [None]:
y_train = y_train.values.ravel()
y_train

array([0, 4, 2, ..., 3, 0, 3])

In [None]:
from sklearn.svm import SVC
model=SVC()
model.fit(x_train_tff,y_train)

  y = column_or_1d(y, warn=True)


In [None]:
y_pred = model.predict(x_val_tff)

In [None]:
y_pred

array([1, 0, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 2, 3, 0, 2, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 2, 0, 3, 0, 0, 0, 0, 0, 3, 2, 1, 0, 2,
       0, 0, 4, 1, 0, 2, 3, 1, 2, 3, 0, 1, 3, 0, 3, 3, 2, 1, 0, 0, 2, 0,
       1, 0, 2, 0, 0, 0, 1, 2, 1, 2, 0, 0, 2, 0, 1, 0, 0, 3, 0, 2, 0, 0,
       0, 0, 1, 0, 3, 0, 1, 1, 3, 0, 0, 4, 0, 0, 0, 0, 2, 0, 2, 0, 2, 3,
       1, 1, 1, 2, 2, 0, 0, 0, 3, 0, 4, 2, 2, 2, 0, 0, 3, 2, 3, 0, 0, 0,
       0, 0, 1, 2, 1, 3, 2, 0, 0, 0, 2, 0, 2, 4, 2, 2, 0, 0, 1, 0, 3, 1,
       0, 2, 3, 0, 2, 2, 3, 0, 1, 2, 1, 4, 3, 1, 3, 3, 0, 0, 2, 0, 2, 0,
       1, 2, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 2, 0, 3, 0, 1, 0, 0, 1, 3,
       2, 2, 0, 3, 0, 1, 0, 3, 0, 2, 0, 2, 1, 0, 0, 0, 0, 0, 2, 2, 0, 3,
       3, 1, 2, 3, 1, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 3, 2, 3, 2, 1,
       2, 1, 0, 3, 4, 0, 0, 2, 0, 0, 2, 0, 3, 2, 1, 0, 1, 0, 0, 1, 1, 2,
       1, 2, 2, 4, 1, 0, 0, 0, 0, 3, 0, 4, 2, 0, 2, 0, 0, 0, 1, 2, 0, 2,
       1, 1, 1, 1, 0, 2, 0, 0, 0, 0, 0, 1, 2, 0, 3,

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix,classification_report
print(f1_score(y_val,y_pred,average="macro"))
print(confusion_matrix(y_val,y_pred))
print(classification_report(y_val,y_pred))

0.8230883137815663
[[298   4  21   2   0]
 [ 20 112   7   2   0]
 [ 34   1 114   3   0]
 [  8   5   5  83   0]
 [  1   4   0   3  15]]
              precision    recall  f1-score   support

           0       0.83      0.92      0.87       325
           1       0.89      0.79      0.84       141
           2       0.78      0.75      0.76       152
           3       0.89      0.82      0.86       101
           4       1.00      0.65      0.79        23

    accuracy                           0.84       742
   macro avg       0.88      0.79      0.82       742
weighted avg       0.84      0.84      0.84       742



In [24]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [47]:
import optuna
from optuna.samplers import TPESampler
from catboost import CatBoostClassifier
def objective(trial):
    param = {
        "random_state":42,
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        "n_estimators":trial.suggest_int("n_estimators", 100, 600),
        "max_depth":trial.suggest_int("max_depth", 10, 16),
        'random_strength' :trial.suggest_int('random_strength', 25, 50),
        "l2_leaf_reg":trial.suggest_float("l2_leaf_reg",1e-5,3e-5)
        
    }
    x_train, x_valid, y_train, y_valid = train_test_split(x,y,test_size=0.2,random_state=42)
  
    cat = CatBoostClassifier(**param)
    cat.fit(x_train, y_train,eval_set=[(x_train, y_train), (x_valid,y_valid)],early_stopping_rounds=35,verbose=100)
    log_score = log_loss(y_valid, cat_pred)

    return log_score

In [None]:
sampler = TPESampler(seed=42)
study = optuna.create_study(
    study_name = 'cat_parameter_opt',
    direction = 'minimize',
    sampler = sampler,
)
study.optimize(objective, n_trials=5)
print("Best Score:",study.best_value)
print("Best trial",study.best_trial.params)

[32m[I 2023-04-07 00:52:41,118][0m A new study created in memory with name: cat_parameter_opt[0m
  'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 0.1),


In [131]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(random_state=42)
model.fit(x_train_, y_train_)

Learning rate set to 0.084594
0:	learn: 1.5317824	total: 838ms	remaining: 13m 57s
1:	learn: 1.4640713	total: 1.39s	remaining: 11m 34s
2:	learn: 1.4138245	total: 1.94s	remaining: 10m 44s
3:	learn: 1.3712734	total: 2.5s	remaining: 10m 22s
4:	learn: 1.3351963	total: 3.07s	remaining: 10m 11s
5:	learn: 1.3020982	total: 3.62s	remaining: 10m
6:	learn: 1.2732796	total: 4.17s	remaining: 9m 52s
7:	learn: 1.2476179	total: 4.71s	remaining: 9m 43s
8:	learn: 1.2240745	total: 5.26s	remaining: 9m 39s
9:	learn: 1.1975615	total: 5.81s	remaining: 9m 35s
10:	learn: 1.1751356	total: 6.38s	remaining: 9m 33s
11:	learn: 1.1601843	total: 6.92s	remaining: 9m 29s
12:	learn: 1.1393196	total: 7.5s	remaining: 9m 29s
13:	learn: 1.1213312	total: 8.05s	remaining: 9m 26s
14:	learn: 1.1056107	total: 8.61s	remaining: 9m 25s
15:	learn: 1.0931971	total: 9.23s	remaining: 9m 27s
16:	learn: 1.0812181	total: 10.3s	remaining: 9m 53s
17:	learn: 1.0698758	total: 11.2s	remaining: 10m 12s
18:	learn: 1.0563254	total: 12.2s	remaining

<catboost.core.CatBoostClassifier at 0x7f4ce04ed9d0>

In [132]:
y_pred = model.predict(test)

In [133]:
y_pred

array([[3],
       [3],
       [0],
       [0],
       [2],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [3],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [3],
       [0],
       [2],
       [0],
       [3],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [3],
       [0],
       [0],
       [2],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [3],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [3],
       [0],
       [0],
       [3],
       [0],
       [0],
       [2],
       [3],
       [0],
       [4],
       [4],
       [0],
       [0],
       [0],
       [0],
       [2],
    

In [134]:
sub=pd.DataFrame(y_pred,columns=['label'])
a=[]
for i in range(929):
    a.append(i)
a=np.array(a)
a=pd.DataFrame(a,columns=["id"])
# result3 = pd.concat([a, sub], axis=1)prediction_label

In [136]:
# result3 = pd.concat([a, sub], axis=1)prediction_label
result3

Unnamed: 0,id,label
0,0,3
1,1,3
2,2,0
3,3,0
4,4,2
...,...,...
924,924,3
925,925,0
926,926,3
927,927,1


In [141]:
# result = pd.concat([a, sub], axis=1)
result3["label"].value_counts()

0    414
2    179
1    179
3    129
4     28
Name: label, dtype: int64

In [190]:
result8.to_csv('/content/drive/MyDrive/4mini/result8.csv', index=False)

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix,classification_report
print(f1_score(y_val,y_pred,average="macro"))
print(confusion_matrix(y_val_pca,y_pred))
print(classification_report(y_val_pca,y_pred))

0.2616651675195788
[[259  22  21  23   0]
 [ 51  43  24  23   0]
 [ 71  31  18  32   0]
 [ 24  42  20  15   0]
 [  3   3  13   4   0]]
              precision    recall  f1-score   support

         0.0       0.63      0.80      0.71       325
         1.0       0.30      0.30      0.30       141
         2.0       0.19      0.12      0.15       152
         3.0       0.15      0.15      0.15       101
         4.0       0.00      0.00      0.00        23

    accuracy                           0.45       742
   macro avg       0.26      0.27      0.26       742
weighted avg       0.40      0.45      0.42       742





### 3-2. Model 2

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(x_train_nn,y_train)

In [None]:
y_pred = model.predict(x_val_tf)



In [None]:
y_val.values

array([[0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       ...,
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0]], dtype=uint8)

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
print(f1_score(y_val,y_pred,average="macro"))
print(confusion_matrix(y_val.values.argmax(axis=1),y_pred.argmax(axis=1)))
print(classification_report(y_val,y_pred))

0.26819407008086255
[[128  64  71  58   4]
 [ 59  25  31  24   2]
 [ 57  29  32  26   8]
 [ 41  19  27  14   0]
 [ 13   6   1   3   0]]


In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(random_state=42)
model.fit(x_train_tff,y_train)

In [None]:
y_pred = model.predict(x_val_tff)

In [None]:
y_pred

array([1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 2, 3, 0, 2, 1, 0, 1,
       2, 0, 0, 2, 0, 1, 0, 1, 1, 2, 0, 3, 1, 0, 2, 0, 0, 3, 2, 1, 0, 2,
       0, 0, 4, 4, 0, 2, 3, 1, 0, 0, 0, 1, 3, 0, 3, 3, 2, 1, 0, 0, 2, 0,
       1, 0, 2, 0, 0, 0, 1, 2, 1, 0, 0, 0, 2, 0, 1, 0, 0, 3, 0, 2, 0, 0,
       0, 0, 1, 0, 3, 0, 1, 3, 3, 0, 0, 4, 0, 0, 0, 3, 2, 0, 2, 0, 2, 3,
       1, 1, 1, 2, 2, 0, 0, 0, 4, 0, 4, 2, 2, 2, 0, 0, 3, 2, 3, 0, 0, 0,
       0, 0, 1, 2, 1, 2, 2, 0, 1, 0, 0, 0, 2, 4, 2, 2, 0, 0, 3, 0, 3, 1,
       0, 1, 3, 0, 2, 2, 3, 0, 1, 2, 2, 4, 0, 1, 3, 3, 0, 0, 2, 0, 2, 0,
       0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 3, 4, 4, 0, 0, 1, 3,
       2, 2, 0, 3, 0, 1, 0, 3, 0, 2, 1, 2, 0, 0, 0, 0, 2, 0, 2, 2, 0, 3,
       3, 0, 1, 3, 2, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 4, 2, 3, 2, 0,
       2, 1, 0, 3, 4, 1, 0, 2, 0, 0, 2, 1, 3, 2, 1, 0, 1, 0, 0, 1, 1, 2,
       0, 2, 2, 4, 1, 0, 0, 0, 0, 3, 0, 4, 2, 0, 2, 0, 0, 0, 1, 2, 0, 2,
       1, 1, 1, 3, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 3,

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
print(f1_score(y_val,y_pred,average="macro"))
print(confusion_matrix(y_val,y_pred))
print(classification_report(y_val,y_pred))

0.8375620043422878
[[288  14  21   2   0]
 [ 20 107  10   3   1]
 [ 31   6 109   6   0]
 [  8   6   6  81   0]
 [  0   0   0   0  23]]
              precision    recall  f1-score   support

           0       0.83      0.89      0.86       325
           1       0.80      0.76      0.78       141
           2       0.75      0.72      0.73       152
           3       0.88      0.80      0.84       101
           4       0.96      1.00      0.98        23

    accuracy                           0.82       742
   macro avg       0.84      0.83      0.84       742
weighted avg       0.82      0.82      0.82       742



### 3-3. Model 3

In [41]:
!pip install pycaret[full]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycaret[full]
  Downloading pycaret-3.0.0-py3-none-any.whl (481 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.8/481.8 KB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting plotly-resampler>=0.8.3.1
  Downloading plotly_resampler-0.8.3.2.tar.gz (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 KB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting wurlitzer
  Downloading wurlitzer-3.0.3-py3-none-any.whl (7.3 kB)
Collecting pmdarima!=1.8.1,<3.0.0,>=1.8.0
  Downloading pmdarima-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0

In [47]:
!pip uninstall packaging
!pip install packaging

Found existing installation: packaging 21.3
Uninstalling packaging-21.3:
  Would remove:
    /usr/local/lib/python3.9/dist-packages/packaging-21.3.dist-info/*
    /usr/local/lib/python3.9/dist-packages/packaging/*
Proceed (Y/n)? y
  Successfully uninstalled packaging-21.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting packaging
  Downloading packaging-23.0-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.7/42.7 KB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: packaging
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mlflow 1.30.1 requires packaging<22, but you have packaging 23.0 which is incompatible.[0m[31m
[0mSuccessfully installed packaging-23.0


In [125]:
y_trainpp = y_train_.values.ravel()
y_trainpp.shape, x_train_.shape

((3706,), (3706, 10201))

In [126]:
from pycaret.classification import *

exp_clf = setup(data = x_train_, target=y_trainpp, session_id = 42,fold=7)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,target
2,Target type,Multiclass
3,Original data shape,"(3706, 10202)"
4,Transformed data shape,"(3706, 10202)"
5,Transformed train set shape,"(2594, 10202)"
6,Transformed test set shape,"(1112, 10202)"
7,Numeric features,10201
8,Preprocess,True
9,Imputation type,simple


In [None]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.8163,0.0,0.8163,0.8207,0.8152,0.7396,0.7418,3.2214
svm,SVM - Linear Kernel,0.8052,0.0,0.8052,0.8097,0.8037,0.7244,0.7267,4.9914
lr,Logistic Regression,0.7893,0.9421,0.7893,0.7949,0.7843,0.6969,0.7017,18.5186
catboost,CatBoost Classifier,0.7666,0.9325,0.7666,0.771,0.7631,0.6666,0.6706,32.6586
gbc,Gradient Boosting Classifier,0.755,0.9237,0.755,0.7622,0.7488,0.6463,0.654,96.9443
et,Extra Trees Classifier,0.7522,0.9316,0.7522,0.7725,0.7425,0.6359,0.6526,7.6329
knn,K Neighbors Classifier,0.7474,0.9079,0.7474,0.7507,0.741,0.6372,0.6427,4.8371
xgboost,Extreme Gradient Boosting,0.7459,0.9199,0.7459,0.7484,0.7433,0.6391,0.6416,16.3557
lightgbm,Light Gradient Boosting Machine,0.7411,0.9144,0.7411,0.7435,0.7392,0.6335,0.6351,10.7871
rf,Random Forest Classifier,0.7136,0.9091,0.7136,0.7245,0.7015,0.5806,0.5943,5.8729


Processing:   0%|          | 0/69 [00:00<?, ?it/s]



In [151]:
svm = create_model('svm',fold=5)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7996,0.0,0.7996,0.8011,0.7998,0.718,0.7183
1,0.8247,0.0,0.8247,0.8235,0.8233,0.7563,0.7567
2,0.7861,0.0,0.7861,0.7881,0.7825,0.6931,0.6967
3,0.7861,0.0,0.7861,0.8011,0.7889,0.7007,0.7028
4,0.8166,0.0,0.8166,0.8208,0.8175,0.7429,0.7436
Mean,0.8026,0.0,0.8026,0.8069,0.8024,0.7222,0.7236
Std,0.0157,0.0,0.0157,0.0134,0.0158,0.0242,0.0232


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [152]:
tuned_svm = tune_model(svm,search_library="tune-sklearn",search_algorithm="optuna")

0,1
Current time:,2023-04-07 05:29:24
Running for:,00:53:03.65
Memory:,7.8/12.7 GiB

Trial name,status,loc,actual_estimator__al pha,actual_estimator__et a0,actual_estimator__fi t_intercept,actual_estimator__l1 _ratio,actual_estimator__le arning_rate,actual_estimator__pe nalty,iter,total time (s),split0_test_score,split1_test_score,split2_test_score
_Trainable_04d42315,TERMINATED,172.28.0.12:50645,0.321472,0.0945431,False,0.37454,optimal,elasticnet,1,160.114,0.425876,0.428571,0.428571
_Trainable_e017aa8d,TERMINATED,172.28.0.12:50712,1.32859e-08,0.00309557,True,0.832443,constant,l1,1,1952.28,0.795148,0.819407,0.795148
_Trainable_e26bf1e6,TERMINATED,172.28.0.12:50645,0.0071082,0.00345871,True,0.45607,adaptive,l2,1,225.461,0.67655,0.698113,0.671159
_Trainable_a12ee1ed,TERMINATED,172.28.0.12:50645,9.4781e-10,0.0702626,False,0.304614,invscaling,l1,1,1143.84,0.58221,0.617251,0.552561
_Trainable_8a43caa3,TERMINATED,172.28.0.12:50645,7.05577e-09,0.413885,False,0.54671,optimal,l2,1,88.7425,0.789757,0.816712,0.814016
_Trainable_f4c167f7,TERMINATED,172.28.0.12:50645,5.17e-08,0.17248,False,0.388677,invscaling,l1,1,592.849,0.684636,0.692722,0.663073
_Trainable_4f2ec35f,TERMINATED,172.28.0.12:50712,0.0142763,0.0808699,True,0.00552212,constant,l2,1,72.3963,0.544474,0.663073,0.641509
_Trainable_51e664fe,TERMINATED,172.28.0.12:50712,1.7858e-07,0.0931505,False,0.310982,adaptive,l2,1,225.618,0.797844,0.80593,0.814016
_Trainable_1d417d54,TERMINATED,172.28.0.12:50645,1.88543e-06,0.00117113,False,0.522733,constant,l1,1,964.059,0.768194,0.762803,0.74124
_Trainable_3717abf3,TERMINATED,172.28.0.12:50712,5.88571e-10,0.00605383,False,0.228798,adaptive,l2,1,433.864,0.819407,0.822102,0.819407


2023-04-07 05:29:24,692	INFO tune.py:798 -- Total run time: 3184.00 seconds (3183.62 seconds for the tuning loop).


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8194,0.0,0.8194,0.8276,0.8191,0.7442,0.7465
1,0.8221,0.0,0.8221,0.8291,0.8234,0.7487,0.7496
2,0.8194,0.0,0.8194,0.8187,0.8183,0.7457,0.746
3,0.7978,0.0,0.7978,0.8026,0.7922,0.7134,0.7143
4,0.8,0.0,0.8,0.8081,0.7975,0.7148,0.7173
5,0.8027,0.0,0.8027,0.8089,0.8027,0.7223,0.7234
6,0.8108,0.0,0.8108,0.8141,0.812,0.7361,0.7363
Mean,0.8103,0.0,0.8103,0.8156,0.8093,0.7322,0.7333
Std,0.0094,0.0,0.0094,0.0093,0.0111,0.014,0.0138


In [161]:
tuned_svm1 = tune_model(svm,search_library="optuna",search_algorithm="tpe",fold=3)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7954,0.0,0.7954,0.8014,0.7929,0.707,0.7112
1,0.8139,0.0,0.8139,0.8165,0.8123,0.735,0.7375
2,0.8183,0.0,0.8183,0.8183,0.8175,0.7444,0.7449
Mean,0.8092,0.0,0.8092,0.8121,0.8076,0.7288,0.7312
Std,0.0099,0.0,0.0099,0.0076,0.0106,0.0159,0.0145


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

[32m[I 2023-04-07 05:37:42,509][0m Searching the best hyperparameters using 2594 samples...[0m
[32m[I 2023-04-07 05:46:33,606][0m Finished hyperparemeter search![0m


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [184]:
cat = tune_model(cat,search_library="optuna",search_algorithm="tpe",fold=3)

Processing:   0%|          | 0/7 [00:00<?, ?it/s]

[32m[I 2023-04-07 06:10:46,316][0m Searching the best hyperparameters using 2594 samples...[0m


KeyboardInterrupt: ignored

In [None]:
tuned_svm

In [None]:
catboost = create_model('ridge',fold=5)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8317,0.0,0.8317,0.8351,0.8317,0.7652,0.7656
1,0.8654,0.0,0.8654,0.8725,0.8639,0.8088,0.8129
2,0.8269,0.0,0.8269,0.8342,0.828,0.7576,0.7585
3,0.8317,0.0,0.8317,0.834,0.8322,0.7652,0.7654
4,0.7971,0.0,0.7971,0.797,0.7928,0.7092,0.713
5,0.8116,0.0,0.8116,0.8135,0.8109,0.7328,0.7342
6,0.8357,0.0,0.8357,0.8468,0.8339,0.7653,0.7691
7,0.7971,0.0,0.7971,0.8019,0.7967,0.7113,0.7135
8,0.7923,0.0,0.7923,0.7926,0.7906,0.7058,0.707
9,0.8068,0.0,0.8068,0.8189,0.8061,0.7237,0.7282


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
tuned_grid = tune_model(catboost)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8231,0.0,0.8231,0.8236,0.8228,0.7532,0.7534
1,0.8538,0.0,0.8538,0.8593,0.855,0.7945,0.7955
2,0.8224,0.0,0.8224,0.8325,0.8238,0.7525,0.7537
3,0.7992,0.0,0.7992,0.7976,0.7952,0.7139,0.7166
4,0.7992,0.0,0.7992,0.8003,0.7991,0.7178,0.7181
5,0.8571,0.0,0.8571,0.8591,0.8559,0.7973,0.7993
6,0.7722,0.0,0.7722,0.7791,0.7713,0.6775,0.6791
7,0.8147,0.0,0.8147,0.8206,0.8144,0.7367,0.7392
Mean,0.8177,0.0,0.8177,0.8215,0.8172,0.7429,0.7444
Std,0.0266,0.0,0.0266,0.0269,0.0272,0.0381,0.038


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 8 folds for each of 10 candidates, totalling 80 fits


                   include=['word0', 'word1', 'word2', 'word3', 'word4',
                            'word5', 'word6', 'word7', 'word8', 'word9',
                            'word10', 'word11', 'word12', 'word13', 'word14',
                            'word15', 'word16', 'word17', 'word18', 'word19',
                            'word20', 'word21', 'word22', 'word23', 'word24',
                            'word25', 'word26', 'word27', 'word28', 'word29', ...],
                   transformer=SimpleImputer(add_indicator=False, copy=True,
                                             fill_value=None,
                                             keep_empty_features=False,
                                             missing_values=nan,
                                             strategy='mean',
                                             verbose='deprecated')), X=      word0  word1  word2     word3     word4  word5  word6  word7  word8  \
2555    0.0    0.0    0.0  0.046939  0.070926    0

                   transformer=SimpleImputer(add_indicator=False, copy=True,
                                             fill_value=None,
                                             keep_empty_features=False,
                                             missing_values=nan,
                                             strategy='most_frequent',
                                             verbose='deprecated')), X=      word0  word1  word2     word3     word4  word5  word6  word7  word8  \
2555    0.0    0.0    0.0  0.046939  0.070926    0.0    0.0    0.0    0.0   
81      0.0    0.0    0.0  0.000000  0.000000    0.0    0.0    0.0    0.0   
707     0.0    0.0    0.0  0.000000  0.000000    0.0    0.0    0.0    0.0   
1081    0.0    0.0    0.0  0.000000  0.000000    0.0    0.0    0.0    0.0   
101     0.0    0.0    0.0  0.000000  0.000000    0.0    0.0    0.0    0.0   
...     ...    ...    ...       ...       ...    ...    ...    ...    ...   
1135    0.0    0.0    0.0  0.000000  0.0000

In [None]:
tuned_svm1,tuned_svm

In [164]:
blender_2 = blend_models(estimator_list = [tuned_svm,tuned_svm,tuned_svm],fold=4)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7951,0.0,0.7951,0.8046,0.7926,0.7052,0.7103
1,0.7951,0.0,0.7951,0.7999,0.7902,0.7051,0.7094
2,0.7886,0.0,0.7886,0.7948,0.7849,0.6979,0.7008
3,0.8117,0.0,0.8117,0.8146,0.8123,0.7358,0.7361
Mean,0.7976,0.0,0.7976,0.8035,0.795,0.711,0.7142
Std,0.0086,0.0,0.0086,0.0073,0.0103,0.0146,0.0132


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

In [186]:
stacker = stack_models(estimator_list = [tuned_svm,tuned_svm1], meta_model = tuned_svm,fold=4)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8336,0.0,0.8336,0.8404,0.8346,0.7667,0.768
1,0.8136,0.0,0.8136,0.8143,0.8108,0.7362,0.7382
2,0.7994,0.0,0.7994,0.8039,0.7992,0.7169,0.7181
3,0.8194,0.0,0.8194,0.8233,0.8206,0.7499,0.7505
Mean,0.8165,0.0,0.8165,0.8205,0.8163,0.7424,0.7437
Std,0.0123,0.0,0.0123,0.0134,0.013,0.0183,0.0182


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

In [187]:
final_model = finalize_model(stacker)
prediction = predict_model(final_model, data = test)

In [188]:
prediction["prediction_label"].values

array([3, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 3, 0, 0, 0, 0,
       0, 0, 3, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 0, 0, 3,
       0, 3, 0, 0, 3, 0, 0, 2, 3, 0, 4, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3,
       3, 2, 0, 3, 2, 0, 2, 0, 2, 0, 0, 0, 0, 3, 0, 3, 3, 0, 0, 0, 0, 0,
       0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 3,
       3, 3, 0, 0, 0, 1, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3,
       0, 2, 0, 2, 2, 2, 0, 0, 0, 2, 2, 0, 2, 0, 2, 2, 0, 0, 0, 0, 3, 2,
       2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 0, 3, 0, 2, 0, 0, 0, 2, 0, 0,
       0, 0, 0, 3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 2, 2, 2, 0, 3, 2, 2, 2, 3,
       2, 0, 2, 3, 0, 0, 2, 0, 0, 0, 4, 0, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 0, 2, 0, 0, 3, 3, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 2, 1, 0,
       4, 2, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 3, 2, 0, 3, 2, 2, 3, 0, 3, 0, 0,

In [189]:
sub=pd.DataFrame(prediction["prediction_label"].values,columns=['label'])
a=[]
for i in range(929):
    a.append(i)
a=np.array(a)
a=pd.DataFrame(a,columns=["id"])
result8 = pd.concat([a, sub], axis=1)
result8

Unnamed: 0,id,label
0,0,3
1,1,3
2,2,0
3,3,0
4,4,2
...,...,...
924,924,3
925,925,0
926,926,3
927,927,1


In [172]:
for i in range(len(result4["label"].values)):
    if result4["label"].values[i]!=result5["label"].values[i]:
        print(1)

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1


In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
print(f1_score(y_val,prediction["prediction_label"],average="macro"))
print(confusion_matrix(y_val,prediction["prediction_label"]))

0.8458355236566721
[[284  10  27   4   0]
 [ 16 115   7   2   1]
 [ 27   6 113   6   0]
 [  5   4   6  86   0]
 [  0   1   0   1  21]]


In [26]:
!pip install mljar-supervised

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mljar-supervised
  Downloading mljar-supervised-0.11.5.tar.gz (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.7/112.7 KB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dtreeviz>=2.0.0
  Downloading dtreeviz-2.2.0-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.5/90.5 KB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting shap>=0.36.0
  Downloading shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.4/572.4 KB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Collecting category_encoders>=2.2.2
  Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 KB[0m [31m7.8 M

In [40]:
from supervised.automl import AutoML
automl = AutoML(mode="Perform")
automl.fit(x_train_, y_train_)

Linear algorithm was disabled.
AutoML directory: AutoML_3
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['LightGBM', 'Neural Network']
AutoML will ensemble available models


KeyboardInterrupt: ignored

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
predictions = automl.predict_all(x_val_tff)
print(predictions.head())

   prediction_0  prediction_1  prediction_2  prediction_3  prediction_4  label
0      0.087837      0.818163      0.061953      0.024604      0.007443      1
1      0.590941      0.195559      0.202674      0.007292      0.003534      0
2      0.099783      0.004629      0.890112      0.004420      0.001056      2
3      0.139048      0.002521      0.856525      0.001022      0.000884      2
4      0.201462      0.314277      0.161033      0.303298      0.019929      1


In [None]:
print(f1_score(y_val,predictions["label"],average="macro"))
print(confusion_matrix(y_val,predictions["label"]))
print(classification_report(y_val,predictions["label"]))

0.8356092207870045
[[293   7  21   4   0]
 [ 15 117   8   1   0]
 [ 28   5 114   5   0]
 [  6   5   6  84   0]
 [  0   3   1   2  17]]
              precision    recall  f1-score   support

           0       0.86      0.90      0.88       325
           1       0.85      0.83      0.84       141
           2       0.76      0.75      0.75       152
           3       0.88      0.83      0.85       101
           4       1.00      0.74      0.85        23

    accuracy                           0.84       742
   macro avg       0.87      0.81      0.84       742
weighted avg       0.84      0.84      0.84       742



### 3-4. Hyperparameter Tuning(Optional) 
* Manual Search, Grid search, Bayesian Optimization, TPE...
> * [grid search tutorial sklearn](https://scikit-learn.org/stable/modules/grid_search.html)
> * [optuna tutorial](https://optuna.org/#code_examples)
> * [ray-tune tutorial](https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html)

In [None]:
# from tensorflow.keras.utils import to_categorical
# y_train = to_categorical(y_train)
# y_val = to_categorical(y_val)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
model_rf = RandomForestClassifier(random_state=42)
params={"max_depth":range(30,36),"n_estimators":[300,500]}
model_rf = GridSearchCV(model_rf,params,cv=10,scoring="f1",n_jobs=-1)
model_rf.fit(x_train,y_train)

NameError: ignored

In [None]:
print(model_rf.best_params_)
print(model_rf.best_score_) 

{'max_depth': 5, 'n_estimators': 100}
nan


In [None]:
from xgboost import XGBClassifier
model_xgb = XGBClassifier(random_state=42)
params={"max_depth":range(5,11),"n_estimators":[100]}
model_xgb = GridSearchCV(model_xgb,params,cv=5,scoring="f1",n_jobs=-1)
model_xgb.fit(x_train_tf,y_train)

KeyboardInterrupt: ignored

In [None]:
print(model_xgb.best_params_)
print(model_xgb.best_score_)

## 4. Deep Learning(Sequence)
* Sequence로 전처리한 데이터를 이용하여 DNN, 1-D CNN, LSTM 등 3가지 이상의 deep learning 모델 학습 및 성능 분석
> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)

### 4-1. DNN

In [None]:
# x_train_tk
# x_val_tk

### 4-2. 1-D CNN

In [142]:
!pip install focal_loss

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting focal_loss
  Downloading focal_loss-0.0.7-py3-none-any.whl (19 kB)
Installing collected packages: focal_loss
Successfully installed focal_loss-0.0.7


In [143]:
from focal_loss import *

In [None]:
import tensorflow as tf
from tensorflow import keras
keras.backend.clear_session()

il = keras.layers.Input(shape=[600])
el = keras.layers. Embedding(9955,256,input_length=600)(il)
hl = keras.layers.Conv1D(64,kernel_size=5,activation="swish",padding="same")(el)
hl = keras.layers.Conv1D(64,kernel_size=5,activation="swish",padding="same")(hl)
hl = keras.layers.BatchNormalization()(hl)
hl = keras.layers.Dropout(0.2)(hl)

hl = keras.layers.GRU(64, return_sequences=True)(hl)
hl = keras.layers.BatchNormalization()(hl)
hl = keras.layers.Dropout(0.2)(hl)

hl = keras.layers.Bidirectional(keras.layers.GRU(64, return_sequences=True))(hl)
hl = keras.layers.Bidirectional(keras.layers.GRU(32, return_sequences=True))(hl)
hl = keras.layers.BatchNormalization()(hl)
hl = keras.layers.Dropout(0.2)(hl)


hl = keras.layers.GlobalAveragePooling1D()(hl)

ol = keras.layers.Dense(5,activation="softmax")(hl)

model = keras.models.Model(il,ol)
model.compile(loss=SparseCategoricalFocalLoss(gamma=2),
              optimizer = keras.optimizers.RMSprop(),
              metrics=['accuracy'])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 600)]             0         
                                                                 
 embedding (Embedding)       (None, 600, 256)          2548480   
                                                                 
 conv1d (Conv1D)             (None, 600, 64)           81984     
                                                                 
 conv1d_1 (Conv1D)           (None, 600, 64)           20544     
                                                                 
 batch_normalization (BatchN  (None, 600, 64)          256       
 ormalization)                                                   
                                                                 
 dropout (Dropout)           (None, 600, 64)           0         
                                                             

In [None]:
from tensorflow.keras.callbacks import EarlyStopping,ReduceLROnPlateau
es = EarlyStopping(monitor="val_loss",
                   min_delta=0.,
                   patience=11,
                   verbose=1,
                   restore_best_weights=True)
lr = ReduceLROnPlateau(monitor="val_loss",
                       patience=4,
                       factor=0.35,
                       verbose=1,
                       min_lr=0.0000001)
model.fit(x_train_tk,y_train,validation_split=0.2,callbacks=[es,lr],epochs=100,verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 10: ReduceLROnPlateau reducing learning rate to 0.00035000001662410796.
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 16: ReduceLROnPlateau reducing learning rate to 0.00012250000581843777.
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 20: ReduceLROnPlateau reducing learning rate to 4.287500050850212e-05.
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 23: early stopping


<keras.callbacks.History at 0x7f2097d70280>

In [None]:
model.evaluate(x_val_tk,y_val)



[1.0855400562286377, 0.8059298992156982]

In [None]:
y_pred=model.predict(x_val_tk)



In [None]:
# y_val.values.argmax(axis=1)

In [None]:
# y_pred.argmax(axis=1)

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix,classification_report
print(f1_score(y_val.values.argmax(axis=1),y_pred.argmax(axis=1),average="macro"))
print(classification_report(y_val.values.argmax(axis=1),y_pred.argmax(axis=1)))

0.8033438529407478
              precision    recall  f1-score   support

           0       0.77      0.92      0.84       325
           1       0.88      0.76      0.82       141
           2       0.76      0.63      0.69       152
           3       0.89      0.77      0.83       101
           4       0.86      0.83      0.84        23

    accuracy                           0.81       742
   macro avg       0.83      0.78      0.80       742
weighted avg       0.81      0.81      0.80       742



In [None]:
confusion_matrix(y_val.values.argmax(axis=1),y_pred.argmax(axis=1))

array([[298,   6,  19,   2,   0],
       [ 23, 107,   7,   2,   2],
       [ 48,   3,  96,   5,   0],
       [ 15,   3,   4,  78,   1],
       [  1,   2,   0,   1,  19]])

### 4-3. LSTM

In [None]:
!pip install focal_loss

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting focal_loss
  Downloading focal_loss-0.0.7-py3-none-any.whl (19 kB)
Installing collected packages: focal_loss
Successfully installed focal_loss-0.0.7


In [None]:
from focal_loss import *

In [147]:
import tensorflow as tf
from tensorflow import keras
keras.backend.clear_session()

il = keras.layers.Input(shape=[1000])
el = keras.layers. Embedding(10805,256,input_length=1000)(il)
hl = keras.layers.Conv1D(64,kernel_size=5,activation="swish",padding="same")(el)
hl = keras.layers.Conv1D(64,kernel_size=5,activation="swish",padding="same")(hl)
hl = keras.layers.BatchNormalization()(hl)
hl = keras.layers.Dropout(0.2)(hl)

hl = keras.layers.GRU(64, return_sequences=True)(hl)
hl = keras.layers.BatchNormalization()(hl)
hl = keras.layers.Dropout(0.2)(hl)

hl = keras.layers.Bidirectional(keras.layers.GRU(64, return_sequences=True))(hl)
hl = keras.layers.Bidirectional(keras.layers.GRU(32, return_sequences=True))(hl)
hl = keras.layers.BatchNormalization()(hl)
hl = keras.layers.Dropout(0.2)(hl)


hl = keras.layers.GlobalAveragePooling1D()(hl)

ol = keras.layers.Dense(5,activation="softmax")(hl)

model = keras.models.Model(il,ol)
model.compile(loss=SparseCategoricalFocalLoss(gamma=2),
              optimizer = keras.optimizers.RMSprop(),
              metrics=['accuracy'])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1000)]            0         
                                                                 
 embedding (Embedding)       (None, 1000, 256)         2766080   
                                                                 
 conv1d (Conv1D)             (None, 1000, 64)          81984     
                                                                 
 conv1d_1 (Conv1D)           (None, 1000, 64)          20544     
                                                                 
 batch_normalization (BatchN  (None, 1000, 64)         256       
 ormalization)                                                   
                                                                 
 dropout (Dropout)           (None, 1000, 64)          0         
                                                             

In [148]:
from tensorflow.keras.callbacks import EarlyStopping,ReduceLROnPlateau
es = EarlyStopping(monitor="val_loss",
                   min_delta=0.001,
                   patience=11,
                   verbose=1,
                   restore_best_weights=True)
lr = ReduceLROnPlateau(monitor="val_loss",
                       patience=4,
                       factor=0.35,
                       verbose=1,
                       min_lr=0.0000001)
model.fit(x_train_t,y_train,validation_split=0.2,callbacks=[es,lr],epochs=100,verbose=1)

Epoch 1/100
 3/75 [>.............................] - ETA: 6:59 - loss: 1.0319 - accuracy: 0.1875

KeyboardInterrupt: ignored

In [None]:
y_pred=model.predict(x_pr_val)



In [None]:
y_pred

array([[0.33457786, 0.2142    , 0.23533179, 0.19832778, 0.01756254],
       [0.33457786, 0.2142    , 0.23533179, 0.19832778, 0.01756254],
       [0.36120293, 0.22973895, 0.24258602, 0.15132841, 0.0151437 ],
       ...,
       [0.30297527, 0.19877382, 0.20034361, 0.27375624, 0.02415107],
       [0.36194006, 0.2316922 , 0.2467193 , 0.14467974, 0.01496877],
       [0.3955371 , 0.24641034, 0.21952716, 0.12580955, 0.01271585]],
      dtype=float32)

In [None]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix,classification_report
print(f1_score(y_val,y_pred.argmax(axis=1),average="macro"))
print(classification_report(y_val,y_pred.argmax(axis=1)))

0.1290615118429688
              precision    recall  f1-score   support

           0       0.44      0.99      0.61       325
           1       0.00      0.00      0.00       141
           2       0.00      0.00      0.00       152
           3       0.20      0.02      0.04       101
           4       0.00      0.00      0.00        23

    accuracy                           0.44       742
   macro avg       0.13      0.20      0.13       742
weighted avg       0.22      0.44      0.27       742



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## 5. Using pre-trained model(Optional)
* 한국어 pre-trained model로 fine tuning 및 성능 분석
> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
> * [HuggingFace-Korean](https://huggingface.co/models?language=korean)