# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계3 : Text classification

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * Machine Learning
>> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
> * Deep Learning
>> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
>> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
>> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

## 1. 개발 환경 설정

### 1-1. 라이브러리 설치

In [2]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn gensim wordcloud python-mecab-ko wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### 1-2. 라이브러리 import

In [3]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wget,os
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
import wget,os

### 1-3. 한글 글꼴 설정(Windows)

In [4]:
if not os.path.exists("malgun.ttf"): 
    wget.download("https://www.wfonts.com/download/data/2016/06/13/malgun-gothic/malgun.ttf")
if 'malgun' not in fm.fontManager.findfont("Malgun Gothic"):
    fm.fontManager.addfont("malgun.ttf")
if plt.rcParams['font.family']!= ["Malgun Gothic"]:
    plt.rcParams['font.family']= [font for font in fm.fontManager.ttflist if 'malgun.ttf' in font.fname][-1].name
plt.rcParams['axes.unicode_minus'] = False #한글 폰트 사용시 마이너스 폰트 깨짐 해결
assert plt.rcParams['font.family'] == ["Malgun Gothic"], "한글 폰트가 설정되지 않았습니다."
FONT_PATH = "malgun.ttf"



In [5]:
!sudo apt-get install -y fonts-nanum

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-525
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 23 not upgraded.
Need to get 9,599 kB of archives.
After this operation, 29.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 fonts-nanum all 20180306-3 [9,599 kB]
Fetched 9,599 kB in 1s (6,887 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype

### 1-4. 자바 경로 설정(Windows)

In [6]:
os.environ['JAVA_HOME'] = "C:\Program Files\Java\jdk-19"

### 1-3. 한글 글꼴 설정(Colab)

In [7]:
!sudo apt-get install -y fonts-nanum

Reading package lists... Done
Building dependency tree       
Reading state information... Done
fonts-nanum is already the newest version (20180306-3).
The following package was automatically installed and is no longer required:
  libnvidia-common-525
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.


In [8]:
FONT_PATH = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font_name = fm.FontProperties(fname=FONT_PATH, size=10).get_name()
print(font_name)
plt.rcParams['font.family']=font_name
assert plt.rcParams['font.family'] == [font_name], "한글 폰트가 설정되지 않았습니다."

NanumGothic


### 1-4. 구글드라이브 연결(Colab)

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. 전처리한 데이터 불러오기
* 1, 2일차에 전처리한 데이터를 불러옵니다.
* sparse data에 대해서는 scipy.sparse.load_npz 활용

In [44]:
x_train = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/x_train.csv")
x_val = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/x_val.csv")
x_test = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/x_test.csv")
y_train = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/y_train.csv")
y_val = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/y_val.csv")
y_test = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/y_test.csv")

train_set = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/train_set.csv")
test_set = pd.read_csv("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/test_set.csv")

In [46]:
test_set

Unnamed: 0.1,Unnamed: 0,0
0,0,self convsnn ModuleList nn Conv2d 1Co K100 fo...
1,1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...
2,2,glob glob PATH를 사용할 때 질문입니다 PATH에 가 포함되면 제대로...
3,3,tmpp tmp groupby by Addr1 as index False C...
4,4,filename TEST IMAGE str round frame sec jpg ...
...,...,...
3701,3701,토큰화 이후 train val 를 분리하고 각 train setval set에 벡터...
3702,3702,올린 값들 중 최고점인 건가요아니면 최근에 올린 파일로 무조건 갱신인가요 최고점보...
3703,3703,수업에서 cacoo랑 packet tracer를 배우는 이유가 1IT 인프라 구조...
3704,3704,inplace True 해도 값이 변경이 안되고 none으로 뜹니다혹시 원격지원 ...


In [47]:
x_train.drop(labels='Unnamed: 0',axis=1)
x_val.drop(labels='Unnamed: 0',axis=1)
x_test.drop(labels='Unnamed: 0',axis=1)
y_train.drop(labels='Unnamed: 0',axis=1)
y_val.drop(labels='Unnamed: 0',axis=1)
y_test.drop(labels='Unnamed: 0',axis=1)
train_set.drop(labels='Unnamed: 0',axis=1)
test_set.drop(labels='Unnamed: 0',axis=1)

Unnamed: 0,0
0,self convsnn ModuleList nn Conv2d 1Co K100 fo...
1,현재 이미지를 여러개 업로드 하기 위해 자바스크립트로 동적으로 폼 여러개 생성하는데...
2,glob glob PATH를 사용할 때 질문입니다 PATH에 가 포함되면 제대로...
3,tmpp tmp groupby by Addr1 as index False C...
4,filename TEST IMAGE str round frame sec jpg ...
...,...
3701,토큰화 이후 train val 를 분리하고 각 train setval set에 벡터...
3702,올린 값들 중 최고점인 건가요아니면 최근에 올린 파일로 무조건 갱신인가요 최고점보...
3703,수업에서 cacoo랑 packet tracer를 배우는 이유가 1IT 인프라 구조...
3704,inplace True 해도 값이 변경이 안되고 none으로 뜹니다혹시 원격지원 ...


In [41]:
y_train

Unnamed: 0.1,Unnamed: 0,label
0,1107,0
1,947,4
2,1225,1
3,1754,0
4,704,0
...,...,...
2959,1953,2
2960,2743,1
2961,2502,1
2962,1561,0


In [17]:
import numpy as np
import scipy.sparse

x_tfidf_train = scipy.sparse.load_npz("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/x_tfidf_train.npz")
x_tfidf_val = scipy.sparse.load_npz("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/x_tfidf_val.npz")
x_tfidf_test = scipy.sparse.load_npz("/content/drive/MyDrive/2023.04.03_미니프로젝트4차_실습자료/x_tfidf_test.npz")

## 3. Machine Learning(N-grams)
* N-gram으로 전처리한 데이터를 이용하여 3개 이상의 Machine Learning 모델 학습 및 성능 분석
> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

### 3-1. Model 1

In [19]:
# 로지스틱
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
model = LogisticRegression()

In [23]:
x_train = np.array(x_train)
x_test = np.array(x_test)


In [43]:
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss',
                   min_delta=0,
                   patience=4,
                   verbose=1,
                   restore_best_weights=True)

model.fit(x_tfidf_train, y_train)

ValueError: ignored

In [None]:
y_pred = model.predict(x_test)
# 5단계 평가하기
print(confusion_matrix(x_test, y_pred))
print(classification_report(x_test, y_pred))

### 3-2. Model 2

### 3-3. Model 3

### 3-4. Hyperparameter Tuning(Optional) 
* Manual Search, Grid search, Bayesian Optimization, TPE...
> * [grid search tutorial sklearn](https://scikit-learn.org/stable/modules/grid_search.html)
> * [optuna tutorial](https://optuna.org/#code_examples)
> * [ray-tune tutorial](https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html)

## 4. Deep Learning(Sequence)
* Sequence로 전처리한 데이터를 이용하여 DNN, 1-D CNN, LSTM 등 3가지 이상의 deep learning 모델 학습 및 성능 분석
> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)

### 4-1. DNN

### 4-2. 1-D CNN

### 4-3. LSTM

## 5. Using pre-trained model(Optional)
* 한국어 pre-trained model로 fine tuning 및 성능 분석
> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
> * [HuggingFace-Korean](https://huggingface.co/models?language=korean)

In [15]:
!pip install mxnet
!pip install gluonnlp pandas tqdm
!pip install sentencepiece
!pip install transformers==3.0.2
!pip install torch

!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mxnet
  Downloading mxnet-1.9.1-py3-none-manylinux2014_x86_64.whl (49.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.20.1
    Uninstalling graphviz-0.20.1:
      Successfully uninstalled graphviz-0.20.1
Successfully installed graphviz-0.8.4 mxnet-1.9.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gluonnlp
  Downloading gluonnlp-0.10.0.tar.gz (344 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m344.5/344.5 KB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hd

In [18]:
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
import warnings
warnings.filterwarnings('ignore')
 
#토크나이저 불러오기
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
 
#모델 불러오기 
model = TFBertModel.from_pretrained("bert-base-multilingual-cased", output_hidden_states = True)


Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
##GPU 사용 시
device = torch.device("cuda:0")

#bert 모델, vocab 불러오기
bertmodel, vocab = get_pytorch_kobert_model()

In [13]:
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

NameError: ignored