<a href="https://colab.research.google.com/github/aloml2543/Doit_Bert_GPT/blob/main/2%EC%9E%A5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
!pip install ratsnlp
from google.colab import drive
from Korpora import Korpora
import os
from tokenizers import ByteLevelBPETokenizer, BertWordPieceTokenizer
from transformers import GPT2Tokenizer, BertTokenizer



In [2]:
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [4]:
# 말뭉치 내려받기 및 전처리
nsmc = Korpora.load("nsmc", force_download=True)

def write_lines(path, lines):
  with open(path, 'w', encoding = 'utf-8') as f:
    for line in lines:
      f.write(f'{line}\n')

write_lines("/root/train.txt", nsmc.train.get_all_texts())
write_lines("/root/test.txt", nsmc.test.get_all_texts())


    Korpora 는 다른 분들이 연구 목적으로 공유해주신 말뭉치들을
    손쉽게 다운로드, 사용할 수 있는 기능만을 제공합니다.

    말뭉치들을 공유해 주신 분들에게 감사드리며, 각 말뭉치 별 설명과 라이센스를 공유 드립니다.
    해당 말뭉치에 대해 자세히 알고 싶으신 분은 아래의 description 을 참고,
    해당 말뭉치를 연구/상용의 목적으로 이용하실 때에는 아래의 라이센스를 참고해 주시기 바랍니다.

    # Description
    Author : e9t@github
    Repository : https://github.com/e9t/nsmc
    References : www.lucypark.kr/docs/2015-pyconkr/#39

    Naver sentiment movie corpus v1.0
    This is a movie review dataset in the Korean language.
    Reviews were scraped from Naver Movies.

    The dataset construction is based on the method noted in
    [Large movie review dataset][^1] from Maas et al., 2011.

    [^1]: http://ai.stanford.edu/~amaas/data/sentiment/

    # License
    CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
    Details in https://creativecommons.org/publicdomain/zero/1.0/



[nsmc] download ratings_train.txt: 14.6MB [00:00, 118MB/s]                             
[nsmc] download ratings_test.txt: 4.90MB [00:00, 43.9MB/s]                            


In [11]:
# GPT 토크나이저 구축
os.makedirs("/gdrive/MyDrive/nlpbook/bbpe", exist_ok=True)

bytebpe_tokenizer = ByteLevelBPETokenizer()
bytebpe_tokenizer.train(
    files = ["/root/train.txt", "/root/test.txt"],
    vocab_size = 10000,
    special_tokens=["[PAD]"]
)
bytebpe_tokenizer.save_model("/gdrive/MyDrive/nlpbook/bbpe")

['/gdrive/MyDrive/nlpbook/bbpe/vocab.json',
 '/gdrive/MyDrive/nlpbook/bbpe/merges.txt']

In [14]:
# BERT 토크나이저 구축
os.makedirs("/gdrive/MyDrive/nlpbook/wordpiece", exist_ok=True)

wordpiece_tokenizer = BertWordPieceTokenizer(lowercase=False)
wordpiece_tokenizer.train(
    files=["/root/train.txt","/root/test.txt"],
    vocab_size = 10000,
)
wordpiece_tokenizer.save_model("/gdrive/MyDrive/nlpbook/wordpiece")

['/gdrive/MyDrive/nlpbook/wordpiece/vocab.txt']

In [22]:
# GPT 입력값 만들기
tokenizer_gpt = GPT2Tokenizer.from_pretrained("/gdrive/MyDrive/nlpbook/bbpe")
tokenizer_gpt.pad_token = "[PAD]"

sentences = [
             "아 더빙.. 진짜 짜증나네요 목소리",
             "흠.. 포스터보고 초딩영환줄.. 오버연기조차 가볍지 않구나",
             "별루 였다..",
]

tokenized_sentences = [tokenizer_gpt.tokenize(sentence) for sentence in sentences]


batch_inputs_gpt = tokenizer_gpt(
    sentences,
    padding = "max_length",
    max_length = 12,
    truncation = True,
)

file /gdrive/MyDrive/nlpbook/bbpe/config.json not found


In [23]:
# BERT 입력값 만들기
tokenizer_bert = BertTokenizer.from_pretrained(
    "/gdrive/MyDrive/nlpbook/wordpiece",
    do_lower_case = False,
)

sentences = [
             "아 더빙.. 진짜 짜증나네요 목소리",
             "흠.. 포스터보고 초딩영환줄.. 오버연기조차 가볍지 않구나",
             "별루 였다..",
]

tokenized_sentences = [tokenizer_bert.tokenize(sentence) for sentence in sentences]

batch_inputs_bert = tokenizer_bert(
    sentences,
    padding = "max_length",
    max_length = 12,
    truncation = True,
)

file /gdrive/MyDrive/nlpbook/wordpiece/config.json not found
