**텍스트 데이터 증강**  
<br>
삽입, 삭제, 교체, 대체, 생성, 반의어, 맞춤법 교정, 역번역 등의 종류가 있다.

자연어 처리 데이터 증강(NLPAUG) 라이브러리 설치

In [1]:
!pip install numpy requests nlpaug transformers sacremoses nltk

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses, nlpaug
Successfully installed nlpaug-1.1.11 sacremoses-0.1.1


In [14]:
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.char as nac

texts = [
    "Those who can imagine anything, can create the impossible.",
    "We can only see a short distance ahead, but we can see plenty there that needs to be done.",
    "If a machine is expected to be infallible, it cannot also be intelligent.",
]

# 삽입 및 삭제

### 삽입

In [19]:
aug = naw.ContextualWordEmbsAug(model_path="bert-base-uncased", action="insert")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")      # 원본
  print(f"dst : {augmented}") # 결과
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : even those mortals who wonder can imagine so anything, can create the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : we just can only see them a short sight distance ahead, sure but we can see a plenty there so that needs to work be done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : if then a computer machine model is expected to be infallible, it cannot presumably also necessarily be inherently intelligent.
------------------


기존 문장의 의미를 바꾸지 않으면서 증강된 것을 확인할 수 있다.

### 삭제

In [20]:
aug = nac.RandomCharAug(action="delete")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : hse who can imagine antng, can reae the iossble.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can oy see a hot dince ahe, but we can see lent there tt needs to be do.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a mhie is epece to be ifalibl, it caot also be nelliet.
------------------


각 단어를 구성하는 문자가 무작위로 삭제되었다.

# 교체 및 대체

### 교체

In [21]:
aug = naw.RandomWordAug(action="swap")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Those imagine who can, anything create can the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can only see short a ahead distance, we but see can plenty there needs that be to. done
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : A if machine to is expected be infallible, it cannot also be intelligent.
------------------


### 대체 (1)

In [11]:
# NLTK 내부에서 사용하는 태깅 모델(averaged_perceptron_tagger_eng) 설치
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [22]:
aug = naw.SynonymAug(aug_src='wordnet')
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Those world health organization can guess anything, backside create the out of the question.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can entirely see a short distance ahead, simply we buns see plenty there that needs to exist done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a machine exist ask to be infallible, it cannot also be intelligent.
------------------


SynonymAug 클래스는 워드넷(WordNet) 데이터베이스나 의역 데이터베이스(The Paraphrase Database, PPDB)를 활용해 단어를 대체해 데이터를 증강한다.  
wordnet 또는 ppdb를 인수로 활용해 문장의 의미를 변경할 수 있다.  
<br>
하지만 해당 가능은 문맥을 파악해 동의어로 변경하는 게 아니라 db 내 유의어나 동의어로 변경하므로 원본의 문맥과 전혀 다른 문장이 생성될 수 있어 주의해야 한다.

### 대체 (2)

In [23]:
reserved_tokens = [
    ["can", "can't", "cannot", "could"],
]

reserved_aug = naw.ReservedAug(reserved_tokens=reserved_tokens)
augmented_texts = reserved_aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Those who could imagine anything, could create the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can't only see a short distance ahead, but we could see plenty there that needs to be done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a machine is expected to be infallible, it can also be intelligent.
------------------


ReservedAug 클래스는 입력 데이터에 포함된 단어를 특정한 단어로 대체하는 기능을 제공한다.  
가능한 모든 조합을 생성하거나 특정 글자나 문자를 reversed_tokens에서 선언한 데이터로 변경한다.

# 역번역

In [24]:
back_translation = naw.BackTranslationAug(
    from_model_name='facebook/wmt19-en-de',
    to_model_name='facebook/wmt19-de-en'
)
augmented_texts = back_translation.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

The following layers were not sharded: model.encoder.embed_positions.weight, model.decoder.layers.*.self_attn_layer_norm.weight, model.decoder.layers.*.self_attn.k_proj.bias, model.encoder.layers.*.self_attn.v_proj.bias, model.encoder.layers.*.fc*.weight, model.decoder.layers.*.self_attn.q_proj.bias, model.decoder.layers.*.self_attn.q_proj.weight, model.decoder.layers.*.self_attn.v_proj.bias, model.encoder.layers.*.self_attn_layer_norm.weight, model.decoder.layers.*.encoder_attn.out_proj.bias, model.decoder.layers.*.self_attn.out_proj.bias, model.decoder.layers.*.encoder_attn_layer_norm.bias, model.decoder.layers.*.self_attn_layer_norm.bias, model.decoder.layers.*.fc*.weight, model.decoder.embed_tokens.weight, model.decoder.layers.*.encoder_attn.v_proj.weight, model.encoder.layers.*.self_attn.v_proj.weight, model.decoder.layers.*.encoder_attn.q_proj.weight, model.decoder.layers.*.final_layer_norm.bias, model.encoder.layers.*.self_attn.k_proj.weight, model.decoder.layers.*.self_attn.v_p

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

The following layers were not sharded: model.encoder.embed_positions.weight, model.decoder.layers.*.self_attn_layer_norm.weight, model.decoder.layers.*.self_attn.k_proj.bias, model.encoder.layers.*.self_attn.v_proj.bias, model.encoder.layers.*.fc*.weight, model.decoder.layers.*.self_attn.q_proj.bias, model.decoder.layers.*.self_attn.q_proj.weight, model.decoder.layers.*.self_attn.v_proj.bias, model.encoder.layers.*.self_attn_layer_norm.weight, model.decoder.layers.*.encoder_attn.out_proj.bias, model.decoder.layers.*.self_attn.out_proj.bias, model.decoder.layers.*.encoder_attn_layer_norm.bias, model.decoder.layers.*.self_attn_layer_norm.bias, model.decoder.layers.*.fc*.weight, model.decoder.embed_tokens.weight, model.decoder.layers.*.encoder_attn.v_proj.weight, model.encoder.layers.*.self_attn.v_proj.weight, model.decoder.layers.*.encoder_attn.q_proj.weight, model.decoder.layers.*.final_layer_norm.bias, model.encoder.layers.*.self_attn.k_proj.weight, model.decoder.layers.*.self_attn.v_p

generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

vocab-src.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

vocab-src.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

src : Those who can imagine anything, can create the impossible.
dst : Anyone who can imagine anything can achieve the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can only take a brief look ahead, but we can see that there is still a lot to be done to be done to be done. We have to be done, and that is a lot of us, and that is a lot of us to be done, and that is a lot of us, and that is a lot of us, and that is to be done, and that is to be done, and that is to be done, and that is a lot of it is to be done, and that is to be done, and that is to be done, and that is to be done, and that is a lot of it, and that is to be done, and that is to be done, and that is to be done, is to be done, and that is to be done, is to be done, and that is to be done, is to be done, is to be done, is to be done, is to be done, is to be done, and that is to be done, and that is to be done, is to be done, is to be

입력 모델과 출력 모델을 설정해 역번역을 수행한다.  
입력 모델은 영어를 독일어로 변경, 출력 모델은 독일어를 영어로 변경한다.