**텍스트 데이터 증강**  
<br>
삽입, 삭제, 교체, 대체, 생성, 반의어, 맞춤법 교정, 역번역 등의 종류가 있다.  
이론은 블로그 참고:
https://choiwonjin.tistory.com/56

자연어 처리 데이터 증강(NLPAUG) 라이브러리 설치

In [None]:
# !pip install numpy requests nlpaug transformers sacremoses nltk

In [None]:
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.char as nac

texts = [
    "Those who can imagine anything, can create the impossible.",
    "We can only see a short distance ahead, but we can see plenty there that needs to be done.",
    "If a machine is expected to be infallible, it cannot also be intelligent.",
]

# 삽입 및 삭제

### 삽입

In [None]:
aug = naw.ContextualWordEmbsAug(model_path="bert-base-uncased", action="insert")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")      # 원본
  print(f"dst : {augmented}") # 결과
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : those who really can imagine anything, can just create not the beautiful impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : we all can only see about a short running distance ahead, but we can see fact plenty there that it needs to also be something done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : if the a time machine is expected to be highly infallible, it therefore cannot also automatically be deemed intelligent.
------------------


기존 문장의 의미를 바꾸지 않으면서 증강된 것을 확인할 수 있다.

### 삭제

In [None]:
aug = nac.RandomCharAug(action="delete")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : hos who can imgn anyig, can crte the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can nl see a sot distance ahe, but we can see pety hre that nee to be oe.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a ahne is expected to be ifllibe, it cnot so be telient.
------------------


각 단어를 구성하는 문자가 무작위로 삭제되었다.

# 교체 및 대체

### 교체

In [None]:
aug = naw.RandomWordAug(action="swap")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Who those imagine can anything, can create the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : Can we only see a short ahead distance but, we can see there plenty that needs be to done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : A if machine is to be expected infallible, it cannot be intelligent also.
------------------


### 대체 (1)

In [None]:
# NLTK 내부에서 사용하는 태깅 모델(averaged_perceptron_tagger_eng) 설치
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [None]:
aug = naw.SynonymAug(aug_src='wordnet')
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Those world health organization fundament imagine anything, seat make the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can only see a light distance ahead, just we put up ensure plenty at that place that needs to be make.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a political machine follow expected to personify infallible, it cannot also be intelligent.
------------------


SynonymAug 클래스는 워드넷(WordNet) 데이터베이스나 의역 데이터베이스(The Paraphrase Database, PPDB)를 활용해 단어를 대체해 데이터를 증강한다.  
wordnet 또는 ppdb를 인수로 활용해 문장의 의미를 변경할 수 있다.  
<br>
하지만 해당 가능은 문맥을 파악해 동의어로 변경하는 게 아니라 db 내 유의어나 동의어로 변경하므로 원본의 문맥과 전혀 다른 문장이 생성될 수 있어 주의해야 한다.

### 대체 (2)

In [None]:
reserved_tokens = [
    ["can", "can't", "cannot", "could"],
]

reserved_aug = naw.ReservedAug(reserved_tokens=reserved_tokens)
augmented_texts = reserved_aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Those who could imagine anything, can't create the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We could only see a short distance ahead, but we could see plenty there that needs to be done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a machine is expected to be infallible, it could also be intelligent.
------------------


ReservedAug 클래스는 입력 데이터에 포함된 단어를 특정한 단어로 대체하는 기능을 제공한다.  
가능한 모든 조합을 생성하거나 특정 글자나 문자를 reversed_tokens에서 선언한 데이터로 변경한다.

# 역번역

In [None]:
back_translation = naw.BackTranslationAug(
    from_model_name='facebook/wmt19-en-de',
    to_model_name='facebook/wmt19-de-en'
)
augmented_texts = back_translation.augment(texts)

for text, augmented in zip(texts, augmented_texts):
  print(f"src : {text}")
  print(f"dst : {augmented}")
  print("------------------")

src : Those who can imagine anything, can create the impossible.
dst : Anyone who can imagine anything can achieve the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can only take a brief look ahead, but we can see that there is still a lot to be done to be done to be done. We have to be done, and that is a lot of us, and that is a lot of us to be done, and that is a lot of us, and that is a lot of us, and that is to be done, and that is to be done, and that is to be done, and that is a lot of it is to be done, and that is to be done, and that is to be done, and that is to be done, and that is a lot of it, and that is to be done, and that is to be done, and that is to be done, is to be done, and that is to be done, is to be done, and that is to be done, is to be done, is to be done, is to be done, is to be done, is to be done, and that is to be done, and that is to be done, is to be done, is to be

입력 모델과 출력 모델을 설정해 역번역을 수행한다.  
입력 모델은 영어를 독일어로 변경, 출력 모델은 독일어를 영어로 변경한다.