# 데이터 증강
> "NLP"

- toc:true
- branch: master
- badges: true
- comments: true
- author: 전북대학교 통계학과 이강철
- categories: [python]
- hide :false
- published: true

# Introduction

* 본 코드는 EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks 를 한국어로 쓸 수 있도록 wordnet 부분만 교체한 프로젝트의 코드를 인용하였다.

* 원 데이터를 10, 5, 2 배수로 증가시켰다.

# 함수 구현

In [16]:
import random
import pickle
import re

wordnet = {}
with open("wordnet.pickle", "rb") as f:
     wordnet = pickle.load(f)


# 한글만 남기고 나머지는 삭제
def get_only_hangul(line):
    parseText= re.compile('/ ^[ㄱ-ㅎㅏ-ㅣ가-힣]*$/').sub('',line)

    return parseText



########################################################################
# Synonym replacement
# Replace n words in the sentence with synonyms from wordnet
########################################################################
def synonym_replacement(words, n):
    new_words = words.copy()
    random_word_list = list(set([word for word in words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break

    if len(new_words) != 0:
        sentence = ' '.join(new_words)
        new_words = sentence.split(" ")

    else:
        new_words = ""

    return new_words


def get_synonyms(word):
    synomyms = []

    try:
        for syn in wordnet[word]:
            for s in syn:
                synomyms.append(s)
    except:
        pass

    return synomyms

########################################################################
# Random deletion
# Randomly delete words from the sentence with probability p
########################################################################
def random_deletion(words, p):
    if len(words) == 1:
        return words

    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    if len(new_words) == 0:
        rand_int = random.randint(0, len(words)-1)
        return [words[rand_int]]

    return new_words

########################################################################
# Random swap
# Randomly swap two words in the sentence n times
########################################################################
def random_swap(words, n):
    new_words = words.copy()
    for _ in range(n):
        new_words = swap_word(new_words)

    return new_words

def swap_word(new_words):
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0

    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1
        if counter > 3:
            return new_words

    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
    return new_words

########################################################################
# Random insertion
# Randomly insert n words into the sentence
########################################################################
def random_insertion(words, n):
    new_words = words.copy()
    for _ in range(n):
        add_word(new_words)

    return new_words


def add_word(new_words):
    synonyms = []
    counter = 0
    while len(synonyms) < 1:
        if len(new_words) >= 1:
            random_word = new_words[random.randint(0, len(new_words)-1)]
            synonyms = get_synonyms(random_word)
            counter += 1
        else:
            random_word = ""

        if counter >= 10:
            return
    
    random_synonym = synonyms[0]
    random_idx = random.randint(0, len(new_words)-1)
    new_words.insert(random_idx, random_synonym)



def EDA(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=1):
    sentence = get_only_hangul(sentence)
    words = sentence.split(' ')
    words = [word for word in words if word is not ""]
    num_words = len(words)

    augmented_sentences = []
    num_new_per_technique = int(num_aug/4) + 1

    n_sr = max(1, int(alpha_sr*num_words))
    n_ri = max(1, int(alpha_ri*num_words))
    n_rs = max(1, int(alpha_rs*num_words))

    # sr
    for _ in range(num_new_per_technique):
        a_words = synonym_replacement(words, n_sr)
        augmented_sentences.append(' '.join(a_words))

    # ri
    for _ in range(num_new_per_technique):
        a_words = random_insertion(words, n_ri)
        augmented_sentences.append(' '.join(a_words))

    # rs
    for _ in range(num_new_per_technique):
        a_words = random_swap(words, n_rs)
        augmented_sentences.append(" ".join(a_words))

    # rd
    for _ in range(num_new_per_technique):
        a_words = random_deletion(words, p_rd)
        augmented_sentences.append(" ".join(a_words))

    augmented_sentences = [get_only_hangul(sentence) for sentence in augmented_sentences]
    random.shuffle(augmented_sentences)

    if num_aug >= 1:
        augmented_sentences = augmented_sentences[:num_aug]
    else:
        keep_prob = num_aug / len(augmented_sentences)
        augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]

    augmented_sentences.append(sentence)

    return augmented_sentences

  words = [word for word in words if word is not ""]


# 기존 데이터 불러오기

In [17]:
import pandas as pd

In [18]:
data = pd.read_csv("kobert입력데이터.csv")

## topic, document 컬럼 type 변경

In [19]:
data["topic"] = data["topic"].astype(str)
data["document"] = data["document"].astype(str)

In [20]:
length = len(data)

## 비어있는 데이터프레임생성

In [21]:
total = pd.DataFrame()

## 생성 후 함수를 돌려 문서별로 쌓기

* 각 단어를 9개씩 증가시켜 1개의 문서를 10개로 만들었다!!(이 과정을 10, 5, 2 배수로 해서 데이터 셋을 만들고 모델을 적합시키겠음!)

In [22]:
for i in range(length) :
        temp = pd.DataFrame()
        text = EDA(data["clean_txt1"][i])
        l = len(text)
        topic = [data["topic"][i]]*l
        document = [data["document"][i]]*l
        
        temp["topic"] = topic
        temp["text"] = text
        temp["document"] = document
        
        total =pd.concat([total,temp])

In [23]:
total = total.iloc[:, [2,0,1]]

In [24]:
total = total.reset_index().iloc[:,[1,2,3]]

In [25]:
total.head()

Unnamed: 0,document,topic,text
0,1,15,존경 지지 주택 전시관 입점 업체 임차인 주택 전시관 업체 심정 호소 강제 철거 업...
1,1,15,존경 지지 주택 전시관 입점 업체 임차인 주택 전시관 업체 심정 호소 강제 철거 업...
2,2,5,올해 여자 작년 치매 고생 사 엄마 하늘 식구 엄마 우울증 무기력 상태 연 풀칠 강...
3,2,5,올해 여자 작년 치매 고생 엄마 하늘 식구 엄마 우울증 무기력 상태 풀칠 강아지 연...
4,3,20,과학 고등학교 직전 학년 고등학교 선행 학습 학원 치열 주위 고등학교 국어 수학 학...


## 완성된 데이터를 저장하자!

In [26]:
total.to_csv("kobert데이터증강(n=2).csv")

# 참고

[1] [데이터증강기법 논문리뷰 블로그](https://catsirup.github.io/ai/2020/04/21/nlp_data_argumentation.html)

[2] [데이터증강기법 코드](https://github.com/catSirup/KorEDA/tree/master)