<a href="https://colab.research.google.com/github/dolmani38/Summary/blob/master/korean_QA_from_wiki.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a QA System with BERT on Wikipedia

https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html

위의 내용을 한국어 QA로 변경

In [112]:
!pip install transformers==3
!pip install wikipedia
!pip install sentence-transformers

Collecting transformers==3
  Using cached https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl
Collecting tokenizers==0.8.0-rc4
  Using cached https://files.pythonhosted.org/packages/e8/bd/e5abec46af977c8a1375c1dca7cb1e5b3ec392ef279067af7f6bc50491a0/tokenizers-0.8.0rc4-cp36-cp36m-manylinux1_x86_64.whl
[31mERROR: Operation cancelled by user[0m
[31mERROR: Operation cancelled by user[0m


한국어 squad, KorQuAD 2.0 사용
https://korquad.github.io/


In [113]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

SQUAD_MODEL = "monologg/koelectra-base-v3-finetuned-korquad"

# executing these commands for the first time initiates a download of the 
# model weights to ~/.cache/torch/transformers/
tokenizer = AutoTokenizer.from_pretrained(SQUAD_MODEL) 
model = AutoModelForQuestionAnswering.from_pretrained(SQUAD_MODEL)

KeyboardInterrupt: ignored

In [None]:
question = "마케도니아를 통치 한 왕조는?"

context = """마케도니아는 고대 그리스와 고전 그리스 주변의 고대 왕국이었습니다.
그리고 나중에 헬레니즘 그리스의 지배 국가. 왕국이 설립되고 처음에 통치
Argead 왕조, Antipatrid 및 Antigonid 왕조가 그 뒤를이었습니다. 고대의 고향
마케도니아인, 그리스 반도의 북동부에서 시작되었습니다. 4 일 이전
기원전 세기, 아테네의 도시 국가가 지배하는 지역 밖의 작은 왕국이었습니다.
스파르타와 테베, 그리고 잠시 아케 메니 드 페르시아에 종속됩니다."""


# 1. TOKENIZE THE INPUT
# note: if you don't include return_tensors='pt' you'll get a list of lists which is easier for 
# exploration but you cannot feed that into a model. 
inputs = tokenizer.encode_plus(question, context, return_tensors="pt") 

# 2. OBTAIN MODEL SCORES
# the AutoModelForQuestionAnswering class includes a span predictor on top of the model. 
# the model returns answer start and end scores for each word in the text
answer_start_scores, answer_end_scores = model(**inputs)

answer_start = torch.argmax(answer_start_scores)  # get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1  # get the most likely end of answer with the argmax of the score

# 3. GET THE ANSWER SPAN
# once we have the most likely start and end tokens, we grab all the tokens between them
# and convert tokens back to words!
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

In [2]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import wikipedia as wiki
import pprint as pp
from collections import OrderedDict

class DocumentReader:
    def __init__(self, pretrained_model_name_or_path=''):
        self.READER_PATH = pretrained_model_name_or_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH)
        self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH)
        self.max_len = self.model.config.max_position_embeddings
        self.chunked = False

    def tokenize(self, question, text):
        self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
        self.input_ids = self.inputs["input_ids"].tolist()[0]

        if len(self.input_ids) > self.max_len:
            self.inputs = self.chunkify()
            self.chunked = True

    def chunkify(self):
        """ 
        Break up a long article into chunks that fit within the max token
        requirement for that Transformer model. 

        Calls to BERT / RoBERTa / ALBERT require the following format:
        [CLS] question tokens [SEP] context tokens [SEP].
        """

        # create question mask based on token_type_ids
        # value is 0 for question tokens, 1 for context tokens
        qmask = self.inputs['token_type_ids'].lt(1)
        qt = torch.masked_select(self.inputs['input_ids'], qmask)
        chunk_size = self.max_len - qt.size()[0] - 1 # the "-1" accounts for
        # having to add an ending [SEP] token to the end

        # create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input
        chunked_input = OrderedDict()
        for k,v in self.inputs.items():
            q = torch.masked_select(v, qmask)
            c = torch.masked_select(v, ~qmask)
            chunks = torch.split(c, chunk_size)
            
            for i, chunk in enumerate(chunks):
                if i not in chunked_input:
                    chunked_input[i] = {}

                thing = torch.cat((q, chunk))
                if i != len(chunks)-1:
                    if k == 'input_ids':
                        thing = torch.cat((thing, torch.tensor([102])))
                    else:
                        thing = torch.cat((thing, torch.tensor([1])))

                chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
        return chunked_input

    def get_answer(self):
        answer = ''
        if self.chunked:
            
            for k, chunk in self.inputs.items():
                answer_start_scores, answer_end_scores = self.model(**chunk)

                answer_start = torch.argmax(answer_start_scores)
                answer_end = torch.argmax(answer_end_scores) + 1

                ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
                if ans.startswith(('[CLS]','[SEP]',' ','°')):
                    raise Exception('No Answer')
                else:
                    answer = ans
                    break
        else:
            answer_start_scores, answer_end_scores = self.model(**self.inputs)

            answer_start = torch.argmax(answer_start_scores)  # get the most likely beginning of answer with the argmax of the score
            answer_end = torch.argmax(answer_end_scores) + 1  # get the most likely end of answer with the argmax of the score
        
            answer = self.convert_ids_to_string(self.inputs['input_ids'][0][
                                              answer_start:answer_end])
        if answer in ['',' ','  ']:
          raise Exception('No Answer')                    
        return answer
        
    def convert_ids_to_string(self, input_ids):
        return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))

In [None]:
questions = [
    '몸에 좋은 콜레스테롤은?'
]

reader = DocumentReader(SQUAD_MODEL) 

# if you trained your own model using the training cell earlier, you can access it with this:
#reader = DocumentReader("./models/bert/bbu_squad2")
wiki.set_lang('ko')
for question in questions:
    print(f"Question: {question}")
    results = wiki.search(question)
    print(results)
    page = ""
    text = ""
    for result in results:
      try:
        page = wiki.page(result)
        print(f"Top wiki result: {page}")
        text = page.content
        reader.tokenize(question, text)
        print(f"Answer: {reader.get_answer()}")
      except Exception as ex:
        print(ex)


여기서 부터는 최적의 답을 찾기 위해 문서 유사도 혼합

In [None]:
from sentence_transformers import SentenceTransformer
# embedder download...
embedder = SentenceTransformer('xlm-r-large-en-ko-nli-ststb')

In [None]:
import scipy

questions = [
    '이탈리아 출신의 철학자이자 과학자 중 가장 유명한 사람은 누구인가?',
    "물의 화학식?",
    "가장 단단한 광물은 무엇인가?"
]
reader = DocumentReader(SQUAD_MODEL) 

# if you trained your own model using the training cell earlier, you can access it with this:
#reader = DocumentReader("./models/bert/bbu_squad2")
wiki.set_lang('ko')

for question in questions:
    print(f"Question: {question}")
    results = wiki.search(question,results=10)
    corpus = []
    pages = []
    #print(results)
    page = ""
    text = ""
    for result in results:
      try:
        page = wiki.page(result)
        #print(f"Top wiki result: {page}")
        text = page.content
        corpus.append(text)
        pages.append(page)
      except Exception as ex:
        print(ex)

    corpus_embeddings = embedder.encode(corpus,show_progress_bar=False) 
    query_embeddings = embedder.encode([question])
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    for idx, distance in results:
        text = corpus[idx]
        try:
            reader.tokenize(question, text)
            print(f"Answer: {reader.get_answer()}", f" from {pages[idx]}")
        except Exception as ex:
            pass


## 위키백과 기반 한국어 QA Library

In [1]:
!pip install transformers==3
!pip install wikipedia
!pip install sentence-transformers

Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 5.6MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 22.8MB/s 
[?25hCollecting tokenizers==0.8.0-rc4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/bd/e5abec46af977c8a1375c1dca7cb1e5b3ec392ef279067af7f6bc50491a0/tokenizers-0.8.0rc4-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 42.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)


위에 있는 DocumentReader class 정의 실행

In [3]:
from sentence_transformers import SentenceTransformer
SQUAD_MODEL = "monologg/koelectra-base-v3-finetuned-korquad"

reader = DocumentReader(SQUAD_MODEL) 
embedder = SentenceTransformer('xlm-r-large-en-ko-nli-ststb')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=591.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263327.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=111.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=449418935.0, style=ProgressStyle(descri…




100%|██████████| 1.80G/1.80G [02:51<00:00, 10.5MB/s]


In [4]:

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import wikipedia as wiki
import pprint as pp
from collections import OrderedDict
import scipy
import requests
from bs4 import BeautifulSoup

class Wiki_Based_Korean_QA:
  def __init__(self, document_reader,sentence_embedder):
    self.reader = document_reader
    self.embedder = sentence_embedder
    wiki.set_lang('ko')

  def __search_from_wiki(self,question,max_rank):
    results = wiki.search(question,results=max_rank)
    contents = []
    for result in results:
      try:
        page = wiki.page(result)
        #print(f"Top wiki result: {page}")
        text = page.content
        contents.append((text,page))
      except Exception as ex:
        print(ex)
    return contents

  def __search_from_naver(self,question,max_rank):
    contents = []
    url = 'https://search.naver.com/search.naver'
    params = {'query': question,'where': 'nexearch',}
    response = requests.get(url, params=params)
    html = response.text
    #뷰티풀소프의 인자값 지정
    soup = BeautifulSoup(html, 'html.parser')
    #쪼개기
    title_list = soup.find_all('a', href=True)
    #print(title_list)
    tmp = []
    for tag in title_list:
      if (len(tag.text) > 10):
        tmp.append(tag.text)
        if len(tmp) >= 10:
          contents.append((''.join(tmp),url))
          tmp.clear()
    #print(contents)      
    return contents



  def question(self, questions, max_rank = 10):
    answers = {}
    for question in questions:
        print(f"Question: {question}")
        contents = []
        contents.extend(self.__search_from_wiki(question,max_rank))
        contents.extend(self.__search_from_naver(question,max_rank))
        #print(len(contents))
        corpus_embeddings = self.embedder.encode([a for (a,b) in contents],show_progress_bar=False) 
        query_embeddings = self.embedder.encode([question])
        distances = scipy.spatial.distance.cdist(query_embeddings, corpus_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
        answer_list = []
        for idx, distance in results:
            text = contents[idx][0]
            #print(text)
            try:
                self.reader.tokenize(question, text)
                t = (self.reader.get_answer(),contents[idx][1])
                print(f"Answer: {t[0]}", f" from {t[1]}")
                answer_list.append(t)
                
            except Exception as ex:
                pass    

        answers[question] = answer_list
        print(' ')
    return answers



In [5]:
wbk_qa = Wiki_Based_Korean_QA(reader,embedder)


In [6]:
answers = wbk_qa.question(["북한에서 실질적인 권력자는 누구인가?",
                           "세계에서 가장 넓은 호수는?",
                           "오로라가 가장 잘 보이는 곳은?",
                           "심장이 죄어오듯이 아프면 의심되는 병은 무엇인가?"])

Question: 북한에서 실질적인 권력자는 누구인가?
Answer: 스마트폰  from https://search.naver.com/search.naver
Answer: 박정희  from <WikipediaPage '박정희'>
 
Question: 세계에서 가장 넓은 호수는?
Answer: 티베트 자치구 남초 호  from <WikipediaPage '염호'>
Answer: 카스피 해  from <WikipediaPage '호수'>
Answer: 보스토크호  from <WikipediaPage '보스토크호'>
Answer: 바이칼 호  from <WikipediaPage '아시아'>
Answer: 나일강  from <WikipediaPage '나일강'>
Answer: 티티카카 호  from <WikipediaPage '남아메리카'>
 
Question: 오로라가 가장 잘 보이는 곳은?
Answer: 남극및 북극 양극지방  from <WikipediaPage '오로라'>
 
Question: 심장이 죄어오듯이 아프면 의심되는 병은 무엇인가?
Answer: 협심증  from https://search.naver.com/search.naver
Answer: 심장마비  from <WikipediaPage '4.19 혁명'>
 


In [8]:
answers = wbk_qa.question(["항문에서 피가 나는 병은 무엇인가?",
                           "김재규는 박정희를 왜 죽였는가?",
                           "케네디를 죽인 암살범은 누구인가?",
                           "술 취하지 않는 방법은?"])

Question: 항문에서 피가 나는 병은 무엇인가?
Answer: 치질  from https://search.naver.com/search.naver
Answer: 오십견  from https://search.naver.com/search.naver
 
Question: 김재규는 박정희를 왜 죽였는가?
Answer: 민주화에 대한 열망  from <WikipediaPage '10·26 사건'>
Answer: 10 . 26 사태  from <WikipediaPage '김재규'>
 
Question: 케네디를 죽인 암살범은 누구인가?
Answer: 리 하비 오스월드  from <WikipediaPage '링컨과 케네디의 공통점'>
 
Question: 술 취하지 않는 방법은?
Answer: wikiHow  from https://search.naver.com/search.naver
Answer: 에탄올  from <WikipediaPage '에탄올'>
Answer: 필수  from <WikipediaPage '심폐소생술'>
Answer: 법을 행하는 수단  from <WikipediaPage '법가'>
 


In [11]:
answers = wbk_qa.question(["사람을 사랑해서 생기는 병은?",
                           "부모는 자식을 왜 사랑하는가?",
                           "나의 와이프는 나를 사랑하는가?",
                           "신은 존재 하는가?"])

Question: 사람을 사랑해서 생기는 병은?
Answer: 정신착란증  from <WikipediaPage '로드 넘버원'>
Answer: 트러블  from <WikipediaPage '트레이스 (만화)'>
 
Question: 부모는 자식을 왜 사랑하는가?
Answer: 평범한 인생의 진리  from <WikipediaPage '파도 (드라마)'>
 
Question: 나의 와이프는 나를 사랑하는가?
 
Question: 신은 존재 하는가?
Answer: 자연적 혹은 초자연적 존재  from <WikipediaPage '신'>
Answer: 천계  from <WikipediaPage '힌두교의 신'>
 


In [12]:
answers = wbk_qa.question(["사람의 인생에서 가장 소중한 것은 무엇인가?",
                           "바람난 여자는 다시 돌아올 수 있는가?",
                           "위가 쓰리고 아플 때 어떤 약을 복용해야 하는가?",
                           "눈알이 빠지면 어떻게 되는가?"])

Question: 사람의 인생에서 가장 소중한 것은 무엇인가?
Answer: 다소  from <WikipediaPage '마법사에게 소중한 것'>
Answer: 한때  from <WikipediaPage '김동길 (1928년)'>
 
Question: 바람난 여자는 다시 돌아올 수 있는가?
Answer: 한때  from <WikipediaPage '유혜정 (배우)'>
 
Question: 위가 쓰리고 아플 때 어떤 약을 복용해야 하는가?
Answer: 제티스정  from https://search.naver.com/search.naver
 
Question: 눈알이 빠지면 어떻게 되는가?
Answer: 매드아이 무디  from <WikipediaPage '해리 포터의 마법 물체 목록'>
Answer: 애꾸눈  from <WikipediaPage '오딘'>
Answer: 도탄  from <WikipediaPage '렐파첸'>
Answer: 종종  from <WikipediaPage '조앤 크로퍼드'>
 
