<a href="https://colab.research.google.com/github/dolmani38/Summary/blob/master/korean_QA_from_wiki_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a QA System with BERT on Wikipedia

https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html

위의 내용을 한국어 QA로 변경

In [1]:
!pip install transformers==3
!pip install wikipedia
!pip install sentence-transformers

Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |▍                               | 10kB 23.8MB/s eta 0:00:01[K     |▉                               | 20kB 29.4MB/s eta 0:00:01[K     |█▎                              | 30kB 25.9MB/s eta 0:00:01[K     |█▊                              | 40kB 21.4MB/s eta 0:00:01[K     |██▏                             | 51kB 15.5MB/s eta 0:00:01[K     |██▋                             | 61kB 16.5MB/s eta 0:00:01[K     |███                             | 71kB 17.0MB/s eta 0:00:01[K     |███▌                            | 81kB 16.3MB/s eta 0:00:01[K     |████                            | 92kB 15.9MB/s eta 0:00:01[K     |████▍                           | 102kB 16.6MB/s eta 0:00:01[K     |████▊                           | 112kB 16.6MB/s eta 0:00:01[K     |█████▏                         

In [2]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import wikipedia as wiki
import pprint as pp
from collections import OrderedDict

class DocumentReader:
    def __init__(self, pretrained_model_name_or_path=''):
        self.READER_PATH = pretrained_model_name_or_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH)
        self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH)
        self.max_len = self.model.config.max_position_embeddings
        self.chunked = False

    def tokenize(self, question, text):
        self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
        self.input_ids = self.inputs["input_ids"].tolist()[0]

        if len(self.input_ids) > self.max_len:
            self.inputs = self.chunkify()
            self.chunked = True

    def chunkify(self):
        """ 
        Break up a long article into chunks that fit within the max token
        requirement for that Transformer model. 

        Calls to BERT / RoBERTa / ALBERT require the following format:
        [CLS] question tokens [SEP] context tokens [SEP].
        """

        # create question mask based on token_type_ids
        # value is 0 for question tokens, 1 for context tokens
        qmask = self.inputs['token_type_ids'].lt(1)
        qt = torch.masked_select(self.inputs['input_ids'], qmask)
        chunk_size = self.max_len - qt.size()[0] - 1 # the "-1" accounts for
        # having to add an ending [SEP] token to the end

        # create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input
        chunked_input = OrderedDict()
        for k,v in self.inputs.items():
            q = torch.masked_select(v, qmask)
            c = torch.masked_select(v, ~qmask)
            chunks = torch.split(c, chunk_size)
            
            for i, chunk in enumerate(chunks):
                if i not in chunked_input:
                    chunked_input[i] = {}

                thing = torch.cat((q, chunk))
                if i != len(chunks)-1:
                    if k == 'input_ids':
                        thing = torch.cat((thing, torch.tensor([102])))
                    else:
                        thing = torch.cat((thing, torch.tensor([1])))

                chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
        return chunked_input

    def get_answer(self):
        answer = ''
        if self.chunked:
            
            for k, chunk in self.inputs.items():
                answer_start_scores, answer_end_scores = self.model(**chunk)

                answer_start = torch.argmax(answer_start_scores)
                answer_end = torch.argmax(answer_end_scores) + 1

                ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
                if ans.startswith(('[CLS]','[SEP]',' ','°')):
                    raise Exception('No Answer')
                else:
                    answer = ans
                    break
        else:
            answer_start_scores, answer_end_scores = self.model(**self.inputs)

            answer_start = torch.argmax(answer_start_scores)  # get the most likely beginning of answer with the argmax of the score
            answer_end = torch.argmax(answer_end_scores) + 1  # get the most likely end of answer with the argmax of the score
        
            answer = self.convert_ids_to_string(self.inputs['input_ids'][0][
                                              answer_start:answer_end])
        if answer in ['',' ','  ']:
          raise Exception('No Answer')                    
        return answer
        
    def convert_ids_to_string(self, input_ids):
        return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))

In [3]:
from sentence_transformers import SentenceTransformer
SQUAD_MODEL = "monologg/koelectra-base-v3-finetuned-korquad"

reader = DocumentReader(SQUAD_MODEL) 
embedder = SentenceTransformer('xlm-r-large-en-ko-nli-ststb')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=591.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263327.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=111.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=449418935.0, style=ProgressStyle(descri…




100%|██████████| 1.80G/1.80G [02:49<00:00, 10.6MB/s]


In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import wikipedia as wiki
import pprint as pp
from collections import OrderedDict
import scipy
import requests
from bs4 import BeautifulSoup

class Wiki_Based_Korean_QA:
  def __init__(self, document_reader,sentence_embedder):
    self.reader = document_reader
    self.embedder = sentence_embedder
    wiki.set_lang('ko')

  def __search_from_wiki(self,question,max_rank):
    results = wiki.search(question,results=max_rank)
    contents = []
    for result in results:
      try:
        page = wiki.page(result)
        #print(f"Top wiki result: {page}")
        text = page.content
        contents.append((text,page))
      except Exception as ex:
        print(ex)
    return contents

  def __search_from_naver(self,question,max_rank):
    contents = []
    url = 'https://search.naver.com/search.naver'
    params = {'query': question,'where': 'nexearch',}
    response = requests.get(url, params=params)
    html = response.text
    #뷰티풀소프의 인자값 지정
    soup = BeautifulSoup(html, 'html.parser')
    #쪼개기
    title_list = soup.find_all('a', href=True)
    #print(title_list)
    tmp = []
    for tag in title_list:
      if (len(tag.text) > 10):
        tmp.append(tag.text)
        if len(tmp) >= 10:
          contents.append((''.join(tmp),url))
          tmp.clear()
    #print(contents)      
    return contents



  def question(self, questions, max_rank = 10):
    answers = {}
    for question in questions:
        print(f"Question: {question}")
        contents = []
        contents.extend(self.__search_from_wiki(question,max_rank))
        contents.extend(self.__search_from_naver(question,max_rank))
        #print(len(contents))
        corpus_embeddings = self.embedder.encode([a for (a,b) in contents],show_progress_bar=False) 
        query_embeddings = self.embedder.encode([question])
        distances = scipy.spatial.distance.cdist(query_embeddings, corpus_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
        answer_list = []
        for idx, distance in results:
            text = contents[idx][0]
            #print(text)
            try:
                self.reader.tokenize(question, text)
                t = (self.reader.get_answer(),contents[idx][1])
                print(f"Answer: {t[0]}", f" from {t[1]}")
                answer_list.append(t)
                
            except Exception as ex:
                pass    

        answers[question] = answer_list
        print(' ')
    return answers


In [5]:
wbk_qa = Wiki_Based_Korean_QA(reader,embedder)

In [6]:
answers = wbk_qa.question(["북한에서 실질적인 권력자는 누구인가?",
                           "세계에서 가장 넓은 호수는?",
                           "오로라가 가장 잘 보이는 곳은?",
                           "심장이 죄어오듯이 아프면 의심되는 병은 무엇인가?",
                           "항문에서 피가 나는 병은 무엇인가?",
                           "김재규는 박정희를 왜 죽였는가?",
                           "케네디를 죽인 암살범은 누구인가?",
                           "술 취하지 않는 방법은?",
                           "사람을 사랑해서 생기는 병은?",
                           "부모는 자식을 왜 사랑하는가?",
                           "나의 와이프는 나를 사랑하는가?",
                           "신은 존재 하는가?",
                           "사람의 인생에서 가장 소중한 것은 무엇인가?",
                           "바람난 여자는 다시 돌아올 수 있는가?",
                           "위가 쓰리고 아플 때 어떤 약을 복용해야 하는가?",
                           "눈알이 빠지면 어떻게 되는가?"])

Question: 북한에서 실질적인 권력자는 누구인가?
Answer: 김정은  from https://search.naver.com/search.naver
Answer: 김정은  from https://search.naver.com/search.naver
Answer: 모란봉 클럽 144회 예고 동영상 바로재생 버튼 02 : 48 [UNK] 이들 중 일인자는 누구일까 ? ! 동영상 바로재생 버튼 06 : 00 북한 , 음주운전이 비일비재하다 ? ! 권력자들의 음주운전 릴레이트럼프 모더  from https://search.naver.com/search.naver
Answer: 박정희  from <WikipediaPage '박정희'>
Answer: 김정은  from https://search.naver.com/search.naver
 
Question: 세계에서 가장 넓은 호수는?
Answer: 티베트 자치구 남초 호  from <WikipediaPage '염호'>
Answer: 카스피 해  from <WikipediaPage '호수'>
Answer: 보스토크호  from <WikipediaPage '보스토크호'>
Answer: 바이칼 호  from <WikipediaPage '아시아'>
Answer: 나일강  from <WikipediaPage '나일강'>
Answer: 티티카카 호  from <WikipediaPage '남아메리카'>
 
Question: 오로라가 가장 잘 보이는 곳은?
Answer: 남극및 북극 양극지방  from <WikipediaPage '오로라'>
 
Question: 심장이 죄어오듯이 아프면 의심되는 병은 무엇인가?
Answer: 내공30  from https://search.naver.com/search.naver
Answer: 심장마비  from <WikipediaPage '4.19 혁명'>
 
Question: 항문에서 피가 나는 병은 무엇인가?
Answer: 치질  from https://search.naver.com/sear