If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 8.2 MB/s 
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your username and password (this only works on Colab, in a regular notebook, you need to do this in a terminal):

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: kharelarogya@hotmail.com
Password: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-crendential store but this isn't the helper defined on your machine.
You will have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal to set it as the default

git config --global credential.helper

Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [None]:
!apt install git-lfs
!git config --global user.email "kharelarogya@hotmail.com"
!git config --global user.name "Arogya"

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 40 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 1s (1,592 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 148492 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.10.2


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

# Fine-tuning a model on a question-answering task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the `Trainer` API to fine-tune a model on it.

![Widget inference representing the QA task](https://github.com/huggingface/notebooks/blob/master/examples/images/question_answering.png?raw=1)

**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "monologg/koelectra-base-v3-finetuned-korquad"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("squad_kor_v2" if squad_v2 else "squad_kor_v1")
# train_dataset = load_dataset("squad_kor_v2" if squad_v2 else "squad_kor_v1", split='train[:10%]+train[-80%:]')
# validation_dataset = load_dataset("squad_kor_v2" if squad_v2 else "squad_kor_v1", split='validation[:10%]+validation[-80%:]')

Reusing dataset squad_kor_v1 (/root/.cache/huggingface/datasets/squad_kor_v1/squad_kor_v1/1.0.0/18d4f44736b8ee85671f63cb84965bfb583fa0a4ff2df3c2e10eee9693796725)


The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 60407
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5774
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 60407
})

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,6532852-4-2,스테고사우루스,"스테고사우루스는 공룡 화석 전쟁에서 최초로 수집 및 기재가 이루어진 수 많은 공룡 중 하나로 콜로라도 주 모리슨의 북쪽에서 발견된 표본을 통해 1877년 오트니엘 찰스 마쉬에 의해 최초로 명명되었다. 이 최초의 뼈들이 스테고사우루스 아르마투스(Stegosaurus armatus)의 완모식표본이 되었다. 마쉬는 처음에 이 화석이 수중생활을 하던 거북 비슷한 동물의 것이라고 생각했으며 골판이 마치 지붕의 기와처럼 겹쳐진 상태로 등을 덮고 있는 것으로 생각했기 때문에 학명을 '지붕 도마뱀'이라고 붙였다. 그 후 수 년에 걸쳐 스테고사우루스의 화석들이 많이 발견되어 마쉬는 스테고사우루스속에 대한 여러 편의 논문을 발표했다. 처음에는 여러 종이 기재되었다. 하지만 이 종들 중 많은 수가 부정확하거나 다른 종의 동물이명인 것으로 간주되고 있어서, 이제는 잘 알려진 종 두 개와 덜 알려진 종 하나만이 남아있다. 스테고사우루스의 화석으로 확인된 것은 모리슨층의 층서존 2-6 사이에서 발견되었으며, 스테고사우루스라고 볼 수도 있을 것 같은 화석이 추가로 층서존 1에서 발견되었다.",마쉬가 스테고사우루스의 생김새를 보고 지은 학명은 무슨 뜻인가?,"{'text': ['지붕 도마뱀'], 'answer_start': [266]}"
1,6555914-3-0,주병진,연예계에 데뷔한 지 얼마 안되던 주병진은 서울특별시 방배동의 유명한 해장국집을 보며 동네가 고급스럽고 상권이 좋을거라 생각하여 여기다가 카페를 차리면 잘될 것이라 생각했다. 술에 취했던 중에 해야겠다고 결심을 하게됐다. 그러나 돈은 없었고 돈을 꿀 사람들의 명단을 정리해 돈을 꾸기 시작했다. 그러나 대부분 꾸지 못했으나 별로 친하지 않았던 사람들 사이에서 꾸어주는 사람들이 생겨나기 시작했다. 조금만 돈이 생기면 계약금으로 내고 중도금 치를 때까지 돈을 꾸고 조금 늦어지면 핑계를 대며 미루고 꾸기를 반복했다. 당시 주병진은 24살때였다. 그러다 갑자기 자신이 너무 큰 일을 벌였다는 생각을 갖게됐고 심각해졌으나 우연히 만난 전 여자친구가 응원의 편지를 차에 꽂아놓고 간 것을 보고 용기를 얻어 더 열심히 돈 마련을 위해 노력했다.,주병진이 24살에 카페를 차리면 잘될 것이라고 본 지역은?,"{'text': ['방배동'], 'answer_start': [29]}"
2,6466142-1-1,기관_감사_위원회,"오늘날 IRB 역할은 IRB 업무를 통해 소득을 얻는 상업적 집단이 수행한다. 이들은 독립 IRB(independent IRB) 혹은 상업적 IRB(commercial IRB)라고 불린다. 이러한 IRB들은 검열 대상이 되는 학술 의료 기관과 독립적이어야 하며 종전의 IRB와 같은 행정적 규제를 받아야 한다. 이에 대한 FDA의 요구는 21 CFR 56.107.에 기록되어 있다. (a)1 IRB는 최소 5명 이상으로 구성되어야 한다. (a)2 구성원은 충분한 경험, 전문성, 연구가 윤리적이며 고지에 입각한 동의가 충분한 수준인지, 적절한 보호장치가 있는지에 대한 유효한 결정을 내릴 수 있을 정도의 포괄성을 지녀야 한다. (a)3 만약 IRB가 상처입기 쉬운 집단을 포함하는 연구와 관련해 일하고 있다면, IRB는 이 그룹을 익히 알고 있는 구성원을 반드시 포함해야 한다. 따라서 IRB가 수감자를 대상으로 하는 연구를 실시할 때 수감자들을 대변하는 이를 포함시키는 것이 일반적이다. (b)1IRB는 반드시 성별 때문에 선택된 것이 아닌 한 남자와 여자를 동시에 포함시켜야 한다. (b)2IRB 모든 구성원이 같은 직종에 종사해서는 안된다. (c)IRB는 최소 한 명의 과학자와 한 명의 비과학자를 포함해야 한다. 이러한 용어는 규제목록에 정의되어 있지 않다. (d)IRB는 대상 기관과 관련되지 않은 사람 또는 대상 기관과 관련된 사람의 직계 가족이 아닌 사람을 한 명 이상 포함해야 한다. (e)IRB 구성원은 검열 대상 프로젝트에 지지를 표명할 수 없다. (f)IRB는 전문성과 다양성을 충족시키기 위한 논의에 조언자를 포함시킬 수 있다.",FDA의 요구에 따르면 IRB구성원이 지지를 표명할 수 없는 것은?,"{'text': ['검열 대상 프로젝트'], 'answer_start': [739]}"
3,6540644-35-0,에리히_폰_만슈타인,유명한 변호사 레지널드 토머스 패깃을 비롯한 만슈타인의 변호인단은 파르티잔의 상당수가 유대인이었기에 해당 명령서는 정당화될 수 있다고 주장했다. 만슈타인은 부하들을 파르티잔 공격에서 보호하고자 하는 마음에 모든 유대인을 죽이라고 명령했다는 소리였다. 패깃은 만슈타인이 자신의 주권국 정부의 명령을 설사 그 명령이 불법적인 것이라 해도 불복종할 수 없었음을 주장했다. 만슈타인은 자기변호를 위해 나치의 인종정책이 혐오스러웠다고 말했다. 그러나 열여섯 명의 서로 다른 증인들이 만슈타인이 집단살해를 인지하고 있었고 또 어쩌면 직접 연루되었을 수 있다고 진술했다. 패깃은 러시아인들을 “야만인”이라고 부르면서 만슈타인은 최악의 “끔찍한 야만성”을 지닌 러시아인들을 상대로 싸우면서도 전쟁법을 준수하는 “품격 있는 독일 군인”으로서 절제를 보여줬다고 말했다. 만슈타인이 아인자츠그루펜 D부대의 행위에 책임이 있는지 여부는 재판의 핵심 주제였다. D부대는 만슈타인의 직접 지휘를 받지는 않았지만 만슈타인의 관할구역에서 활동했다. 검사측은 이 부대가 무엇을 하는지 파악하는 것은 사령관으로서 만슈타인의 의무이며 또 집단살해 행위를 멈추도록 압력을 행사하는 것도 그의 의무였다고 주장했다. 베놀트 르메이를 비롯한 최근 학자들은 만슈타인이 본인의 재판과 뉘른베르크 재판에서 한 말들이 거의 다 위증일 것이라는 데 의견을 같이하고 있다.,만슈타인이 자기변호를 위해 나치의 인종정책이 혐오스러웠다고 말한 것에 대해 몇 명의 증인이 반대의 의견을 진술했나?,"{'text': ['열여섯'], 'answer_start': [247]}"
4,6486153-0-0,대한민국_제19대_대통령_선거,"보수 진영의 유력 대권후보로 등극한 반기문 전 유엔 사무총장이 1월 불출마를 선언하고, 이어서 유력 후보였던 황교안 대통령 권한대행도 불출마 선언을 함에 따라 보수 표심이 크게 약화되었다. 때문에 선거 극초반에는 문재인 더불어민주당 후보가 30 ~ 40%에 달하는 지지율로 대세론이 굳혀지면서, 이를 견제하려는 세력 간에 이른바 '문재인 대 비문 연대' 구도가 형성되었다. 이에 따라 비문 연대를 통한 단일화 목소리가 지속적으로 제기되었으나, 각당 후보 당사자들이 단일화 거부 및 대선 완주를 선언하면서 실제 단일화로 이어지지는 않았다. 따라서 본격적인 선거전은 문재인·홍준표·안철수·유승민·심상정의 원내 5대 주요 정당 후보들 간의 5자 대결 구도로 진행되었다.",19대 대권후보 불출마를 선언한 전 유엔 사무총장은 누구인가?,"{'text': ['반기문'], 'answer_start': [20]}"
5,6539049-10-0,안철수,"박원순은 단일화에 대해 “두 사람 모두 시장직 자리를 원한 게 아니다. 진정 새로운 세상을 만드는 데 관심이 있었기 때문에 이렇게 상식적으로 이해 안 되는 결론이 나온 것”라고 말했다. 박원순은 또 안철수에 대해 “아무리 신뢰관계가 있다해도 저보다 10배나 더 되는 지지도를 갖고 있던 분이 정말 아무 조건 없이 ‘더 잘 할 수 있다’고 하는 (내 말) 한마디로 양보한다는 게 사실 또 믿기 어려운 그런 일”이라며 “안 교수가 개인의 이익보다 사회의 어떤 공공적인 이익을 위해서 해왔던 분이었기 때문에 가능했던 태도였다고 본다”라고 말했다. 이후 박원순, 한명숙, 문재인 등은 “선거 승리를 위해서는 범시민 야권 단일후보를 통해 한나라당과 1:1 구도를 만들어야 한다. (서울시장 후보로 거론되는)박원순-한명숙 두 사람은 범시민 야권 단일 후보 선출을 위해 상호 협력하고, 이후엔 선거 승리를 위해 모든 힘을 기울인다”라며 결의를 다졌다.",안철수는 박원순에 비해 몇배나 되는 지지도를 갖고 있었나?,"{'text': ['10배'], 'answer_start': [139]}"
6,6517543-5-2,서예,"문자를 쓸 때에 필요한 8종의 용필법(用筆法)으로서 그것이 영(永)자의 8개의 점획에 맞기 때문에 영자팔법(永字八法)이라 부르고 있다. <서원청화(書苑靑華)>에 ""팔법은 예자(隸字)로부터 생긴다……""하였으며, 오래전부터 그렇게 말해진 듯한데 당시대에 해서의 전형이 확립된 것에 곁들여 영자팔법을 습득하면 모든 문자에 응용된다고 생각했던 것 같다. 그림과 같이 첫째 점을 측(側), 둘째의 횡획(橫劃)을 늑(勒), 셋째의 종획(縱劃)을 노(努), 그 날개를 적, 다섯째의 바른쪽 위로 긋는 선을 책(策), 왼쪽 밑으로 긋는 선을 약(掠), 일곱째의 바른쪽에서 왼쪽으로의 선을 탁(啄), 바른쪽 밑으로 터는 선을 책이라 한다. 초학자를 상대로 하나 그다지 가치있는 기법은 아니다.",영자팔법에서 바른쪽 밑으로 터는 선을 지칭하는 말은?,"{'text': ['책'], 'answer_start': [341]}"
7,6505638-4-1,미싱노,"미싱노에 대한 플레이어들의 반응은 사회학 연구 주제가 되기도 하였다. 사회학자 윌리엄 심스 베인브리지는 게임 프리크가 “게임 역사상 가장 인기있는 글리치”를 만들어냈다고 하면서 플레이어들의 창의성에 대해 언급했다. 《피카츄 글로벌 어드벤처: 포켓몬의 흥망》(Pikachu's Global Adventure: The Rise and Fall of Pokémon)이라는 책의 저자인 교육학자 줄리언 세프턴그린(Julian Sefton-Green) 교수는 자신의 아들이 미싱노를 일종의 ‘치트’로서 활용하고 있으며 미싱노로 인해 게임에 대한 아들의 시각이 극명하게 바뀌었음을 밝히고 있다. 결과적으로 미싱노와 같은 존재는 게임이라는 ‘봉인된 세계’에 대한 환상을 깨트리고, 본질적으로는 그것이 ‘하나의 컴퓨터 프로그램’이라는 사실을 상기시켜 주었다는 것이다. 《비디오 게임 하기》(Playing with Videogames)라는 책에는 미싱노를 조우하게 된 플레이어들의 심리에 대해 상세하게 다룬 심층 연구 내용이 포함되어 있다. 이 책에 따르면 플레이어들은 미싱노 현상을 서로 비교하고, 발견한 사항을 평가하고, 서로 비평하기도 한다고 한다. 또, 포켓몬 커뮤니티를 통해 팬아트나 팬픽션 형태로 미싱노를 게임 카논화하거나, 일종의 결함인데도 불구하고 인기를 끌고 있는 현상 등은 미싱노만의 독특하고 특별한 사례라 할 수 있다고 밝히고 있다.",줄리언 세프턴그린은 무슨 학자인가요?,"{'text': ['교육학자'], 'answer_start': [212]}"
8,6547282-20-0,아마루베_철교,"다리 위에 깔리는 궤도는 장대 경간 구조의 프리스트레스트 콘크리트 구조로 크리프나 건조 수축에 따른 다리의 길이굽음의 변화량이 크기 때문에 안전성을 고려하여 밸러스트 구조를 채용했다. 이 구조를 통해 소음을 줄일 수 있었다. 또한 프리스트레스트 콘크리트 구조는 궤도 사이에 빈 공간이 없기 때문에 궤도 사이로 솟아오르는 바람이 열차에 영향을 끼치지 못할 뿐만 아니라, 열차에서 나온 물체가 다리 밑으로 떨어지는 일도 없게 되었다. 또한 바람에 대한 대책으로 높이 1.7 m의 방풍벽을 설치하여 풍속 30 m/s에도 열차의 운행이 가능하도록 하였고, 조망을 확보하기 위하여 방풍벽을 만드는 데에는 투명 아크릴판을 사용했다. 이 밖에 아마루베 지구는 눈이 많이 내리기 때문에 교량에 눈이 쌓이지 않게 할 대책이 필요했는데 기존의 물을 이용한 융설(融雪) 장치는 각종 설비가 추가로 필요했기 때문에 저설(貯雪) 시설로 대체하였다. 20년 동안을 기준으로 적설량 116 cm을 상정하여 궤도 사이에 저설 공간을 만들어 다리 밑으로 눈이 떨어지는 것을 막았다. 또한 장기적으로 산인 본선의 전철화를 고려하여 전기선이 설치될 공간을 남겨두었다.",아마루베 지구의 교량에 눈이 쌓이지 않게 할 대책으로 설치하기로 한 시설은 무엇인가?,"{'text': ['저설(貯雪) 시설'], 'answer_start': [444]}"
9,6542650-2-0,위키백과,"2014년 1월 20일 수보드 바르마(Subodh Varma)는 《이코노믹 타임즈》에 투고한 글을 통해 위키백과가 2012년 12월에서 2013년 12월 사이에 전체적으로 페이지 뷰가 10퍼센트에 달하는 2억 번 이상의 페이지뷰를 잃었다고 발표하였다. 주요 언어판에 따라 나누면 영어 위키백과의 페이지뷰 감소율은 12%, 독일어가 17%, 일본어는 9% 였다. 바르마는 ""만일 위키백과 운영자들이 통계 집계에 오류가 있다고 주장한다면 지난해 도입된 구글의 지식 그래프가 그 입을 다물게 할 것""이라고 덧붙였다. 뉴욕 대학교의 부교수 클레이 셔키는 지식 그래프가 다른 사이트들의 페이지뷰를 잠식하고 있는 것에 대해 ""검색 페이지에서 당신의 질문에 대한 답을 바로 볼 수 있는데 굳이 그 싸이트를 방문하겠는가?""라고 반문하였다.",수보드 바르마(Subodh Varma)가 글을 투고한 곳은 어디인가?,"{'text': ['이코노믹 타임즈'], 'answer_start': [37]}"


## Preprocessing the training data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
!git clone https://github.com/monologg/KoBERT-Transformers.git

Cloning into 'KoBERT-Transformers'...
remote: Enumerating objects: 75, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 75 (delta 27), reused 58 (delta 17), pack-reused 0[K
Unpacking objects: 100% (75/75), done.


In [None]:
# %cd KoBERT-Transformers/


/content/KoBERT-Transformers


In [None]:
# from kobert_transformers import get_tokenizer

# tokenizer = get_tokenizer()
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpswe34pln


Downloading:   0%|          | 0.00/111 [00:00<?, ?B/s]

storing https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/48eee2d44ad4a70fe71c0a50350afc7ceb24c6017bfa350b88610838848262b5.e8305b64b2031f694f527d11857299760b6a9bbe443c14acd5a70c84241009fa
creating metadata file for /root/.cache/huggingface/transformers/48eee2d44ad4a70fe71c0a50350afc7ceb24c6017bfa350b88610838848262b5.e8305b64b2031f694f527d11857299760b6a9bbe443c14acd5a70c84241009fa
https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpse33i4ul


Downloading:   0%|          | 0.00/591 [00:00<?, ?B/s]

storing https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/d6cd8995aeb671b209d1194330df13141348633b09ed969a2a3ba2d03beb463e.92c05231de100063d029108b46d69c15d74004c7c3b6afadc026e438f2117711
creating metadata file for /root/.cache/huggingface/transformers/d6cd8995aeb671b209d1194330df13141348633b09ed969a2a3ba2d03beb463e.92c05231de100063d029108b46d69c15d74004c7c3b6afadc026e438f2117711
loading configuration file https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/d6cd8995aeb671b209d1194330df13141348633b09ed969a2a3ba2d03beb463e.92c05231de100063d029108b46d69c15d74004c7c3b6afadc026e438f2117711
Model config ElectraConfig {
  "architectures": [
    "ElectraForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropo

Downloading:   0%|          | 0.00/263k [00:00<?, ?B/s]

storing https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/b8bd3288ab00153b48506e82e4501372313432eee1d80f3c51c1b5b84c823411.5c96fdc82531199b4da6fcabb6b273b84bb0f6f10af9211637ccdec0e0ccddda
creating metadata file for /root/.cache/huggingface/transformers/b8bd3288ab00153b48506e82e4501372313432eee1d80f3c51c1b5b84c823411.5c96fdc82531199b4da6fcabb6b273b84bb0f6f10af9211637ccdec0e0ccddda
https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0vj2icz2


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

storing https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/6300017d72730ef13c029f55d4f3f3eee5e4934fcdd54ab52070218782a5aa5f.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /root/.cache/huggingface/transformers/6300017d72730ef13c029f55d4f3f3eee5e4934fcdd54ab52070218782a5aa5f.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/b8bd3288ab00153b48506e82e4501372313432eee1d80f3c51c1b5b84c823411.5c96fdc82531199b4da6fcabb6b273b84bb0f6f10af9211637ccdec0e0ccddda
loading file https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/adde

The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
# transformers.PreTrainedTokenizerFast

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on two sentences (one for the answer, one for the context):

In [None]:
tokenizer("What is your name?", "My name is Sylvain.")

{'input_ids': [2, 32276, 12557, 27264, 12904, 35, 3, 21866, 12904, 12557, 55, 23158, 4020, 11889, 18, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Let's find one long example in our dataset:

In [None]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

Without any truncation, we get the following length for the input IDs:

In [None]:
len(tokenizer(example["question"], example["context"])["input_ids"])

385

Now, if we just truncate, we will lose information (and possibly the answer to our question):

In [None]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

384

Note that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and by passing the stride:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

Now we don't have one list of `input_ids`, but several: 

In [None]:
[len(x) for x in tokenized_example["input_ids"]]
# tokenized_example

[384, 151]

And if we decode them, we can see the overlap:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] 바그너가 파우스트 서곡을 쓸 때 어떤 곡의 영향을 받았는가? [SEP] 1839년 바그너는 괴테의 파우스트을 처음 읽고 그 내용에 마음이 끌려 이를 소재로 해서 하나의 교향곡을 쓰려는 뜻을 갖는다. 이 시기 바그너는 1838년에 빛 독촉으로 산전수전을 다 [UNK] 상황이라 좌절과 실망에 가득했으며 메피스토펠레스를 만나는 파우스트의 심경에 공감했다고 한다. 또한 파리에서 아브네크의 지휘로 파리 음악원 관현악단이 연주하는 베토벤의 교향곡 9번을 듣고 깊은 감명을 받았는데, 이것이 이듬해 1월에 파우스트의 서곡으로 쓰여진 이 작품에 조금이라도 영향을 끼쳤으리라는 것은 의심할 여지가 없다. 여기의 라단조 조성의 경우에도 그의 전기에 적혀 있는 것처럼 단순한 정신적 피로나 실의가 반영된 것이 아니라 베토벤의 합창교향곡 조성의 영향을 받은 것을 볼 수 있다. 그렇게 교향곡 작곡을 1839년부터 40년에 걸쳐 파리에서 착수했으나 1악장을 쓴 뒤에 중단했다. 또한 작품의 완성과 동시에 그는 이 서곡 ( 1악장 ) 을 파리 음악원의 연주회에서 연주할 파트보까지 준비하였으나, 실제로는 이루어지지는 않았다. 결국 초연은 4년 반이 지난 후에 드레스덴에서 연주되었고 재연도 이루어졌지만, 이후에 그대로 방치되고 말았다. 그 사이에 그는 리엔치와 방황하는 네덜란드인을 완성하고 탄호이저에도 착수하는 등 분주한 시간을 보냈는데, 그런 바쁜 생활이 이 곡을 잊게 한 것이 아닌가 하는 의견도 있다 [SEP]
[CLS] 바그너가 파우스트 서곡을 쓸 때 어떤 곡의 영향을 받았는가? [SEP]과 동시에 그는 이 서곡 ( 1악장 ) 을 파리 음악원의 연주회에서 연주할 파트보까지 준비하였으나, 실제로는 이루어지지는 않았다. 결국 초연은 4년 반이 지난 후에 드레스덴에서 연주되었고 재연도 이루어졌지만, 이후에 그대로 방치되고 말았다. 그 사이에 그는 리엔치와 방황하는 네덜란드인을 완성하고 탄호이저에도 착수하는 등 분주한 시간을 보냈는데, 그런 바쁜 생활이 이 곡을 잊게 한 것이 아닌가 하는

Now this will give us some work to properly treat the answers: we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. Thankfully, the tokenizer we're using can help us with that by returning an `offset_mapping`:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 3), (3, 4), (5, 7), (7, 9), (10, 11), (11, 12), (12, 13), (14, 15), (16, 17), (18, 20), (21, 22), (22, 23), (24, 26), (26, 27), (28, 29), (29, 30), (30, 31), (31, 32), (32, 33), (0, 0), (0, 3), (3, 4), (4, 5), (6, 9), (9, 10), (11, 13), (13, 14), (15, 17), (17, 19), (19, 20), (21, 23), (24, 25), (25, 26), (27, 28), (29, 31), (31, 32), (33, 35), (35, 36), (37, 39), (40, 41), (41, 42), (43, 45), (45, 46), (47, 49), (50, 52), (52, 53), (54, 57), (57, 58), (59, 60), (60, 61), (61, 62), (63, 64), (64, 65), (66, 67), (67, 68), (68, 69), (69, 70), (71, 72), (73, 75), (76, 79), (79, 80), (81, 84), (84, 85), (85, 86), (86, 87), (88, 89), (90, 92), (92, 94), (95, 97), (97, 98), (98, 99), (99, 100), (101, 102), (103, 105), (106, 108), (108, 109), (109, 110), (111, 113), (113, 114), (115, 117), (117, 118), (119, 121), (121, 122), (122, 124), (125, 126), (126, 128), (128, 129), (129, 130), (130, 132), (132, 133), (134, 136), (136, 137), (138, 140), (140, 142), (142, 143), (144, 146), (

This gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. The very first token (`[CLS]`) has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 3 of the question:

In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

바그너 바그너


So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

It returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). Now with all of this, we can find the first and last token of the answer in one of our input feature (or if the answer is not in this feature):

In [None]:
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

122 126


And we can double check that it is indeed the theoretical answer:

In [None]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

베토벤의 교향곡 9번
베토벤의 교향곡 9번


For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

In [None]:
pad_on_right = tokenizer.padding_side == "right"

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag `allow_impossible_answers` is `False`. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(datasets["train"][:5])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it.

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

  0%|          | 0/61 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/d6cd8995aeb671b209d1194330df13141348633b09ed969a2a3ba2d03beb463e.92c05231de100063d029108b46d69c15d74004c7c3b6afadc026e438f2117711
Model config ElectraConfig {
  "architectures": [
    "ElectraForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "transformers_version": "4.10.2",
  "type_vocab_size": 

Downloading:   0%|          | 0.00/449M [00:00<?, ?B/s]

storing https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/c68e01e42a50fae365d7124ec1f27081bcd613aa40a5323d51a574e98cdea121.922e11c1c9fca41f08877727bf0b286d98a3efaceda242df277da5b99e8d6c47
creating metadata file for /root/.cache/huggingface/transformers/c68e01e42a50fae365d7124ec1f27081bcd613aa40a5323d51a574e98cdea121.922e11c1c9fca41f08877727bf0b286d98a3efaceda242df277da5b99e8d6c47
loading weights file https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/c68e01e42a50fae365d7124ec1f27081bcd613aa40a5323d51a574e98cdea121.922e11c1c9fca41f08877727bf0b286d98a3efaceda242df277da5b99e8d6c47
All model checkpoint weights were used when initializing ElectraForQuestionAnswering.

All the weights of ElectraForQuestionAnswering were initialized from the model checkpoint at monologg/koelectra-base-v3-finetuned

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"kobert_squad_kor_v1",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    push_to_hub=True,
    push_to_hub_model_id=f"{model_name}-finetuned-squad_kor_v1",
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of `push_to_hub_model_id` to something you would prefer.

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Cloning https://huggingface.co/Arogya/koelectra-base-v3-finetuned-korquad-finetuned-squad_kor_v1 into local empty directory.


We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

***** Running training *****
  Num examples = 69845
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4366


Epoch,Training Loss,Validation Loss
1,0.2048,0.999522


Saving model checkpoint to kobert_squad_kor_v1/checkpoint-500
Configuration saved in kobert_squad_kor_v1/checkpoint-500/config.json
Model weights saved in kobert_squad_kor_v1/checkpoint-500/pytorch_model.bin
tokenizer config file saved in kobert_squad_kor_v1/checkpoint-500/tokenizer_config.json
Special tokens file saved in kobert_squad_kor_v1/checkpoint-500/special_tokens_map.json
Saving model checkpoint to kobert_squad_kor_v1/checkpoint-1000
Configuration saved in kobert_squad_kor_v1/checkpoint-1000/config.json
Model weights saved in kobert_squad_kor_v1/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in kobert_squad_kor_v1/checkpoint-1000/tokenizer_config.json
Special tokens file saved in kobert_squad_kor_v1/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to kobert_squad_kor_v1/checkpoint-1500
Configuration saved in kobert_squad_kor_v1/checkpoint-1500/config.json
Model weights saved in kobert_squad_kor_v1/checkpoint-1500/pytorch_model.bin
tokenizer config

TrainOutput(global_step=4366, training_loss=0.20591888043293974, metrics={'train_runtime': 5729.3474, 'train_samples_per_second': 12.191, 'train_steps_per_second': 0.762, 'total_flos': 1.368770398066944e+16, 'train_loss': 0.20591888043293974, 'epoch': 1.0})

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
trainer.save_model("finetuned_koelectra_squad_kor_v1")

Saving model checkpoint to kobert_squad_kor_v1
Configuration saved in kobert_squad_kor_v1/config.json
Model weights saved in kobert_squad_kor_v1/pytorch_model.bin
tokenizer config file saved in kobert_squad_kor_v1/tokenizer_config.json
Special tokens file saved in kobert_squad_kor_v1/special_tokens_map.json


## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [None]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 384]), torch.Size([16, 384]))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each featyre is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 21,  92,  21, 198, 260,  75,  36,  86,  91, 114,  86, 115, 163,  72,
          16, 113], device='cuda:0'),
 tensor([ 26,  93,  22, 203, 266,  76,  38,  86,  94, 116,  86, 120, 167,  73,
          22, 118], device='cuda:0'))

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.


To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call `n_best_size`. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one. Here is how we would do this on the first feature in the batch:

In [None]:
n_best_size = 20

In [None]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

And then we can sort the `valid_answers` according to their `score` and only keep the best one. The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

  0%|          | 0/6 [00:00<?, ?ba/s]

Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(validation_features)

The following columns in the test set  don't have a corresponding argument in `ElectraForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping.
***** Running Prediction *****
  Num examples = 6892
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [None]:
max_answer_length = 30

In [None]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

We can compare to the actual ground-truth answer:

In [None]:
datasets["validation"][0]["answers"]

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

We're almost ready for our post-processing function. The last bit to deal with is the impossible answer (when `squad_v2 = True`). The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad_v2" if squad_v2 else "squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. In the case of squad_v2, we also have to set a `no_answer_probability` argument (which we set to 0.0 here as we have already set the answer to empty if we picked it).

In [None]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("sgugger/my-awesome-model")
```