## Package Download

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers

In [2]:
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-crendential store but this isn't the helper defined on your machine.
You will have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal to set it as the default

git config --global credential.helper store[0m


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 1s (2,230 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155219 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [4]:
import transformers

print(transformers.__version__)

4.12.3


# Fine-tuning a model on a question-answering task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the `Trainer` API to fine-tune a model on it.


**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [5]:
model_name = "monologg/koelectra-base-v3-finetuned-korquad"

## Loading the dataset -- __failed__

- datasets 의 DatasetDict 클래스 사용하면서, 코드에서 data column 추가해서 datasets 로 만드는 방법 여지 1
- json 불러올 때 튕기는거 원인 알아내는거 여지 2 : 한글이랑 잘 호환이 안되는것같다.

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
# import datasets
# from datasets import load_dataset, load_metric

In [None]:
# squad_v2 = False
# datasets_en = load_dataset("squad_v2" if squad_v2 else "squad")

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


데이터셋 안맞아서, 새로 데이터셋 refine 하는 과정(다시 안해도 된다)

In [None]:
# train_datasets_ko = load_dataset('json', data_files='/content/drive/MyDrive/QA_Project/goormkoreanmrcproject/data/train.json', field='data')
# test_datasets_ko = load_dataset('json', data_files='/content/drive/MyDrive/QA_Project/goormkoreanmrcproject/data/test.json', field='data')

Using custom data configuration default-daa1f6ff47389dee
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-daa1f6ff47389dee/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426)


  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-83458224606751ad
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-83458224606751ad/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# # test/train_datasets_ko 를 이 예제에서 다루는 SQUAD Dataset과 같은 형태로 바꾸어준다.
# # 한번 했으면 굳이 다시할 필요 없음!
# # 데이터셋 만들기
# import pandas as pd
# import json

# # train
# train_datasets_df = pd.DataFrame(columns=['id', 'title', 'question', 'answers', 'context'])
# i=0
# for data_mungch in train_datasets_ko['train']: # 여기서 실수.
#     for paragraphs_ in data_mungch['paragraphs']:
#         for qas_ in paragraphs_['qas']:
#             new_data_i = {'id':qas_['guid'],'title': data_mungch['title'], 'question':qas_['question'], 'answers': qas_['answers'][0], 'context':paragraphs_['context']}
#             train_datasets_df = train_datasets_df.append(new_data_i, ignore_index=True) # 여기서 실수 : 대입을 안했다
            
#             if i%1000==0 : 
#                 print('train iter : ', i)
#             i += 1

# # test
# test_datasets_df = pd.DataFrame(columns=['id', 'title', 'question', 'context'])
# i=0
# for data_mungch in test_datasets_ko['train']: # 여기서 실수.
#     for paragraphs_ in data_mungch['paragraphs']:
#         for qas_ in paragraphs_['qas']:
#             new_data_i = {'id':qas_['guid'],'title': data_mungch['title'], 'question':qas_['question'], 'context':paragraphs_['context']}
#             test_datasets_df = test_datasets_df.append(new_data_i, ignore_index=True) # 여기서 실수.
            
#             if i%1000==0 : 
#                 print('test iter : ', i)
#             i += 1

In [None]:
# pandas 이용

# for_test_df = pd.DataFrame(columns=['id', 'man'])
# for i in range(0,10):
#     for_test_df=for_test_df.append({'id':i, 'man':'이재형'}, ignore_index=True)

# for_test_df.to_json(DIR+'ERR_test_sample.json', orient = 'records', force_ascii=False)
# datasets_for_test = load_dataset('json', data_files = DIR + 'ERR_test_sample.json')
# datasets_for_test['train']
# datasets_for_test['train']['id'] -> 0 으로, 처음 id 하나만 담기고, 나머지는 제대로 load 되지 않는다. => records가 아닌 다른 형식을 이용하면 다를까?

In [None]:
# 저장방식 바꿔보기, 기존처럼 version, data 추가하기

# ex_dict={}
# for _ in range(0,10):
#     for_test_df=for_test_df.append({'id':12, 'man':'이재형'}, ignore_index=True)

# for_test_df.to_json(DIR+'ERR_test_sample.json', orient = 'records', force_ascii=False)
# datasets_for_test = load_dataset('json', data_files=DIR+'ERR_test_sample.json')
# datasets_for_test['train']['data']

In [None]:
# # new_datasets_df 를 내보내고, datasets 모듈로 다시 불러온다.
# DIR = '/content/drive/MyDrive/QA_Project/goormkoreanmrcproject/data/'
# train_datasets_df.to_json(DIR + 'train_refined.json', orient = 'records', force_ascii=False)
# test_datasets_df.to_json(DIR + 'test_refined.json', orient = 'records', force_ascii=False)

In [None]:
# # 문제 1 : to_json 으로 저장 하면서 [] 으로 전체 묶이는거 푸니까 잘 로드됨 -> 문제2 발생
# # 문제 3 : {"data" : [{ 내 데이터 },]} 꼴로 하니까 잘 로드됨 하지만, 튕기는문제 발생 -> datasets 모듈 사용하지 않음으로 노선 변경, 시간이 없다.
# datasets_test = load_dataset('json', data_files='/content/drive/MyDrive/QA_Project/goormkoreanmrcproject/data/test_refined.json')
# datasets_train = load_dataset('json', data_files='/content/drive/MyDrive/QA_Project/goormkoreanmrcproject/data/train_refined.json')

Using custom data configuration default-8010e8459ce14cdf


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-8010e8459ce14cdf/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# # 문제 2 : row가 1개만 잡혀온다.
# datasets_train 

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:
# datasets_en # 예제 datasets

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

## Loading the dataset using Pandas

In [7]:
import json

DIR = '/content/drive/MyDrive/QA_Project/goormkoreanmrcproject/data/'

with open( DIR + 'train.json' ) as f:
    train_raw = json.loads(f.read())

with open( DIR + 'test.json' ) as f:
    test_raw = json.loads(f.read())


In [8]:
train_raw['data'][0]

{'news_category': '종합',
 'paragraphs': [{'context': '올여름 장마가 17일 제주도에서 시작됐다. 서울 등 중부지방은 예년보다 사나흘 정도 늦은 이달 말께 장마가 시작될 전망이다.17일 기상청에 따르면 제주도 남쪽 먼바다에 있는 장마전선의 영향으로 이날 제주도 산간 및 내륙지역에 호우주의보가 내려지면서 곳곳에 100㎜에 육박하는 많은 비가 내렸다. 제주의 장마는 평년보다 2~3일, 지난해보다는 하루 일찍 시작됐다. 장마는 고온다습한 북태평양 기단과 한랭 습윤한 오호츠크해 기단이 만나 형성되는 장마전선에서 내리는 비를 뜻한다.장마전선은 18일 제주도 먼 남쪽 해상으로 내려갔다가 20일께 다시 북상해 전남 남해안까지 영향을 줄 것으로 보인다. 이에 따라 20~21일 남부지방에도 예년보다 사흘 정도 장마가 일찍 찾아올 전망이다. 그러나 장마전선을 밀어올리는 북태평양 고기압 세력이 약해 서울 등 중부지방은 평년보다 사나흘가량 늦은 이달 말부터 장마가 시작될 것이라는 게 기상청의 설명이다. 장마전선은 이후 한 달가량 한반도 중남부를 오르내리며 곳곳에 비를 뿌릴 전망이다. 최근 30년간 평균치에 따르면 중부지방의 장마 시작일은 6월24~25일이었으며 장마기간은 32일, 강수일수는 17.2일이었다.기상청은 올해 장마기간의 평균 강수량이 350~400㎜로 평년과 비슷하거나 적을 것으로 내다봤다. 브라질 월드컵 한국과 러시아의 경기가 열리는 18일 오전 서울은 대체로 구름이 많이 끼지만 비는 오지 않을 것으로 예상돼 거리 응원에는 지장이 없을 전망이다.',
   'qas': [{'answers': [{'answer_start': 478, 'text': '한 달가량'},
      {'answer_start': 478, 'text': '한 달'}],
     'guid': '798db07f0b9046759deed9d4a35ce31e',
     'question': '북태평양 기단과 오호츠크해 기단이 만나 국내에 머무르는 기간은?'}]}]

In [9]:
import pandas as pd
import json

# train
train_df = pd.DataFrame(columns=['id', 'title', 'question', 'answers', 'context'])
i=0
for data_ in train_raw['data']:
    for paragraphs_ in data_['paragraphs']:
        for qas_ in paragraphs_['qas']:
            new_data_i = {'id':qas_['guid'],'title': data_['title'], 'question':qas_['question'], 'answers': qas_['answers'][0], 'context':paragraphs_['context']}
            train_df = train_df.append(new_data_i, ignore_index=True)
            
            if i%1000==0 : 
                print('train iter : ', i)
            i += 1

# test
test_df = pd.DataFrame(columns=['id', 'title', 'question', 'context'])
i=0
for data_mungch in test_raw['data']:
    for paragraphs_ in data_mungch['paragraphs']:
        for qas_ in paragraphs_['qas']:
            new_data_i = {'id':qas_['guid'],'title': data_mungch['title'], 'question':qas_['question'], 'context':paragraphs_['context']}
            test_df = test_df.append(new_data_i, ignore_index=True)
            
            if i%1000==0 : 
                print('test iter : ', i)
            i += 1

train iter :  0
train iter :  1000
train iter :  2000
train iter :  3000
train iter :  4000
train iter :  5000
train iter :  6000
train iter :  7000
train iter :  8000
train iter :  9000
train iter :  10000
train iter :  11000
train iter :  12000
test iter :  0
test iter :  1000
test iter :  2000
test iter :  3000
test iter :  4000


In [10]:
# # EDA
# import seaborn as sns
# from pylab import rcParams
# import matplotlib.pyplot as plt
# from matplotlib import rc
# import gc

# text_words_count = []
# for context in train_df['context']:
#     text_words_count.append(len(context))

# len(text_words_count)

In [11]:
# m =0
# for word_len in text_words_count:
#     m += word_len
# m = m/12037
# print(m, max(text_words_count))

# # 평균 1017 words 의 단어수, 최대 2046

In [12]:
# plt.rcParams["figure.figsize"] = (12, 9)
# sns.histplot(text_words_count)

In [13]:
# from transformers import (
#     AutoTokenizer
# )

# tokenizer = AutoTokenizer.from_pretrained(model_name)

In [14]:
# tokenized_words_len=[]
# for i in range(0, len(train['context'])):
#     tokenized_words_len.append(len(tokenizer.encode(train['context'][i])))

# plt.rcParams["figure.figsize"] = (12, 9)
# sns.histplot(tokenized_words_len)

# m = 0
# max_len = 0
# for word_len in tokenized_words_len:
#     m += word_len
#     if word_len>max_len:max_len=word_len
# m = m/12037
# print(m, max_len)

# # tokenized 결과, 평균 536길이 최대 1157 길이. -> max len을 540은 해야겠구나.

## Preprocessing the training data

In [15]:
from transformers import AutoTokenizer
    
model_name = 'monologg/koelectra-base-v3-finetuned-korquad'
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/257k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [16]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

### example로 tokenizer 다뤄보기 -- all 주석

In [17]:
# for i, example in enumerate(train_raw['data']):
#     if len(tokenizer(example['paragraphs'][0]["qas"][0]['question'], example['paragraphs'][0]["context"])["input_ids"]) > 384*2:
#         break
# # example = train_raw["data"][i]
# overflow_example = train_df.loc[i]

In [18]:
# # 983 개의 tokens
# # len(tokenizer(example['paragraphs'][0]["qas"][0]['question'], example['paragraphs'][0]["context"])["input_ids"])
# len(tokenizer(example_df['question'],
#               example_df['context'])['input_ids']
#     )

In [19]:
# # example #
# example_max_length = 384*2
# example_stride = 256

# tokenized_result = tokenizer(example_df['question'],
#                              example_df['context'],
#                              max_length = example_max_length,
#                              truncation='only_second'
#                             )

# tokenized_result.keys() # 모든 transformer attention에 들어갈 수 있는 형식이구나
# len(tokenized_result['input_ids'][1])

In [20]:
# max_length = 384*2 # The maximum length of a feature (question and context)
# doc_stride = 128*2 # The authorized overlap between two part of the context when splitting it is needed.

# tokenized_example = tokenizer(
#     example_df["question"],
#     example_df["context"],
#     max_length=max_length,
#     truncation="only_second",  # 두번째 오는 text만 tuncate 한다는 의미
#     return_overflowing_tokens=True,  # overflowing tokens 
#     stride=doc_stride
# )

Now this will give us some work to properly treat the answers: we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. _Thankfully, the tokenizer we're using can help us with that by returning an `offset_mapping`:_

In [21]:
# tokenized_example = tokenizer(
#     example['paragraphs'][0]['qas'][0]["question"],
#     example['paragraphs'][0]["context"],
#     max_length=max_length,
#     truncation="only_second",
#     return_overflowing_tokens=True,
#     return_offsets_mapping=True,
#     stride=doc_stride
# )

# # tokenizer 에서 offset_mapping 지원해준다!
# print(tokenized_example["offset_mapping"][0])
# print(tokenized_example['overflow_to_sample_mapping']) # ?
# tokenized_example['input_ids'][0][1]
# tokenizer.convert_ids_to_tokens(11507)

In [22]:
# tokenized_example.keys()

So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

---

1. 'offset_mapping + sequence_ids' -> 정답토큰 위치 찾기. 
2. context에 해당하는 token에서만 answer을 찾아야하므로.

In [23]:
# print(tokenized_example.sequence_ids())
# print(tokenized_example['offset_mapping']) # 모든 토큰의 offset 은 (0,0) 으로 매핑된다.

we can find the first and last token of the answer in one of our input feature (or if the answer is not in this feature):

---

first token 은 offset으로 찾고, last token은 [start + len(answer) - 1] 까지다.

In [24]:
# answers = example['paragraphs'][0]['qas'][0]["answers"]
# start_char = answers[0]["answer_start"]
# end_char = start_char + len(answers[0]["text"])

# # Start token index of the current span in the text.
# # Question은 0이므로, Context에 해당하는 start index를 찾는다.
# token_start_index = 0
# while sequence_ids[token_start_index] != 1:
#     token_start_index += 1

# # End token index of the current span in the text.
# # Conext의 [SEP]가 아닌 마지막 token 위치 찾는다
# token_end_index = len(tokenized_example["input_ids"][0]) - 1
# while sequence_ids[token_end_index] != 1:
#     token_end_index -= 1

# # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
# # answer 위치를 token 위치로 바꾸는 중이다.
# offsets = tokenized_example["offset_mapping"][0] # 첫번째 tuncated input
# if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
#     # Move the token_start_index and token_end_index to the two ends of the answer.
#     # Note: we could go after the last offset if the answer is the last word (edge case).
#     while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
#         token_start_index += 1
#     start_position = token_start_index - 1
#     while offsets[token_end_index][1] >= end_char:
#         token_end_index -= 1
#     end_position = token_end_index + 1
#     print(start_position, end_position)
# else:
#     print("The answer is not in this feature.")

For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

In [25]:
# pad_on_right = tokenizer.padding_side == "right"

### tokenizer 이용, 진짜 사용될 학습 features 구하기
- features : input_

In [26]:
from sklearn.model_selection import train_test_split

train_df_splited, val_df_splited = train_test_split(train_df, test_size=.1, random_state=42)

#### TY flow - all 주석
Now let's put everything together in one function we will apply to our training set. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [27]:
# # Training preprocessing
# max_seq_length = 384*2
# doc_stride = 128*2

# def prepare_train_features(example, tokenizer):
#     example["question"] = example["question"].lstrip()
#     tokenized_example = tokenizer(
#         example["question"],
#         example["context"],
#         truncation="only_second",
#         max_length=max_seq_length,
#         stride=doc_stride,
#         return_overflowing_tokens=True, # 길이를 넘어가는 토큰들을 반환할 것인지
#         return_offsets_mapping=True,  # 각 토큰에 대해 (char_start, char_end) 정보를 반환한 것인지
#         padding="max_length",
#     )

#     sample_mapping = tokenized_example.pop("overflow_to_sample_mapping")
#     offset_mapping = tokenized_example.pop("offset_mapping")

#     features = []
#     for i, offsets in enumerate(offset_mapping):
#         feature = {}

#         input_ids = tokenized_example["input_ids"][i]
#         attention_mask = tokenized_example["attention_mask"][i]

#         feature['input_ids'] = input_ids
#         feature['attention_mask'] = attention_mask
#         feature['offset_mapping'] = offsets

#         cls_index = input_ids.index(tokenizer.cls_token_id)
#         sequence_ids = tokenized_example.sequence_ids(i)

#         sample_index = sample_mapping[i]
#         answers = example["answers"]

#         if answers["answer_start"] == None:
#             feature["start_position"] = cls_index
#             feature["end_position"] = cls_index
#         else:
#             start_char = answers["answer_start"]
#             end_char = start_char + len(answers["text"])

#             token_start_index = 0
#             while sequence_ids[token_start_index] != 1:
#                 token_start_index += 1

#             token_end_index = len(input_ids) - 1
#             while sequence_ids[token_end_index] != 1:
#                 token_end_index -= 1

#             if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
#                 feature["start_position"] = cls_index
#                 feature["end_position"] = cls_index
#             else:
#                 while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
#                     token_start_index += 1
#                 feature["start_position"] = token_start_index - 1
#                 while offsets[token_end_index][1] >= end_char:
#                     token_end_index -= 1
#                 feature["end_position"] = token_end_index + 1

#         features.append(feature)
#     return features

# # features(input_ids, attention_mask, offset_mapping, start_position, end_position) 이렇게 tokenizer을 이용한 5개의 feature을 반환해준다.

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [28]:
# train_features = []
# valid_features = []

# for i, row in train_df_splited.iterrows():
#     train_features.append(prepare_train_features(row, tokenizer)[0])
# for i, row in val_df_splited.iterrows():
#     valid_features.append(prepare_train_features(row, tokenizer)[0])

In [29]:
# tokenized_datasets = {'train':train_features,'validation':valid_features}
# # train에서 쓰일 최종 train용 datasets
# # pipeline 으로 보면 tokenization 까지 마쳤고, embedding 등 model 내부에서 진행되나보다.

#### Because I use hugging fucking face Trainer <커스터마이제이션>
- offset_mapping 뺐다.

In [30]:
# Custom function
max_seq_length = 384*2
doc_stride = 128*2

def prepare_train_features_hf(example, tokenizer):
    example["question"] = example["question"].lstrip()
    tokenized_example = tokenizer(
        example["question"],
        example["context"],
        truncation="only_second",
        max_length=max_seq_length,
        stride=doc_stride,
        return_overflowing_tokens=True, # 길이를 넘어가는 토큰들을 반환할 것인지
        return_offsets_mapping=True,  # 각 토큰에 대해 (char_start, char_end) 정보를 반환한 것인지
        padding="max_length",
    )

    sample_mapping = tokenized_example.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_example.pop("offset_mapping")

    features = []
    for i, offsets in enumerate(offset_mapping):
        feature = {}

        input_ids = tokenized_example["input_ids"][i]
        attention_mask = tokenized_example["attention_mask"][i]

        feature['input_ids'] = input_ids
        feature['attention_mask'] = attention_mask

        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized_example.sequence_ids(i)

        sample_index = sample_mapping[i]
        answers = example["answers"]

        if answers["answer_start"] == None:
            feature["start_position"] = cls_index
            feature["end_position"] = cls_index
        else:
            start_char = answers["answer_start"]
            end_char = start_char + len(answers["text"])

            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                feature["start_position"] = cls_index
                feature["end_position"] = cls_index
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                feature["start_position"] = token_start_index - 1
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                feature["end_position"] = token_end_index + 1

        features.append(feature)
    return features

# features(input_ids, attention_mask, offset_mapping, start_position, end_position) 이렇게 tokenizer을 이용한 5개의 feature을 반환해준다.

In [31]:
train_features_hf = []
valid_features_hf = []

for i, row in train_df_splited.iterrows():
    train_features_hf.append(prepare_train_features_hf(row, tokenizer)[0])
for i, row in val_df_splited.iterrows():
    valid_features_hf.append(prepare_train_features_hf(row, tokenizer)[0])

# start_position, end_position -> s 붙여주는 DF 작업도 필요

In [None]:
tokenized_datasets = {'train':train_features_hf,'validation':valid_features_hf}
# train에서 쓰일 최종 train용 datasets
# pipeline 으로 보면 tokenization 까지 마쳤고, embedding 등 model 내부에서 진행되나보다.

tokenized_datasets

#### 코드 무덤

In [None]:
# features = prepare_train_features(train_df.loc[1], tokenizer)

# pandas인 train을 dict로 slicing 하기 위해선, pandas를 아예 dataset 객체로 만들어야할것같다.
# train_dict = df_to_dataset(train)

# features = prepare_train_features(train_dict)

In [None]:
# # 이 예제의 dataset 모듈처럼 반환하도록 Dataframe을 제단해준다.
# class Df_to_dataset():
#     def __init__(self, df_dataset):
#         self.dataset_dict = {'answers':[], 'context':[], 'id':[], 'question':[], 'title':[]}
        
#         for i in range(0,len(df_dataset.to_dict()['answers'])):
#             self.dataset_dict['answers'].append({'answer_start' : df_dataset['answers'][i]['answer_start'], 'text' : df_dataset['answers'][i]['text']})
#             self.dataset_dict['context'].append(df_dataset['context'][i])
#             self.dataset_dict['id'].append(df_dataset['guid'][i])
#             self.dataset_dict['question'].append(df_dataset['question'][i])
#             self.dataset_dict['title'].append(df_dataset['title'][i]) 
    
#     def __getitem__(self, index):
#         return self.dataset_dict[index]

#     def len(self):
#         return len(self.dataset_dict)

dataset.map() 등판으로 나의 노력이.. 다시 dataset으로 내 데이터 가져오게 만들었다.
오히려 좋아... 더 유용한 technique

dataset이 오류가 너무 많아서 pandas 로 다시 돌아왔다.
다음 task에서는 다른사람이 한거를 먼저 참고하고 그걸 발판삼아서 해야겠다..
혼자 baseline 해보려다가 시간 너무 낭비했다.

#### 다시 getit

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it.

In [None]:
# tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

tokenized_datasets 참고 : (클릭)
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 88524
    })
    validation: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 10784
    })
})

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

### corpus 전처리 - not yet

- 한자, 일부 특수문자 제거 : 이 task에선 중요하지 않다 판단.
- 한국어 문장 분리기(kss) 사용
- 뉴스관련 문장을 제거 (서울=뉴스1, 무단전재)

## Fine-tuning the model

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Downloading:   0%|          | 0.00/429M [00:00<?, ?B/s]

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
train_features[0][0] # 첫 데이터
train_features[0][0].keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'start_position', 'end_position'])

In [None]:
'hel/lo the/re'.split('/')[-1]

're'

In [None]:
model_name_final = model_name.split("/")[-1] 
batch_size = 16

# output_dir : The output directory where the model predictions and checkpoints will be written.
args = TrainingArguments(
    output_dir = f"{model_name_final}-finetuned-korquad", # 모델 predictions, checkpoinsts 저장될 dir 이름이다.
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    push_to_hub=True,
)

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
# tokenized_datasets 는 그냥 내 train_features 중 4가지 and train, eval
# tokenzied_datasets 어떻게 train시 호출될까? -> index로 조회시 4가지 features 반환됨. dict형으로.
train_feature

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset = tokenized_datasets["train"], # 
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator, # 개선 여지
    tokenizer = tokenizer,
)

Cloning https://huggingface.co/LJ/koelectra-base-v3-finetuned-korquad-finetuned-korquad into local empty directory.


We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

***** Running training *****
  Num examples = 10833
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 678


TypeError: ignored

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
trainer.save_model("test-squad-trained")

## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [None]:
output.start_logits.shape, output.end_logits.shape

We have one logit for each feature and each token. The most obvious thing to predict an answer for each feature is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.


To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call `n_best_size`. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one. Here is how we would do this on the first feature in the batch:

In [None]:
n_best_size = 20

In [None]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

And then we can sort the `valid_answers` according to their `score` and only keep the best one. The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [None]:
max_answer_length = 30

In [None]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

We can compare to the actual ground-truth answer:

In [None]:
datasets["validation"][0]["answers"]

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

We're almost ready for our post-processing function. The last bit to deal with is the impossible answer (when `squad_v2 = True`). The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Then we can load the metric from the datasets library.

In [None]:
metric = load_metric("squad_v2" if squad_v2 else "squad")

Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. In the case of squad_v2, we also have to set a `no_answer_probability` argument (which we set to 0.0 here as we have already set the answer to empty if we picked it).

In [None]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("sgugger/my-awesome-model")
```