# Test set Inference
#### This code performs inference on the test set with the model learned in "1_model_training.ipynb" and generates outputs.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd

/content


In [None]:
%cd /content/drive/Shareddrives/Capstone/KoBART-summarization

/content/drive/Shareddrives/Capstone/KoBART-summarization


In [None]:
!pip install transformers

In [4]:
import pandas as pd

from transformers import BartForConditionalGeneration
from transformers import AutoTokenizer
import torch

import time
import pickle

In [None]:
!python --version

Python 3.9.16


#### Use gogamza/kobart-summarization tokenizer which was used for training

## Load trained Model & Tokenizer

In [None]:
model = BartForConditionalGeneration.from_pretrained('./kobart_summary')
tokenizer = AutoTokenizer.from_pretrained("gogamza/kobart-summarization")

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/177k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/682k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


## Function that model summarizes the text
#### Experimented by adjusting the values of the variables, and changed them to better performance values.

In [None]:

def inference_model(text): 
    # param들을 적당히 잘 조정해 보자.
    input_ids = tokenizer.encode(text)
    input_ids = torch.tensor(input_ids)
    input_ids = input_ids.unsqueeze(0)

    output = model.generate(
        input_ids, 
        eos_token_id=1, 
        min_length=33, 
        max_length=75, 
        num_beams=5,
        #do_sample=True,
        #early_stopping=True, 
        #length_penalty=0.8,
        #no_repeat_ngram_size=5, 
        #temperature = 0.6,#
        num_return_sequences=2,
        repetition_penalty=10.0,
        #top_p=0.92,#
        )

    output = tokenizer.decode(output[0], skip_special_tokens=True)
    return output

#### Check testset

In [None]:
test_data = pd.read_csv('/content/drive/Shareddrives/Capstone/KoBART-summarization/data/test.tsv', sep='\t')
test_data

Unnamed: 0,news,summary
0,You don t need an additional holster when keep...,"Works great. Durable, holds extra mags"
1,A great page turner that kept my interest unti...,Great mixture of fantasy and modern times
2,after seeing the show I thought I would give t...,after seeing the show I thought I would give t...
3,It was nice to be able to test your tank and r...,It was nice to be able to test your tank and r...
4,This is where I get my B12 etc love the pro...,"Love this product, has a cheesy flavor."
...,...,...
1012,It was just as described on line Arrived as ...,It was just as described on line. Arrived as ...
1013,Just the graphic art book I was looking for to...,Father in search of graphic art for kids
1014,Very nice quality and would Highly Recommend t...,Very nice quality and would Highly Recommend t...
1015,Works just like it s suppose to Construction ...,DVI-D M/F Digital Video Extension Cable-2M


#### The output is in dictionary format, with id:summary mapped.

In [None]:
texts = list(test_data['news'])
ids = [i for i in range(len(test_data))]

In [None]:
# test
print(texts[0])
print(inference_model(texts[0]))

You don t need an additional holster when keeping the gun in this   Works great   Durable  holds extra mags   Nice and thin so you can pack one gun and it s stuff in each case 
Works great, Durable and holds extra mags. Nice and thin so you can pack one gun and it's stuff in each case


#### Generates output & Checks the inference time.

In [None]:
start = time.time() 

summary = {}
cnt = 0
for t in texts:
  if cnt%20 == 0:
    print(cnt, "   time :", time.time() - start)

    with open('/content/drive/Shareddrives/Capstone/KoBART-summarization/answer.pkl','wb') as f:
      pickle.dump(summary,f)

  summary[ids[cnt]] = inference_model(t)
  cnt += 1


with open('/content/drive/Shareddrives/Capstone/KoBART-summarization/answer.pkl','wb') as f:
  pickle.dump(summary,f)

0    time : 0.00030803680419921875
20    time : 72.1609456539154
40    time : 147.94418287277222


KeyboardInterrupt: ignored

#### Let's check the generated output (answer.pkl)


In [7]:
import pickle
with open("/content/drive/Shareddrives/Capstone/KoBART-summarization/answer.pkl","rb") as fr:
    data = pickle.load(fr)


In [10]:
with open("/content/drive/Shareddrives/Capstone/BART/answer.pkl","rb") as fr:
    data2 = pickle.load(fr)


In [11]:
len(data), len(data2)

(1017, 1017)

In [22]:
data[5]

'Good solid feel to it, and programmable. Decisive yet quiet clicks of the buttons'

In [13]:
data2[:10]

['Great for keeping the gun in this case.',
 'A great page turner that kept my interest...',
 'after seeing the show I thought I would give the books a try',
 'It was nice to be able to test your tank and realize...',
 'This is where I get my B12, etc',
 'Good solid feel to it, and programmable.',
 'Very sharp out of box, and holds its edge for a long...',
 'Plenty meat and beans too. Got another 8 pack in my cart',
 'It is very pretty and unique. I like that it needs minimal care',
 'A must have for those of us who love this book']