<a href="https://colab.research.google.com/github/YoungjaeDev/HuggingFace-Tutorial/blob/master/GPT_Neo_DistilGPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers sentencepiece



## **GPT Neo를 활용한 작문**

In [3]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer # Causal Language Model (인과적 언어 모델)

model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [4]:
input = tokenizer.encode("I evaluated the performance of GPT-Neo developed by OpenAI.", return_tensors='pt')

In [6]:
input[0]

tensor([   40, 16726,   262,  2854,   286,   402, 11571,    12,  8199,    78,
         4166,   416,  4946, 20185,    13])

In [8]:
print(tokenizer.decode(input[0]))

I evaluated the performance of GPT-Neo developed by OpenAI.


In [14]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

input = tokenizer.batch_encode_plus(["I evaluated the performance of GPT-Neo developed by OpenAI.", "I evaluated the performance of GPT developed by OpenAI."], padding=True, truncation=True, return_tensors='pt')

In [15]:
print(input['input_ids'])

tensor([[   40, 16726,   262,  2854,   286,   402, 11571,    12,  8199,    78,
          4166,   416,  4946, 20185,    13],
        [   40, 16726,   262,  2854,   286,   402, 11571,  4166,   416,  4946,
         20185,    13, 50257, 50257, 50257]])


In [16]:
for i in range(len(input['input_ids'])):
  print(tokenizer.decode(input['input_ids'][i]))

I evaluated the performance of GPT-Neo developed by OpenAI.
I evaluated the performance of GPT developed by OpenAI.[PAD][PAD][PAD]


## **작문 예제**

In [17]:
# 최대 단어를 5개를 사용해서
input = tokenizer.batch_encode_plus(["I evaluated the performance of GPT2 developed by OpenAI.", "Vaccine for new coronavirus in the UK","3.1415926535"], max_length=5, truncation=True, padding=True, return_tensors="pt")

print(input['input_ids'])

tensor([[   40, 16726,   262,  2854,   286],
        [   53,  4134,   500,   329,   649],
        [   18,    13,  1415, 19707, 22980]])


In [21]:
for i in range(len(input['input_ids'])):
  print(tokenizer.decode(input['input_ids'][i]))

I evaluated the performance of
Vaccine for new
3.14159265


In [18]:
generated = model.generate(input['input_ids'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [20]:
generated_texts = tokenizer.batch_decode(generated)

for i, sentence in enumerate(generated_texts):
  print(f'No.{i+1}')
  print(sentence)

No.1
I evaluated the performance of the proposed method on the real-world dataset. The results are shown in
No.2
Vaccine for new-borns

The vaccine for new-borns is a vaccine
No.3
3.1415926535897932384626433832795028841971693


## **DistilGPT2**

In [22]:
from transformers import AutoTokenizer, AutoModelWithLMHead

In [24]:
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilgpt2')
model = AutoModelWithLMHead.from_pretrained('distilbert/distilgpt2')



model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [29]:
input_ids = tokenizer.encode("I like gpt because it's", return_tensors='pt')
greedy_output = model.generate(input_ids, max_length=12)

print("Output:\n"+ 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I like gpt because it's a good thing to have


## **DialoGPT-Small**

In [30]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('microsoft/DialoGPT-small')

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [31]:
input_ids = tokenizer.encode("I like gpt because it's", return_tensors='pt')
greedy_output = model.generate(input_ids, max_length=30)

print("Output:\n"+ 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I like gpt because it's a good way to get a feel for the game.


In [32]:
input_ids = tokenizer.encode("My company wants me to leave. So", return_tensors='pt')
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n"+ 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
My company wants me to leave. So I'm not sure if this is a good thing or not.
