<a href="https://colab.research.google.com/github/edwin-19/Text-Generation-Comparison/blob/main/Text_Generation_HF_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install libraries

In [1]:
!pip install transformers
!pip install datasets
!pip install tokenizers
!pip install sentencepiece

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 33.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 35.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 33.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

# Import necessary libs

In [75]:
from transformers import AutoTokenizer, AutoModelForCausalLM, T5ForConditionalGeneration
from datasets import load_metric
from pprint import pprint

import torch
import random
import numpy as np

In [76]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEED = 15
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

In [2]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")

In [8]:
model.eval();
model.cuda();

In [90]:
inputs = tokenizer('3 Steps for Secure Financial Growth', return_tensors='pt')
inputs['input_ids'] = inputs['input_ids'].to(device)
inputs['attention_mask'] = inputs['attention_mask'].to(device)
outputs = model.generate(
    **inputs, do_sample=True, num_beams=5, no_repeat_ngram_size=2, min_length=100, max_length=500, early_stopping=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [91]:
text_generated = tokenizer.batch_decode(outputs, skip_special_tokens=True)

pprint(text_generated)
'Text Length: {}'.format(len(text_generated[0].split(' ')))

['3 Steps for Secure Financial Growth\n'
 '\n'
 'There’s a lot of talk in the financial services industry about the '
 'importance of security, and one of the most important aspects of that '
 'security is the use of encryption. Encryption is a method of concealing '
 'information in a way that makes it impossible to read the information '
 'without the key. The key is used to decrypt the encrypted information, which '
 'allows you to see the original information. This is known as a “one-way” '
 'encryption, meaning that you can only see information that has been '
 'encrypted with the same key that was used for encrypting it. In this '
 'article, we will look at the steps you need to take in order to secure your '
 'finances. We will also talk about how to choose the right encryption '
 'software for your business. If you have any questions, feel free to reach '
 'out to us at [email protected] We look forward to hearing from you soon!']


'Text Length: 158'

# Load t5 model 
- Test out t5 model for conditional gen

In [35]:
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")
t5_model = T5ForConditionalGeneration.from_pretrained("t5-base")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [38]:
t5_model.eval();
t5_model.cuda();

In [86]:
inputs = t5_tokenizer(
    'summarize: 3 Steps for Secure Financial Growth', return_tensors='pt'
)
inputs['input_ids'] = inputs['input_ids'].to(device)
inputs['attention_mask'] = inputs['attention_mask'].to(device)
outputs = t5_model.generate(
    **inputs, do_sample=True, num_beams=5, no_repeat_ngram_size=2, min_length=100, max_length=300, early_stopping=True
)

In [87]:
text_generated = t5_tokenizer.batch_decode(outputs, skip_special_tokens=True)
pprint(text_generated)
'Text Length: {}'.format(len(text_generated[0].split(' ')))

['3 Steps for Secure Financial Growth: Secure Investing, Secure Investments, '
 'and Secure Capital Growth. Step 1: Learn to manage your risk, protect your '
 'assets and grow your business! Step 2: Make sure you’re able to invest your '
 'money in the right way, invest wisely and make the most of your investment '
 'opportunities by investing in a safe, secure and scalable way. 3: Take '
 'Control of Your Money and Keep Your Assets Safe. c.']


'Text Length: 73'

# Multiple languages
- Using mt5 we can generate for different languages as well
- Still it uses less resources in comparison to gpt

In [80]:
mt5_tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")
mt5_model = T5ForConditionalGeneration.from_pretrained("google/mt5-base")

Downloading:   0%|          | 0.00/376 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/702 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


Downloading:   0%|          | 0.00/2.17G [00:00<?, ?B/s]

In [93]:
mt5_model.eval();

In [100]:
inputs = mt5_tokenizer(
    'ringkasan: 3 Langkah untuk Pertumbuhan Kewangan yang Selamat', return_tensors='pt'
)
outputs = mt5_model.generate(
    **inputs, do_sample=True, num_beams=5, no_repeat_ngram_size=2, min_length=100, max_length=300, early_stopping=True
)

In [101]:
text_generated = mt5_tokenizer.batch_decode(outputs, skip_special_tokens=True)
pprint(text_generated)
'Text Length: {}'.format(len(text_generated[0].split(' ')))

['<extra_id_0>kasan: 3 Langkah <extra_id_1>: <extra_id_2>ringkasan '
 '<extra_id_3>: <extra_id_4>: 4 Tahun <extra_id_55> Utama <extra_id_56> Untuk '
 'Pertumbuhan Kewangan Selamat - Percutian Rumah Pertama di Dunia. Bagi anda '
 'yang memerlukan. Sebagai tambahan, ia perlu bersedia untuk mengelakkan '
 'daripada kesan-kesan sampingan. Lakukan ini ialah beberapa langkah untuk '
 'mempercepatkan pertumbuhan kewangan: Dalam keadaan tertentu, dan lebih mudah '
 'tanpa sebab-dan akibatnya, adalah apabila ianya telah tiba masanya untuk '
 'bermulanya. Boleh maklum balas ringkas: perkongsian artikel ini diubahsuai '
 'oleh pegawai khidmat nasihat ahli supaya mengambil alih kedudukannya. '
 'Panduan utama! Lihat imej pilihan anda! Tarikh tutup kemaskini mingguan, '
 'khusus']


'Text Length: 90'