<a href="https://colab.research.google.com/github/bitanb1999/TalentSumoAI/blob/main/GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers



In [2]:
#for reproducability
SEED = 34

#maximum number of words in output text
MAX_LEN = 70

###I. Intro
A language model is a machine learning model that can look at part of a sentence and predict the next word/sequence of words. Much like the autofill features on your iPhone/Android, GPT-2 is capable of next word prediction on a much larger and more sophisticated scale. For reference, the smallest available GPT-2 has 117 million parameters, whereas the largest one (invisible to the public) has over 1.5 billion parameters. The largest one available for public use is half the size of their main GPT-2 model

😊 Transformers makes it very easy to import this model with both PyTorch and TensorFlow - in this notebook we will be using TensorFlow but it is just as easy in PyTorch. Both the model and its Tokenizer can be imported from the transformers library that anyone can get by typing !pip install transformers. Let's see just how simple it is to generate text with a neural network. We begin with our input sequence:

In [3]:
input_sequence = "I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I "

In [4]:
#get transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

#get large GPT2 tokenizer and GPT2 model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)
GPT2.summary()

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 774030080 
 r)                                                              
                                                                 
Total params: 774,030,080
Trainable params: 774,030,080
Non-trainable params: 0
_________________________________________________________________


In [5]:
#get deep learning basics
import tensorflow as tf
tf.random.set_seed(SEED)

In [6]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')
# set return_num_sequences > 1
beam_outputs = GPT2.generate(
    input_ids, 
    max_length = MAX_LEN, 
    num_beams = 5, 
    no_repeat_ngram_size = 2, 
    num_return_sequences = 5, 
    early_stopping = True
)

print('')
print("Output:\n" + 50 * '-')

# now we have 3 output sequences
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


Output:
--------------------------------------------------
0: I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I  have a great sense of humor, and I love to make people laugh. If you have any questions, please feel free to contact me.
1: I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I  have a great sense of humor, and I love to make people laugh. If you have any questions, please feel free to send me an e-mail.
2: I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I  have a great sense of humor, and I love to make people laugh. If you have any questions, please feel free to contact me. Thank you for your time.
3: I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I  have a great sense of humor, and I love to make people laugh. If you have any

In [7]:
# use temperature to decrease the sensitivity to low probability candidates
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 0, 
                             temperature = 0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I _________________________________________" I have boiled my stories down to their essence and the emotion is flowing all around me.


I am a semi-professional photographer and writer. My skills are mostly focused on family


In [8]:
#sample only from 80% most likely words
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_p = 0.8, 
                             top_k = 0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I ive always done this work with my whole heart and I will always do it and i am no exceptions!I love to do any job that im doing. And it is my passion and desire to make the ...


In [9]:
#combine both sampling techniques
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = 2*MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .7,
                              top_k = 50, 
                              top_p = 0.85, 
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I  would love to be your leader!...

1: I am enthusiastic,diligent and hardowking. I am very passionate about my work and am a very good communicator. I ive been an active member of the Church for many years and have been a member of the Priesthood for a couple of years. I am also an active member of the Church Youth, the Young Women, the Young Men, the Relief Society and the Young Men's Mutual Improvement Associations. I have a good understanding of the social and cultural needs of the community and have been working with these groups for many years.

I believe that I have a genuine and authentic love for the Church and I am committed to being a dedicated, faithful, and faithful...

2: I am enthusiastic,diligent and hardowking. I am very passionate about my work and