<a href="https://colab.research.google.com/github/bigfacecatxue/Language-generation/blob/main/Using_GPT2_to_generate_text_like_Shakespeare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# download train set
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2021-03-07 05:24:22--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’


2021-03-07 05:24:22 (14.9 MB/s) - ‘input.txt.2’ saved [1115394/1115394]



In [2]:
# make a temporary directory to store fine-tuned model
!mkdir output

mkdir: cannot create directory ‘output’: File exists


In [2]:
# install datasets and transformers
!pip install datasets
!pip install transformers



In [3]:
# download pretrained language model
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_clm.py

--2021-03-07 05:24:37--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_clm.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17016 (17K) [text/plain]
Saving to: ‘run_clm.py.2’


2021-03-07 05:24:38 (48.7 MB/s) - ‘run_clm.py.2’ saved [17016/17016]



In [1]:
# fine tune distilgpt2 model. gpt2 model takes almost 6 hours to fine tune, distilgpt2 takes 3 hours, so I chose distilgpt2. Feel free to try gpt2 or gpt2-medium
!python run_clm.py \
    --model_name_or_path distilgpt2 \
    --train_file /content/input.txt \
    --per_device_train_batch_size 2 \
    --do_train \
    --output_dir output

2021-03-07 05:27:44.926379: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
03/07/2021 05:27:48 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=output, overwrite_output_dir=False, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Mar07_05-27-47_eb1c98354f4e, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=Fa

In [2]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
 
tokenizer = GPT2Tokenizer.from_pretrained('/content/output')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained('/content/output', pad_token_id=tokenizer.eos_token_id)

In [3]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('[BOS] The King must leave the throne now. [EOS]',
                      return_tensors='pt')

# Greedy Search

In [4]:
# generate text until the output length (which includes the context length) reaches 300
greedy_output = model.generate(input_ids, max_length=300)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS] I'll not be so. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be gone. I'll be


Text generated by greedy search repeate a lot

# Beam Search

In [5]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=300, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the throne now. [EOS]The King must leave the 

Text generated by beam search also repeate a lot

In [6]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=300, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS] Come, my lord, I'll leave you to the king's side.BUCKINGHAM:I'll go, sir.KING RICHARD II:Sir, let's go.BRUTUS HENRY VI:What's the King's name?ROMEO:The King of England, the Duke of York, is the son of Richard III, and the heir of the crown.The king of France, Richard II, was born in France in the year of Henry VI.He is known to be a traitor to France and to his country, as well as an enemy of his own.First Lady Margaret of Gloucester, daughter of Edward IV, died in Paris in December of this year.Second Lady Anne of Warwick, son-in-law of Prince Edward III and his son, Edward VI, were both executed by the French army on the night of December 11th, but were not executed on that night.Third Lady Edward II died at the age of twenty-two years old.Fourth Lady Elizabeth II was a prisoner of war for the purpose of her life.She was executed

Adding no_repeat_ngram_size to decode can force model not generate repeating text, but results are not so great.

In [7]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=300, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=3, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: [BOS] The King must leave the throne now. [EOS] Come, my lord, I'll leave you to the king's side.BUCKINGHAM:I'll go, sir.KING RICHARD II:Sir, let's go.BRUTUS HENRY VI:What's the King's name?ROMEO:The King of England, the Duke of York, is the son of Richard III, and the heir of the crown.The king of France, Richard II, was born in France in the year of Henry VI.He is known to be a traitor to France and to his country, as well as an enemy of his own.First Lady Margaret of Gloucester, daughter of Edward IV, died in Paris in December of this year.Second Lady Anne of Warwick, son-in-law of Prince Edward III and his son, Edward VI, were both executed by the French army on the night of December 11th, but were not executed on that night.Third Lady Edward II died at the age of twenty-two years old.Fourth Lady Elizabeth II was a prisoner of war for the purpose of her life.She was execu

Beam search can return top results(less than num_beams), so we can manually pick the best one.

# Sampling

In [8]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=300, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS] Mere invention for thy messengerator.KING OF HENRY VI:There is no preparation.DECUTIO:Who know you now? KING OF HENRY VI:I have his royal catchall and make power official.DECUTIO:My lord, know you would obey the king.KING OF HENRY VI:But make our king's watch.DECUTIO:Servant, what you shall hear?KING OF AUDOLA:The king is not, sir.DAVID:My lord, I do know your prince Tuke,regarde who holds the crown and, from whom he is spoken.KING OF HENRY VI:Who is Lord Duke Tuke? Duke Tuke, name your prince, Duke Tuke, both King Duke Tuke and Duke Kale.ABULT:Father Tuke, you must now see that duty.KING OF HENRY VI:For the king, I call upon you, of my godly purism to speak your mind.ABULT:I'll take the command we need to assist thee. I need to come to your ear, and speak certain words of your tongue.DECUTIO:Ay my lords, speak those words, Duke Fa

Rather than pick the highest probability words, sampling method randomly pick next words. In this way, text generated is more flexible, but sometimes doesn't make sense.

In [9]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=300, 
    top_k=0, 
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS] Merely for thy messenger, and shall not be found;Nor shall he be found. KING RICHARD III:Who shall be found? KING RICHARD III:I think he shall be found. KING RICHARD III:Why, King?KING RICHARD III:My cause, I am on my way. KING RICHARD III:My cause, I am all well. KING RICHARD III:Where should I find my way?Where should I find my way?Where should I find my way?Where should I find my way?Where should I find my way?Where should I find my way?Where should I find my way?Where should I find my way?Pray, kindly king, and welcome you.KING RICHARD III:Haggard, how come I stay here?BOS:I have heard you speak of thy messenger, and he hath not yet been found.KING RICHARD III:Let's go to London.KING RICHARD III:Let's go.KING RICHARD III:Where shall I find my way?BOS:I have heard you speak of thy messenger.KING RICHARD III:I have heard you spea

Temperature feature can sharpen words ditribution, avoid picking low probability words.

# Top-K Sampling

In [10]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=300, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS]Tis nothing for thy will, and shall not change my view.Nurse, stay with you.CUTUS:As a soldier, but one for whom you, 'tis a soldier and a prophet.SICINIUS:Why, it is a traitor of justice? O, the king's blood on my head.Nurse:Well, 'tis an exile, but one where he is known for fighting.CUTUS:O my oath, sir.ISABELLA:Is the crown a sword, sir?ISABELLA:The crown is a dagger, sir?ISABELLA:No, sir.Nurse:The crown is a dagger that comes with a sword.ISABELLA:Why, it is a traitor to a King.A:Why, the king's blood.ISABELLA:A royal dagger, sir.ISABELLA:So, sir, why the king's blood?ISABELLA:Was the king in distress?ISABELLA:But, no, sir, 'tis false.Is it a royal sword to the king? ISABELLA:No, sir, it is a sword, Sir.ISABELLA:And, 'tis true.ISABELLA:Ay, but I believe you have your brother Edward VI.ISABELLA:To you.IS


Top-K sampling pick the k highest probability words, and redistribute probabilities among them.

# Top-p (nucleus) Sampling


In [11]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=300, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
[BOS] The King must leave the throne now. [EOS] Mere invention for thy own good.KING OF HENRY VI:There is no preparation.DEALINGS:Here it is, King Edward, for this power, amending it to his royal masters.First Son:My lord, are you deposed in this matter? EDWARD IV:Farewell, my lord.First Son:Now our king needs an absolute successor.First Son:Doubt not.First Son:And none worse. EDWARD IV:The king is not, my lord, my lord, any more.First Son:Doubt, my lord.Second Son:Take, my lord; consent your will.Second Son:Sir, Richard.First Son:Confess this, my lord, that I have dawd you to a King.A noble lord.KING OF HENRY VI:To your purposes, you must now have a duty.First Son:Sir, pray not.First Son:What? my lord, of my lord?First Son:To your mind, as I did, who gave my hand, we shall see the power of royal succession to come to your end.KING OF HENRY VI:And to our nobles, how are you?Seco

Top-p Sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p.

In [13]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=300, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: [BOS] The King must leave the throne now. [EOS]Tis very strange; my father shall not be king.I'll not leave him, if he should.JULIET:I think I should leave him to him: he should stay.WARWICK:What then?EOS:He, you have lost him.WARWICK:Is it true; let's leave me!WARWICK:Is it true?EOS:He'll leave him to him, if he should, which is true!JULIET:Nay! I'll leave him.WARWICK:I'll leave him to the king: I will leave him to him.WARWICK:I am a man to be king, but he shall stay--WARWICK:Yesternock, I shall leave him to me and to the king's son, and the crown shall stay--WARWICK:He shall stay to the king, and not stay: I will stay his and to the king's son;And the crown shall remain.WARWICK:That's very strange; your father cannot stay, you know, my father.JULIET:Nay!Your father is gone;And his sons are gone.WARWICK:He must be gone; and the king shall not stay.WARWICK:Why? You have been 

Combine top-k and top-p together, result is much better, it does look like Shakespeare style, altough take a closer look, you'll find it's not written by Shakespeare. Anyway, it's kind of amazing that language model can produce such good text.

**References**

[How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)
