<a href="https://colab.research.google.com/github/cagBRT/promptEngineering/blob/main/gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conversational GPT2

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/promptEngineering.git cloned-repo
%cd cloned-repo

In [None]:
!pip install -U -q transformers

In [None]:
!pip install xformers

The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:

**Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.**

Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case.<br><br>

We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.



The model below will return the top five responses.

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

The OpenAI team wanted to train this model on a corpus as large as possible. <br>
To build it, they **scraped all the web pages from outbound links on Reddit which received at least 3 karma.**<br><br>Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia.<br>

*The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released.*

In [None]:
#from transformers import pipeline, set_seed
#generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("The man worked as a", max_length=10, num_return_sequences=5)

What do you think some for the responses for "The dog worked as a ..."

In [None]:
set_seed(42)
generator("The dog worked as a", max_length=10, num_return_sequences=5)



---



---



# TensorFlow GPT2 Model

In [None]:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')

**Tokenize input sequences**

In [None]:
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
#output

# Conversational Models

In [None]:
from transformers import pipeline, Conversation
converse = pipeline("conversational")

In [None]:
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
converse([conversation_1])

In [None]:
conversation_2 = Conversation("What's the last book you have read?")
converse([ conversation_2])

This is an instance of microsoft/DialoGPT-medium trained on a game character, Joshua from The World Ends With You. The data comes from a Kaggle game script dataset. Chat with the model:



In [None]:
!pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [None]:
from IPython.display import Image
Image("gpt2-sizes-hyperparameters-3.png", width=640)

The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. One key difference between the two is that GPT2, like traditional language models, outputs one token at a time.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium",padding_side='left')
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")

In [None]:
times=5
# Let's chat for 5 lines
for step in range(times):
  # encode the new user input, add the eos_token and return a tensor in Pytorch
  new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
  # append the new user input tokens to the chat history
  bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

  # generated a response while limiting the total chat history to 1000 tokens,
  chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
  # pretty print last ouput tokens from bot
  print("AI: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))