## AISV_801 Natural Language Processing ##

#### Instructor Joseph Meyer ####

**Lab6:** Load GPT2 from Hugging Face, use it to generate text for prompts of your choice!

Find one more recent large language model, use it to generate text for prompts of your choice!

Bobby Wen

In [None]:
import torch

### Using GPT2 model to generate an output to a prompt ###

####  Loading the model directly ###

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.

This is the smallest version of GPT-2, with 124M parameters.

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
def query(payload):
    inputs = tokenizer(payload, return_tensors="pt")

    ###model_outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
    model_outputs = model.generate(**inputs,return_dict_in_generate=True, output_scores=True)

    generated_tokens_ids = model_outputs.sequences[0]

    response = tokenizer.decode(generated_tokens_ids)
    return response

In [None]:
phrase = "What is the best ice cream?"
query(phrase)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What is the best ice cream?\n\nThe best ice cream is made with real ice cream.'

### Try the model yourself.  Input a phrase ###

In [None]:
sentence = input("Enter a phrase : ")
sentence_output = query(sentence)
print (sentence_output)

Enter a phrase : What are the best dog breeds?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What are the best dog breeds?

The best dog breeds are those that are adaptable to


### Using an API  ###

Let's see if another model generates a different result

The MBZUAI/LaMini-Cerebras-111M model is one of the LaMini-LM model series in paper "LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions". This model is a fine-tuned version of cerebras/Cerebras-GPT-111M on LaMini-instruction dataset that contains 2.58M samples for instruction fine-tuning. For more information about our dataset, please refer to our project repository.

In [None]:
!pip install -U xformers

Collecting xformers
  Downloading xformers-0.0.20-cp310-cp310-manylinux2014_x86_64.whl (109.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyre-extensions==0.0.29 (from xformers)
  Downloading pyre_extensions-0.0.29-py3-none-any.whl (12 kB)
Collecting typing-inspect (from pyre-extensions==0.0.29->xformers)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions==0.0.29->xformers)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensions, typing-inspect, pyre-extensions, xformers
Successfully installed mypy-extensions-1.0.0 pyre-extensions-0.0.29 typing-inspect-0.9.0 xformers-0.0.20


In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MBZUAI/LaMini-Cerebras-111M")

Downloading (…)lve/main/config.json:   0%|          | 0.00/832 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/486M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/462 [00:00<?, ?B/s]

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MBZUAI/LaMini-Cerebras-111M")
model = AutoModelForCausalLM.from_pretrained("MBZUAI/LaMini-Cerebras-111M")

In [None]:
import requests
API_URL = "https://api-inference.huggingface.co/models/MBZUAI/LaMini-Cerebras-111M"
headers = {"Authorization": "Bearer hf_VRt*********"}

In [None]:
def query_api(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

In [None]:
from time import sleep   ### Pause for model to load
sleep(30)  # Time in seconds

In [None]:
query_api(sentence)

[{'generated_text': "What are the best dog breeds? \n\nI'm sorry, I cannot provide a specific answer"}]

### Let's Translations the output to French ###

Helsinki-NLP/opus-mt-en-fr  https://huggingface.co/Helsinki-NLP/opus-mt-en-fr

In [None]:
!pip install transformers sentencepiece
from transformers import MarianTokenizer, MarianMTModel
!pip install sacremoses



In [None]:
# Get the name of the model
model_name_fr = 'Helsinki-NLP/opus-mt-en-fr'

# Get the tokenizer
tokenizer_fr = MarianTokenizer.from_pretrained(model_name_fr)
# Instantiate the model
model_fr = MarianMTModel.from_pretrained(model_name_fr)

Downloading (…)olve/main/source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [None]:
def format_batch_texts(language_code, batch_texts):
    formated_bach = [">>{}<< {}".format(language_code, text) for text in
                batch_texts]
    return formated_bach

In [None]:
def perform_translation(batch_texts, model_name_fr, tokenizer, language="fr"):

  # Prepare the text data into appropriate format for the model
  formated_batch_texts = format_batch_texts(language, batch_texts)

  # Generate translation using model
  translated = model_fr.generate(**tokenizer(formated_batch_texts,
                                  return_tensors="pt",
                                   padding=True
                                ))

  # Convert the generated tokens indices back into text
  translated_texts = [tokenizer_fr.decode(t, skip_special_tokens=True) for t in translated]

  return translated_texts

In [None]:
#english_texts = ["Good morning","good evening"]  ### DEBUG Test input
english_texts = [sentence_output]

In [None]:
# Check the model translation from the original language (English) to French
translated_texts = perform_translation(english_texts, model_fr, tokenizer_fr)

# Create wrapper to properly format the text
from textwrap import TextWrapper
# Wrap text to 80 characters.
wrapper = TextWrapper(width=80)

for text in translated_texts:
  print("Original text: \n", english_texts[0])
  print("Translation : \n", text)



Original text: 
 What are the best dog breeds?

The best dog breeds are those that are adaptable to
Translation : 
 Quelles sont les meilleures races de chiens? Les meilleures races de chiens sont celles qui sont adaptables à


@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}