<a href="https://colab.research.google.com/github/anujsaxena/Python/blob/main/Transformers_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Creating A Paragraph Auto Generator Using GPT2 Transformers**

# **What is Natural Language Generation?**
Natural Language Generation, also known as NLG, uses artificial intelligence to produce written or spoken text content. It is a subsidiary of artificial intelligence and is a process that automatically transforms input data into plain-English content. The fascinating thing about NLG is that the technology can help tell a story using human-like creativity and intelligence, writing long sentences and paragraphs for you. 

Some of the uses of NLG are to generate product or service descriptions, content curation, creating portfolio summaries, or being used in customer communications through certain implementations in chatbots. Natural-language generation can be a bit complicated and require layers of language knowledge to work. These days, NLG is being integrated into tools to help with content strategy quickly, hence increasing productivity.

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 60.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 39.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  A

# **Hugging face**

Hugging Face is an NLP focused startup that shares a large open-source community and provides an open-source library for Natural Language Processing. Their core mode of operation for natural language processing revolves around the use of Transformers. This python based library exposes an API to use many well-known architectures that help obtain the state of the art results for various NLP tasks like text classification, information extraction, question answering, and text generation. All the architectures provided come with a set of pre-trained weights utilizing deep learning that help with ease of operation for such tasks. These transformer models come in different shape and size architectures and have their ways of accepting input data tokenization. A tokenizer takes an input word and encodes the word into a number, thus allowing faster processing.

In [2]:
import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer #importing the main model and tokenizer


To implement a natural language generator that generates paragraphs from a single line of input text. For that, we will first set up all our dependencies using Hugging Face transformers for Natural Language Processing, then load our GPT2 model. This pre-trained model generates coherent paragraphs of text, encodes our input, and decodes our output to generate a paragraph.

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

In [4]:
#Instantiate the pre-trained model and padding with the tokenizer.

model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/3.02G [00:00<?, ?B/s]

Testing the model by tokenizing our First sentence
Now that the model has been created, we will test it by providing our first input sentence to tokenize. 

In [5]:
sentence = 'You will always succeed in Life' #input sentence

In [6]:
#Encode it into a sequence of numbers and return them as PyTorch tensors.
input_ids = tokenizer.encode(sentence, return_tensors='pt')#using pt to return as pytorch tensors

In [7]:
print(input_ids)

tensor([[1639,  481, 1464, 6758,  287, 5155]])


Decoding the text and Generating the Output 
Creating a new variable called output to decode and setting our hyperparameters

In [10]:
output = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

With this line, we have called the input and set the maximum length of the paragraph to be generated as 100 words. We are also using a beam search technique to find the most appropriate word to be generated from the input sentence. We have also set no-repeat ngram as 2, which will prevent our model from repeating similar words more than twice and early stopping as true so that when the model does not find appropriate words, it stops the generation process.

In [11]:
print(tokenizer.decode(output[0], skip_special_tokens=True))#printing results

You will always succeed in Life, but you will never be successful in Death."

"I am not afraid of death, because I know that I am going to be with you when you die. I will be waiting for you in Heaven, and I want you to know how much I love you. You are the most important person in my life. If you are not with me when I die, I don't know what will happen to me. It is better for me to die with


# **Cross validating our Model**
We can also do the same and tune our hyperparameters to generate larger paragraphs with a new sentence. Beware this may take a longer time to generate output.

In [12]:
sentence = 'Artificial intelligence is the key'
input_ids = tokenizer.encode(sentence, return_tensors='pt')
output = model.generate(input_ids, max_length=500, num_beams=5, no_repeat_ngram_size=2, early_stopping=True) #setting length as 500 to generate larger output text
print(tokenizer.decode(output[0], skip_special_tokens=True)) 

Artificial intelligence is the key to unlocking the mysteries of the universe, but it's also the source of a lot of our problems.

In a new paper published in the journal Science Advances, a team of researchers from the University of California, Berkeley, and the National Institute of Standards and Technology (NIST) in Gaithersburg, Maryland, describes a way to create an artificial intelligence (AI) system that can learn from its mistakes and improve its performance over time. The system, which they call a "neural network," is capable of learning to recognize patterns in images, recognize objects in a video, or even learn how to play a musical instrument. In the paper, the researchers describe how they created the neural network and how it can be used to train an AI system to perform a variety of tasks, such as recognizing objects and playing musical instruments.


Neural networks, also known as deep neural networks or deep learning, are a type of machine learning algorithm that is bas

In [13]:
text = tokenizer.decode(output[0],skip_special_tokens = True)
with open('AIBLOG.txt','w') as f:
   f.write(text) 

In [15]:
sentence = 'Love is worship'
input_ids = tokenizer.encode(sentence, return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True) #setting length as 500 to generate larger output text
print(tokenizer.decode(output[0], skip_special_tokens=True)) 

Love is worship, love is service, and service is love," he said.

"We are all in this together. We all have a responsibility to love and serve one another."


# **Happy Transformer**

In [16]:
!pip install happytransformer

Collecting happytransformer
  Downloading happytransformer-2.4.0-py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 1.9 MB/s 
[?25hCollecting datasets>=1.6.0
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 8.9 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 76.9 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 46.9 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 61.9 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243

This model generates a revised version of inputted text with the goal of containing fewer grammatical errors. It was trained with Happy Transformer using a dataset called JFLEG. 

In [17]:
from happytransformer import HappyTextToText, TTSettings
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")

args = TTSettings(num_beams=5, min_length=1)

# Add the prefix "grammar: " before each input 
result = happy_tt.generate_text("grammar: This sentences has has bads grammar.", args=args)

print(result.text) # This sentence has bad grammar.

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

01/18/2022 07:31:18 - INFO - happytransformer.happy_transformer -   Using model: cpu


This sentence has bad grammar.


In [18]:
# Add the prefix "grammar: " before each input 
result = happy_tt.generate_text("grammar: Poor internet forces students and teachrs to climb on water tank to submit board examination form", args=args)

print(result.text) # This sentence has bad grammar.

Poor internet forces students and teachers to climb on water tanks to submit board examination forms.
