# Text Generation using HuggingFace

## Installation

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 7.6 MB 9.2 MB/s 
[K     |████████████████████████████████| 182 kB 67.0 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


## Sample text

The objective is to generate text from a given set of sentences. Let me start by extracting some text from https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/

# Greedy Search - Part 1

Greedy Search takes a list of potential outputs and the probability distribution which has been already calculated & chooses the option with the highest probability.

In [2]:
input_text_short = "Hugging Face is building the GitHub of machine learning."

In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [4]:
inputs = tokenizer.encode(input_text_short, return_tensors='pt')
outputs = model.generate(inputs, max_length=32)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hugging Face is building the GitHub of machine learning.

The goal is to create a machine learning framework that can be used to build machine learning applications.


# Greedy Search - Part 2

The above output seems entirely logical & in many cases, it works perfectly well. However, for longer sequences, this can cause some problems. Greedy Search algorithm selects one best candidate as an input sequence for each time step. Choosing just one best candidate might be suitable for the current time step, but when we construct the full sentence, it may be a sub-optimal choice.

Link to the long text article - https://en.wikipedia.org/wiki/Geoffrey_Hinton

In [33]:
input_text_long = "Geoffrey E. Hinton is internationally distinguished for his work on artificial neural nets, especially how they can be designed to learn without the aid of a human teacher. This may well be the start of autonomous intelligent brain-like machines. He has compared effects of brain damage with effects of losses in such a net,"

In [34]:
inputs = tokenizer.encode(input_text_long, return_tensors='pt')

In [35]:
outputs = model.generate(inputs, max_length=200)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Geoffrey E. Hinton is internationally distinguished for his work on artificial neural nets, especially how they can be designed to learn without the aid of a human teacher. This may well be the start of autonomous intelligent brain-like machines. He has compared effects of brain damage with effects of losses in such a net, and he has also shown that the loss of a neural network can be reversed by a human teacher.

The paper is available online at: http://www.sciencedirect.com/science/article/pii/S0029105900010011

The paper is available online at: http://www.sciencedirect.com/science/article/pii/S0029105900010011

The paper is available online at: http://www.sciencedirect.com/science/article/pii/S0029105900010011

The paper is available online at: http://


*   **The output is repeating and this is most likely due to the greedy decoding method getting stuck on a particular word or sentence and repetitively assigning these sets of words the highest probability again and again.**

# Random Sampling

Sampling means randomly picking the next word​ according to the conditional word probability distribution extracted from the language model.

In [36]:
outputs = model.generate(inputs, max_length=200, do_sample=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Geoffrey E. Hinton is internationally distinguished for his work on artificial neural nets, especially how they can be designed to learn without the aid of a human teacher. This may well be the start of autonomous intelligent brain-like machines. He has compared effects of brain damage with effects of losses in such a net, but seems to have no evidence to back his hypothesis as the net can only teach in its own image. [2]

"You might say that AI can't understand the nature of real things, but it can't tell you whether something is real or not, so when we try to understand that, it takes some effort," he adds, "but this is how robots come into play, and so it's not an easy problem." This is not to say that artificial intelligence can't tell you about something that doesn't exist, but there is some benefit to being able to learn about things if it really is human. And if we have a

*   **The output looks fine however if you look closely the generated text is too random and lacks coherence:**

# Beam Search



*   We have Greedy Search which is too strict for generating text & has a tendency to repeat the same output.

*   We have Random sampling which produces random text which lacks meaning.



The beam search algorithm selects multiple alternatives for an input sequence at each timestep based on conditional probability. The number of multiple alternatives depends on a parameter called Beam Width B. At each time step, the beam search selects B number of best alternatives with the highest probability as the most likely possible choices for the time step.

In [37]:
outputs = model.generate(inputs, max_length=200, do_sample=True, num_beams=2)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Geoffrey E. Hinton is internationally distinguished for his work on artificial neural nets, especially how they can be designed to learn without the aid of a human teacher. This may well be the start of autonomous intelligent brain-like machines. He has compared effects of brain damage with effects of losses in such a net, and his own research suggests that such systems are capable of learning without the aid of a human teacher.

Hinton's work on artificial neural nets was published in the Proceedings of the National Academy of Sciences.

Source: Wikipedia<|endoftext|>
