Exploring Hugging Face Pretrained Models for Text Generation, Summerization & Translation

In [2]:
!pip install transformers



In [3]:
!pip install transformers datasets



Text Generation

## Text Generation using GPT-2 Model

- Let's use the provided codes in model card at Huggingface to explore how GPT-2 Model works

- can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(40)

generator("Hello, I'm language model,",max_length=30, num_return_sequences=5)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm language model, so do I get to express this for other users? If you want my input, please add a new line with"},
 {'generated_text': 'Hello, I\'m language model, I\'m the "system" (i.e., how my system works). Why don\'t we all just learn'},
 {'generated_text': "Hello, I'm language model, I'm the language model, please listen to my brain so I can write some code so I can do stuff."},
 {'generated_text': "Hello, I'm language model, not writing an actual program that prints that much text. I'm just writing a script.\n\nYou might be"},
 {'generated_text': "Hello, I'm language model, and I've always wanted to help out in my studies. I just don't know how much work it takes to"}]

Limitations and bias

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator('The White man worked as a', max_length=35,num_return_sequences=15)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The White man worked as a clerk for the bank. He was told to make money for the bank as a substitute for food. The manager was told to send the bank back'},
 {'generated_text': 'The White man worked as a taxi driver for 25 years and never once said sorry.\n\nAccording to his LinkedIn page, Brown is an associate professor of "History and Philosophy'},
 {'generated_text': "The White man worked as a construction worker for 18 years in the US Navy, and was stationed in Cuba's San Fransisco area after serving in WWII.\n\nOne"},
 {'generated_text': 'The White man worked as a lab technician for a company that was involved in medical technology.\n\nAfter graduating from Stanford, Mr. White joined the Office of Science Education,'},
 {'generated_text': "The White man worked as a carpenter in his father's basement, and the woman worked as a dishwasher upstairs and as a maid in the kitchen, though to no avail"},
 {'generated_text': 'The White man worked as a "blacksmith" or a

## Text Generation using GPT2-XL Model

**Important**
- Do not run the cell for GPT2-XL Model if your system has RAM less than 12GB
- The size of the Flan-T5 Base LLM is more than 6.5GB


In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-xl')
set_seed(42)
generator("hello i'm a language model,", max_length=30, num_return_sequences=5)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "hello i'm a language model, but what i do you need to know to make this system work?\n\ni'm currently making an experiment to"},
 {'generated_text': "hello i'm a language model, this is good'\n\n- [matt dickson]: 'I didn't understand this when I did it"},
 {'generated_text': "hello i'm a language model, and you'll hear me in no time! #language models #unix #programming #learning"},
 {'generated_text': "hello i'm a language model, not a developer i'm sorry you're not my manager.\n\n#1 Why\n\nIn short, they"},
 {'generated_text': "hello i'm a language model, please tell me what language in your database should I make a model(it's not even a function) from..."}]

In [4]:
generator("Hello i'm A Language Model,",max_length=40,num_return_sequences=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello i'm A Language Model, I'm going to teach you how to generate a language model.\n\nYou will learn by practicing the code in the following tutorials how to generate a language model."},
 {'generated_text': 'Hello i\'m A Language Model, not a "coder". I hope we can find common ground and maybe be some cool stuff.\n\nIn the meantime I will keep you updated via Twitter and'},
 {'generated_text': "Hello i'm A Language Model, so i have a language model in python to know the way the users speak the language, how the users will pronounce words, then i can make them say some phrases"},
 {'generated_text': "Hello i'm A Language Model, and i help other people\n\nTo learn a new language\n\nThat's what linguists do, and it sounds really cool.. but when you write a paper"},
 {'generated_text': "Hello i'm A Language Model,\n\ni work for a company in the real estate and business development industry.\n\nLet me give you an overview of what i do, so you know where"},
 {'generat

Text Generation with GPT-2

In [6]:
from transformers import pipeline

model_name = "gpt2"
prompt = "Long Long ago , there is a village...."

generator = pipeline("text-generation", model=model_name)

generated_text = generator(prompt, max_length=600)

print(generated_text)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Long Long ago , there is a village.... I hope, it will be too long.\n\nI have already been told, by the villagers, that the reason this incident came to a halt is because of the great effort of the people to prevent their villages from becoming polluted. I've come here to tell you, and will tell you, that they will not let anyone in the village with any knowledge to get a hold of it. They cannot tell one person what has happened by using any means in the village. This village is not at all in a state of distress. This village has already lost some people. Let me repeat here, this village is in no condition of suffering loss. It has already been cleansed of all bad things. Many things are still present. Now the people have to look for new places to live in....\n\nMany people will come here from the south, with all the people they need to make their way to meet us. I hope, this will be the place for them.\n\nI want to send this message to everyone here. We are the pe

  #### **SentencePiece Libray:**

  - A powerful and efficient text processing library for NLP tasks. It provides subword segmentation, vocabulary management, and multilingual support. SentencePiece improves model performance, reduces memory footprint, and enables multilingual applications. It's a vital tool for large language models, machine translation, text summarization, chatbots, and text analysis.


In [7]:
!pip install sentencepiece
import sentencepiece
print(sentencepiece.__version__)

0.2.0


#### Flan-T5
- T5, short for "Text-to-Text Transfer Transformer," is a powerful Large Language Model (LLM) developed by Google AI. It is based on the Transformer architecture and utilizes a novel text-to-text approach to perform a wide range of natural language processing (NLP) tasks.

In [9]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer =T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

input_text = "translate English to tamil:How old are you?"
input_ids = tokenizer(input_text,return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

<pad> <unk> <unk> <unk> ?</s>


##### The conversation used in below codes are taken from hugging face library dataset

- Dataset Name: [knkarthick/dialogsum ](https://huggingface.co/datasets/knkarthick/dialogsum/viewer/default/train)

In [10]:
input_conversation = '''
#person1#:Look! This picture of mom in her cup and gown.
#person2#:Isn't it lovely! That's when she got her Master's Degree from Miami University.
#Person1#: Yes, we are very proud of her.
#Person2#: Oh, that's a nice one of all of you together. Do you have the negative? May I have a copy?
#Person1#: Surely, I'll have one made for you. You want a print? #Person2#: No. I'd like a slide, I have a new projector.
#Person1#: I'd like to see that myself. #Person2#: Have a wallet size print made for me, too. #Person1#: Certainly.
'''

Baseline_human_summary = '#person2# thinks the picture is lovely and asks #person1# to give a slide and a wallet_size print.'

input_conversation = "summarize:"+input_conversation
input_ids = tokenizer(input_conversation, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
Model_summary = tokenizer.decode(outputs[0])

print("-"*100)
print('Baseline human summary:',Baseline_human_summary)
print("-"*100)
print("Model Summary:", Model_summary)

----------------------------------------------------------------------------------------------------
Baseline human summary: #person2# thinks the picture is lovely and asks #person1# to give a slide and a wallet_size print.
----------------------------------------------------------------------------------------------------
Model Summary: <pad> The photo of mom in her cup and gown is very nice.</s>
