<a href="https://colab.research.google.com/github/adisadhu/DogsManagementSystem/blob/master/03_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center;">
<h1>  Text Generation </h1>
</div>

<b> Objective: </b> <p> The goal of this worksheet is to provide learners the ability to generate the text (few words, sentences, even paragraphs) from a given short sentence or a word with a comprehensive understanding of text generation using advanced natural language processing techniques. Participants will explore various models and methodologies that facilitate the automatic generation of coherent and contextually relevant text. This worksheet is designed to equip participants with the skills necessary to implement text generation in applications such as content creation, chatbots, and automated narrative generation, thus enhancing their ability to innovate and streamline communication processes.</p>

<b>Introduction: </b>
<p> Text generation is a process where AI generates text that mimics natural human communication.It finds its use in various applications including chatbots, content creation, automated summarization, language translation, etc.,  </p>

**Ex**:



```
 Once upon a time ...
```

*Generated text*:



```
Once upon a time in a small village nestled between rolling hills and dense forests, there lived a young boy named Eliot. He was known throughout the village for his curious nature and boundless imagination. Eliot loved to explore the woods that bordered the village, always searching for something magical and extraordinary
```





**Quick Overview**: A prompt is provided to a Large Language Model to generate text. LLM *encodes* the prompt into its readble format called *tokens*, *attends* only the required tokens and generates text.

<b> Requirements: </b>
<ol>
<li> <i> Transformers </i> - A versatile library from Hugging Face providing state-of-the-art pre-trained models for natural language processing tasks </li>
<li> <i> Torch (PyTorch) </i> - PyTorch is an open-source machine learning library known for its flexibility and speed, widely used for developing and training deep learning models.
</ol>

<b> Steps: </b>
<ol>
    <li> Install <code> transformers </code>, <code> torch </code> packages.</li>
     <li> Write source code </li>
        <p> 2.1 Import <code> transformers </code>, <code> torch </code> modules <br>
            2.2 Get the Transformer name. <br>
            2.3 Load the corresponding Tokenizer and Deep Learning model <br>
            2.4 Give a prompt <br>
           2.5 Convert that prompt to machine understandle format (<b>Tokens</b>) using <i> Encoder </i>. <br>
            2.6 Extract the meaningful tokens required with the help of <i> attention mask </i>. <br>
            2.7 Generate the output <br>
        </p>
</ol>

<h3>Step 1: Install <code> transformers </code>, <code> torch </code> packages </h3>

**Note**: if the following command fails, execute
`python -m pip install transformers`

In [1]:
pip install transformers



**Note**: if the following command fails, execute
`python -m pip install torch`

In [2]:
pip install torch



<h3> Step 2:  Write source code </h3>

<h4> Step 2.1 : Import the <code> transformers </code>, <code> torch </code> modules from the packages </h4>

In this worksheet, we are going to use **GPT2** transformer. It requires a *tokenizer* **GPT2Tokenizer** and *Deep Learning Model* **GPT2LMHeadModel**.

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [4]:
import torch

<h4> Step 2.2 : Get the Transformer name </h4>

In [11]:
model_name = "moonshotai/Kimi-K2-Instruct-0905"

<h4> Step 2.3: Load the corresponding Tokenizer </h4>

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

<h4> Step 2.4 : Load the corresponding Deep Learning Model </h4>

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

<h4> Step 2.5 : Give a prompt </h4>

In [None]:
prompt = "A world where humans are replaced by Robots...."

Convert that prompt to machine understandle format (*Tokens*) using **Encoder**

Here,


*   prompt - Given sentencce
*   return tensors - required format of tokens. Here `pt` indicates pytorch. Use `tf` for tensorflow.
*  add_special_tokens - Extra tokens that make prompt meaning. Ex: `endoftext`. Here `true` indicates to add.



In [None]:
input_ids = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=True)

Extract the **meaningful tokens**  required with the help of **attention mask**

An attention mask differentiates necessary (1) and unnecessary tokens (0), converts to ones and zeroes, keeps only ones.

In [None]:
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

<h4> Step 2.6: Generate the output </h4>

Here,

*   input_ids - token IDs for input text.
*   attention_mask - necessary tokens.
*   max_length - maximum length of the sequence to be generated.
*   num_return_sequences - number of sequences to generate from the same prompt. `100` indicates to limit the words to 100.
*   no_repeat_ngram_size - repetition of words. `2` indicates that no two words should be repeated.
*   top_k - top (based on *probability*) **k** tokens are considered for the next word. `50` indicates to consider top 50 suitable tokens.
*   top_p - smallest possible set of tokens whose cumulative probability exceeds the threshold **p**.  `0.95` indicates to consider group of tokens whose cumulative probability is 95%
*   temperature - randomness of the output generation.
    * < 1 - consider high probability words.
    * \> 1 - consider low probability words.



In [None]:
output = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=100,  # Maximum length of the generated sequence
    num_return_sequences=1,  # Number of sequences to generate
    no_repeat_ngram_size=2,  # Avoid repetition of n-grams in generated text
    top_k=50,  # Filter out tokens with top-k probabilities
    top_p=0.95,  # Filter out tokens with cumulative probability exceeding this threshold
    temperature=0.7  # Controls randomness in sampling. Higher temperature leads to more randomness.
)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<h4> Step 2.7: print the generated text </h4>

In [None]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

A world where humans are replaced by Robots....

The world of robots is a world that is not only a place where robots are created, but also where they are used to do things that humans cannot do.
...
, and., and, but the world is also a space where the robots that are being created are not just robots, they're also robots. The world in which humans live is the place that the Robots are made. And the robot that makes the World
