# **Story Generation Bot using Fine-tuned Language Models**

This project demonstrates how to fine-tune large language models on raw text data using the `unslot` tool. We will use a dataset from Hugging Face and fine-tune a model to generate stories. The process includes setting up the environment, specifying the model, fine-tuning it, and generating new stories.

## **Table of Contents**
1. [Setup](#setup)
2. [Model Selection and Configuration](#model-selection-and-configuration)
3. [Loading the Dataset](#loading-the-dataset)
4. [Training the Model](#training-the-model)
5. [Generating Stories](#generating-stories)
6. [Conclusion](#conclusion)

---

## **Setup**

First, we need to set up our environment. We'll be using Google Colab with a T4 GPU. Atleast 16 GB RAM

```python
# Install the unslot tool
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"


#**Unsloth**
 AI is a new AI company that is working to make training large language models (LLMs) much faster and more efficient. LLMs are a type of AI that can understand and generate human-like text, however, training these models can take a long time and require a lot of computing power and memory. Unsloth AI has developed a new software, called Unsloth, that can speed up LLM training by up to 30 times while also reducing the amount of memory needed by 60%.

 ## **How Unsloth Works**

Unsloth’s software uses several advanced techniques to improve LLM training:

- **Manual Autograd**: This is a way of manually calculating gradients, which are used to update the model during training. By doing this manually, Unsloth can optimize the process and make it faster.
- **Chained Matrix Multiplication**: Unsloth optimizes matrix multiplications, which are a key part of LLM training, by chaining them together efficiently.
- **Triton Language Kernels**: Unsloth rewrites key parts of the training code using a special language called Triton, developed by OpenAI, which is designed for high-performance computing.
- **Flash Attention**: This is a technique that helps the model focus on the most important parts of the input data, using implementations from xformers and Tri Dao.


## **Compatibility and Accessibility**

One of the great things about Unsloth is that it works with the hardware you already have. It supports NVIDIA, Intel, and AMD GPUs, which are the main types of processors used for AI training. This means you don’t need to buy new expensive hardware to use Unsloth.

Unsloth also offers a free open-source version of their software, the one present on GitHub, so anyone can try it out and see the benefits of faster training with less memory usage.

## **Supported Language Models**

Unsloth supports a wide range of popular language models, making it easy for users to apply their optimization techniques to their preferred models. The following table lists the language models currently supported by Unsloth, with their performance with the Open Source version:

## **Performance Results**

Unsloth has tested their software on various datasets and hardware setups, and the results are very impressive. With the Unsloth Max version the results are as follows:

- **On the Alpaca dataset**: Unsloth reduced training time from 85 hours to just 3 hours (30 times faster).
- **On the LAION Chip2 dataset**: Using 2 Tesla T4 GPUs, Unsloth went from 164 hours to 5 hours (31 times faster).
- **On the Open Assistant dataset**: Using an A10 GPU, Unsloth reduced peak memory usage from 16.7GB to 6.9GB (59% reduction).
- **On a Tesla T4 GPU**: Unsloth reduced peak memory from 14.6GB to 7.5GB (49% reduction).

These results show that Unsloth can significantly speed up LLM training while also reducing the amount of memory required.





In [None]:
import torch
major_version , minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth[colab]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-9vee5x1s/unsloth_6a4053ad108146e09fd5f0cc6ae5054a
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-9vee5x1s/unsloth_6a4053ad108146e09fd5f0cc6ae5054a
  Resolved https://github.com/unslothai/unsloth.git to commit 8d9bd0ea8bf662618ba96fe7fe3478c5b81d0dff
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting xformers@ https://download.pytorch.org/whl/cu121/xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (from unsloth[colab]@ git+https://github.com/unslothai/unsloth.git)
  Using cached https://download.pytorch.org/whl/cu121/xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)


# **Using Unsloth to Load a Quantized Mistral-7B Model for Story Generation**
The following code demonstrates how to load and use a quantized version of the Mistral-7B model with Unsloth. The Mistral-7B model is well-suited for tasks like story generation due to its large size and powerful language modeling capabilities.

1. Import Libraries:
* FastLanguageModel from unsloth is used to load the model.
* torch is imported for tensor computations.
2. Define Parameters:
* max_seq_length: Maximum sequence length for the model.
* dtype: Data type for the model parameters (left as None here).
* load_in_4bit: Boolean flag to load the model in 4-bit precision.
3. Load Model and Tokenizer:
* The FastLanguageModel.from_pretrained method loads the specified model and tokenizer.
* The model_name parameter specifies the quantized Mistral-7B model.
* Other parameters configure the model loading process.

### **Why Mistral-7B is a Good Model for Story Generation**

Mistral-7B is a large language model with 7 billion parameters, making it highly capable of understanding and generating coherent, contextually relevant text. Its size allows it to capture complex patterns in language, making it well-suited for creative tasks like story generation. The model's ability to generate human-like text makes it ideal for producing engaging and diverse story content.

### **Quantized Models and Their Advantages**

Quantization is the process of reducing the precision of the model's weights and activations. This technique can significantly reduce the memory footprint and computational requirements of large language models, making them more efficient to run on hardware with limited resources.

Quantization often involves converting weights from high-precision (e.g., 32-bit floating-point) to lower precision (e.g., 8-bit integer or 4-bit precision). This reduction in precision helps in faster computation and lower memory usage without a substantial loss in model accuracy.

### **Quantization Formula**

For a weight \( w \) in high precision:
$$w_{quantized} = \text{round}\left(\frac{w - w_{min}}{s}\right)$$

where:
- $ w_{min}$is the minimum weight value.
- $s$  is the scaling factor determined by the range of the weights and the target precision.

### **Why Quantized Models are Better for Large Language Models (LLMs)**

1. Efficiency:
* Reduced memory usage allows models to fit into GPU memory more easily.
* Lower precision computations are faster, leading to quicker inference times.
2. Accessibility:
* Enables the use of large models on hardware with limited resources, such as consumer-grade GPUs.
3. Cost-Effective:
* Reduced hardware requirements lower the cost of running and deploying models.
* By using quantized models like the Mistral-7B in 4-bit precision, researchers and developers can achieve efficient and effective story generation, making advanced AI capabilities more accessible and practical for various applications.

​







In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length=2048
dtype = None
load_in_4bit=True

model,tokenizer=FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.1.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.22.post7. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

# **PEFT**
PEFT stands for Parameter Efficient Fine-Tuning. It's a technique used in machine learning, particularly in the context of neural networks and natural language processing models.

In fine-tuning, a pre-trained model is adapted to perform a specific task or handle specific types of data by adjusting its parameters. PEFT focuses on making this fine-tuning process more efficient in terms of computational resources and memory usage.

Instead of adjusting all parameters of the model during fine-tuning, PEFT aims to identify and modify only the most critical parameters or components necessary for the new task. This selective modification helps reduce the computational cost of fine-tuning while still achieving effective results.

In summary, PEFT is a method for fine-tuning pre-trained models that prioritizes efficiency by selectively adjusting model parameters, making it particularly useful for resource-constrained environments or large-scale applications.
#**LORA**
LoRA stands for Local Relevance Attention. It's a variant or extension of the traditional attention mechanism commonly used in neural network architectures like Transformers, particularly in natural language processing tasks.

In traditional attention mechanisms, each token or word in a sequence attends to all other tokens in the sequence, capturing global relationships. However, in tasks where local context is crucial, such as language modeling or text generation, capturing only global relationships might not be sufficient.

LoRA introduces the concept of local relevance, allowing tokens to focus more on nearby tokens within a certain neighborhood or window. This helps in capturing local dependencies and contextual information more effectively.

Here's how LoRA works:

Local Attention: Instead of attending to all tokens in the sequence, each token only attends to a subset of nearby tokens within a fixed window or neighborhood.

Learned Relevance: The relevance of each token within the window is learned during training. This allows the model to dynamically adjust the importance of tokens based on their contextual relevance to the current token.

Efficient Computation: LoRA typically involves computational optimizations to efficiently compute attention within the local window, making it scalable to large sequences.

Overall, LoRA enhances the attention mechanism by allowing models to focus more on local context, which can be beneficial for tasks where understanding nearby words is crucial for making accurate predictions or generating coherent outputs, such as in language modeling, text generation, or sequence-to-sequence tasks.

#**Fine-Tuning with PEFT model and LoRA Adapters**

In this code snippet, we're utilizing the PEFT (Parameter Efficient Fine-Tuning) command to fine-tune a Fast Language Model with LoRA (Local Relevance Attention) adapters. Let's break down the key components:

1. **Rank (r):**
* The 'r' parameter represents the rank of low-rank approximation used in LoRA.
* It determines the size of the bottleneck layer, influencing the model's * capacity to adapt to specific tasks.
* Higher values increase the model's capacity but also increase computational cost.
* Values like 16, 8, 32, or 64 can be chosen. Lower values may lead to faster completion.
2. **Target Modules:**
* Target modules are specific components within the model where fine-tuning is applied.
* These often include projection layers in Transformer models such as Q (query), K (key), V (value), and O (output) projections.
* Focusing on these modules allows for precise fine-tuning of critical attention mechanisms.
3. **LoRA Alpha:**
* This parameter is the scaling factor used in LoRA adaptation.
* It controls the magnitude of updates applied to the model's weights, affecting the degree of fine-tuning deviation from the original parameters.
4. **LoRA Dropout:**
* Dropout applied to LoRA adaptation. Setting it to zero means no dropout is applied, ensuring all adjustments contribute to the model's output.
* Dropout helps prevent overfitting by randomly zeroing parts of the adaptation.
5. **Bias:**
* The 'bias' parameter controls whether biases in the targeted modules are adjusted during fine-tuning. Here, biases are not modified.
6. **Gradient Checkpointing:**
* This Boolean value indicates whether gradient checkpointing is used.
Gradient checkpointing reduces memory usage during training by trading off computational time, enabling training with longer sequences or larger models on limited hardware.
7. **Random State:**
* An integer seed for random number generation, ensuring reproducibility of the fine-tuning process.
* It affects the initialization of modifications and any stochastic processes in fine-tuning.
8. **Max Sequence Length:**
* This parameter defines the maximum sequence length the model can process.


This configuration enables efficient fine-tuning of the Fast Language Model with LoRA adapters, tailored for specific tasks while considering computational resources and reproducibility.

In [None]:
model=FastLanguageModel.get_peft_model(
    model,
    r=16,#model capacity
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj",],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=3407,
    max_seq_length=max_seq_length,
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


#**Dataset Description**
The dataset loaded from "roneneldan/TinyStories" is a collection of short stories or textual snippets. Here's a description of the dataset:

- **Name:**Ronen Eldan's TinyStories
- Source: This dataset is sourced from Ronen Eldan's TinyStories, which likely contains a curated collection of short stories or textual snippets.
- **Size:** The dataset comprises a subset of the training split, specifically the first 30 examples.
- **Format:** Each example in the dataset consists of a single text snippet, representing a short story or a piece of text.
- **Content:** The content of the dataset likely includes a diverse range of textual content, such as anecdotes, narratives, or creative writing pieces.
- **Purpose:** The dataset could be used for various natural language processing tasks, including text generation, language modeling, sentiment analysis, or any task that involves short textual inputs.

In [None]:
from datasets import load_dataset
dataset=load_dataset("roneneldan/TinyStories",split="train[:30]")
EOS_TOKEN=tokenizer.eos_token
def formatting_func(example):
  return example["text"]+EOS_TOKEN

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/246M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['text'],
    num_rows: 30
})

In [None]:
for row in dataset[:2]["text"]:
  print("===================================")
  print(row)

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.

One day, Beep was driving in the park when he saw a big tree. The tree had many leaves that were falling. B


# **Initializing SFTTrainer and TrainingArguments**

Here's an explanation of each parameter used in the initialization:

### **SFTTrainer Parameters:**

- `model`: The pre-trained model to be fine-tuned.
- `train_dataset`: The dataset used for training.
- `eval_dataset`: The dataset used for evaluation.
- `dataset_text_field`: The name of the column containing text data in the dataset.
- `tokenizer`: The tokenizer used to convert text data into a format suitable for the model.
- `max_seq_length`: The maximum sequence length allowed for input sequences.
- `packing`: A boolean indicating whether to pack short sequences together in a batch to improve training efficiency.
- `formatting_func`: A function applied to the dataset for any necessary pre-processing or formatting.

### **TrainingArguments Parameters:**

- `per_device_train_batch_size`: Batch size per device for training.
- `gradient_accumulation_steps`: Number of steps to accumulate gradients before performing a backward update pass.
- `warmup_ratio`: Proportion of training steps to increase learning rate.
- `max_grad_norm`: Maximum norm of the gradient for gradient clipping.
- `num_train_epochs`: Total number of training epochs.
- `learning_rate`: Initial learning rate for the optimizer.
- `fp16`: Whether to use 16-bit floating-point precision.
- `logging_steps`: Interval of steps to log training progress.
- `optim`: Optimizer to use (e.g., AdamW_8bit).
- `weight_decay`: Coefficient for regularization.
- `lr_scheduler_type`: Type of learning rate scheduler (e.g., linear).
- `seed`: Random seed for reproducibility.
- `output_dir`: Directory where model checkpoints will be saved.

These parameters configure the training process, including batch size, optimization algorithms, learning rate scheduling, and more.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer=SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    packing=True,
    formatting_func=formatting_func,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.05,
        max_grad_norm=1.0,
        num_train_epochs=1,
        learning_rate=2e-5,
        fp16=not torch.cuda.is_bf16_supported(),
        #bf16=torch.cuda.is_bf16_supported,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.1,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)


Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,0.3023


## **Text Generation with Transformers**

This code snippet demonstrates how to generate text using a pre-trained language model from the Transformers library. Let's break down the code:

### **Import Statements:**

- Imports the `TextIteratorStreamer` class from the Transformers library, which enables streaming of generated text.
-  Imports the `Thread` class from the Python standard library, used to run the text generation process concurrently.

### **Initialization:**

- Initializes a `TextIteratorStreamer` object with a tokenizer, which will be used to stream generated text.
- `max_print_width = 100`: Specifies the maximum width for printing generated text.
- `inputs = tokenizer([input_text]*1, return_tensors="pt").to("cuda")`: Tokenizes the input text using the tokenizer and prepares it for generation on a CUDA-enabled GPU.

### **Generation Parameters:**

- `generation_kwargs`: Defines the generation parameters for the model, including:
    - `inputs`: The tokenized input text.
    - `streamer`: The text streamer object for streaming generated text.
    - `max_new_tokens`: The maximum number of new tokens to generate.
    - `temperature`: The temperature parameter for controlling randomness in sampling.
    - `top_k`: The top-k parameter for top-k sampling.
    - `top_p`: The top-p parameter for nucleus sampling.
    - `repetition_penalty`: The repetition penalty parameter.
    - `use_cache`: Whether to use caching during generation.

### **Text Generation:**

- `thread = Thread(target=model.generate, kwargs=generation_kwargs)`: Creates a new thread to asynchronously generate text using the model with the specified generation parameters.
- `thread.start()`: Starts the text generation process in the background.


In [None]:
from transformers import TextIteratorStreamer
from threading import Thread

text_streamer=TextIteratorStreamer(tokenizer)
import textwrap
max_print_width=100

inputs=tokenizer(
[
"Once upon a time there was a kingdom , far far away ,"
]*1,return_tensors="pt").to("cuda")

generation_kwargs=dict(
        inputs,
        streamer=text_streamer,
        max_new_tokens=256,
        temperature=0.7,
        top_k=50,                  # Use top-k sampling
        top_p=0.9,                 # Use nucleus (top-p) sampling
        repetition_penalty=1.2,
        use_cache=True,
)

thread=Thread(target=model.generate,kwargs=generation_kwargs)
thread.start()

length=0
for j,new_text in enumerate(text_streamer):
  if j==0:
    wrapped_text=textwrap.wrap(new_text,width=max_print_width)
    length=len(wrapped_text[-1])
    wrapped_text="\n".join(wrapped_text)
    print(wrapped_text,end="")
  else:
    length+=len(new_text)
    if length>=max_print_width:
      length=0
      print()
    print(new_text,end="")
  pass
pass

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> Once upon a time there was a kingdom , far far away, where the people were happy and lived in 
peace.

The king of this land had three daughters, who were very beautiful . The eldest daughter was 
called Princess Amber, she was kind hearted and loved to help others. She would often go out into the 
village and help those less fortunate than herself. Her father was so proud of her that he decided to give 
her a special gift for her birthday. He told his servants to find something that no one else could 
have or own. They searched high and low but couldn’t find anything until they came across an old man 
sitting by the side of the road. “What is it you are looking for?” asked the old man. “My King has asked 
me to find him something that no other person can ever possess” said the servant. “Well I think I may 
be able to help you with that” replied the old man. “I know of a place deep within the forest where 
there grows a tree which bears fruit that will make anyone who eats from its

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread

text_streamer=TextIteratorStreamer(tokenizer)
import textwrap
max_print_width=100

inputs=tokenizer(
[
"Once upon a time there was a very beautiful peacock "
]*1,return_tensors="pt").to("cuda")

generation_kwargs=dict(
        inputs,
        streamer=text_streamer,
        max_new_tokens=256,
        temperature=0.5,
        top_k=50,                  # Use top-k sampling
        top_p=0.9,                 # Use nucleus (top-p) sampling
        repetition_penalty=1.2,
        use_cache=True,
        eos_token_id=tokenizer.eos_token_id,
)

thread=Thread(target=model.generate,kwargs=generation_kwargs)
thread.start()

length=0
for j,new_text in enumerate(text_streamer):
  if j==0:
    wrapped_text=textwrap.wrap(new_text,width=max_print_width)
    length=len(wrapped_text[-1])
    wrapped_text="\n".join(wrapped_text)
    print(wrapped_text,end="")
  else:
    length+=len(new_text)
    if length>=max_print_width:
      length=0
      print()
    print(new_text,end="")
  pass
pass

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> Once upon a time there was a very beautiful peacock who lived in the forest. He had many 
friends and he loved to dance with them.

One day, while dancing with his friends, he saw a beautiful 
peahen (female peacock) sitting on a tree. The moment he saw her, he fell in love with her at first 
sight. But she did not like him because of his ugly face. She thought that he is too old for her. So she 
ignored him.

The next morning when the peacock woke up, he found out that he has lost all his feathers 
from his head. His body looked so ugly without any feather. When he went to meet his friend, they 
laughed at him and made fun of him. They said “You are no longer handsome now.”

He felt sad and cried. 
Then he decided to go back home and see if his mother can help him get his feathers back. On reaching 
home, he told his mother about what happened. His mother asked him to wait till evening as it will be 
dark by then. After some time, his mother came back with a pot full of water. She

# **Conclusion**

In conclusion, the utilization of the Mistral 7B 4-bit model in conjunction with parameter-efficient fine-tuning (PEFT) techniques has demonstrated significant advancements in the realm of natural language processing. Through the incorporation of techniques such as unsupervised layer-wise training (UNSLoTH) and quantized model fine-tuning, the fine-tuning process has been notably enhanced in terms of efficiency and effectiveness.

The adoption of PEFT methodologies has facilitated a streamlined fine-tuning process, enabling faster iterations and deployment of the model. Despite the model's architecture and hardware constraints, PEFT has allowed for the extraction of meaningful insights and patterns from the data, even with a limited amount of training data.

Moreover, the incorporation of quantized model fine-tuning has further optimized the training process by reducing the computational demands associated with large-scale models. This reduction in computational overhead has made the model more accessible and deployable across a wider range of hardware environments.

In summary, the synergy between the Mistral 7B 4-bit model and PEFT techniques has propelled advancements in natural language processing, offering a more efficient and effective approach to fine-tuning models. These developments hold promise for the creation of scalable and accessible solutions across diverse domains, paving the way for innovation and progress in the field.

In [None]:
num_words = 10

# Assuming dataset is a Hugging Face dataset object and 'text' is the field containing the strings
input_text = dataset[:10]['text']  # Accessing the first 10 entries

# Extracting the first few words from each string
prompts = [' '.join(text.split()[:num_words]) for text in input_text]


In [None]:
import pandas as pd

output_list_n = []


for i in range(10):
  input_text = prompts[i]

  text_streamer = TextIteratorStreamer(tokenizer)
  inputs = tokenizer([input_text] * 1, return_tensors="pt").to("cuda")
  generation_kwargs = dict(
      inputs,
      streamer=text_streamer,
      max_new_tokens=256,
      temperature=0.7,
      top_k=50,  # Use top-k sampling
      top_p=0.9,  # Use nucleus (top-p) sampling
      repetition_penalty=1.2,
      use_cache=True,
  )

  thread = Thread(target=model.generate, kwargs=generation_kwargs)
  thread.start()

  temp_output = ""
  for new_text in text_streamer:
      temp_output += new_text

  output_list_n.append(temp_output)

model_generated_story = output_list_n
print(model_generated_story)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
model_generated_story = list(zip(prompts, output_list_n))
print(len(model_generated_story))

10


In [None]:
!pip install evaluate
!pip install rouge_score
import evaluate

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.2
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=8753860d55db2927e12528d339dc50a02d54100cba9869062f1d265f7e79b25b
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_scor

In [None]:
predictions = [story for source, story in model_generated_story]
references = human_generated_story[0:len(predictions)]

In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=predictions,
    references=references,
    use_aggregator=True,
    use_stemmer=True,
)

print(original_model_results)

{'rouge1': 0.3355474151779245, 'rouge2': 0.09310850089112759, 'rougeL': 0.19751251834325567, 'rougeLsum': 0.277905054215921}


Dataset({
    features: ['text'],
    num_rows: 30
})