## Task 4: Text Summarization

In this task, I fine-tuned a pre-trained language model to perform extractive text summarization on BBC News dataset. The objective was to generate concise summaries of longer texts and evaluate the quality of the summaries against human-written versions. The fine-tuned model demonstrated its potential for effective text summarization in this domain. Contribution for this task involves developing a custom preprocessing function to extract the model output and then evaluate it against human-writen summarizations using metrics like BLUE and ROUGE.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Let's read the dataset

In [None]:
import os
import pandas as pd

# Define the path to the dataset folders
articles_folder = '/content/drive/MyDrive/Advanced NLP/Dataset/business_articles'
summaries_folder = '/content/drive/MyDrive/Advanced NLP/Dataset/business_summaries'

In [None]:
# Function to load text files from a folder into a list
def load_text_files(folder):
    texts = []
    filenames = sorted(os.listdir(folder))  # Sort to ensure matching order
    for filename in filenames:
        file_path = os.path.join(folder, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            texts.append(file.read().strip())
    return texts

In [None]:
# Load articles and summaries
articles = load_text_files(articles_folder)
summaries = load_text_files(summaries_folder)

In [None]:
# Create a DataFrame with articles and summaries
df = pd.DataFrame({
    'article': articles,
    'human_summary': summaries
})

In [None]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,article,human_summary
0,Ad sales boost Time Warner profit\n\nQuarterly...,TimeWarner said fourth quarter sales rose 2% t...
1,Dollar gains on Greenspan speech\n\nThe dollar...,The dollar has hit its highest level against t...
2,Yukos unit buyer faces loan claim\n\nThe owner...,Yukos' owner Menatep Group says it will ask Ro...
3,High fuel prices hit BA's profits\n\nBritish A...,"Rod Eddington, BA's chief executive, said the ..."
4,Pernod takeover talk lifts Domecq\n\nShares in...,Pernod has reduced the debt it took on to fund...


In [None]:
instruction_prompt = "Given an article delimited by triple quotes, generate a concise summary of the key points from the article. Answer with the summary without any explanation."

def format_dataset(data):
    df = data.copy()

    def process_row(row):
        full_text = row[0]
        summary = row[1]
        input_message = full_text
        output_message = summary

        return pd.Series({
            "instruction": instruction_prompt,
            "input": input_message,
            "output": output_message
        })

    df = df.apply(process_row, axis=1)

    return df

df = format_dataset(df)

  full_text = row[0]
  summary = row[1]


In [None]:
df.head()

Unnamed: 0,instruction,input,output
0,"Given an article delimited by triple quotes, g...",Ad sales boost Time Warner profit\n\nQuarterly...,TimeWarner said fourth quarter sales rose 2% t...
1,"Given an article delimited by triple quotes, g...",Dollar gains on Greenspan speech\n\nThe dollar...,The dollar has hit its highest level against t...
2,"Given an article delimited by triple quotes, g...",Yukos unit buyer faces loan claim\n\nThe owner...,Yukos' owner Menatep Group says it will ask Ro...
3,"Given an article delimited by triple quotes, g...",High fuel prices hit BA's profits\n\nBritish A...,"Rod Eddington, BA's chief executive, said the ..."
4,"Given an article delimited by triple quotes, g...",Pernod takeover talk lifts Domecq\n\nShares in...,Pernod has reduced the debt it took on to fund...


In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_df = train_df.reset_index()
test_df = test_df.reset_index()

# Unsloth Setup

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(
      

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.9.post4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from datasets import Dataset
train_dataset = Dataset.from_pandas(train_df)
# test_dataset = Dataset.from_pandas(test_df)

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

train_dataset = train_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

In [None]:
train_dataset

Dataset({
    features: ['instruction', 'input', 'output', '__index_level_0__', 'text'],
    num_rows: 408
})

In [None]:
train_dataset["text"][0]

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGiven an article delimited by triple quotes, generate a concise summary of the key points from the article. Answer with the summary without any explanation.\n\n### Input:\nStormy year for property insurers\n\nA string of storms, typhoons and earthquakes has made 2004 the most expensive year on record for property insurers, according to Swiss Re.\n\nThe world\'s second biggest insurer said disasters around the globe have seen property claims reach $42bn (£21.5bn). "2004 reinforces the trend towards higher losses," said Swiss Re. Tightly packed populations in the areas involved in natural and man-made disasters were to partly to blame for the rise in claims, it said. Some 95% of insurance claims were for natural catastrophes, with the rest attributed to made-made events.\n\nThe largest claims came from the US, 

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 250,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 100,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/408 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 408 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 250
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
100,1.2668
200,0.9605


## We will save the model

In [None]:
# Load the model to Hugging face
model.push_to_hub("hf_id/llama3.1_8b_text_summarization", token = "hf_...") # Online saving
tokenizer.push_to_hub("hf_id/1_8b_text_summarization", token = "hf_...") # Online saving

README.md:   0%|          | 0.00/591 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/auragFouad/llama3.1_8b_text_summarization


#For inference we will load the model from hugging face then we will summarize the articles from the testing dataset. Then we will use a custom preprocessing function to exctract the generated summary then use some metrics like ROUGE and BLUE with the writen summarization by humans.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

if True:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "hf_id/llama3.1_8b_text_summarization", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.9.post4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [None]:
test_df.head()

Unnamed: 0,index,instruction,input,output
0,480,"Given an article delimited by triple quotes, g...",Christmas sales worst since 1981\n\nUK retail ...,"""The retail sales figures are very weak, but a..."
1,449,"Given an article delimited by triple quotes, g...",US retail sales surge in December\n\nUS retail...,US retail sales ended the year on a high note ...
2,475,"Given an article delimited by triple quotes, g...",Saudi NCCI's shares soar\n\nShares in Saudi Ar...,Shares in Saudi Arabia's National Company for ...
3,434,"Given an article delimited by triple quotes, g...",Fosters buys stake in winemaker\n\nAustralian ...,Australian brewer Fosters has bought a large s...
4,368,"Given an article delimited by triple quotes, g...",Beer giant swallows Russian firm\n\nBrewing gi...,Inbev was formed in August 2004 when Belgium's...


In [None]:
instruction_prompt = "Given an article delimited by triple quotes, generate a concise summary of the key points from the article. Answer with the summary without any explanation."

y_pred = []

for idx, test_row in test_df.iterrows():
  text = test_df["input"].iloc[idx]
  input_text = text

  # input_text = test["text"].iloc[idx]
  inputs = tokenizer(
  [
      alpaca_prompt.format(
          instruction_prompt,
          input_text, # input
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
  output = tokenizer.batch_decode(outputs)
  y_pred.append(output)

test_df['predictions'] = y_pred

In [None]:
test_df["predictions"].iloc[0]

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGiven an article delimited by triple quotes, generate a concise summary of the key points from the article. Answer with the summary without any explanation.\n\n### Input:\nChristmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of retailers have already reported poor figures for December. Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.\n\nThe last time reta