# **Step 1: Set Up Google Colab Environment**


To create a model for summarizing chats through fine-tuning, you'll need a step-by-step approach. We'll use a pre-trained model (like OpenAI's GPT or Hugging Face's transformers) and fine-tune it on your specific dataset to summarize large chats effectively. Here's a complete guide:

Step 1: Set Up Google Colab Environment
Open Google Colab and create a new notebook.
Ensure you have a GPU runtime enabled (Runtime > Change runtime type > Hardware Accelerator > GPU).

# **Step 2: Install Required Libraries**

In [5]:
!pip install transformers datasets accelerate
!pip install wandb  # Optional, for experiment tracking


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

# **Step 3: Prepare Your Dataset**

In [6]:
import csv
import random

# Generate a large dataset with dialogue and summary pairs
def generate_large_dataset(num_samples=1000):
    data = []
    for i in range(num_samples):
        dialogue = f"User: Hello, this is message {i}. How are you?\nAssistant: I'm fine, thank you. How can I assist you with message {i}?"
        summary = f"User greeted and inquired about assistance for message {i}."
        data.append({"dialogue": dialogue, "summary": summary})
    return data

# Create the dataset
data = generate_large_dataset(1000)

# Write the dataset to a CSV file
csv_file_path = "large_dialogue_summary_dataset.csv"

with open(csv_file_path, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["dialogue", "summary"])
    writer.writeheader()
    writer.writerows(data)

print(f"CSV file has been created: {csv_file_path}")


CSV file has been created: large_dialogue_summary_dataset.csv


In [7]:
from datasets import load_dataset

# Replace 'your_dataset_path.csv' with your uploaded dataset path
dataset = load_dataset('csv', data_files='large_dialogue_summary_dataset.csv')
print(dataset)


Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['dialogue', 'summary'],
        num_rows: 1000
    })
})


In [9]:
# Split into train and validation datasets
train_dataset = dataset['train'].train_test_split(test_size=0.1)['train']
val_dataset = dataset['train'].train_test_split(test_size=0.1)['test']


# **Step 4: Choose a Pre-trained Model**

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

# **Step 5: Preprocess the Data**

In [11]:
def preprocess_function(examples):
    inputs = tokenizer(examples['dialogue'], max_length=1024, truncation=True, padding='max_length')
    outputs = tokenizer(examples['summary'], max_length=128, truncation=True, padding='max_length')
    inputs['labels'] = outputs['input_ids']
    return inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

# **Step 6: Define the Data Collator**

In [12]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)


# **Step 7: Set Up Training Arguments**

In [13]:
import os
from transformers import Seq2SeqTrainingArguments

# Disable Weights & Biases (wandb)
os.environ["WANDB_DISABLED"] = "true"

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True,  # Use mixed precision for faster training on GPU
    logging_dir='./logs',
    logging_steps=10,
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


# **Step 8: Train the Model**

In [15]:
from transformers import Seq2SeqTrainer, DataCollatorForSeq2Seq

# Define Data Collator for padding and batch consistency
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,  # Replace with your tokenized training dataset
    eval_dataset=tokenized_val,     # Replace with your tokenized validation dataset
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start training
trainer.train()

# Save the final model
trainer.save_model("./final_chat_summary_model")
tokenizer.save_pretrained("./final_chat_summary_model")

print("Training completed! The fine-tuned model is saved to './final_chat_summary_model'.")


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,0.0039,0.023974
2,0.0021,0.020394
3,0.0009,0.031836




Training completed! The fine-tuned model is saved to './final_chat_summary_model'.


# **Step 9: Evaluate the Model**

In [14]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [16]:
# Install the required dependency
!pip install rouge_score

# Now import and use the ROUGE metric
import evaluate
from datasets import load_dataset

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Function to compute metrics during evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = tokenizer.batch_decode(logits, skip_special_tokens=True)
    references = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE score
    results = rouge.compute(predictions=predictions, references=references)
    return results


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=8232656f139ecd6024fabd2c9f1dd7d8ab5c5a3bcff6a90de43f183269b791fe
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

# **Step 10: Test the Model**

In [18]:
# Import torch
import torch

# Make sure to move the model and input tensors to the same device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to the correct device
model = model.to(device)
chat = """User1: "Hey! How was your day?"
User2: "It was great, thanks for asking! I had a pretty productive day at work. How about you?"
User1: "Same here! I finished that project I was working on for the past week. Feels good to be done."
User2: "That's awesome! Congratulations. What's next on your to-do list?"
User1: "Thanks! Well, now I need to plan my weekend. I was thinking about going hiking."
User2: "That sounds fun! Where do you plan to go?"
User1: "I was thinking about the national park nearby. It has some really nice trails."
User2: "Nice! I’ve heard great things about it. Are you going with anyone?"
User1: "Probably just going alone this time. I enjoy the peace and quiet."
User2: "I get that. Sometimes it's nice to just be by yourself and enjoy nature."
User1: "Exactly! Do you like hiking?"
User2: "I do, but I haven't gone in a while. I should probably get back into it."
User1: "Yeah, you should! It's a great way to clear your mind."
User2: "For sure. Maybe I'll join you sometime."
User1: "That would be fun! Just let me know when you're free."
"""
# Tokenize the input chat and move inputs to the correct device
inputs = tokenizer(chat, return_tensors="pt", max_length=1024, truncation=True).to(device)

# Generate the summary
summary_ids = model.generate(inputs['input_ids'], max_length=128, num_beams=4, early_stopping=True)

# Decode and print the summary
print("Summary:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Summary: User greeted and inquired about assistance for project he'd worked on for past week.User inquired about plans for hiking with co-worker about going to national park near where he was working.User and co-workers expressed interest in taking part in nature excursions for 'clearance and to enjoy nature'


In [21]:
# Import torch
import torch

# Make sure to move the model and input tensors to the same device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to the correct device
model = model.to(device)
chat = """"user1": "Did you catch the game last night?",
        "user2": "Yeah! It was such an exciting match. I can't believe they pulled off that last-minute goal!",
        "user1": "I know, right? I thought they were done for, but then out of nowhere, they scored.",
        "user2": "Exactly! That was insane. Who were you rooting for?",
        "user1": "I was cheering for the underdogs. I always like seeing the smaller teams upset the favorites.",
        "user2": "I get that. It’s always fun when that happens. I think this season is going to be really competitive.",
        "user1": "Totally! Can’t wait for the next game."
"""
# Tokenize the input chat and move inputs to the correct device
inputs = tokenizer(chat, return_tensors="pt", max_length=1024, truncation=True).to(device)

# Generate the summary
summary_ids = model.generate(inputs['input_ids'], max_length=128, num_beams=4, early_stopping=True)

# Decode and print the summary
print("Summary:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Summary: User greeted and inquired about assistance for a game last night. "User1 was cheering for the underdogs. I always like seeing the smaller teams upset the favorites!", says "User2"User was impressed by last-minute goal for "User 1" and cheered for "user 2"


In [20]:
# Import torch
import torch

# Make sure to move the model and input tensors to the same device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to the correct device
model = model.to(device)
chat = """ "user1": "Hey, do you have any plans for the weekend?",
        "user2": "Not much, just catching up on some reading. How about you?",
        "user1": "I'm thinking about going to a new café that opened up downtown. Heard the coffee there is amazing!",
        "user2": "That sounds great! I’m always up for good coffee. What’s the name of the place?",
        "user1": "It's called Brewed Awakening. I’ve heard the vibe is really chill, perfect for a weekend hangout.",
        "user2": "Nice! I’ve been looking for a new spot. Maybe I’ll join you.",
        "user1": "You should! I’ll let you know when I’m heading out.",
        "user2": "Sounds like a plan!"
"""
# Tokenize the input chat and move inputs to the correct device
inputs = tokenizer(chat, return_tensors="pt", max_length=1024, truncation=True).to(device)

# Generate the summary
summary_ids = model.generate(inputs['input_ids'], max_length=128, num_beams=4, early_stopping=True)

# Decode and print the summary
print("Summary:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Summary: User greeted and inquired about plans for the weekend. "User1" inquired about visiting a new cafe downtown. "Brewed Awakening" sounded "amazing!", inquired about joining "user2 for good coffee. "Sounds like a plan!", says "user1"


In [22]:
# Import torch
import torch

# Make sure to move the model and input tensors to the same device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move model to the correct device
model = model.to(device)
chat = """"user1": "I’ve been in the mood to bake lately. Do you like baking?",
        "user2": "I love it! I’m not the best at it, but I enjoy making cookies and cakes.",
        "user1": "Same here! I was thinking of trying a new recipe—maybe brownies this time.",
        "user2": "Ooh, brownies are always a win. Do you have a recipe in mind?",
        "user1": "I found one with caramel and sea salt. It looks amazing.",
        "user2": "That sounds delicious! I’ll definitely have to try that sometime. Let me know how they turn out!",
        "user1": "Will do! I’ll bring you some if they turn out good."
"""
# Tokenize the input chat and move inputs to the correct device
inputs = tokenizer(chat, return_tensors="pt", max_length=1024, truncation=True).to(device)

# Generate the summary
summary_ids = model.generate(inputs['input_ids'], max_length=128, num_beams=4, early_stopping=True)

# Decode and print the summary
print("Summary:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Summary: User greeted and inquired about assistance for a recipe for brownies for "user1" "User1" and "user2" inquired about making cookies and cakes for "User 1 and "User 2" The pair enjoyed chatting about their interest in baking and looking for new recipes.
