# Cy | gist

## Summarizer model based on Google's Flan-T5

<a href="https://colab.research.google.com/github/cybardev/Sheikhspeare/blob/main/sheikhspeare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [27]:
!pip install --upgrade transformers datasets huggingface_hub



In [28]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Define the model name and load the tokenizer and model
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)



In [29]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Print out which device we're using (GPU or CPU)
print(device)

cuda


In [30]:
def raw_generator(text):
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [31]:
# Define a sample text for conversion
sample_text = """
Person A: Hey, did you hear about the new project management software our company is planning to implement?

Person B: Yeah, I heard a bit about it. Whatâ€™s the deal with it?

Person A: Itâ€™s called "TaskFlow." The management thinks itâ€™s going to streamline our workflow, especially with remote teams. Itâ€™s supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform.

Person B: That sounds interesting. But Iâ€™m a bit concerned about the learning curve. Is it user-friendly?

Person A: From what Iâ€™ve seen, it looks pretty intuitive. Theyâ€™re also planning to run a couple of training sessions to get everyone up to speed. The first one is next Monday.

Person B: Okay, that helps. I guess Iâ€™ll have to attend that session. How does it compare to what weâ€™re using now?

Person A: Itâ€™s supposed to be much more efficient. Weâ€™ll be able to track project progress more easily and get real-time updates. Plus, it has built-in analytics to help us with performance tracking.

Person B: That sounds promising. I just hope it doesnâ€™t come with too many bugs at launch.

Person A: Yeah, thatâ€™s always a concern with new software. But theyâ€™ve been testing it for a while now, so fingers crossed it goes smoothly.

Person B: Letâ€™s hope for the best. Thanks for the info!

Person A: No problem. See you at the training!
"""

In [None]:
# Convert the sample text using the pre-trained model (without fine-tuning)
pre_finetuned_summary = raw_generator(sample_text)
print("Summary before fine-tuning:", pre_finetuned_summary)

Summary before fine-tuning: Thanks for the info!


In [None]:
from datasets import load_dataset

# Load relevant dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="train")

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
# Split the dataset into training and testing subsets
dataset_split = dataset.train_test_split(test_size=0.1)

# Further reduce the training set size for faster testing during development
small_train_dataset = dataset_split['train'].train_test_split(test_size=0.99)['train']
eval_dataset = dataset_split['test']

In [None]:
def preprocess_function(examples):
  # Extract the articles from the dataset
  inputs = [doc for doc in examples['article']]

  # Tokenize the articles (inputs) with padding and truncation to a max length of 512
  model_inputs = tokenizer(inputs, max_length=512, padding="max_length", truncation=True, return_tensors="pt")

  # Tokenize the conversions (labels) using the target tokenizer context
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples['highlights'], max_length=128, padding="max_length", truncation=True, return_tensors="pt")

  # Attach the tokenized conversions as labels to the model inputs
  model_inputs["labels"] = labels["input_ids"]

  # Move the tokenized inputs and labels to the appropriate device (GPU/CPU)
  model_inputs = {k: v.to(device) for k, v in model_inputs.items()}

  return model_inputs

In [None]:
# Tokenize the small training dataset
tokenized_train_dataset = small_train_dataset.map(preprocess_function, batched=True)

# Tokenize the evaluation dataset
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/2584 [00:00<?, ? examples/s]



Map:   0%|          | 0/28712 [00:00<?, ? examples/s]

In [None]:
from transformers import Seq2SeqTrainingArguments

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',              # Directory to save the model checkpoints
    evaluation_strategy="epoch",         # Evaluate the model at the end of every epoch
    learning_rate=2e-5,                  # Learning rate for the optimizer
    per_device_train_batch_size=8,       # Batch size for training
    per_device_eval_batch_size=8,        # Batch size for evaluation
    weight_decay=0.01,                   # Regularization to prevent overfitting
    save_total_limit=3,                  # Only keep the last 3 checkpoints
    num_train_epochs=4,                  # Number of training epochs
    predict_with_generate=True,          # Enable text generation during evaluation
    logging_dir="./logs"                 # Directory for storing training logs
)



In [None]:
from transformers import Seq2SeqTrainer

# Create the trainer object
trainer = Seq2SeqTrainer(
    model=model,                            # The model to be trained
    args=training_args,                     # The training arguments defined earlier
    train_dataset=tokenized_train_dataset,  # The tokenized training dataset
    eval_dataset=tokenized_eval_dataset,    # The tokenized evaluation dataset
    tokenizer=tokenizer                     # The tokenizer to handle input and output
)

In [None]:
# Let's train
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.580548
2,5.800900,1.673243
3,5.800900,1.385123
4,1.883300,1.328252


TrainOutput(global_step=1292, training_loss=3.3324478669063224, metrics={'train_runtime': 2474.1052, 'train_samples_per_second': 4.178, 'train_steps_per_second': 0.522, 'total_flos': 1921364256620544.0, 'train_loss': 3.3324478669063224, 'epoch': 4.0})

In [None]:
# Evaluate the model on the evaluation dataset
metrics = trainer.evaluate()

# Print the evaluation metrics
print(metrics)

{'eval_loss': 1.328251838684082, 'eval_runtime': 479.8354, 'eval_samples_per_second': 59.837, 'eval_steps_per_second': 7.48, 'epoch': 4.0}


In [None]:
def tuned_generator(text):
  # Tokenize the input text and move it to the correct device
  inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)

  # Generate the converted text using the fine-tuned model
  summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)

  # Decode the generated conversion back into text and return it
  return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [None]:
print(tuned_generator(sample_text))

Project management software is being developed by the company's management team. It's supposed to integrate all the tools we use, like Slack, Trello, and Google Drive into one platform. It's supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform. It's supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform. It's supposed to integrate all the tools we use, like Slack, Trello, and Google Drive, into one platform. It'


In [32]:
# Publish the model
from google.colab import userdata

REPO_NAME = "cybargist"
HF_TOKEN = userdata.get("HF_TOKEN")

# save model and tokenizer
model.save_pretrained(REPO_NAME)
tokenizer.save_pretrained(REPO_NAME)

# push model and tokenizer to huggingface
model.push_to_hub(REPO_NAME, token=HF_TOKEN)
tokenizer.push_to_hub(REPO_NAME, token=HF_TOKEN)

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/cybardev/cybargist/commit/7b0003d6ad38db3614724b7c437c3de5c197dca6', commit_message='Upload tokenizer', commit_description='', oid='7b0003d6ad38db3614724b7c437c3de5c197dca6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/cybardev/cybargist', endpoint='https://huggingface.co', repo_type='model', repo_id='cybardev/cybargist'), pr_revision=None, pr_num=None)

---

## Testing the published model

In [33]:
# Import the required modules
from transformers import pipeline

# Check if a GPU is available
import torch
device = 0 if torch.cuda.is_available() else -1

# Load the Flan-T5 base model for text summarization
model = pipeline("summarization", model="cybardev/cybargist", device=device)

print("Environment set up. Model loaded on:", "GPU" if device == 0 else "CPU")

# Example of zero-shot prompt for summarization
prompt = "Summarize the following text: " + sample_text
response = model(prompt)

print("Zero-shot Summary:", response[0]['summary_text'])

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Environment set up. Model loaded on: GPU
Zero-shot Summary: It's a good idea to get everyone up to speed with the new project management software they're planning to implement. They'll be able to track project progress more easily and get real-time updates.
