In [None]:
!pip install rouge-score

In [1]:
from datasets import load_dataset, load_metric
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
import numpy as np
import torch

In [None]:
dataset = load_dataset("multi_news")

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
})

In [5]:
dataset['train'].features

{'document': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None)}

In [6]:
print(dataset['train']['document'][0])

National Archives 
 
 Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. 
 
 A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. 
 
 Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. 
 
 Enjoy the show. ||||| Employers pulled back sharply on hiring last month, a reminder that the U.S. economy may not be growing fas

In [7]:
print(dataset['train']['summary'][0]) 

– The unemployment rate dropped to 8.2% last month, but the economy only added 120,000 jobs, when 203,000 new jobs had been predicted, according to today's jobs report. Reaction on the Wall Street Journal's MarketBeat Blog was swift: "Woah!!! Bad number." The unemployment rate, however, is better news; it had been expected to hold steady at 8.3%. But the AP notes that the dip is mostly due to more Americans giving up on seeking employment.


In [8]:
# Taking the subset of the dataset for the finetuning purpose
train_subset = dataset["train"]
validation_subset = dataset["validation"]
test_subset = dataset["test"]

In [2]:
checkpoint = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(checkpoint)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [10]:
def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=150, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Tokenize datasets
tokenized_train = train_subset.map(preprocess_function, batched=True)
tokenized_validation = validation_subset.map(preprocess_function, batched=True)
tokenized_test = test_subset.map(preprocess_function, batched=True)

In [14]:
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [16]:
# Define compute_metrics function
rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # Decode the predictions and labels
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    # Compute ROUGE scores
    rouge_output = rouge.compute(predictions=pred_str, references=label_str, use_stemmer=True)

    # Aggregate the ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in rouge_output.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in pred_ids]
    result["gen_len"] = np.mean(prediction_lens)

    return result

In [17]:
# Seq2Seq training arguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",             # Directory to save model checkpoints and logs
    evaluation_strategy="epoch",        # Evaluate the model at the end of each epoch
    learning_rate=2e-5,                 # Learning rate for the optimizer
    per_device_train_batch_size=16,     # Batch size for training
    per_device_eval_batch_size=16,      # Batch size for evaluation
    weight_decay=0.01,                  # Weight decay for regularization
    save_total_limit=3,                 # Limit the total number of checkpoints saved
    num_train_epochs=5,                 # Number of training epochs
    predict_with_generate=True,         # Use generation mode for prediction
    generation_max_length=150,          # Maximum length for generated sequences
    generation_num_beams=4,             # Number of beams for beam search during generation
)



In [18]:
## Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,                       # The model to be trained
    args=training_args,                # Training arguments defined with Seq2SeqTrainingArguments
    train_dataset=tokenized_train,     # The training dataset
    eval_dataset=tokenized_validation, # The evaluation dataset
    data_collator=data_collator,       # The data collator for processing data batches
    tokenizer=tokenizer,               # The tokenizer used for preprocessing
    compute_metrics=compute_metrics,   # The function to compute evaluation metrics
)

In [24]:
# Train the model
trainer.train()
save_directory = './saved_model'
model.save_pretrained(save_directory)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,3.092,2.878141,37.950024,12.701885,21.8913,21.906595,144.865884


In [20]:
# Evaluate the model on validation set
trainer.evaluate()

# Evaluate the model on test set
test_results = trainer.evaluate(eval_dataset=tokenized_test)

print(test_results)

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


{'eval_loss': 2.875044822692871, 'eval_model_preparation_time': 0.0053, 'eval_rouge1': 38.06026813699936, 'eval_rouge2': 12.679145428250266, 'eval_rougeL': 21.8045894617448, 'eval_rougeLsum': 21.80490631855089, 'eval_gen_len': 143.9530416221985, 'eval_runtime': 2107.9188, 'eval_samples_per_second': 2.667, 'eval_steps_per_second': 0.167}


### Test

In [3]:
# Load Model
model = T5ForConditionalGeneration.from_pretrained('model/')

In [44]:
example_text = """NEW YORK (Reuters) -More stocks are participating in the S&P 500’s latest march to record highs, easing concerns over a rally that has been concentrated in a handful of giant technology names for much of 2024.

The S&P 500 gained 5.5% in the third quarter. This time, however, optimism that the Federal Reserve’s rate cuts will boost U.S. growth is pushing investors into shares of regional banks, industrial companies and other beneficiaries of a strong economy and lower rates, in addition to the tech-focused stocks that have already seen massive gains this year.

More than 60% of S&P 500 components outperformed the index this quarter, compared to around 25% in the first half of the year. At the same time, the equal-weight version of the S&P 500 -- a proxy for the average index stock -- gained 9% in the quarter, outperforming the S&P 500, which is more influenced by the heavily weighted shares of megacaps such as Nvidia (NASDAQ:NVDA) and Apple (NASDAQ:AAPL).

The broadening rally is an encouraging sign for stocks, investors said, following concerns that the market could be vulnerable to a reversal if the cluster of tech names propping it up fell out of favor. The “soft-landing” narrative of resilient growth will be tested by employment data at the end of the week and the start of corporate earnings season in October. 

The second half of the year so far is "almost a mirror image of what the first half was," said Kevin Gordon, senior investment strategist at Charles Schwab (NYSE:SCHW). "Even if the megacaps aren't contributing as much, as long as the rest of the market is doing well... I think that's a healthy development.”

The Fed kicked off its first rate cutting cycle in four years earlier this month with a 50-basis point reduction, a move Chairman Jerome Powell said was meant to safeguard a resilient economy. Traders are pricing in some chance of another jumbo-sized reduction when the central bank meets again in November and project about 190 basis points of cuts through the end of 2025, according to LSEG data. 

Various corners of the stock market are benefiting from expectations of lower rates and steady growth.

The S&P 500’s industrial and financials sectors - seen by investors as among the most economically sensitive areas - rose 11% and 10%, respectively, in the third quarter.

Falling rates are also a boon to shares of smaller companies, which disproportionately struggle with elevated borrowing costs. The small-cap focused Russell 2000 climbed about 9% in the quarter.

The market’s bond proxies - stocks with strong dividends - are also attracting investors seeking dividend income as bond yields fall alongside interest rates. Two such sectors, utilities and consumer staples, rose over 18% and 8%, respectively, in the period.

Mark Hackett, chief of investment research at Nationwide, said the broadening builds on a trend that appeared before the September 17-18 Fed meeting.

"We were going to have this greater participation, this leveling of performance among sectors, and then you had the Fed cut more aggressively and that's leading to... an acceleration of that trend," he said.

'QUITE HEALTHY' 

In all, eight of the S&P 500's 11 sectors outperformed the index in the third quarter. By comparison, only technology and the communications sector, which includes Google parent Alphabet (NASDAQ:GOOGL) and Facebook owner Meta Platforms (NASDAQ:META), outperformed the broader index in the first half of the year.

The S&P 500 is up more than 20% year-to-date, at record-high levels.

Meanwhile, the overall influence of the megacaps has moderated. The combined weight in the S&P 500 of the "Magnificent Seven" -- Apple, Microsoft (NASDAQ:MSFT), Nvidia, Amazon (NASDAQ:AMZN), Alphabet, Meta and Tesla (NASDAQ:TSLA) -- has declined to 31% from 34% in mid-July, according to LSEG Datastream.

"I find it to be quite healthy that tech has kind of consolidated," said King Lip, chief strategist at BakerAvenue Wealth Management. "We're not in a bear market for tech by any means. But you've definitely seen some evidence of rotation."

Investors would likely need to see further proof of economic strength for the broadening trend to continue. Jobs data on Oct. 4 will be one test of the soft landing scenario, after the prior two employment reports were weaker than expected.

Market participants will also want to see non-tech firms deliver strong earnings in the months ahead to justify their gains.

Magnificent Seven companies are expected to increase earnings by about 20% in the third quarter, against a profit rise of 2.5% for the rest of the S&P 500, according to Tajinder Dhillon, senior research analyst at LSEG. That gap is expected to shrink in 2025, with the rest of the index expected to increase earnings by 14% for the full year against a 19% rise for the megacap group.

In a soft landing scenario, the Magnificent Seven "should not have to carry the profit rebound alone," Lisa Shalett, chief investment officer at Morgan Stanley Wealth Management, said in a recent report.

"We are in the 'show me' stage for the soft landing," Shalett said.    

"""

In [47]:
# Preprocess the input text
input_text = "summarize: " + example_text
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=2048, truncation=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = inputs.to(device)
# Generate the summary
summary_ids = model.generate(inputs, max_length=1024, min_length=64, length_penalty=4, num_beams=16, no_repeat_ngram_size=4
, early_stopping=True)

# Decode the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# print("Original Text:\n", example_text)
print("\nGenerated Summary:\n", summary)


Generated Summary:
 – More than 60% of S&P 500 components outperformed the index this quarter, compared to around 25% in the first half of the year. At the same time, the equal-weight version of the 500 -- a proxy for the average index stock -- gained 9% in the quarter, outperforming the index, which is more influenced by the heavily weighted shares of megacaps such as Nvidia (NASDAQ:NVDA) and Apple (NASDAQ:AAPL), according to LSEG. "Even if the megacaps aren't contributing as much, as long as the rest of the market is doing well... I think that's a healthy development," says Mark Hackett, senior investment strategist at Nationwide,
