<a href="https://colab.research.google.com/github/epadam/Machine-Learning-Tutorial-Demo-Resources/blob/master/notebooks/nlp/News_Summarization_with_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### News Summary Dataset

In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ravirajag","key":"1e173d956dd413c1be238cb748832358"}'}

In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d sunnysai12345/news-summary

Downloading news-summary.zip to /content
 46% 9.00M/19.8M [00:00<00:00, 24.4MB/s]
100% 19.8M/19.8M [00:00<00:00, 50.0MB/s]


In [None]:
!ls

kaggle.json  news-summary.zip  sample_data


In [None]:
!unzip news-summary.zip

Archive:  news-summary.zip
  inflating: news_summary.csv        
  inflating: news_summary_more.csv   


In [None]:
!ls

kaggle.json	  news_summary_more.csv  sample_data
news_summary.csv  news-summary.zip


### Installations

In [None]:
!pip install transformers -q

[K     |████████████████████████████████| 778kB 8.9MB/s 
[K     |████████████████████████████████| 1.1MB 58.6MB/s 
[K     |████████████████████████████████| 3.0MB 31.5MB/s 
[K     |████████████████████████████████| 890kB 55.7MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


### Imports

In [None]:
import time
import numpy as np
import pandas as pd

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
torch.manual_seed(42)
np.random.seed(42)
torch.backends.cudnn.deterministic = True

### Data Exploration

In [None]:
df = pd.read_csv('news_summary.csv', encoding="latin-1")
df.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [None]:
len(df)

4514

In [None]:
print(f"News: {df.iloc[10]['ctext']} \nSummary: {df.iloc[10]['text']} \nHeadline: {df.iloc[10]['headlines']}")

News: The Food Safety and Standards Authority of India (FSSAI) is in the process of creating a network of food banking partners to collect and distribute leftover food from large parties and weddings to the hungry.A notification to create a separate category of food business operators (FBOs), who will be licensed to deal only with leftover food, has been drafted to ensure the quality of food.?We are looking at partnering with NGOs or organisations that collect, store and distribute surplus food to ensure they maintain certain hygiene and health standards when handling food,? said Pawan Agarwal, CEO of FSSAI.?Tonnes of food is wasted annually. We are looking at creating a mechanism through which food can be collected from restaurants, weddings, large-scale parties,?  says Pawan Agarwal, ?All food, whether it is paid for or distributed free, must meet the country?s food safety and hygiene standards,? he said.The organisations in the business of collecting leftover food will now have to w

In [None]:
df = df[['text', 'ctext']]
df.ctext = "summarize: " + df.ctext
df.head()

Unnamed: 0,text,ctext
0,The Administration of Union Territory Daman an...,summarize: The Daman and Diu administration on...
1,Malaika Arora slammed an Instagram user who tr...,summarize: From her special numbers to TV?appe...
2,The Indira Gandhi Institute of Medical Science...,summarize: The Indira Gandhi Institute of Medi...
3,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,summarize: Lashkar-e-Taiba's Kashmir commander...
4,Hotels in Maharashtra will train their staff t...,summarize: Hotels in Mumbai and other Indian c...


In [None]:
train_size = 0.8

train_df = df.sample(frac=train_size, random_state=42).reset_index(drop=True)
valid_df = df.drop(train_df.index).reset_index(drop=True)

In [None]:
len(train_df), len(valid_df)

(3611, 903)

### Configurations

In [None]:
MODEL_NAME = "t5-base"
MODEL_PATH = "model.pt"
TOKENIZER = T5Tokenizer.from_pretrained(MODEL_NAME)
MAX_LEN = 512
SUMMARY_LEN = 150
TRAIN_BATCH_SIZE = 2 
VALID_BATCH_SIZE = 2
EPOCHS = 4
LR = 1e-4

### Dataset class

In [None]:
class NewsDataset(Dataset):
    def __init__(self, df, tokenizer, source_len, summary_len):
        super().__init__()

        self.tokenizer = tokenizer
        self.data = df
        self.source_len = source_len
        self.summary_len = summary_len
        self.text = df.text
        self.ctext = df.ctext
    
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, item):
        ctext = str(self.ctext[item])
        ctext = " ".join(ctext.split())

        text = str(self.text[item])
        text = " ".join(text.split())

        source = self.tokenizer.encode_plus(
            text,
            max_length=self.source_len,
            pad_to_max_length=True,
            return_attention_mask=True,
            truncation=True,
            return_tensors='pt')
        
        target = self.tokenizer.encode_plus(
            ctext,
            max_length=self.summary_len,
            pad_to_max_length=True,
            return_attention_mask=True,
            truncation=True,
            return_tensors='pt')

        return {
            "source_ids": source["input_ids"].flatten(),
            "source_mask": source["attention_mask"].flatten(),
            "target_ids": target["input_ids"].flatten(),
            "target_mask": target["attention_mask"].flatten()
        }

In [None]:
train_dataset = NewsDataset(train_df, TOKENIZER, MAX_LEN, SUMMARY_LEN)
valid_dataset = NewsDataset(valid_df, TOKENIZER, MAX_LEN, SUMMARY_LEN)

### DataLoaders

In [None]:
train_data_loader = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
val_data_loader = DataLoader(valid_dataset, batch_size=VALID_BATCH_SIZE, shuffle=False)

In [None]:
# sample check
sample = next(iter(train_data_loader))
sample['source_ids'].shape, sample['source_mask'].shape, sample['target_ids'].shape

(torch.Size([2, 512]), torch.Size([2, 512]), torch.Size([2, 150]))

### Model

In [None]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
model.to(device)

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dr

### Optimizer

In [None]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LR)

### Training Method

In [None]:
def train(data_loader, model, tokenizer, optimizer, device):
    model.train()

    total_steps = len(data_loader)
    epoch_loss = 0

    for idx, batch in enumerate(data_loader):
        optimizer.zero_grad()

        ids = batch["source_ids"].to(device)
        mask = batch["source_mask"].to(device)

        target_ids = batch["target_ids"].to(device)
        
        y_ids = target_ids[:, :-1].contiguous()
        lm_labels = target_ids[:, 1:].clone().detach()
        lm_labels[target_ids[:, 1:] == tokenizer.pad_token_id] = -100

        outputs = model(
            input_ids=ids,
            attention_mask=mask,
            decoder_input_ids=y_ids,
            lm_labels=lm_labels
        )

        loss = outputs[0]
        epoch_loss += loss.item()

        loss.backward()
        optimizer.step()

        if idx%100 == 0:
            print(f"Step: {idx}/{total_steps} | Loss: {loss.item()}")
    
    return epoch_loss / total_steps

### Validation Method

In [None]:
def evaluate(data_loader, model, tokenizer, device):
    model.eval()

    total_steps = len(data_loader)
    epoch_loss = 0

    with torch.no_grad():
        for idx, batch in enumerate(data_loader):
            ids = batch["source_ids"].to(device)
            mask = batch["source_mask"].to(device)

            target_ids = batch["target_ids"].to(device)
            
            y_ids = target_ids[:, :-1].contiguous()
            lm_labels = target_ids[:, 1:].clone().detach()
            lm_labels[target_ids[:, 1:] == tokenizer.pad_token_id] = -100

            outputs = model(
                input_ids=ids,
                attention_mask=mask,
                decoder_input_ids=y_ids,
                lm_labels=lm_labels
            )

            loss = outputs[0]
            epoch_loss += loss.item()

            if idx%100 == 0:
                print(f"Val Step: {idx}/{total_steps} | Loss: {loss.item()}")
    
    return epoch_loss / total_steps

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs 

### Training

In [None]:
best_valid_loss = float('inf')

for epoch in range(EPOCHS):
    start_time = time.time()
    train_loss = train(train_data_loader, model, TOKENIZER, optimizer, device)
    val_loss = evaluate(val_data_loader, model, TOKENIZER, device)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if val_loss < best_valid_loss:
        best_valid_loss = val_loss
        torch.save(model.state_dict(), MODEL_PATH)
    print(f"Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s")
    print(f"\t Train Loss: {train_loss:.3f} | Train PPL: {np.exp(train_loss):5.4f}")
    print(f"\t Val Loss: {val_loss:.3f} | Val PPL: {np.exp(val_loss):5.4f}")

Step: 0/1806 | Loss: 10.286368370056152
Step: 100/1806 | Loss: 3.7034249305725098
Step: 200/1806 | Loss: 3.340522527694702
Step: 300/1806 | Loss: 3.012753486633301
Step: 400/1806 | Loss: 2.862734079360962
Step: 500/1806 | Loss: 2.7713520526885986
Step: 600/1806 | Loss: 2.6481432914733887
Step: 700/1806 | Loss: 1.7158317565917969
Step: 800/1806 | Loss: 3.0307559967041016
Step: 900/1806 | Loss: 3.471998691558838
Step: 1000/1806 | Loss: 2.417418956756592
Step: 1100/1806 | Loss: 2.720583915710449
Step: 1200/1806 | Loss: 3.022886276245117
Step: 1300/1806 | Loss: 2.584986448287964
Step: 1400/1806 | Loss: 3.4161250591278076
Step: 1500/1806 | Loss: 2.8636693954467773
Step: 1600/1806 | Loss: 2.862813711166382
Step: 1700/1806 | Loss: 2.383406162261963
Step: 1800/1806 | Loss: 2.6561291217803955
Val Step: 0/452 | Loss: 2.0754969120025635
Val Step: 100/452 | Loss: 2.256164312362671
Val Step: 200/452 | Loss: 2.4404520988464355
Val Step: 300/452 | Loss: 1.6999928951263428
Val Step: 400/452 | Loss: 2.

In [None]:
model.load_state_dict(torch.load(MODEL_PATH))

<All keys matched successfully>

### Inference

In [None]:
def inference(model, news, tokenizer, device):
    model.eval()

    news = "summarize: " + news

    source = tokenizer.encode_plus(
        news,
        max_length=MAX_LEN,
        pad_to_max_length=True,
        return_attention_mask=True,
        truncation=True,
        return_tensors='pt')
    
    with torch.no_grad():
        ids = source["input_ids"].to(device)
        mask = source["attention_mask"].to(device)

        generated_ids = model.generate(
            input_ids=ids,
            attention_mask=mask,
            max_length=SUMMARY_LEN,
            num_beams=2,
            repetition_penalty=2.5,
            length_penalty=1.0,
            early_stopping=True
        )

        summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

    return summary[0]

In [None]:
news = """The Twitter accounts of Joe Biden, Elon Musk, Jeff Bezos, Kanye West were among the tens of high-profile accounts that were hijacked last night. These accounts were then used to spread bitcoin scam and fool people into donating money through a link.

A Twitter employee was seemingly the reason behind the hacking of high profile users that took place on Wednesday. It is still being investigated if the Twitter employee himself hacked the account or gave the tool to the hackers, a Twitter spokesperson told Motherboard.

So how did the hackers get access to the internal tool? "We used a rep that literally did all the work for us," one of the hacker sources told Motherboard. Another source said that they paid the Twitter insider.

An internal tool at Twitter was used to take over the high-profile accounts, screenshots obtained by Motherboard as well as sources revealed.

The hacker used the tool to reset the associated email addresses of affected accounts to make it more difficult for the owner to regain control, TechCrunch noted. The hacker then pushed a cryptocurrency scam that was noticed by everyone on Wednesday.

The tool was used on the Twitter panel to hack OG accounts or accounts that have a handle consisting of only one or two characters. The panel, whose screenshots were widely shared and later taken down by Twitter, showed if the targeted user's account has been suspended, is permanently suspended, or has protected status.

The panel was also used to post tweets related to cryptocurrency scams from the high profile accounts that blasted off on the platform.

Twitter also acknowledged that the hacks were a coordinated social engineering attack by people who successfully “targeted some of our employees with access to internal systems and tools.”

Screenshots of the panel being posted by users are being taken down as a violation of Twitter policies.

"As per our rules, we're taking action on any private, personal information shared in Tweets," said a Twitter spokesperson.

Some leading cryptocurrency sites were also compromised on Wednesday. Cryptocurrency platforms like Coinbase and Gemini falsely “announced” they had partnered up with an organization called CryptoForHealth, through their Twitter accounts. They claimed that the organisation was going to provide people with bitcoin as long as they sent some to an address first.

Other prominent Twitter accounts that were hacked were that of President Barack Obama, Kim Kardashian West, Warren Buffett, Jeff Bezos, and Mike Bloomberg. Official accounts of Uber and Apple tweeted out a post that was a spam message. The spam message directed readers to invest bitcoin in the wallet address that was provided in the tweets and claimed that they would get double the money they spend."""

In [None]:
news = " ".join(news.split()).strip()
summary = inference(model, news, TOKENIZER, device)
print(summary)

a Twitter employee was apparently the reason behind the hacking of high profile accounts that took place on Wednesday night. The hacker then used an internal tool to take over the accounts, according to a report in the New York Times.The account was also used to post tweets related to cryptocurrency scams from the high profile accounts that were hacked by the hacker, who claimed to have been paid by the company?s insiders.Also read: #BitcoinScamTomorrow @twitter_tomorrow@twitter.com/b9f8d0xYZQXyzjqJuJuJuJuJuJ
