# Drive Mount, Installations, Import

## Import Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Installations

In [2]:
!pip install transformers
!git clone https://github.com/chriskhanhtran/bert-extractive-summarization.git
!pip install boto3
!pip install rouge
%cd bert-extractive-summarization

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 35.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 46.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.8 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.4 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3

## Imports

In [3]:
import torch
from torch import nn
from torch.optim import Adam
import numpy as np  
import pandas as pd
from sklearn.metrics import jaccard_score
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from ext_sum import test
import random
from models.encoder import ExtTransformerEncoder
from transformers import BertTokenizer, BertModel, AutoModel
from plotly.figure_factory import create_table
from sklearn.model_selection import train_test_split
from rouge import Rouge

device = 'cuda' if torch.cuda.is_available() else 'cpu'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




## Data EDA and category selection

### Data Load

In [4]:
df = pd.read_csv('/content/drive/MyDrive/Masters First Year/NLP/Final Proj/Data/Hebrew articles/hebrew_news.csv')

In [5]:
df.head(1)

Unnamed: 0,articleBody,description,headline,keywords,title,type,category,source
0,זה היה באוויר כבר יותר מחודשיים. העצמאות שבה ב...,העצמאות שבה ניהל שלמה רודב את דירקטוריון בזק ה...,סרצ'לייט רצתה להוכיח שליטה והראתה לרודב את הדלת,"['בזק', 'רודב', ""סרצ'לייט""]",סרצ'לייט רצתה להוכיח שליטה והראתה לרודב את הדלת,article,שוק ההון,1


### Data Random Sample
We see from the sample that the text description provides a better summerization of the texts, while the title mostly acts as a click-bait. Therefore we will use the description as our gold-label.

In [6]:
sample = df.sample(1, random_state=42)
print('Sample Title:', sample['title'].values)
print('Sample Description:', sample['description'].values)
print('Sample Text:' , list(sample['articleBody'].values))

Sample Title: ['הערב זה מתחיל: קווי התחבורה הציבורית שפועלים בשבת בגוש דן']
Sample Description: ['עם כניסת השבת החלו לפעול 6 קווים בתל אביב-יפו, גבעתיים, רמת השרון וקריית אונו. בכל חצי שעה יעברו האוטובוסים במאות תחנות שמוקמו בנקודות מרכזיות בערים המשתתפות במיזם. רבים כבר החלו לעשות שימוש בשירות. מקווים שלא יבוטל כיצד נראית מפת הקווים? כל הפרטים']
Sample Text: ['מיזם התחבורה הציבורית בשבת בגוש דן החל את פעילותו הערב (שישי). במהלך סוף השבוע יצאו לדרכם לראשונה חמישה קווי אוטובוס שייסעו בערים מרכזיות באזור, בהן תל אביב-יפו, גבעתיים, רמת השרון וקריית אונו. האוטובוסים יעברו על פני מאות תחנות שמוקמו בנקודות מרכזיות בערים השונות, ויחלפו על פניהן בכל חצי שעה. הקווים שהחלו לפעול הערב יסיימו את נסיעתם ב-01:00 או 03:00 בלילה, וישובו לפעילות מחר החל מ-09:00 בבוקר ועד לצאת השבת. בשלב זה יינתן השירות בחינם, ובהמשך ייקבע המחיר וכיצד ייגבה.כבר בשעות המוקדמות שלאחר כניסת השבת נראו נוסעים העושים שימוש לראשונה בשירות. בתחנה ברחוב כצנלסון בגבעתיים נצפו כמה צעירים שהחליטו לנצל את האוטובוסים החדש ולהימנע מהו

### Understand the categories in the DataSet
we would like to select a single category from this list, and train our model on it. This is done due to lack of resources, long training time and the hypothesis that focusing on a single category will yield better results.

In [7]:
cate_df = pd.DataFrame(pd.Series(df['category']).value_counts())
print('These are the top labeled categories in our dataset:')
cate_df.head(10)

These are the top labeled categories in our dataset:


Unnamed: 0,category
בארץ,2578
טכנולוגי,1725
שוק ההון,1302
ועידות,940
"נדל""ן",584
Health,577
עולם,541
PnaiPlus,367
Economy,367
פרסום ושיווק,343


### Select the real-estate category

In [8]:
re_df = df[df["category"] == 'נדל"ן']
re_texts = re_df['articleBody']
re_labels = re_df['description']

# Preprocess

## Init Hebrt Model, inc. weights

In [9]:
HeBERT = 'avichr/heBERT'
hebert = AutoModel.from_pretrained(HeBERT, force_download=True)
tokenizer = BertTokenizer.from_pretrained(HeBERT, do_lower_case=True)
## add the 'SEP' and 'CLS' tokens, will later be used in modifiying the texts to the summarzaion model
sep_vid = tokenizer.vocab["[SEP]"]
cls_vid = tokenizer.vocab["[CLS]"]

Downloading:   0%|          | 0.00/505 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at avichr/heBERT were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at avichr/heBERT and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should prob

Downloading:   0%|          | 0.00/299k [00:00<?, ?B/s]

In [10]:
MAX_POS = 1024

## Preprocess methods

In [11]:
def preprocess(source, max_pos=MAX_POS):
    """
    a simple preprocess function
    - Remove \n
    - Sentence Tokenize
    - Add [SEP] [CLS] as sentence boundary
    """
    source = source[:max_pos]
    raw_text = str(source).replace("\n", " ")
    sents = sent_tokenize(raw_text)
    processed_text = "[CLS] [SEP]".join(sents)
    return processed_text

def preprocess_with_label(source, label, max_pos=MAX_POS):
    """
    preprocess with a twist - insert gold standatrd summaraztion text into the source
    - Remove \n
    - Sentence Tokenize
    - Add [SEP] [CLS] as sentence boundary
    """
    raw_text = str(source).replace("\n", " ")
    sents = sent_tokenize(raw_text[:min(max_pos-len(label), len(source))])
    idx = 0
    if len(sents) > 0:
      idx = random.choice(range(len(sents)))
      sents = sents[:idx] + [label + '.'] + sents[idx:]
      processed_text = "[CLS] [SEP]".join(sents)
      vec = [0] * len(sents)
      vec[idx] = 1
      assert(len(vec) == len(sents))
    else:
        processed_text, vec = "", []
    return processed_text, torch.Tensor(vec).to(device)

In [12]:
  def load_text(processed_text, max_pos, device):

    def _process_src(raw):
        raw = raw.strip().lower()
        raw = raw.replace("[cls]", "[CLS]").replace("[sep]", "[SEP]")
        src_subtokens = tokenizer.tokenize(raw)
        src_subtokens = ["[CLS]"] + src_subtokens + ["[SEP]"]
        src_subtoken_idxs = tokenizer.convert_tokens_to_ids(src_subtokens)
        src_subtoken_idxs = src_subtoken_idxs[:-1][:max_pos]
        src_subtoken_idxs[-1] = sep_vid
        _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == sep_vid]
        segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
        
        segments_ids = []
        segs = segs[:max_pos]
        for i, s in enumerate(segs):
            if i % 2 == 0:
                segments_ids += s * [0]
            else:
                segments_ids += s * [1]

        src = torch.tensor(src_subtoken_idxs)[None, :].to(device)
        mask_src = (1 - (src == 0).int()).to(device)
        cls_ids = [[i for i, t in enumerate(src_subtoken_idxs) if t == cls_vid]]
        clss = torch.tensor(cls_ids).to(device)
        mask_cls = (1 -(clss == -1).int()).to(device)
        clss[clss == -1] = 0
        return src, mask_src, torch.Tensor(segments_ids).to(device), clss, mask_cls.to(device)

    src, mask_src, segments_ids, clss, mask_cls = _process_src(processed_text)
    segs = torch.tensor(segments_ids)[None, :].to(device)
    src_text = [[sent.replace("[SEP]", "").strip() for sent in processed_text.split("[CLS]")]]
    return src, mask_src, segs, clss, mask_cls, src_text

## Test Preprocess on a a text

In [13]:
processed_text_sample = preprocess(sample['articleBody'].values[0])
data_sample = load_text(processed_text_sample, MAX_POS, device=device)
print(data_sample)

(tensor([[    2, 17478,  7540,  6762,  3966, 14814,  5066,  2393,  1517, 13720,
          2870,    12,  5710,    13,    18,     2,     3,  2376,  3187,  3319,
          6175,  7439,  1764,  4209,  6514,  7572,  1868,  2308,  1592,  2671,
         12237, 15724,  3104,    16,  3666,  2117,  2130,    17,  3214,    16,
         17924,    16,  3312,  4578, 21431,  1511, 10988,    18,     2,     3,
         26672, 20270,  1532,  2398,  3668,  9128,  2822,  1031,  1539, 20397,
         15724, 12237,  5234,    16,  5137, 19968,  1532, 10723,  1057,  1794,
          3980,  3879,    18,     2,     3, 14064, 25683,  5008,  2870, 26472,
          1896,  1517,  1778,  5093,  1046,   198,    17,  4437,    30,  3164,
          1567,  3988,    30,  3164,  5954,    16,  2553,  8002,  8079,  6640,
          2393,   211,    17,  3906,    30,  3164,  5231,  2705,  3832,  5821,
            18,     2,     3,  3873,  1607, 20875,  3509,  8169,    16, 11506,
          9123,  1648,  5229, 10628, 19973,  1798, 


To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



## Check and remove nans in our dataset

In [14]:
print('Number of nans in description column: ', len(re_df[re_df['description'].isna() == True]))
print('Number of nans in articleBody column: ', len(re_df[re_df['articleBody'].isna() == True]))

texts = re_df['articleBody']
labels = re_df['description']
### removing nans, if appliable
texts = re_texts.drop(re_texts[re_texts.isna()==True].index, axis=0)
labels = re_labels.drop(re_labels[re_labels.isna()==True].index, axis=0)
assert(len(texts) == len(labels))
print('Total samples :', len(texts))

Number of nans in description column:  0
Number of nans in articleBody column:  0
Total samples : 584


## PreProcess Data - using preprocess with label

In [15]:
pairs_orig = [(load_text(preprocess(text), max_pos=MAX_POS, device=device), label) for text, label in zip(re_texts, re_labels) if len(text) > 5]
processed_texts = [pair[0] for pair in pairs_orig]
labels_text = [pair[1] for pair in pairs_orig]


To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



In [16]:
pairs_orig_j = [(load_text(preprocess(text), max_pos=MAX_POS, device=device), load_text(preprocess(label), max_pos=MAX_POS, device=device)) for text, label in zip(re_texts, re_labels) if len(text) > 5]
processed_texts_j = [pair[0] for pair in pairs_orig_j]
processed_labels_j = [pair[1] for pair in pairs_orig_j]


To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



In [17]:
pairs = [preprocess_with_label(text, label) for text, label in zip(re_texts, re_labels)]
labels_vec = [pair[1] for pair in pairs if len(pair[0]) > 5]
processed_texts_with_label = [load_text(pair[0], max_pos=MAX_POS, device=device) for pair in pairs if len(pair[0]) > 5]
assert(len(labels_vec) == len(processed_texts_with_label))


To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



# HeBERT Summarizer

## HeBert Summarizer obj and functions

In [18]:
class summarizer(nn.Module):
  def __init__(self, bert_model):
    super().__init__()
    self.bert = bert_model
    self.ext_layer = ExtTransformerEncoder(self.bert.config.hidden_size, d_ff=2048, heads=8, dropout=0.2, num_inter_layers=2)

  def forward(self, src, segs, clss, mask_src, mask_cls):
    top_vec = self.bert(src, segs, mask_src).last_hidden_state
    sents_vec = top_vec[torch.arange(top_vec.size(0)).unsqueeze(1), clss]
    sents_vec = sents_vec * mask_cls[:, :, None].float()
    sent_scores = self.ext_layer(sents_vec, mask_cls).squeeze(-1)
    return sent_scores, mask_cls
  

def train(model, optimizer, input_data, labels, epochs=5):
  criterion = torch.nn.BCELoss(reduction='sum')
  model.train()
  for epoch in range(epochs):
    loss = 0
    for i, data in enumerate(input_data):
        try:
          optimizer.zero_grad()
          src, mask, segs, clss, mask_cls, _ = data
          sent_scores, mask = model(src, segs, clss, mask, mask_cls)
          cur_loss = criterion(sent_scores , labels[i].view(1,-1))
          cur_loss.backward()
          optimizer.step()
          loss += cur_loss.item()
        except Exception as e:
          # print(e)
          continue

    print("epoch:", epoch, "\t loss:", loss/len(input_data))
  return loss

def predict(model, input_data, max_length, block_trigram=True):
    def _get_ngrams(n, text):
        ngram_set = set()
        text_length = len(text)
        max_index_ngram_start = text_length - n
        for i in range(max_index_ngram_start + 1):
            ngram_set.add(tuple(text[i : i + n]))
        return ngram_set

    def _block_tri(c, p):
        tri_c = _get_ngrams(3, c.split())
        for s in p:
            tri_s = _get_ngrams(3, s.split())
            if len(tri_c.intersection(tri_s)) > 0:
                return True
        return False

    with torch.no_grad():
        src, mask, segs, clss, mask_cls, src_str = input_data
        sent_scores, mask = model(src, segs, clss, mask, mask_cls)
        sent_scores = sent_scores + mask.float()
        sent_scores = sent_scores.cpu().data.numpy()
        selected_ids = np.argsort(-sent_scores, 1)

        pred = []
        for i, idx in enumerate(selected_ids):
            _pred = []
            if len(src_str[i]) == 0:
                continue
            for j in selected_ids[i][: len(src_str[i])]:
                if j >= len(src_str[i]):
                    continue
                candidate = src_str[i][j].strip()
                if block_trigram:
                    if not _block_tri(candidate, _pred):
                        _pred.append(candidate)
                else:
                    _pred.append(candidate)

                if len(_pred) == max_length:
                    break

            _pred = " ".join(_pred)
            pred.append(_pred)
    ret = ''
    for i in range(len(pred)):
      ret += pred[i].strip() + "\n"
    return ret

def get_num_segs(seg):
  c = 1
  cur = 0
  for i in seg:
    if i != cur:
      c += 1
      cur = i
  return c

def get_label_by_rouge(sentences, title, num_segs):
  rouge_scorer = Rouge()
  scores = []
  for sentence in sentences:
    if len(scores) == num_segs: # in case text exceeded max_pos
      break
    try:
      scores.append(rouge_scorer.get_scores(sentence, title)[0]['rouge-l']['f']) # use f1 score of ROUGE-L
    except Exception as e:
      scores.append(0.0) # sentence is "." or something similar
  
  label = [0] * len(scores)
  label[np.argmax(scores)] = 1
  return torch.FloatTensor(label).to(device)

def get_label_for_sentence_by_jaccard(sentences, desc):
  mask = sentences[2].view(-1)
  tokens = sentences[0].view(-1)
  scores = []
  last_change = 0
  for i in range(len(mask)-1):
    if mask[i] != mask[i+1]:
      scores.append(len(np.intersect1d(tokens[last_change:i+1].cpu().numpy(), desc.cpu().numpy())))
      last_change = i+1
  
  scores.append(len(np.intersect1d(tokens[last_change:].cpu().numpy(), desc.cpu().numpy())))
  label = [0] * len(scores)
  label[np.argmax(scores)] = 1

  return torch.FloatTensor(label).to(device)

def eval(model, data, labels):
  rouge_scorer = Rouge()
  right = 0
  non_accurate = 0
  score = 0
  with torch.no_grad():
    for i, text in enumerate(data):
      try:
        src, mask, segs, clss, mask_cls, orig_text = text
        sent_scores, mask = model(src, segs, clss, mask, mask_cls)
        pred = torch.argmax(sent_scores).item()
        label = torch.argmax(labels[i]).item()
        if pred == label:
          right += 1
        else:
          non_accurate += 1
        score += rouge_scorer.get_scores(orig_text[0][pred], orig_text[0][label])[0]['rouge-l']['f']
      except:
        continue
  return right, non_accurate, score/(right+non_accurate)

def print_eval_results(model, data, labels):
  right , non_accurate, f1rouge = eval(model, data, labels)
  print("Total accuracy: {}, F1-Rouge-L score {}".format(right/ (right + non_accurate), f1rouge))

In [19]:
def load_saved_hebert_model(path, train=True):
  model = summarizer(hebert)
  optimizer = Adam(model.parameters(), lr=0.001, eps=1e-9)
  checkpoint = torch.load(path)
  model.load_state_dict(checkpoint['model_state_dict'])
  optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
  epoch = checkpoint['epoch']
  loss = checkpoint['loss']

  if train:
    model.train()
  else:
    model.eval()

  return model, optimizer

# Experiments

We present 4 Experiments over our dataset, in an attempt to achieve the best summarization.

## Experiment #0 - BaseLine
Testing the HeBERT ability to summarize text following the BERTSUM architecture changes without any fine-tuning on our dataset.<br>
The label for each article, which is the sentence that best summarizes the text, will be the sentence from within the body of the article that is closest to the description of the article using ROUGE-L F1 score.


In [20]:
hebert_sum = summarizer(hebert)
hebert_sum = hebert_sum.to(device)
rouge_labels_all = [get_label_by_rouge(processed_texts[i][-1][0], labels_text[i], get_num_segs(processed_texts[i][2][0])) for i in range(len(processed_texts))]
print_eval_results(hebert_sum, processed_texts, rouge_labels_all)


masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/cuda/Indexing.cu:937.)



Total accuracy: 0.11492281303602059, F1-Rouge-L score 0.15921295228656862


## Experiment #1

The label for training is the same as the label for test.<br>
The label for each article, which is the sentence that best summarizes the text, will be the sentence from within the body of the article that is closest to the description of the article using ROUGE-L F1 score.


In [21]:
hebert_sum_rouge = summarizer(hebert)
hebert_sum_rouge = hebert_sum_rouge.to(device)
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(processed_texts, labels_text, range(len(labels_text)), test_size=0.1, random_state=10)
optimizer = Adam(hebert_sum_rouge.parameters(), lr=0.001, eps=1e-9, betas=[0.9, 0.999])
rouge_labels_train = [get_label_by_rouge(X_train[i][-1][0], y_train[i], get_num_segs(X_train[i][2][0])) for i in range(len(X_train))]
epochs = 20
final_loss = train(hebert_sum_rouge, optimizer, X_train, rouge_labels_train, epochs=epochs)
# torch.save({
#             'epoch': epochs,
#             'model_state_dict': hebert_sum_rouge.state_dict(),
#             'optimizer_state_dict': optimizer.state_dict(),
#             'loss': final_loss/len(X_train)
#             },f"/content/drive/MyDrive/Masters First Year/NLP/Final Proj/rouge_model_{epochs}.tar")

print('Train eval results:')
print_eval_results(hebert_sum_rouge, X_train, rouge_labels_train)

print('Test eval results:')
rouge_labels_test = [get_label_by_rouge(X_test[i][-1][0], y_test[i], get_num_segs(X_test[i][2][0])) for i in range(len(X_test))]
print_eval_results(hebert_sum_rouge, X_test, rouge_labels_test)


masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/cuda/Indexing.cu:937.)



epoch: 0 	 loss: 3.4168450595768354
epoch: 1 	 loss: 3.1474122282202917
epoch: 2 	 loss: 3.163495206650887
epoch: 3 	 loss: 3.1464247817301567
epoch: 4 	 loss: 3.13647196493076
epoch: 5 	 loss: 3.1456618090622297
epoch: 6 	 loss: 3.1540625286466293
epoch: 7 	 loss: 3.164990616663722
epoch: 8 	 loss: 3.1621402752308447
epoch: 9 	 loss: 3.1437080557109747
epoch: 10 	 loss: 3.1614237031863848
epoch: 11 	 loss: 3.133929811361182
epoch: 12 	 loss: 3.137342304218816
epoch: 13 	 loss: 3.146078703057675
epoch: 14 	 loss: 3.1406930753292928
epoch: 15 	 loss: 3.1375304738073857
epoch: 16 	 loss: 3.122884641166862
epoch: 17 	 loss: 3.113286261340134
epoch: 18 	 loss: 3.118809333511891
epoch: 19 	 loss: 3.1263116229581467
Train eval results:
Total accuracy: 0.14694656488549618, F1-Rouge-L score 0.1954740814258561
Test eval results:
Total accuracy: 0.0847457627118644, F1-Rouge-L score 0.1390711691019701


## Experiment #2

Inserting the description into the article body's text while training, then using the location of the description as the label. <br>
Test label is as in the previous experiment.


In [22]:
hebert_sum_desc_in = summarizer(hebert)
hebert_sum_desc_in = hebert_sum_desc_in.to(device)
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(processed_texts_with_label, labels_vec, range(len(labels_vec)), test_size=0.1, random_state=10)
optimizer = Adam(hebert_sum_desc_in.parameters(), lr=0.001, eps=1e-9, betas=[0.9, 0.999])
epochs = 20
final_loss = train(hebert_sum_desc_in, optimizer, X_train, y_train, epochs=epochs)
# torch.save({
#             'epoch': epochs,
#             'model_state_dict': hebert_sum_desc_in.state_dict(),
#             'optimizer_state_dict': optimizer.state_dict(),
#             'loss': final_loss/len(X_train)
#             }, f"/content/drive/MyDrive/Masters First Year/NLP/Final Proj/rouge_model_with_label_insert_{epochs}.tar")

print('Train eval results:')
print_eval_results(hebert_sum_desc_in, X_train, y_train)
test_texts_label = re_df.iloc[idx_test]
X_test = np.array(processed_texts)[idx_test]
rouge_labels = [get_label_by_rouge(X_test[i][-1][0], test_texts_label.iloc[i], get_num_segs(X_test[i][2][0])) for i in range(len(X_test))]
print('Test eval results:')
print_eval_results(hebert_sum_desc_in, X_test, rouge_labels)


masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/cuda/Indexing.cu:937.)



epoch: 0 	 loss: 3.2561467467373566
epoch: 1 	 loss: 3.151883061150558
epoch: 2 	 loss: 3.1402907580819748
epoch: 3 	 loss: 3.128541143340919
epoch: 4 	 loss: 3.152829005745531
epoch: 5 	 loss: 3.1412915598800164
epoch: 6 	 loss: 3.1263063581845234
epoch: 7 	 loss: 3.1197508983029665
epoch: 8 	 loss: 3.121711722084584
epoch: 9 	 loss: 3.119796545450924
epoch: 10 	 loss: 3.106259200864166
epoch: 11 	 loss: 3.0992170481281427
epoch: 12 	 loss: 3.1071566848354486
epoch: 13 	 loss: 3.1218973916905526
epoch: 14 	 loss: 3.1084269214677445
epoch: 15 	 loss: 3.117094032864534
epoch: 16 	 loss: 3.1049534967382444
epoch: 17 	 loss: 3.0968286256298763
epoch: 18 	 loss: 3.0966840622989276
epoch: 19 	 loss: 3.1468508266310655
Train eval results:
Total accuracy: 0.14885496183206107, F1-Rouge-L score 0.22735528639975316



Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray



Test eval results:
Total accuracy: 0.1864406779661017, F1-Rouge-L score 0.23259662468147052


<br>
The label for each article, which is the sentence that best summarizes the text, will be the sentence from within the body of the article that is closest to the description of the article using Jaccard similarity score.

In [23]:
hebert_sum_jaccard = summarizer(hebert)
hebert_sum_jaccard = hebert_sum_jaccard.to(device)
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(processed_texts_j, processed_labels_j, range(len(processed_labels_j)), test_size=0.1, random_state= 10)
optimizer = Adam(hebert_sum_jaccard.parameters(), lr=0.001, eps=1e-9, betas=[0.9, 0.999])
jaccard_labels_train = [get_label_for_sentence_by_jaccard(X_train[i], y_train[i][0]) for i in range(len(y_train))]
epochs = 20
final_loss = train(hebert_sum_jaccard, optimizer, X_train, jaccard_labels_train, epochs=epochs)
# torch.save({
#             'epoch': epochs,
#             'model_state_dict': hebert_sum_jaccard.state_dict(),
#             'optimizer_state_dict': optimizer.state_dict(),
#             'loss': final_loss/len(X_train)
#             },f"/content/drive/MyDrive/Masters First Year/NLP/Final Proj/jaccard_model_{epochs}.tar")

print('Train eval results:')
print_eval_results(hebert_sum_jaccard , X_train, jaccard_labels_train)

print('Test eval results to jaccard labels:')
jaccard_labels_test = [get_label_for_sentence_by_jaccard(X_test[i], y_test[i][0]) for i in range(len(X_test))]
print_eval_results(hebert_sum_jaccard, X_test, jaccard_labels_test)

rouge_labels = [get_label_by_rouge(X_test[i][-1][0], y_test[i], get_num_segs(X_test[i][2][0])) for i in range(len(X_test))]
print('Test eval results to rouge labels:')
print_eval_results(hebert_sum_jaccard, X_test, rouge_labels)


masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/cuda/Indexing.cu:937.)



epoch: 0 	 loss: 3.234019856416542
epoch: 1 	 loss: 3.1352593239027127
epoch: 2 	 loss: 3.1588123822940215
epoch: 3 	 loss: 3.1469533675499544
epoch: 4 	 loss: 3.1238897169819313
epoch: 5 	 loss: 3.1157209477351824
epoch: 6 	 loss: 3.104453434243457
epoch: 7 	 loss: 3.142199685555378
epoch: 8 	 loss: 3.1298101093932873
epoch: 9 	 loss: 3.146158279808423
epoch: 10 	 loss: 3.135597749975801
epoch: 11 	 loss: 3.11209923188195
epoch: 12 	 loss: 3.1449480682383966
epoch: 13 	 loss: 3.105233308468156
epoch: 14 	 loss: 3.1291236258645094
epoch: 15 	 loss: 3.1259208990417364
epoch: 16 	 loss: 3.144496167434081
epoch: 17 	 loss: 3.152257397884631
epoch: 18 	 loss: 3.1363814391252647
epoch: 19 	 loss: 3.1192359410169472
Train eval results:
Total accuracy: 0.13358778625954199, F1-Rouge-L score 0.17879564972524917
Test eval results to jaccard labels:
Total accuracy: 0.1864406779661017, F1-Rouge-L score 0.2366187314767476
Test eval results to rouge labels:
Total accuracy: 0.15254237288135594, F1-Ro

In [24]:
print("orig text:\n", "\n".join(processed_texts[0][-1][0]))
label_by_rouge = get_label_by_rouge(processed_texts[0][-1][0], labels_text[0], get_num_segs(processed_texts[0][2][0]))
print("label by rouge to description:\n", processed_texts[0][-1][0][torch.argmax(label_by_rouge).item()])
print("orig label:\n", labels_text[0])
rouge_scorer = Rouge()
rouge_scorer.get_scores(labels_text[0], processed_texts[0][-1][0][torch.argmax(label_by_rouge).item()])[0]['rouge-l']['f']

orig text:
 זוכי תוכנית מחיר למשתכן ברחוב דבורה עומר ברעננה גילו לאחרונה כי בפרויקט שלהם תוצמד רק חניה אחת לדירה, בעוד יתר החניות יישארו בידי הקבלן, שלטענתם מעוניין למכור אותן בשוק החופשי.
הפרויקט נמצא בהליכי קידום וטרם התקבל לו היתר בנייה.
הוא כולל חמישה בניינים בני תשע קומות, ובסך הכל 147 דירות, מהן 27 יימכרו בשוק החופשי והיתר במסגרת מחיר למשתכן.
בנוסף לכך הפרויקט כולל שטחי מסחר.
המחיר לזכאים עומד על כ־14 אלף שקל למ”ר, וטווח המחירים נע בין כ־1.2 מיליון שקל ל־2 מיליון שקל לדירת מחיר למשתכן.
לפי התב"ע, לפרויקט יש 298 חניות תת־קרקעיות.
לכל אחת מהדירות שיימכרו בשוק החופשי יוצמדו שתי חניות, ובסך הכל 54 חניות, בעוד ליתר 120 הדירות תוצמד רק חניה אחת.
כ־20 חניות יוצמדו לשטחי המסחר.
מכל זאת יוצא ש־104 חניות נותרות ללא הצמדה.
לטענת הרוכשים, הקבלן כמיל שגראווי מחברת שגראווי SBC אמר להם כי בכוונתו למכור את החניות בשוק החופשי, אף שהמפרט של תוכנית מחיר למשתכן אוסר על כך.
לפי המפרט, שהופץ על ידי משרד השיכון בדצמבר 2017, "כל החניות המתוכננות בתחום המגרשים יוצמדו לדירות המתוכננות במגרשים.
למען הסר ספ

0.16666666167534736

In [25]:
print(predict(hebert_sum, processed_texts[0], 1, True))
print(labels_text[0])

בנוסף לכך הפרויקט כולל שטחי מסחר.

קבלן ברעננה תכנן לפרויקט 298 חניות, ששליש מהן הוא מתכוון למכור בשוק החופשי. המפרט של משרד הבינוי אוסר זאת, אך העירייה מתירה לו



masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/cuda/Indexing.cu:937.)



In [26]:
src, mask, segs, clss, mask_cls, orig_text = processed_texts[0]
sent_scores, mask = hebert_sum_desc_in(src, segs, clss, mask, mask_cls)
pred = torch.argmax(sent_scores).item()
orig_text[0][pred]


masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /pytorch/aten/src/ATen/native/cuda/Indexing.cu:937.)



'המחיר לזכאים עומד על כ־14 אלף שקל למ”ר, וטווח המחירים נע בין כ־1.2 מיליון שקל ל־2 מיליון שקל לדירת מחיר למשתכן.'

In [27]:
sent_scores, mask

(tensor([[0.1412, 0.1117, 0.1366, 0.1347, 0.2303, 0.1197, 0.1379, 0.1801, 0.1946,
          0.2067, 0.2278, 0.1273]], device='cuda:0', grad_fn=<SqueezeBackward1>),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0',
        dtype=torch.int32))