# SummerTime Midway Showcase

#### This notebook shows part of the current functionality of the SummerTime - A summarization library.

Note: This is not the production version of the library and more modules are being added, including but not limited to pip installation, more models, more datasets, etc.

## Installation
Cloning from GitHub at the moment, but will support `pip install` soon

In [1]:
## Uncomment to clone git repo if not already done so
## Swith to the Summertime directory
## Switch to the relevant git branch

# !git clone https://github.com/Yale-LILY/SummerTime.git
# %cd SummerTime/
# !git checkout origin/troyfeng116/integration-tests

/data/lily/mmm274/SummerTime/notebook/SummerTime
HEAD is now at 6b48c8b Merge branch 'main' into troyfeng116/integration-tests


In [1]:
!ls

build		  __init__.py		SummerTime.egg-info
dataset		  model			SummerTime_midway_showcase.ipynb
dataset_test.py   pipeline		summertime_pkg
demo.ipynb	  pip_instructions.txt	summertime.py
dependencies.txt  README.md		tests
dist		  requirements.txt
evaluation	  setup.py


### Install dependencies for the library

In [3]:
!pip install -r requirements.txt







In [3]:
## Uncomment to restart runtime if prompted to do so in the previous cell; else ignore
## Restart runtime to install modules
# import os
# os.kill(os.getpid(), 9)

In [None]:
## Uncomment to move back into the Summertime directory if restarted runtime
# %cd SummerTime/

In [4]:
## Finish setup

# Setup ROUGE
!export ROUGE_HOME=/usr/local/lib/python3.7/dist-packages/summ_eval/ROUGE-1.5.5/
!pip install -U  git+https://github.com/bheinzerling/pyrouge.git

Collecting git+https://github.com/bheinzerling/pyrouge.git
  Cloning https://github.com/bheinzerling/pyrouge.git to /tmp/pip-req-build-u3ae59bd
Building wheels for collected packages: pyrouge
  Building wheel for pyrouge (setup.py) ... [?25ldone
[?25h  Created wheel for pyrouge: filename=pyrouge-0.1.3-py3-none-any.whl size=191915 sha256=d1d8f935101ed24af7392cb647bb65198ea94f6265cc007b0e9a483cbd585e80
  Stored in directory: /tmp/pip-ephem-wheel-cache-2ul06_jw/wheels/33/46/ed/a3751151da9865df57f1333c182c8cb2ac2a7a419bbb5f1258
Successfully built pyrouge
Installing collected packages: pyrouge
  Attempting uninstall: pyrouge
    Found existing installation: pyrouge 0.1.3
    Uninstalling pyrouge-0.1.3:
      Successfully uninstalled pyrouge-0.1.3
Successfully installed pyrouge-0.1.3


In [5]:
# import modules for this notebook

from pprint import pprint
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Models

### Supported Models

SummerTime supports different models (*e.g.,* TextRank, BART, Longformer) as well as model wrappers for more complex summariztion tasks (*e.g.,* JointModel for multi-doc summarzation, BM25 retrieval for query-based summarization).

In [8]:
from model import SUPPORTED_SUMM_MODELS

pprint(SUPPORTED_SUMM_MODELS)

[<class 'model.single_doc.bart_model.BartModel'>,
 <class 'model.single_doc.lexrank_model.LexRankModel'>,
 <class 'model.single_doc.longformer_model.LongformerModel'>,
 <class 'model.single_doc.pegasus_model.PegasusModel'>,
 <class 'model.single_doc.textrank_model.TextRankModel'>,
 <class 'model.multi_doc.multi_doc_joint_model.MultiDocJointModel'>,
 <class 'model.multi_doc.multi_doc_separate_model.MultiDocSeparateModel'>,
 <class 'model.dialogue.hmnet_model.HMNetModel'>,
 <class 'model.query_based.tf_idf_model.TFIDFSummModel'>,
 <class 'model.query_based.bm25_model.BM25SummModel'>]


### Automatic Pipeline Assembly

### Model selection

In [9]:
import model

# Users can load a default summarization model
sample_model = model.summarizer()

In [10]:
from model import SUPPORTED_SUMM_MODELS, LexRankModel, PegasusModel

# Or a specific model
pegasus = PegasusModel()

In [11]:
# Users can easily access documentation to assist with model selection
sample_model.show_capability()

Pegasus is the default singe-document summarization model.
Pegasus is a abstractive, neural model for summarization. 
 #################### 
 Introduced in 2019, a large neural abstractive summarization model trained on web crawl and news data.
 Strengths: 
 - High accuracy 
 - Performs well on almost all kinds of non-literary written text 
 Weaknesses: 
 - High memory usage 
 Initialization arguments: 
 - `device = 'cpu'` specifies the device the model is stored on and uses for computation. Use `device='gpu'` to run on an Nvidia GPU.


### Inference

In [12]:
documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]

sample_model.summarize(documents)

["California's largest electricity provider has turned off power to hundreds of thousands of customers."]

# Datasets

### Datasets supported

SummerTime supports different summarization datasets across different domains (*e.g.,* CNNDM dataset - news article corpus, Samsum - dialogue corpus, QM-Sum - query-based dialogue corpus, MultiNews - multi-document corpus, ML-sum - multi-lingual corpus, PubMedQa - Medical domain, Arxiv - Science papers domain, among others.

In [13]:
from dataset import SUPPORTED_SUMM_DATASETS

pprint(SUPPORTED_SUMM_DATASETS)

[<class 'dataset.huggingface_datasets.CnndmDataset'>,
 <class 'dataset.huggingface_datasets.MultinewsDataset'>,
 <class 'dataset.huggingface_datasets.SamsumDataset'>,
 <class 'dataset.huggingface_datasets.XsumDataset'>,
 <class 'dataset.huggingface_datasets.PubmedqaDataset'>,
 <class 'dataset.huggingface_datasets.MlsumDataset'>,
 <class 'dataset.non_huggingface_datasets.ScisummnetDataset'>,
 <class 'dataset.non_huggingface_datasets.SummscreenDataset'>,
 <class 'dataset.non_huggingface_datasets.QMsumDataset'>,
 <class 'dataset.non_huggingface_datasets.ArxivDataset'>]


### Dataset Initialization

In [14]:
import dataset

cnn_dataset = dataset.CnndmDataset()

Reusing dataset cnn_dailymail (/home/lily/mmm274/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)


In [15]:
# Data is loaded using a generator to save on space and time

data_instance = next(cnn_dataset.train_set)
print("\n")
pprint(data_instance.__str__())

  0%|          | 0/287113 [00:00<?, ?it/s]



("{'source': 'It\\'s official: U.S. President Barack Obama wants lawmakers to "
 'weigh in on whether to use military force in Syria. Obama sent a letter to '
 'the heads of the House and Senate on Saturday night, hours after announcing '
 'that he believes military action against Syrian targets is the right step to '
 'take over the alleged use of chemical weapons. The proposed legislation from '
 'Obama asks Congress to approve the use of military force "to deter, disrupt, '
 'prevent and degrade the potential for future uses of chemical weapons or '
 'other weapons of mass destruction." It\\\'s a step that is set to turn an '
 'international crisis into a fierce domestic political battle. There are key '
 'questions looming over the debate: What did U.N. weapons inspectors find in '
 'Syria? What happens if Congress votes no? And how will the Syrian government '
 'react? In a televised address from the White House Rose Garden earlier '
 'Saturday, the president said he would take 

### A non-neural model
Below we train an unsupervised non-neural summarizer with a subset of the cnn_dailymail dataset: 

In [16]:
import itertools

# Get a slice of the train set - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance.source for instance in train_set]
pprint(corpus)

trad_model = LexRankModel(corpus)

['(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming '
 "his third gold in Moscow as he anchored Jamaica to victory in the men's "
 '4x100m relay. The fastest man in the world charged clear of United States '
 'rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar '
 'Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished '
 'second in 37.56 seconds with Canada taking the bronze after Britain were '
 'disqualified for a faulty handover. The 26-year-old Bolt has now collected '
 'eight gold medals at world championships, equaling the record held by '
 'American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention '
 'the small matter of six Olympic titles. The relay triumph followed '
 'individual successes in the 100 and 200 meters in the Russian capital. "I\'m '
 "proud of myself and I'll continue to work to dominate for as long as "
 'possible," Bolt said, having previously expressed his intention to carry on '


In [17]:
# Inference
text = [next(cnn_dataset.test_set).source]
pprint(text)

summary = trad_model.summarize(text)
pprint(summary)


  0%|          | 0/11490 [00:00<?, ?it/s][A

['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. '
 'Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. '
 'He was 88. Best died in hospice in Hickory, North Carolina, of complications '
 'from pneumonia, said Steve Latshaw, a longtime friend and Hollywood '
 "colleague. Although he'd been a busy actor for decades in theater and in "
 'Hollywood, Best didn\'t become famous until 1979, when "The Dukes of '
 'Hazzard\'s" cornpone charms began beaming into millions of American homes '
 "almost every Friday night. For seven seasons, Best's Rosco P. Coltrane "
 'chased the moonshine-running Duke boys back and forth across the back roads '
 'of fictitious Hazzard County, Georgia, although his "hot pursuit" usually '
 'ended with him crashing his patrol car. Although Rosco was slow-witted and '
 'corrupt, Best gave him a childlike enthusiasm that got laughs and made him '
 'endearing. His character became known for his distinctive "kew-kew

In [18]:
# More about lexrank

trad_model.show_capability()

LexRank is a extractive, non-neural model for summarization. 
 #################### 
 Works by using a graph-based method to identify the most salient sentences in the document. 
Strengths: 
 - Fast with low memory usage 
 - Allows for control of summary length 
 Weaknesses: 
 - Not as accurate as neural methods. 
 Initialization arguments: 
 - `corpus`: Unlabelled corpus of documents. ` 
 - `summary_length`: sentence length of summaries 
 - `threshold`: Level of salience required for sentence to be included in summary.


A spaCy pipeline for TextRank (another non-neueral extractive summarization model)

In [19]:
# TextRank model
textrank = model.TextRankModel()

In [20]:
textrank_summary = textrank.summarize(text[0:1])
pprint(textrank_summary)

["For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running "
 'Duke boys back and forth across the back roads of fictitious Hazzard County, '
 'Georgia, although his "hot pursuit" usually ended with him crashing his '
 'patrol car.']


In [21]:
# More about TextRank
textrank.show_capability()

TextRank is a extractive, non-neural model for summarization. 
 #################### 
 A graphbased ranking model for text processing. Extractive sentence summarization. 
 Strengths: 
 - Fast with low memory usage 
 - Allows for control of summary length 
 Weaknesses: 
 - Not as accurate as neural methods.


In [22]:
# Longformer2Roberta
longformer = model.LongformerModel()

You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
long_article = """(CNN)James Holmes made his introduction to the world in a Colorado cinema filled with spectators watching a midnight showing of the new Batman movie, "The Dark Knight Rises," in June 2012. The moment became one of the deadliest shootings in U.S. history. Holmes is accused of opening fire on the crowd, killing 12 people and injuring or maiming 70 others in Aurora, a suburb of Denver. Holmes appeared like a comic book character: He resembled the Joker, with red-orange hair, similar to the late actor Heath Ledger\'s portrayal of the villain in an earlier Batman movie, authorities said. But Holmes was hardly a cartoon. Authorities said he wore body armor and carried several guns, including an AR-15 rifle, with lots of ammo. He also wore a gas mask. Holmes says he was insane at the time of the shootings, and that is his legal defense and court plea: not guilty by reason of insanity. Prosecutors aren\'t swayed and will seek the death penalty. Opening statements in his trial are scheduled to begin Monday. Holmes admits to the shootings but says he was suffering "a psychotic episode" at the time,  according to court papers filed in July 2013 by the state public defenders, Daniel King and Tamara A. Brady. Evidence "revealed thus far in the case supports the defense\'s position that Mr. Holmes suffers from a severe mental illness and was in the throes of a psychotic episode when he committed the acts that resulted in the tragic loss of life and injuries sustained by moviegoers on July 20, 2012," the public defenders wrote. Holmes no longer looks like a dazed Joker, as he did in his first appearance before a judge in 2012. He appeared dramatically different in January when jury selection began for his trial: 9,000 potential jurors were summoned for duty, described as one of the nation\'s largest jury calls. Holmes now has a cleaner look, with a mustache, button-down shirt and khaki pants. In January, he had a beard and eyeglasses. If this new image sounds like one of an academician, it may be because Holmes, now 27, once was one. Just before the shooting, Holmes was a doctoral student in neuroscience, and he was studying how the brain works, with his schooling funded by a U.S. government grant. Yet for all his learning, Holmes apparently lacked the capacity to command his own mind, according to the case against him. A jury will ultimately decide Holmes\' fate. That panel is made up of 12 jurors and 12 alternates. They are 19 women and five men, and almost all are white and middle-aged. The trial could last until autumn. When jury summonses were issued in January, each potential juror stood a 0.2% chance of being selected, District Attorney George Brauchler told the final jury this month. He described the approaching trial as "four to five months of a horrible roller coaster through the worst haunted house you can imagine." The jury will have to render verdicts on each of the 165 counts against Holmes, including murder and attempted murder charges. Meanwhile, victims and their relatives are challenging all media outlets "to stop the gratuitous use of the name and likeness of mass killers, thereby depriving violent individuals the media celebrity and media spotlight they so crave," the No Notoriety group says. They are joined by victims from eight other mass shootings in recent U.S. history. Raised in central coastal California and in San Diego, James Eagan Holmes is the son of a mathematician father noted for his work at the FICO firm that provides credit scores and a registered nurse mother, according to the U-T San Diego newspaper. Holmes also has a sister, Chris, a musician, who\'s five years younger, the newspaper said. His childhood classmates remember him as a clean-cut, bespectacled boy with an "exemplary" character who "never gave any trouble, and never got in trouble himself," The Salinas Californian reported. His family then moved down the California coast, where Holmes grew up in the San Diego-area neighborhood of Rancho Peñasquitos, which a neighbor described as "kind of like Mayberry," the San Diego newspaper said. Holmes attended Westview High School, which says its school district sits in "a primarily middle- to upper-middle-income residential community." There, Holmes ran cross-country, played soccer and later worked at a biotechnology internship at the Salk Institute and Miramar College, which attracts academically talented students. By then, his peers described him as standoffish and a bit of a wiseacre, the San Diego newspaper said. Holmes attended college fairly close to home, in a neighboring area known as Southern California\'s "inland empire" because it\'s more than an hour\'s drive from the coast, in a warm, low-desert climate. He entered the University of California, Riverside, in 2006 as a scholarship student. In 2008 he was a summer camp counselor for disadvantaged children, age 7 to 14, at Camp Max Straus, run by Jewish Big Brothers Big Sisters of Los Angeles. He graduated from UC Riverside in 2010 with the highest honors and a bachelor\'s degree in neuroscience. "Academically, he was at the top of the top," Chancellor Timothy P. White said. He seemed destined for even higher achievement. By 2011, he had enrolled as a doctoral student in the neuroscience program at the University of Colorado Anschutz Medical Campus in Aurora, the largest academic health center in the Rocky Mountain region. The doctoral in neuroscience program attended by Holmes focuses on how the brain works, with an emphasis on processing of information, behavior, learning and memory. Holmes was one of six pre-thesis Ph.D. students in the program who were awarded a neuroscience training grant from the National Institutes of Health. The grant rewards outstanding neuroscientists who will make major contributions to neurobiology. A syllabus that listed Holmes as a student at the medical school shows he was to have delivered a presentation about microRNA biomarkers. But Holmes struggled, and his own mental health took an ominous turn. In March 2012, he told a classmate he wanted to kill people, and that he would do so "when his life was over," court documents said. Holmes was "denied access to the school after June 12, 2012, after he made threats to a professor," according to court documents. About that time, Holmes was a patient of University of Colorado psychiatrist Lynne Fenton. Fenton was so concerned about Holmes\' behavior that she mentioned it to her colleagues, saying he could be a danger to others, CNN affiliate KMGH-TV reported, citing sources with knowledge of the investigation. Fenton\'s concerns surfaced in early June, sources told the Denver station. Holmes began to fantasize about killing "a lot of people" in early June, nearly six weeks before the shootings, the station reported, citing unidentified sources familiar with the investigation. Holmes\' psychiatrist contacted several members of a "behavioral evaluation and threat assessment" team to say Holmes could be a danger to others, the station reported. At issue was whether to order Holmes held for 72 hours to be evaluated by mental health professionals, the station reported. "Fenton made initial phone calls about engaging the BETA team" in "the first 10 days" of June, but it "never came together" because in the period Fenton was having conversations with team members, Holmes began the process of dropping out of school, a source told KMGH. Defense attorneys have rejected the prosecution\'s assertions that Holmes was barred from campus. Citing statements from the university, Holmes\' attorneys have argued that his access was revoked because that\'s normal procedure when a student drops enrollment. What caused this turn for the worse for Holmes has yet to be clearly detailed. In the months before the shooting, he bought four weapons and more than 6,000 rounds of ammunition, authorities said. Police said he also booby-trapped his third-floor apartment with explosives, but police weren\'t fooled. After Holmes was caught in the cinema parking lot immediately after the shooting, bomb technicians went to the apartment and neutralized the explosives. No one was injured at the apartment building. Nine minutes before Holmes went into the movie theater, he called a University of Colorado switchboard, public defender Brady has said in court. The number he called can be used to get in contact with faculty members during off hours, Brady said. Court documents have also revealed that investigators have obtained text messages that Holmes exchanged with someone before the shooting. That person was not named, and the content of the texts has not been made public. According to The New York Times, Holmes sent a text message to a fellow graduate student, a woman, about two weeks before the shooting. She asked if he had left Aurora yet, reported the newspaper, which didn\'t identify her. No, he had two months left on his lease, Holmes wrote back, according to the Times. He asked if she had heard of "dysphoric mania," a form of bipolar disorder marked by the highs of mania and the dark and sometimes paranoid delusions of major depression. The woman asked if the disorder could be managed with treatment. "It was," Holmes wrote her, according to the Times. But he warned she should stay away from him "because I am bad news," the newspaper reported. It was her last contact with Holmes. After the shooting, Holmes\' family issued a brief statement: "Our hearts go out to those who were involved in this tragedy and to the families and friends of those involved," they said, without giving any information about their son. Since then, prosecutors have refused to offer a plea deal to Holmes. For Holmes, "justice is death," said Brauchler, the district attorney. In December, Holmes\' parents, who will be attending the trial, issued another statement: They asked that their son\'s life be spared and that he be sent to an institution for mentally ill people for the rest of his life, if he\'s found not guilty by reason of insanity. "He is not a monster," Robert and Arlene Holmes wrote, saying the death penalty is "morally wrong, especially when the condemned is mentally ill." "He is a human being gripped by a severe mental illness," the parents said. The matter will be settled by the jury. CNN\'s Ana Cabrera and Sara Weisfeldt contributed to this report from Denver."""
pprint(long_article)

('(CNN)James Holmes made his introduction to the world in a Colorado cinema '
 'filled with spectators watching a midnight showing of the new Batman movie, '
 '"The Dark Knight Rises," in June 2012. The moment became one of the '
 'deadliest shootings in U.S. history. Holmes is accused of opening fire on '
 'the crowd, killing 12 people and injuring or maiming 70 others in Aurora, a '
 'suburb of Denver. Holmes appeared like a comic book character: He resembled '
 "the Joker, with red-orange hair, similar to the late actor Heath Ledger's "
 'portrayal of the villain in an earlier Batman movie, authorities said. But '
 'Holmes was hardly a cartoon. Authorities said he wore body armor and carried '
 'several guns, including an AR-15 rifle, with lots of ammo. He also wore a '
 'gas mask. Holmes says he was insane at the time of the shootings, and that '
 'is his legal defense and court plea: not guilty by reason of insanity. '
 "Prosecutors aren't swayed and will seek the death penalty. O

In [24]:
longformer_summary = longformer.summarize([long_article])
pprint(longformer_summary)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Longformer model: processing document of tensor([2124]) tokens
['James Holmes, 27, is accused of opening fire on a Colorado theater.\n'
 'He was a doctoral student at University of Colorado.\n'
 'Holmes says he was suffering "a psychotic episode" at the time of the '
 'shooting.\n'
 "Prosecutors won't say whether Holmes was barred from campus."]


In [25]:
longformer.show_capability()

Longformer is a abstractive, neural model for summarization. 
 #################### 
 A Longformer2Roberta model finetuned on CNN-DM dataset for summarization.

Strengths:
 - Correctly handles longer (> 2000 tokens) corpus.

Weaknesses:
 - Less accurate on contexts outside training domain.

Initialization arguments:
  - `corpus`: Unlabelled corpus of documents.



# Evaluation

### Supported Evalutaionmetrics

SummerTime supports different evaluation metrics (*e.g.,* ROUGE, Bleu, BertScore, Meteor, etc)

In [26]:
from evaluation import SUPPORTED_EVALUATION_METRICS

pprint(SUPPORTED_EVALUATION_METRICS)

[<class 'evaluation.bertscore_metric.BertScore'>,
 <class 'evaluation.bleu_metric.Bleu'>,
 <class 'evaluation.rouge_metric.Rouge'>,
 <class 'evaluation.rougewe_metric.RougeWe'>,
 <class 'evaluation.meteor_metric.Meteor'>]


In [27]:
from evaluation.base_metric import SummMetric
from evaluation import Rouge, RougeWe, BertScore

import itertools

# Initializes a bertscore metric object
metric = BertScore()

# Evaluates model on subset of cnn_dailymail
# Get a slice of the train set - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance for instance in train_set]
pprint(corpus)

articles = [instance.source for instance in corpus]

summaries = sample_model.summarize(articles)
targets = [instance.summary for instance in corpus]



  0%|          | 6/287113 [00:55<740:45:34,  9.29s/it]

[<dataset.st_dataset.SummInstance object at 0x7f8f3fc39820>,
 <dataset.st_dataset.SummInstance object at 0x7f90901aec40>,
 <dataset.st_dataset.SummInstance object at 0x7f8f3fc398b0>,
 <dataset.st_dataset.SummInstance object at 0x7f8f3fc39880>,
 <dataset.st_dataset.SummInstance object at 0x7f8f3fc397c0>]


In [28]:
# Calculate BertScore
metric = RougeWe()
metric.evaluate(summaries, targets)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'rouge_we_3_f': 0.20183811597317614}

## More Model and dataset tests

In [None]:
## Uncomment if using Ziva Server
## Installs the pytorch version compatible with the CUDA version

# !pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html -U

In [2]:
import torch
import sys

print('A', sys.version)
print('B', torch.__version__)
print('C', torch.cuda.is_available())
print('D', torch.backends.cudnn.enabled)
device = torch.device('cuda')
print('E', torch.cuda.get_device_properties(device))
print('F', torch.tensor([1.0, 2.0]).cuda())

A 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0]
B 1.10.0.dev20210818+cu111
C True
D True
E _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
F tensor([1., 2.], device='cuda:0')


In [10]:
import unittest

from model.base_model import SummModel
from model import SUPPORTED_SUMM_MODELS, LexRankModel, PegasusModel

from pipeline import assemble_model_pipeline

from evaluation.base_metric import SummMetric
from evaluation import SUPPORTED_EVALUATION_METRICS, Rouge, RougeWe

from dataset.st_dataset import SummInstance, SummDataset
from dataset import SUPPORTED_SUMM_DATASETS
from dataset.non_huggingface_datasets import ScisummnetDataset, SummscreenDataset, ArxivDataset
from dataset.huggingface_datasets import CnndmDataset, MlsumDataset, SamsumDataset

from tests.helpers import print_with_color, retrieve_random_test_instances

import random
import time
from typing import Dict, List, Union, Tuple
import sys

import nltk
nltk.download('stopwords')


class IntegrationTests(unittest.TestCase):
    
    def get_prediction(self, model: SummModel, dataset: SummDataset, test_instances: List[SummInstance]) -> Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]]:
        """
        Get summary prediction given model and dataset instances.

        :param SummModel `model`: Model for summarization task.
        :param SummDataset `dataset`: Dataset for summarization task.
        :param List[SummInstance] `test_instances`: Instances from `dataset` to summarize.
        :returns Tuple containing summary list of summary predictions and targets corresponding to each instance in `test_instances`.
        """

        src = [ins.source[0] for ins in test_instances] if isinstance(dataset, ScisummnetDataset) else [ins.source for ins in test_instances]
        tgt = [ins.summary for ins in test_instances]
        query = [ins.query for ins in test_instances] if dataset.is_query_based else None
        prediction = model.summarize(src, query)
        return prediction, tgt
    
    def get_eval_dict(self, metric: SummMetric, prediction: List[str], tgt: List[str]):
        """
        Run evaluation metric on summary prediction.

        :param SummMetric `metric`: Evaluation metric.
        :param List[str] `prediction`: Summary prediction instances.
        :param List[str] `tgt`: Target prediction instances from dataset.
        """
        score_dict = metric.evaluate(prediction, tgt)
        return score_dict

    def test_all(self):
        """
        Runs integration test on all compatible dataset + model + evaluation metric pipelines supported by SummerTime.
        """

        print_with_color("\nInitializing all evaluation metrics...", "35")
        evaluation_metrics = []
        for eval_cls in SUPPORTED_EVALUATION_METRICS:
            # # TODO: Temporarily skipping Rouge/RougeWE metrics to avoid local bug.
            # if eval_cls in [Rouge, RougeWe]:
            #     continue
            print(eval_cls)
            evaluation_metrics.append(eval_cls())

        print_with_color("\n\nBeginning integration tests...", "35")
        for dataset_cls in SUPPORTED_SUMM_DATASETS:
            # TODO: Temporarily skipping MLSumm (Gitlab: server-side login gating) and Arxiv (size/time)
            if dataset_cls in [MlsumDataset, ArxivDataset, SamsumDataset]:
                continue
            dataset = dataset_cls()
            if dataset.train_set is not None:
                dataset_instances = list(dataset.train_set)
                print(f"\n{dataset.dataset_name} has a training set of {len(dataset_instances)} examples")
                print_with_color(f"Initializing all matching model pipelines for {dataset.dataset_name} dataset...", "35")
                # matching_model_instances = assemble_model_pipeline(dataset_cls, list(filter(lambda m: m != PegasusModel, SUPPORTED_SUMM_MODELS)))
                matching_model_instances = assemble_model_pipeline(dataset_cls, SUPPORTED_SUMM_MODELS)
                for model, model_name in matching_model_instances:
                    test_instances = retrieve_random_test_instances(dataset_instances=dataset_instances, num_instances=1)
                    print_with_color(f"{'#' * 20} Testing: {dataset.dataset_name} dataset, {model_name} model {'#' * 20}", "35")
                    prediction, tgt = self.get_prediction(model, dataset, test_instances)
                    print(f"Prediction: {prediction}\nTarget: {tgt}\n")
                    for metric in evaluation_metrics:
                        print_with_color(f"{metric.metric_name} metric", "35")
                        score_dict = self.get_eval_dict(metric, prediction, tgt)
                        print(score_dict)

                    print_with_color(f"{'#' * 20} Test for {dataset.dataset_name} dataset, {model_name} model COMPLETE {'#' * 20}\n\n", "32")

unittest.main(argv=['first-arg-is-ignored'], exit=False)


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
2021-08-18 18:39:19,420 [MainThread  ] [INFO ]  Set ROUGE home directory to /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1.5.5/.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[35m
Initializing all evaluation metrics...[0m
<class 'evaluation.bertscore_metric.BertScore'>
<class 'evaluation.bleu_metric.Bleu'>
<class 'evaluation.rouge_metric.Rouge'>
<class 'evaluation.rougewe_metric.RougeWe'>
<class 'evaluation.meteor_metric.Meteor'>
[35m

Beginning integration tests...[0m


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Reusing dataset cnn_dailymail (/home/lily/mmm274/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)
100%|██████████| 287113/287113 [00:18<00:00, 15767.79it/s]



cnn_dailymail has a training set of 287113 examples
[35mInitializing all matching model pipelines for cnn_dailymail dataset...[0m


Reusing dataset cnn_dailymail (/home/lily/mmm274/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)
  logger.warn(
You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificatio

[35m#################### Testing: cnn_dailymail dataset, BART model ####################[0m
Prediction: ['The 58-year-old man was pinned against a wall after the bin lorry started rolling. It crashed into a Jaguar X-Type before careering through railings. The lorry came to rest hanging over the edge of a pier in South Queensferry. Staff from a nearby pub rushed to help the driver before emergency services reached the scene. But the man, who is yet to be named, later died from his injuries.']
Target: ['The 58-year-old man was pinned against a wall when truck started rolling .\nBiffa lorry mounted a pavement, crashed into a car and through railings .\nIt came to rest hanging over the edge of Hawes Pier in South Queensferry .\nStaff from a nearby pub rushed to help driver but he later died in hospital .']

[35mbert score metric[0m


2021-08-18 18:41:03,499 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:41:03,500 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpl1g226hl/system and model files to /tmp/tmpl1g226hl/model.
2021-08-18 18:41:03,501 [MainThread  ] [INFO ]  Processing files in /tmp/tmpuuhhe111.
2021-08-18 18:41:03,502 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:41:03,503 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpl1g226hl/system.
2021-08-18 18:41:03,504 [MainThread  ] [INFO ]  Processing files in /tmp/tmp7rl6jyiv.
2021-08-18 18:41:03,505 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:41:03,506 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpl1g226hl/model.
2021-08-18 18:41:03,507 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpzn5ugn_i/rouge_conf.xml
2021-08-18 18:41:03,508 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.8048666715621948}
[35mbleu metric[0m
{'bleu': 40.62769602451984}
[35mrouge metric[0m
{'rouge_1_f_score': 0.69291, 'rouge_2_f_score': 0.512, 'rouge_l_f_score': 0.67716}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.6178861788617885}
[35mmeteor metric[0m
{'meteor': 0.4994627816275194}
[32m#################### Test for cnn_dailymail dataset, BART model COMPLETE ####################

[0m
[35m#################### Testing: cnn_dailymail dataset, LexRank model ####################[0m
Prediction: ['While Ami worked, Tesca played a simple interactive game, appearing to deftly use the computer as she sucked on her pacifier. One scene from a plane, as recalled by Ami Fitzgerald: When Tesca was not yet 2, she and her mother took a flight, each with her own laptop, unusual for a toddler in the 1990s.']
Target: ['Tesca Fitzgerald began to play with computers at before the age of two .\nShe skipped middle

2021-08-18 18:41:23,823 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:41:23,824 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmps7vo_qe5/system and model files to /tmp/tmps7vo_qe5/model.
2021-08-18 18:41:23,825 [MainThread  ] [INFO ]  Processing files in /tmp/tmpr5p_lquj.
2021-08-18 18:41:23,826 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:41:23,827 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmps7vo_qe5/system.
2021-08-18 18:41:23,827 [MainThread  ] [INFO ]  Processing files in /tmp/tmppbq_jdot.
2021-08-18 18:41:23,828 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:41:23,829 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmps7vo_qe5/model.
2021-08-18 18:41:23,830 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmppcmtjg7d/rouge_conf.xml
2021-08-18 18:41:23,831 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5137070417404175}
[35mbleu metric[0m
{'bleu': 1.7783078636476368}
[35mrouge metric[0m
{'rouge_1_f_score': 0.29412, 'rouge_2_f_score': 0.02, 'rouge_l_f_score': 0.2353}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.02040816326530612}
[35mmeteor metric[0m
{'meteor': 0.16068993180906538}
[32m#################### Test for cnn_dailymail dataset, LexRank model COMPLETE ####################

[0m
[35m#################### Testing: cnn_dailymail dataset, Longformer model ####################[0m
Longformer model: processing document of tensor([584]) tokens
Prediction: ["Josh Jenkins, 19, bought a Bargain Bucket after eating leftovers the following day.\nHe immediately repulsed by the 'grey and wrinkled' organ.\nKFC has since apologised but he insists he will never eat in one of its branches again.\nThe chain has since said it will provide him with a goodwill gift."]
Target: ["John Jenkins found 'white and wrinkled' organ while eating in Dorset branch .\nAfter the student found the offal he sent it back to the fast-food chain .\nKFC studied the piece of chicken and said that it was a kidney ."]

[35mbert score metric[0m


2021-08-18 18:41:47,068 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:41:47,070 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp61lup_j2/system and model files to /tmp/tmp61lup_j2/model.
2021-08-18 18:41:47,070 [MainThread  ] [INFO ]  Processing files in /tmp/tmpndp7m44l.
2021-08-18 18:41:47,071 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:41:47,072 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp61lup_j2/system.
2021-08-18 18:41:47,073 [MainThread  ] [INFO ]  Processing files in /tmp/tmp4v_nnes8.
2021-08-18 18:41:47,073 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:41:47,075 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp61lup_j2/model.
2021-08-18 18:41:47,075 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpnwgn6mmw/rouge_conf.xml
2021-08-18 18:41:47,076 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5806867480278015}
[35mbleu metric[0m
{'bleu': 4.035316002020003}
[35mrouge metric[0m
{'rouge_1_f_score': 0.3913, 'rouge_2_f_score': 0.06666, 'rouge_l_f_score': 0.36956}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.1590909090909091}
[35mmeteor metric[0m
{'meteor': 0.16666666666666666}
[32m#################### Test for cnn_dailymail dataset, Longformer model COMPLETE ####################

[0m
[35m#################### Testing: cnn_dailymail dataset, Pegasus model ####################[0m
Prediction: ['A father who lost a leg in a work accident died after taking an accidental overdose of a drug he bought on the internet, an inquest has found.']
Target: ["Daniel Batchelor, 36, fell off a ladder in 2011 and suffered multiple injuries .\nHe had to have his right leg amputated and struggled with the pain .\nHe was unable to take opiate-based painkillers due to an allergy .\nSo, he bought a painkill

2021-08-18 18:42:11,642 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:42:11,644 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpn9l1v2ua/system and model files to /tmp/tmpn9l1v2ua/model.
2021-08-18 18:42:11,644 [MainThread  ] [INFO ]  Processing files in /tmp/tmp_yepz5rf.
2021-08-18 18:42:11,645 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:42:11,647 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpn9l1v2ua/system.
2021-08-18 18:42:11,647 [MainThread  ] [INFO ]  Processing files in /tmp/tmpq4ya46rg.
2021-08-18 18:42:11,648 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:42:11,649 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpn9l1v2ua/model.
2021-08-18 18:42:11,651 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpt4uesv3p/rouge_conf.xml
2021-08-18 18:42:11,651 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.581717312335968}
[35mbleu metric[0m
{'bleu': 0.8410080006327163}
[35mrouge metric[0m
{'rouge_1_f_score': 0.31666, 'rouge_2_f_score': 0.0678, 'rouge_l_f_score': 0.18334}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.18965517241379312}
[35mmeteor metric[0m
{'meteor': 0.30023659586579854}
[32m#################### Test for cnn_dailymail dataset, Pegasus model COMPLETE ####################

[0m
[35m#################### Testing: cnn_dailymail dataset, TextRank model ####################[0m
Prediction: ['I didn\'t get tired until 1.am -- it was very energizing." \'Touching infinity\' Brooks, who is studying international ocean policy, was one of 30 U.S. scientists monitoring Antarctica\'s unique eco-system, as part of a National Science Foundation research cruise.']
Target: ['Science research ship cruises Antarctica, captures stunning time-lapse video .\nSouth Pole dubbed "Land of the Midnight Sun

2021-08-18 18:42:31,574 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:42:31,575 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpl_fzw7ul/system and model files to /tmp/tmpl_fzw7ul/model.
2021-08-18 18:42:31,576 [MainThread  ] [INFO ]  Processing files in /tmp/tmpic4jkduc.
2021-08-18 18:42:31,577 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:42:31,578 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpl_fzw7ul/system.
2021-08-18 18:42:31,579 [MainThread  ] [INFO ]  Processing files in /tmp/tmpzw5kiuv7.
2021-08-18 18:42:31,580 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:42:31,582 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpl_fzw7ul/model.
2021-08-18 18:42:31,583 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp1z2l1z5q/rouge_conf.xml
2021-08-18 18:42:31,584 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.4563243091106415}
[35mbleu metric[0m
{'bleu': 1.3302791074921412}
[35mrouge metric[0m
{'rouge_1_f_score': 0.16092, 'rouge_2_f_score': 0.02353, 'rouge_l_f_score': 0.16092}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.04819277108433735}
[35mmeteor metric[0m
{'meteor': 0.08838383838383838}
[32m#################### Test for cnn_dailymail dataset, TextRank model COMPLETE ####################

[0m


Using custom data configuration default
Reusing dataset multi_news (/home/lily/mmm274/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376)
100%|██████████| 44972/44972 [00:03<00:00, 13619.86it/s]



multi_news has a training set of 44972 examples
[35mInitializing all matching model pipelines for multi_news dataset...[0m


Using custom data configuration default
Reusing dataset multi_news (/home/lily/mmm274/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376)
  0%|          | 0/44972 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be 

[35m#################### Testing: multi_news dataset, Multi-document joint (BART) model ####################[0m
Prediction: ['2004 BL86 will pass about three times the distance of Earth to the moon on January 26. The flyby will be the closest by any known space rock this large until asteroid 1999 AN10 flies past Earth in 2027. Asteroid is expected to be observable to amateur astronomers with small telescopes and strong binoculars.']
Target: ['– If you\'ve ever wanted to get a good look at an asteroid, Monday night will be your best chance for more than a decade. Asteroid 2004 BL86, a space rock about a third of a mile in diameter, will be 745,000 miles away on Monday, around three times as far away as the moon, reports CNN. Barring cosmic surprises, the next close encounter with something that big will be in 2027, and NASA says that while 2004 BL86 doesn\'t pose any threat, it gives astronomers a "unique opportunity to observe and learn more." Very little is known about this particul

2021-08-18 18:46:15,256 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:46:15,257 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmplgqx5a1m/system and model files to /tmp/tmplgqx5a1m/model.
2021-08-18 18:46:15,258 [MainThread  ] [INFO ]  Processing files in /tmp/tmpbq0tznac.
2021-08-18 18:46:15,259 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:46:15,260 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmplgqx5a1m/system.
2021-08-18 18:46:15,261 [MainThread  ] [INFO ]  Processing files in /tmp/tmpq5mfresd.
2021-08-18 18:46:15,262 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:46:15,262 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmplgqx5a1m/model.
2021-08-18 18:46:15,263 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpjn3iuhf3/rouge_conf.xml
2021-08-18 18:46:15,264 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.534206211566925}
[35mbleu metric[0m
{'bleu': 0.08961889027287664}
[35mrouge metric[0m
{'rouge_1_f_score': 0.2549, 'rouge_2_f_score': 0.04606, 'rouge_l_f_score': 0.13725}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.1390728476821192}
[35mmeteor metric[0m
{'meteor': 0.3126196990424077}
[32m#################### Test for multi_news dataset, Multi-document joint (BART) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document joint (LexRank) model ####################[0m
Prediction: ["Problem 1: Clinton made herself vulnerable to hackers \n  \n Clinton's private server had several potential points of vulnerability, so it was possible for spies to hack into the system — both to view messages or to reroute messages. We don't know why Clinton used a private server."]
Target: ['– "There\'s no equivalency." That\'s Ivanka Trump\'s official take on 

2021-08-18 18:46:37,600 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:46:37,602 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp0i22inlz/system and model files to /tmp/tmp0i22inlz/model.
2021-08-18 18:46:37,602 [MainThread  ] [INFO ]  Processing files in /tmp/tmpqv0czt_f.
2021-08-18 18:46:37,603 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:46:37,605 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp0i22inlz/system.
2021-08-18 18:46:37,605 [MainThread  ] [INFO ]  Processing files in /tmp/tmpjruimacm.
2021-08-18 18:46:37,606 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:46:37,607 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp0i22inlz/model.
2021-08-18 18:46:37,608 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpxzy4h0rm/rouge_conf.xml
2021-08-18 18:46:37,608 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.44767850637435913}
[35mbleu metric[0m
{'bleu': 0.00485597570718259}
[35mrouge metric[0m
{'rouge_1_f_score': 0.09523, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.07738}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.018072289156626505}
[35mmeteor metric[0m
{'meteor': 0.10838150289017343}
[32m#################### Test for multi_news dataset, Multi-document joint (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document joint (Longformer) model ####################[0m
Longformer model: processing document of tensor([1510]) tokens
Target: ['– The country\'s biggest theater chain has announced that all moviegoers can expect to have their bags searched on entry—and it isn\'t looking for illicit snacks. Instead, Regal Cinemas says on its website that since "security issues have become a daily part of our lives in America," searches will be carried out to ensure the "safety of our guests and employees," Entertainment Weekly reports. The move at the chain, which has around 7,300 screens nationwide, comes after recent theater attacks in Louisiana and Tennessee, which happened three years after James Holmes massacred 12 p

2021-08-18 18:47:02,828 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:47:02,829 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpl6kxgxpw/system and model files to /tmp/tmpl6kxgxpw/model.
2021-08-18 18:47:02,830 [MainThread  ] [INFO ]  Processing files in /tmp/tmpmnd8c3dg.
2021-08-18 18:47:02,831 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:47:02,832 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpl6kxgxpw/system.
2021-08-18 18:47:02,833 [MainThread  ] [INFO ]  Processing files in /tmp/tmp94hvetj0.
2021-08-18 18:47:02,834 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:47:02,835 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpl6kxgxpw/model.
2021-08-18 18:47:02,836 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmptc1xxnqx/rouge_conf.xml
2021-08-18 18:47:02,837 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5382329821586609}
[35mbleu metric[0m
{'bleu': 0.2804200562123925}
[35mrouge metric[0m
{'rouge_1_f_score': 0.21212, 'rouge_2_f_score': 0.0458, 'rouge_l_f_score': 0.1591}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.13846153846153847}
[35mmeteor metric[0m
{'meteor': 0.2767118566176471}
[32m#################### Test for multi_news dataset, Multi-document joint (Longformer) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document joint (Pegasus) model ####################[0m
Prediction: ['The CIA\'s findings that Russia intervened in the 2016 election to help Donald Trump win the presidency are both "stunning and not surprising", the next leader of Senate Democrats said.']
Target: ['– Reversing months of vague muttering about Russian influence, the CIA secretly told "key senators" that Russian hackers actively tried to put Donald Trump in th

2021-08-18 18:47:30,921 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:47:30,922 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpdjs_f6n2/system and model files to /tmp/tmpdjs_f6n2/model.
2021-08-18 18:47:30,923 [MainThread  ] [INFO ]  Processing files in /tmp/tmpyypdgvl6.
2021-08-18 18:47:30,924 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:47:30,925 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpdjs_f6n2/system.
2021-08-18 18:47:30,925 [MainThread  ] [INFO ]  Processing files in /tmp/tmp8qx1rpnt.
2021-08-18 18:47:30,926 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:47:30,927 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpdjs_f6n2/model.
2021-08-18 18:47:30,927 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmphxmhiz7p/rouge_conf.xml
2021-08-18 18:47:30,928 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5230874419212341}
[35mbleu metric[0m
{'bleu': 0.000266292702621717}
[35mrouge metric[0m
{'rouge_1_f_score': 0.14959, 'rouge_2_f_score': 0.03343, 'rouge_l_f_score': 0.09419}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.08403361344537814}
[35mmeteor metric[0m
{'meteor': 0.1851851851851852}
[32m#################### Test for multi_news dataset, Multi-document joint (Pegasus) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document joint (TextRank) model ####################[0m
Prediction: ['\n  \n “It’s profoundly sad,” said neighbor Harriet Allen.']
Target: ['– A 74-year-old Massachusetts woman may have been living with the decomposing body of her sister for up to 18 months, possibly without even realizing her sister was dead, the Brookline TAB reports. According to the Boston Globe, Lynda Waldman lived alone with her 67-year-old sister, Ho

2021-08-18 18:47:54,852 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:47:54,853 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp1di8llcs/system and model files to /tmp/tmp1di8llcs/model.
2021-08-18 18:47:54,853 [MainThread  ] [INFO ]  Processing files in /tmp/tmp0mh1015s.
2021-08-18 18:47:54,854 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:47:54,855 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp1di8llcs/system.
2021-08-18 18:47:54,855 [MainThread  ] [INFO ]  Processing files in /tmp/tmp83hic5ri.
2021-08-18 18:47:54,856 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:47:54,857 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp1di8llcs/model.
2021-08-18 18:47:54,858 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpskj3db95/rouge_conf.xml
2021-08-18 18:47:54,858 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.3624487519264221}
[35mbleu metric[0m
{'bleu': 3.3506980192796586e-11}
[35mrouge metric[0m
{'rouge_1_f_score': 0.0315, 'rouge_2_f_score': 0.00793, 'rouge_l_f_score': 0.02363}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.016}
[35mmeteor metric[0m
{'meteor': 0.03355704697986578}
[32m#################### Test for multi_news dataset, Multi-document joint (TextRank) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (BART) model ####################[0m
Prediction: ['Former White Plains Officer Glen Hochman had no known health or psychiatric problems. Interviews with people who knew him found no one had an inkling that "Mr. Hoffman would have committed this heinous crime" Alissa, 17, and Deanna, 13, were apparently both sleeping when they were shot in the head. Glen Hochman, a recently retired White Plains police officer, killed 

2021-08-18 18:48:27,946 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:48:27,947 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmped4gq_94/system and model files to /tmp/tmped4gq_94/model.
2021-08-18 18:48:27,948 [MainThread  ] [INFO ]  Processing files in /tmp/tmp4o46p1kq.
2021-08-18 18:48:27,949 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:48:27,950 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmped4gq_94/system.
2021-08-18 18:48:27,951 [MainThread  ] [INFO ]  Processing files in /tmp/tmpgf3vxkqc.
2021-08-18 18:48:27,952 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:48:27,953 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmped4gq_94/model.
2021-08-18 18:48:27,953 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpjhb4fwre/rouge_conf.xml
2021-08-18 18:48:27,954 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5882028341293335}
[35mbleu metric[0m
{'bleu': 3.7324323174646765}
[35mrouge metric[0m
{'rouge_1_f_score': 0.36828, 'rouge_2_f_score': 0.07407, 'rouge_l_f_score': 0.14164}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.2292263610315186}
[35mmeteor metric[0m
{'meteor': 0.25565161187295843}
[32m#################### Test for multi_news dataset, Multi-document separate (BART) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (LexRank) model ####################[0m
Prediction: ['“As doctors we were assuming that sex gets worse for women.” \n  \n More comfortable with our bodies \n  \n To get a better sense of the impact of age on sex, Thomas and her colleagues spoke to 39 women whose ages ranged from 46 to 59, either in one on one interviews or in focus groups. The women pointed to several factors leading to better sex: \n  \n In

2021-08-18 18:48:49,832 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:48:49,833 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpg3h6l8nm/system and model files to /tmp/tmpg3h6l8nm/model.
2021-08-18 18:48:49,834 [MainThread  ] [INFO ]  Processing files in /tmp/tmpp0alr1w6.
2021-08-18 18:48:49,835 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:48:49,836 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpg3h6l8nm/system.
2021-08-18 18:48:49,837 [MainThread  ] [INFO ]  Processing files in /tmp/tmpcixmjpb0.
2021-08-18 18:48:49,838 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:48:49,838 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpg3h6l8nm/model.
2021-08-18 18:48:49,839 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp71yrupev/rouge_conf.xml
2021-08-18 18:48:49,840 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.6463611721992493}
[35mbleu metric[0m
{'bleu': 10.54782677145177}
[35mrouge metric[0m
{'rouge_1_f_score': 0.50722, 'rouge_2_f_score': 0.16977, 'rouge_l_f_score': 0.2268}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.3575883575883576}
[35mmeteor metric[0m
{'meteor': 0.3566705772985148}
[32m#################### Test for multi_news dataset, Multi-document separate (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (Longformer) model ####################[0m
Longformer model: processing document of tensor([441]) tokens


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Longformer model: processing document of tensor([1416]) tokens


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Longformer model: processing document of tensor([2030]) tokens
Prediction: ["Chicago Superintendent Eddie Johnson says he will recommend firing seven officers.\nHe will recommend the firing of seven officers who filed false reports in the fatal shooting of black teen Laquan McDonald.\nThe officer shot McDonald 16 times in October 2014.\nTwo of the officers cited in the report have since retired. The move is a long-delayed official response to a six-month investigation.\nThe department's critics say the department is covering up for one another.\nA video shows the officer opening fire within seconds of the shooting.\nIt's unclear whether the officers will be fired or fired. Chicago Police Supt. Eddie Johnson is seeking to fire seven officers for lying in their accounts of what happened in the shooting of Laquan McDonald.\nThe officer who was shot 16 times by Van Dyke was one of the 7 cops who were fired.\nJohnson has decided not to seek termination because he disagrees with the city's r

2021-08-18 18:49:24,762 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:49:24,763 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp_ycxlavt/system and model files to /tmp/tmp_ycxlavt/model.
2021-08-18 18:49:24,763 [MainThread  ] [INFO ]  Processing files in /tmp/tmp5qmxqon3.
2021-08-18 18:49:24,764 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:49:24,764 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp_ycxlavt/system.
2021-08-18 18:49:24,765 [MainThread  ] [INFO ]  Processing files in /tmp/tmp51lls709.
2021-08-18 18:49:24,765 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:49:24,766 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp_ycxlavt/model.
2021-08-18 18:49:24,767 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp9zowatj7/rouge_conf.xml
2021-08-18 18:49:24,767 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.6449233889579773}
[35mbleu metric[0m
{'bleu': 6.361614074750951}
[35mrouge metric[0m
{'rouge_1_f_score': 0.52262, 'rouge_2_f_score': 0.18687, 'rouge_l_f_score': 0.29146}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.350253807106599}
[35mmeteor metric[0m
{'meteor': 0.3261371887185183}
[32m#################### Test for multi_news dataset, Multi-document separate (Longformer) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (Pegasus) model ####################[0m
Prediction: ['The periodic table will soon have four new names added to its lower right-hand corner. Four new names have been recommended for the chemical elements ununtrium, ununpentium, ununseptium and ununoctium. The latest additions to the periodic table will now be known as nihonium, moscovium, tennessine and oganesson.']
Target: ['– Sorry, chemistry students, 

2021-08-18 18:50:03,853 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:50:03,854 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmppjt86z54/system and model files to /tmp/tmppjt86z54/model.
2021-08-18 18:50:03,854 [MainThread  ] [INFO ]  Processing files in /tmp/tmpvdga5x5c.
2021-08-18 18:50:03,855 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:50:03,856 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmppjt86z54/system.
2021-08-18 18:50:03,856 [MainThread  ] [INFO ]  Processing files in /tmp/tmp9h40vkkz.
2021-08-18 18:50:03,857 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:50:03,857 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmppjt86z54/model.
2021-08-18 18:50:03,858 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpr_urmv97/rouge_conf.xml
2021-08-18 18:50:03,858 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.6016560196876526}
[35mbleu metric[0m
{'bleu': 0.3600425939675759}
[35mrouge metric[0m
{'rouge_1_f_score': 0.24373, 'rouge_2_f_score': 0.11553, 'rouge_l_f_score': 0.14337}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.16727272727272727}
[35mmeteor metric[0m
{'meteor': 0.3393689285875867}
[32m#################### Test for multi_news dataset, Multi-document separate (Pegasus) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (TextRank) model ####################[0m
Prediction: ["Mercury's hot surface also posed problems during low orbits, so additional radiators and heat pipes were used to get rid of this excess heat. Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. The probe, which has been mapping Mercury since 2011, is due to run out of fuel on Thursday and is expected to crash i

2021-08-18 18:50:29,751 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:50:29,752 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp6rqe1ery/system and model files to /tmp/tmp6rqe1ery/model.
2021-08-18 18:50:29,753 [MainThread  ] [INFO ]  Processing files in /tmp/tmpmq2ks8nx.
2021-08-18 18:50:29,753 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:50:29,754 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp6rqe1ery/system.
2021-08-18 18:50:29,755 [MainThread  ] [INFO ]  Processing files in /tmp/tmpm993uda_.
2021-08-18 18:50:29,755 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:50:29,756 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp6rqe1ery/model.
2021-08-18 18:50:29,757 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpuxkby6ee/rouge_conf.xml
2021-08-18 18:50:29,757 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5458119511604309}
[35mbleu metric[0m
{'bleu': 1.298182715298779}
[35mrouge metric[0m
{'rouge_1_f_score': 0.30199, 'rouge_2_f_score': 0.06304, 'rouge_l_f_score': 0.16524}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.14409221902017288}
[35mmeteor metric[0m
{'meteor': 0.21472392638036808}
[32m#################### Test for multi_news dataset, Multi-document separate (TextRank) model COMPLETE ####################

[0m


Using custom data configuration default
Reusing dataset xsum (/home/lily/mmm274/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499)
100%|██████████| 204045/204045 [00:24<00:00, 8461.04it/s] 



xsum has a training set of 204045 examples
[35mInitializing all matching model pipelines for xsum dataset...[0m


Using custom data configuration default
Reusing dataset xsum (/home/lily/mmm274/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499)
  0%|          | 0/204045 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly ide

[35m#################### Testing: xsum dataset, BART model ####################[0m
Prediction: ['Clouds of apple and cherry mist and peach snow will be released into the air. People will also get scratch \'n\' sniff programmes and fruit sweets. The mayor said it was among the world\'s "most dazzling firework displays" BBC London weather forecaster Sara Thornton said there would be scattered showers.']
Target: ['Revellers celebrating the New Year in central London will be able to "taste" the atmosphere with flavoured mist, "snow" and confetti released.']

[35mbert score metric[0m


2021-08-18 18:52:18,778 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:52:18,779 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp57sdcxr6/system and model files to /tmp/tmp57sdcxr6/model.
2021-08-18 18:52:18,779 [MainThread  ] [INFO ]  Processing files in /tmp/tmp56vc6ddf.
2021-08-18 18:52:18,780 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:52:18,781 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp57sdcxr6/system.
2021-08-18 18:52:18,782 [MainThread  ] [INFO ]  Processing files in /tmp/tmp1x2g3f0x.
2021-08-18 18:52:18,783 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:52:18,784 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp57sdcxr6/model.
2021-08-18 18:52:18,785 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpprwm04f2/rouge_conf.xml
2021-08-18 18:52:18,785 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5275789499282837}
[35mbleu metric[0m
{'bleu': 2.0690162381325976}
[35mrouge metric[0m
{'rouge_1_f_score': 0.24657, 'rouge_2_f_score': 0.02817, 'rouge_l_f_score': 0.10959}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.14492753623188406}
[35mmeteor metric[0m
{'meteor': 0.0635593220338983}
[32m#################### Test for xsum dataset, BART model COMPLETE ####################

[0m
[35m#################### Testing: xsum dataset, LexRank model ####################[0m
Prediction: ['Mr Schultz said the sales increase meant it had served 23 million more customers in the quarter compared to the same period last year. Both firms give their customers a chance to earn Starbucks "starts" which can be used in the coffee chain\'s shops.']
Target: ['Starbucks said global sales increased 18% to $4.9bn in the quarter to 28 June - its highest ever quarterly revenue.']

[35mbert score metric[0m


2021-08-18 18:52:40,365 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:52:40,366 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpp03t_i4w/system and model files to /tmp/tmpp03t_i4w/model.
2021-08-18 18:52:40,367 [MainThread  ] [INFO ]  Processing files in /tmp/tmp11ajaonr.
2021-08-18 18:52:40,368 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:52:40,369 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpp03t_i4w/system.
2021-08-18 18:52:40,369 [MainThread  ] [INFO ]  Processing files in /tmp/tmpj9ua_nno.
2021-08-18 18:52:40,370 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:52:40,371 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpp03t_i4w/model.
2021-08-18 18:52:40,372 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpuigl96g0/rouge_conf.xml
2021-08-18 18:52:40,373 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.529384434223175}
[35mbleu metric[0m
{'bleu': 3.725917780842771}
[35mrouge metric[0m
{'rouge_1_f_score': 0.27692, 'rouge_2_f_score': 0.09524, 'rouge_l_f_score': 0.21539}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.13114754098360654}
[35mmeteor metric[0m
{'meteor': 0.19200969485060393}
[32m#################### Test for xsum dataset, LexRank model COMPLETE ####################

[0m
[35m#################### Testing: xsum dataset, Longformer model ####################[0m
Longformer model: processing document of tensor([384]) tokens
Prediction: ['Public Health Wales recommends restrictions on advertising e-cigarettes in media.\nPublic Health Service says e-cigs deliver nicotine within an inhalable aerosol.\nComes as health risks are significantly lower than cigarettes.\nHealth officials say e-cigarette sales are not without risk.']
Target: ['Health officials are calling for a ban on the sale of confectionary-like flavours in e-cigarettes over concerns they appeal to children.']

[35mbert score metric[0m


2021-08-18 18:53:03,296 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:53:03,297 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp0h72qqc6/system and model files to /tmp/tmp0h72qqc6/model.
2021-08-18 18:53:03,298 [MainThread  ] [INFO ]  Processing files in /tmp/tmpx3u0gpab.
2021-08-18 18:53:03,299 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:53:03,300 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp0h72qqc6/system.
2021-08-18 18:53:03,300 [MainThread  ] [INFO ]  Processing files in /tmp/tmp98zsdy29.
2021-08-18 18:53:03,301 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:53:03,302 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp0h72qqc6/model.
2021-08-18 18:53:03,303 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp0h5gibr_/rouge_conf.xml
2021-08-18 18:53:03,303 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5759812593460083}
[35mbleu metric[0m
{'bleu': 2.331372206682652}
[35mrouge metric[0m
{'rouge_1_f_score': 0.24616, 'rouge_2_f_score': 0.06349, 'rouge_l_f_score': 0.18462}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.0983606557377049}
[35mmeteor metric[0m
{'meteor': 0.12892253675663815}
[32m#################### Test for xsum dataset, Longformer model COMPLETE ####################

[0m
[35m#################### Testing: xsum dataset, Pegasus model ####################[0m
Prediction: ['Leicester have made three changes to the side that lost to Saracens at the weekend.']
Target: ['Adam Thompstone returns on the wing for Leicester for the visit of London Irish, while Jono Kitto makes a first start for the club at scrum-half.']

[35mbert score metric[0m


2021-08-18 18:53:30,674 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:53:30,675 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpzg_kclyy/system and model files to /tmp/tmpzg_kclyy/model.
2021-08-18 18:53:30,676 [MainThread  ] [INFO ]  Processing files in /tmp/tmpitmvlg25.
2021-08-18 18:53:30,677 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:53:30,678 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpzg_kclyy/system.
2021-08-18 18:53:30,678 [MainThread  ] [INFO ]  Processing files in /tmp/tmpa5l6chzh.
2021-08-18 18:53:30,679 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:53:30,680 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpzg_kclyy/model.
2021-08-18 18:53:30,681 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp60j_p7fj/rouge_conf.xml
2021-08-18 18:53:30,682 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.46819964051246643}
[35mbleu metric[0m
{'bleu': 1.7274520003385847}
[35mrouge metric[0m
{'rouge_1_f_score': 0.19048, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.14286}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.052631578947368425}
[35mmeteor metric[0m
{'meteor': 0.15527950310559005}
[32m#################### Test for xsum dataset, Pegasus model COMPLETE ####################

[0m
[35m#################### Testing: xsum dataset, TextRank model ####################[0m
Prediction: ['Three of the courses earmarked for the Olympics are in the bay and three are in the Atlantic, with up to 1,400 athletes set to compete in water sports at the Games.']
Target: ["Sailing's governing body has warned that events at the Rio Olympics in 2016 could be moved out of the polluted Guanabara Bay."]

[35mbert score metric[0m


2021-08-18 18:53:51,696 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:53:51,697 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpfc6fazc0/system and model files to /tmp/tmpfc6fazc0/model.
2021-08-18 18:53:51,698 [MainThread  ] [INFO ]  Processing files in /tmp/tmph91p2ray.
2021-08-18 18:53:51,698 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:53:51,699 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpfc6fazc0/system.
2021-08-18 18:53:51,700 [MainThread  ] [INFO ]  Processing files in /tmp/tmpv31ktv6j.
2021-08-18 18:53:51,701 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:53:51,702 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpfc6fazc0/model.
2021-08-18 18:53:51,703 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp4jvxt4im/rouge_conf.xml
2021-08-18 18:53:51,703 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.49894580245018005}
[35mbleu metric[0m
{'bleu': 3.5410607693940146}
[35mrouge metric[0m
{'rouge_1_f_score': 0.25, 'rouge_2_f_score': 0.07407, 'rouge_l_f_score': 0.17857}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.23076923076923075}
[35mmeteor metric[0m
{'meteor': 0.09677419354838708}
[32m#################### Test for xsum dataset, TextRank model COMPLETE ####################

[0m


Reusing dataset pubmed_qa (/home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d)
Loading cached split indices for dataset at /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-ea46b0ec6b6151e7.arrow and /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-34d92cfafb0bbd63.arrow
100%|██████████| 169226/169226 [01:26<00:00, 1956.70it/s]



pubmed_qa has a training set of 169226 examples
[35mInitializing all matching model pipelines for pubmed_qa dataset...[0m


Reusing dataset pubmed_qa (/home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d)
  0%|          | 0/169226 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a 

[35m#################### Testing: pubmed_qa dataset, TF-IDF (BART) model ####################[0m
Prediction: ['chronic metabolic acidosis ( cma ) normal adults results complex endocrine metabolic alterations including growth hormone ( gh ) insensitivity, hypothyroidism, hyperglucocorticoidism, hypoalbuminaemia loss protein stores. treated 14 chronic haemodialysis patients daily oral na-citrate 4 weeks, yielding steady-state pre-dialytic plasma bicarbonate concentration 26.7 mmol/l.']
Target: ['CMA contributes to the derangements of the growth and thyroid hormone axes and to hypoalbuminaemia, but is not a modulator of systemic inflammation in dialysis patients. Correcting CMA may improve nutritional and metabolic parameters and thus lower morbidity and mortality.']

[35mbert score metric[0m


2021-08-18 18:58:37,664 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:58:37,665 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp2cn_oar8/system and model files to /tmp/tmp2cn_oar8/model.
2021-08-18 18:58:37,666 [MainThread  ] [INFO ]  Processing files in /tmp/tmpfi8xwe34.
2021-08-18 18:58:37,667 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:58:37,668 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp2cn_oar8/system.
2021-08-18 18:58:37,668 [MainThread  ] [INFO ]  Processing files in /tmp/tmpi71hwkbh.
2021-08-18 18:58:37,669 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:58:37,670 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp2cn_oar8/model.
2021-08-18 18:58:37,671 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpms00qx9_/rouge_conf.xml
2021-08-18 18:58:37,671 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5473907589912415}
[35mbleu metric[0m
{'bleu': 1.177721496744438}
[35mrouge metric[0m
{'rouge_1_f_score': 0.14117, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.11765}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.07407407407407408}
[35mmeteor metric[0m
{'meteor': 0.045871559633027525}
[32m#################### Test for pubmed_qa dataset, TF-IDF (BART) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, TF-IDF (LexRank) model ####################[0m
Prediction: ['treatment epcs gw501516 , ppar-δ agonist , induced specifically matrix metallo-proteinase ( mmp ) -9 direct transcriptional activation . roles peroxisome proliferator-activated receptor ( ppar ) -δ vascular biology mainly unknown .']
Target: ['Our results suggest that PPAR-δ is a crucial modulator of angio-myogenesis via the paracrine effects of EPCs, and its agonist is a good candidate as a t

2021-08-18 18:59:00,887 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:59:00,888 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp15ipf7jq/system and model files to /tmp/tmp15ipf7jq/model.
2021-08-18 18:59:00,889 [MainThread  ] [INFO ]  Processing files in /tmp/tmpa9omdq6c.
2021-08-18 18:59:00,890 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:59:00,891 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp15ipf7jq/system.
2021-08-18 18:59:00,891 [MainThread  ] [INFO ]  Processing files in /tmp/tmpdfxcqc0b.
2021-08-18 18:59:00,892 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:59:00,893 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp15ipf7jq/model.
2021-08-18 18:59:00,894 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpej9vfzpv/rouge_conf.xml
2021-08-18 18:59:00,895 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5089986324310303}
[35mbleu metric[0m
{'bleu': 1.4476896280149854}
[35mrouge metric[0m
{'rouge_1_f_score': 0.13334, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.1}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.0}
[35mmeteor metric[0m
{'meteor': 0.046583850931677016}
[32m#################### Test for pubmed_qa dataset, TF-IDF (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, TF-IDF (Longformer) model ####################[0m
Longformer model: processing document of tensor([185]) tokens
Prediction: ['V-couples were able to extubate higher v-cpf and involuntary cough peak flow.\nV-cp fintillation was a controlled cough peak.\nThe study was compared predictive accuracy voluntary cough peak (v-cpF)\nThe researchers were able extubated higher vibrations 106 times.']
Target: ['V-CPF is noninvasive. It is much more accurate than IV-CPF as a predictor of re-intubation in cooperative patients because the IV-CPF may underestimate cough strength in patients with high V-CPF. However, it is unclear which is optimal for use in uncooperative patients.']

[35mbert score metric[0m


2021-08-18 18:59:27,254 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:59:27,255 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp3madcj83/system and model files to /tmp/tmp3madcj83/model.
2021-08-18 18:59:27,256 [MainThread  ] [INFO ]  Processing files in /tmp/tmpf0s4razc.
2021-08-18 18:59:27,257 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:59:27,258 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp3madcj83/system.
2021-08-18 18:59:27,258 [MainThread  ] [INFO ]  Processing files in /tmp/tmpnmsu2vga.
2021-08-18 18:59:27,259 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:59:27,260 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp3madcj83/model.
2021-08-18 18:59:27,261 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpee3g4is_/rouge_conf.xml
2021-08-18 18:59:27,261 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5620355010032654}
[35mbleu metric[0m
{'bleu': 1.1885227360790502}
[35mrouge metric[0m
{'rouge_1_f_score': 0.15731, 'rouge_2_f_score': 0.04598, 'rouge_l_f_score': 0.15731}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.1411764705882353}
[35mmeteor metric[0m
{'meteor': 0.052083333333333336}
[32m#################### Test for pubmed_qa dataset, TF-IDF (Longformer) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, TF-IDF (Pegasus) model ####################[0m
Prediction: ['Expression of a novel human pancreatic acinar cell receptor has been reported for the first time.']
Target: ['These results indicate that SPINK1 plays a role as a growth factor, signaling through the EGFR pathway in pancreatic ductal adenocarcinoma and neoplasms, and that the EGFR is involved in the malignant transformation of IPMN.']

[35mbert score metric[0m


2021-08-18 18:59:53,412 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 18:59:53,413 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpr9h6i0dy/system and model files to /tmp/tmpr9h6i0dy/model.
2021-08-18 18:59:53,414 [MainThread  ] [INFO ]  Processing files in /tmp/tmpcihtahsj.
2021-08-18 18:59:53,414 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 18:59:53,416 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpr9h6i0dy/system.
2021-08-18 18:59:53,416 [MainThread  ] [INFO ]  Processing files in /tmp/tmpmpoyj9a3.
2021-08-18 18:59:53,417 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 18:59:53,418 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpr9h6i0dy/model.
2021-08-18 18:59:53,419 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpcsuu8lc6/rouge_conf.xml
2021-08-18 18:59:53,420 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5063368082046509}
[35mbleu metric[0m
{'bleu': 0.9943036623640187}
[35mrouge metric[0m
{'rouge_1_f_score': 0.15687, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.11764}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.0851063829787234}
[35mmeteor metric[0m
{'meteor': 0.11173184357541902}
[32m#################### Test for pubmed_qa dataset, TF-IDF (Pegasus) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, TF-IDF (TextRank) model ####################[0m
Prediction: ['therefore , hypothesized , δpwv strongly associated left ventricular mass index ( lvmi ) apwv cpwvd . δpwv 2.4 ± 1.2 m/s ( mean ± sd ) , ranging 0.8 m/s , indicating almost constant arterial stiffness cardiac cycle , 4.4 m/s , reflecting substantial pressure dependency .']
Target: ['The change in arterial stiffness over the cardiac cycle, rather than diastolic stiffness, is independently as

2021-08-18 19:00:16,911 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:00:16,912 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpp5grgucd/system and model files to /tmp/tmpp5grgucd/model.
2021-08-18 19:00:16,912 [MainThread  ] [INFO ]  Processing files in /tmp/tmpxt1b_9y4.
2021-08-18 19:00:16,913 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:00:16,914 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpp5grgucd/system.
2021-08-18 19:00:16,914 [MainThread  ] [INFO ]  Processing files in /tmp/tmpbnwred7t.
2021-08-18 19:00:16,915 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:00:16,916 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpp5grgucd/model.
2021-08-18 19:00:16,916 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpsxq_qnxy/rouge_conf.xml
2021-08-18 19:00:16,917 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5187576413154602}
[35mbleu metric[0m
{'bleu': 4.2168875803062384}
[35mrouge metric[0m
{'rouge_1_f_score': 0.23077, 'rouge_2_f_score': 0.07895, 'rouge_l_f_score': 0.15384}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.13513513513513511}
[35mmeteor metric[0m
{'meteor': 0.12231815803244374}
[32m#################### Test for pubmed_qa dataset, TF-IDF (TextRank) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, BM25 (BART) model ####################[0m
Prediction: ["parkinson's disease ( pd ) chronic progressive neurologic disorder, affects approximately one million men women us alone. pd represents heterogeneous disorder common clinical manifestations, part, common neuropathological findings. pD represents heterogenous disorder common Clinical manifestations, Part, common Neuropathological Findings."]
Target: ['Strategies aimed at maintaining parkin i

2021-08-18 19:00:46,985 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:00:46,986 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpns46jgom/system and model files to /tmp/tmpns46jgom/model.
2021-08-18 19:00:46,986 [MainThread  ] [INFO ]  Processing files in /tmp/tmp4hlg9lyr.
2021-08-18 19:00:46,987 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:00:46,988 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpns46jgom/system.
2021-08-18 19:00:46,989 [MainThread  ] [INFO ]  Processing files in /tmp/tmp419ba0_2.
2021-08-18 19:00:46,989 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:00:46,990 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpns46jgom/model.
2021-08-18 19:00:46,991 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmptt4ik8ox/rouge_conf.xml
2021-08-18 19:00:46,991 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.3538912534713745}
[35mbleu metric[0m
{'bleu': 1.022951633574269}
[35mrouge metric[0m
{'rouge_1_f_score': 0.0, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.0}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.0}
[35mmeteor metric[0m
{'meteor': 0.0}
[32m#################### Test for pubmed_qa dataset, BM25 (BART) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, BM25 (LexRank) model ####################[0m
Prediction: ['examined whether accounting conscious status administrative data improved mortality prediction among patients moderate severe tbi . patients dichotomized no/brief loss consciousness ( loc ) vs extended loc greater 1 hour using international classification diseases , ninth revision ( icd-9 ) fifth digit modifiers .']
Target: ['Accounting for LOC along with anatomical measures of injury severity improves mortality prediction among patients

2021-08-18 19:01:10,049 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:01:10,051 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpkfcy1v4x/system and model files to /tmp/tmpkfcy1v4x/model.
2021-08-18 19:01:10,052 [MainThread  ] [INFO ]  Processing files in /tmp/tmp_3km17ae.
2021-08-18 19:01:10,052 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:01:10,053 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpkfcy1v4x/system.
2021-08-18 19:01:10,054 [MainThread  ] [INFO ]  Processing files in /tmp/tmp9p099xf7.
2021-08-18 19:01:10,054 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:01:10,055 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpkfcy1v4x/model.
2021-08-18 19:01:10,056 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpn8eu5q7p/rouge_conf.xml
2021-08-18 19:01:10,056 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5667684078216553}
[35mbleu metric[0m
{'bleu': 6.407123783602852}
[35mrouge metric[0m
{'rouge_1_f_score': 0.29629, 'rouge_2_f_score': 0.1519, 'rouge_l_f_score': 0.22222}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.2077922077922078}
[35mmeteor metric[0m
{'meteor': 0.12585812356979406}
[32m#################### Test for pubmed_qa dataset, BM25 (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, BM25 (Longformer) model ####################[0m
Longformer model: processing document of tensor([98]) tokens
Prediction: ['58 patients studied obese patients.\nThe obese patients were assessed for low calorie diet.\nIn the obese patients, 58 patients were treated obese.\nDiabetes resistance evaluated.\nLow calorie diet was assessed.\nStress-free diet was also assessed.22 patients.']
Target: ['NK cells are significantly increased in IR severely obese people in respect to IS, suggesting a slightly different immune status in these patients with a probable dietary relationship. Weight loss could reverse this increase either after VLCD or after bariatric surgery.']

[35mbert score metric[0m


2021-08-18 19:01:37,955 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:01:37,957 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp5x_uiotu/system and model files to /tmp/tmp5x_uiotu/model.
2021-08-18 19:01:37,958 [MainThread  ] [INFO ]  Processing files in /tmp/tmpaggurjea.
2021-08-18 19:01:37,958 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:01:37,959 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp5x_uiotu/system.
2021-08-18 19:01:37,960 [MainThread  ] [INFO ]  Processing files in /tmp/tmpd_odzko_.
2021-08-18 19:01:37,960 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:01:37,961 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp5x_uiotu/model.
2021-08-18 19:01:37,962 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpjknkp3br/rouge_conf.xml
2021-08-18 19:01:37,963 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5035908222198486}
[35mbleu metric[0m
{'bleu': 1.188432823684058}
[35mrouge metric[0m
{'rouge_1_f_score': 0.075, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.075}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.02631578947368421}
[35mmeteor metric[0m
{'meteor': 0.04010695187165775}
[32m#################### Test for pubmed_qa dataset, BM25 (Longformer) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, BM25 (Pegasus) model ####################[0m
Prediction: ['The influence of electromyographic signals on muscle contraction and pain is investigated.']
Target: ['Short-term dynamic reorganization of the spatial distribution of muscle activity occurred in response to nociceptive afferent input.']

[35mbert score metric[0m


2021-08-18 19:02:10,095 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:02:10,096 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpkt8o2h3h/system and model files to /tmp/tmpkt8o2h3h/model.
2021-08-18 19:02:10,097 [MainThread  ] [INFO ]  Processing files in /tmp/tmpnrxhb32d.
2021-08-18 19:02:10,098 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:02:10,099 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpkt8o2h3h/system.
2021-08-18 19:02:10,099 [MainThread  ] [INFO ]  Processing files in /tmp/tmphu3bvv3k.
2021-08-18 19:02:10,100 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:02:10,101 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpkt8o2h3h/model.
2021-08-18 19:02:10,102 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp2r0cot1c/rouge_conf.xml
2021-08-18 19:02:10,103 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5448035597801208}
[35mbleu metric[0m
{'bleu': 2.7673854938424567}
[35mrouge metric[0m
{'rouge_1_f_score': 0.2, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.2}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.15384615384615385}
[35mmeteor metric[0m
{'meteor': 0.12}
[32m#################### Test for pubmed_qa dataset, BM25 (Pegasus) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, BM25 (TextRank) model ####################[0m
Prediction: ['patients evaluated upon admission , day leg crossing , upon discharge , 1 year discharge .']
Target: ['Leg crossing is an easily obtained clinical sign and is independent of additional technical examinations. Leg crossing within the first 15 days after severe stroke indicates a favorable outcome which includes less neurologic deficits, better independence in daily life, and lower rates of death.']

[35mbert score m

2021-08-18 19:02:34,657 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:02:34,658 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpbbo7z1bt/system and model files to /tmp/tmpbbo7z1bt/model.
2021-08-18 19:02:34,658 [MainThread  ] [INFO ]  Processing files in /tmp/tmpcuco0l13.
2021-08-18 19:02:34,658 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:02:34,659 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpbbo7z1bt/system.
2021-08-18 19:02:34,660 [MainThread  ] [INFO ]  Processing files in /tmp/tmp1c9ibnqq.
2021-08-18 19:02:34,661 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:02:34,661 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpbbo7z1bt/model.
2021-08-18 19:02:34,662 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpoj6kcm8o/rouge_conf.xml
2021-08-18 19:02:34,663 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.4858647882938385}
[35mbleu metric[0m
{'bleu': 0.46806954336200896}
[35mrouge metric[0m
{'rouge_1_f_score': 0.07142, 'rouge_2_f_score': 0.03704, 'rouge_l_f_score': 0.07142}
[35mrougeWE metric[0m


Reusing dataset summertime_scisummnet (/home/lily/mmm274/.cache/huggingface/datasets/summertime_scisummnet/default/0.0.0/2de3b585a09db9aaf7f42cebe7c334a94ebae2d8104a752ce286c8a0d3fadaec)


{'rouge_we_3_f': 0.07692307692307693}
[35mmeteor metric[0m
{'meteor': 0.16788563829787237}
[32m#################### Test for pubmed_qa dataset, BM25 (TextRank) model COMPLETE ####################

[0m


100%|██████████| 808/808 [00:00<00:00, 5996.24it/s]
Reusing dataset summertime_scisummnet (/home/lily/mmm274/.cache/huggingface/datasets/summertime_scisummnet/default/0.0.0/2de3b585a09db9aaf7f42cebe7c334a94ebae2d8104a752ce286c8a0d3fadaec)



ScisummNet has a training set of 808 examples
[35mInitializing all matching model pipelines for ScisummNet dataset...[0m


  0%|          | 0/808 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
 12%|█▏        | 99/808 [01:03<07:37,  1.55it/s]


[35m#################### Testing: ScisummNet dataset, BART model ####################[0m
Prediction: ['For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48% to 90%. More than half of the newly covered items accurately translated, as opposed to none in current approaches. We show that upon encountering an unknown source phrase, we can substitute a paraphrase for it and then proceed using the translation of that paraphrase.']
Target: ['Improved Statistical Machine Translation Using Paraphrases\nParallel corpora are crucial for training SMT systems.\nHowever, for many language pairs they are available only in very limited quantities.\nFor these language pairs a huge portion of phrases encountered at run-time will be unknown.\nWe show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases.\nOur results show that augmenting a state-of-the-art SMT system with paraphrases leads to sig

2021-08-18 19:04:32,961 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:04:32,962 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpdommb5zv/system and model files to /tmp/tmpdommb5zv/model.
2021-08-18 19:04:32,963 [MainThread  ] [INFO ]  Processing files in /tmp/tmpyr3v78io.
2021-08-18 19:04:32,963 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:04:32,964 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpdommb5zv/system.
2021-08-18 19:04:32,965 [MainThread  ] [INFO ]  Processing files in /tmp/tmpf94is5so.
2021-08-18 19:04:32,966 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:04:32,967 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpdommb5zv/model.
2021-08-18 19:04:32,967 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp6mx5z84v/rouge_conf.xml
2021-08-18 19:04:32,968 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.7273702025413513}
[35mbleu metric[0m
{'bleu': 17.512927724582575}
[35mrouge metric[0m
{'rouge_1_f_score': 0.55111, 'rouge_2_f_score': 0.4574, 'rouge_l_f_score': 0.53333}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.47963800904977366}
[35mmeteor metric[0m
{'meteor': 0.7080104819116995}
[32m#################### Test for ScisummNet dataset, BART model COMPLETE ####################

[0m
[35m#################### Testing: ScisummNet dataset, LexRank model ####################[0m
Prediction: ['entropy-based measures, as in (Redlich, 1993).</S>\n    <S sid="44" ssid="38">Word induction from natural language text without word boundaries is also studied in (Deligne and Bimbot, 1997; Hua, 2000), where MDL-based model optimization measures are used.</S>\n    <S sid="45" ssid="39">Viterbi or the forward-backward algorithm (an EM algorithm) is used for improving the segmentation of the corpus2.</S>\n   

2021-08-18 19:04:56,295 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:04:56,296 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpmanwigr9/system and model files to /tmp/tmpmanwigr9/model.
2021-08-18 19:04:56,297 [MainThread  ] [INFO ]  Processing files in /tmp/tmpuaji__a0.
2021-08-18 19:04:56,298 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:04:56,299 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpmanwigr9/system.
2021-08-18 19:04:56,300 [MainThread  ] [INFO ]  Processing files in /tmp/tmp2pnyusik.
2021-08-18 19:04:56,301 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:04:56,302 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpmanwigr9/model.
2021-08-18 19:04:56,303 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpilv6_41y/rouge_conf.xml
2021-08-18 19:04:56,303 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5092178583145142}
[35mbleu metric[0m
{'bleu': 0.8733969433383787}
[35mrouge metric[0m
{'rouge_1_f_score': 0.05796, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.05796}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.10474631751227495}
[35mmeteor metric[0m
{'meteor': 0.04660400472057366}
[32m#################### Test for ScisummNet dataset, LexRank model COMPLETE ####################

[0m
[35m#################### Testing: ScisummNet dataset, Longformer model ####################[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Longformer model: processing document of tensor([4096]) tokens
Prediction: ['Theorem 2 states that disagreement upper bounds error.\nTheorem predicts that disagreement between rules and the core of the language.\nIt is not difficult to prove that a classifier has low generalization error.2222;.\nA simple example is the number of examples of the number 1 and 0;.']
Target: ["Bootstrapping\nThis paper refines the analysis of co-training, defines and evaluates a new co-training algorithm that has theoretical justification, gives a theoretical justification for the Yarowsky algorithm, and shows that co-training and the Yarowsky algorithm are based on different independence assumptions.\nWe show that the independence assumption can be relaxed, and co-training is still effective under a weaker independence assumption.\nWe refine Dasgupta et al's result by relaxing the view independence assumption with a new constraint.\nWe propose the Greedy Agreement Algorithm, which, based on two independen

2021-08-18 19:05:28,826 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:05:28,827 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpgb4iin0s/system and model files to /tmp/tmpgb4iin0s/model.
2021-08-18 19:05:28,828 [MainThread  ] [INFO ]  Processing files in /tmp/tmp_8go2dm_.
2021-08-18 19:05:28,829 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:05:28,830 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpgb4iin0s/system.
2021-08-18 19:05:28,830 [MainThread  ] [INFO ]  Processing files in /tmp/tmpfsj5h24w.
2021-08-18 19:05:28,831 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:05:28,832 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpgb4iin0s/model.
2021-08-18 19:05:28,833 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpsz0gevon/rouge_conf.xml
2021-08-18 19:05:28,833 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.5065644383430481}
[35mbleu metric[0m
{'bleu': 0.3717885486559232}
[35mrouge metric[0m
{'rouge_1_f_score': 0.18433, 'rouge_2_f_score': 0.03721, 'rouge_l_f_score': 0.17512}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.12206572769953052}
[35mmeteor metric[0m
{'meteor': 0.23439483259526697}
[32m#################### Test for ScisummNet dataset, Longformer model COMPLETE ####################

[0m
[35m#################### Testing: ScisummNet dataset, Pegasus model ####################[0m
Prediction: ['Thesaurus extraction has traditionally been used in retrieval tasks to expand words in queries with synonymous terms.']
Target: ['Improvements In Automatic Thesaurus Extraction\nThe use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use.\nWe evaluate existing and new similarity metrics for 

2021-08-18 19:05:59,511 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:05:59,513 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpqrccxfe7/system and model files to /tmp/tmpqrccxfe7/model.
2021-08-18 19:05:59,513 [MainThread  ] [INFO ]  Processing files in /tmp/tmpsy5j7v6w.
2021-08-18 19:05:59,514 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:05:59,515 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpqrccxfe7/system.
2021-08-18 19:05:59,516 [MainThread  ] [INFO ]  Processing files in /tmp/tmp3eo2i986.
2021-08-18 19:05:59,516 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:05:59,517 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpqrccxfe7/model.
2021-08-18 19:05:59,518 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmphlpts20i/rouge_conf.xml
2021-08-18 19:05:59,519 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.4973493218421936}
[35mbleu metric[0m
{'bleu': 0.0017876429458869197}
[35mrouge metric[0m
{'rouge_1_f_score': 0.09756, 'rouge_2_f_score': 0.01235, 'rouge_l_f_score': 0.09756}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.012499999999999999}
[35mmeteor metric[0m
{'meteor': 0.13422818791946312}
[32m#################### Test for ScisummNet dataset, Pegasus model COMPLETE ####################

[0m
[35m#################### Testing: ScisummNet dataset, TextRank model ####################[0m
Prediction: ['<PAPER>\n  <S sid="0">Accurate Information Extraction From Research Papers Using Conditional Random Fields</S>\n  <ABSTRACT>\n    <S sid="1" ssid="1">With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance.</S>\n    <S sid="2" ssid="2">This paper employs Conditional Random F

2021-08-18 19:06:24,137 [MainThread  ] [INFO ]  Writing summaries.
2021-08-18 19:06:24,138 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmptz0nywut/system and model files to /tmp/tmptz0nywut/model.
2021-08-18 19:06:24,139 [MainThread  ] [INFO ]  Processing files in /tmp/tmp36dqkwju.
2021-08-18 19:06:24,140 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-08-18 19:06:24,141 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmptz0nywut/system.
2021-08-18 19:06:24,142 [MainThread  ] [INFO ]  Processing files in /tmp/tmpw8ef_y27.
2021-08-18 19:06:24,142 [MainThread  ] [INFO ]  Processing model.A.0.txt.
2021-08-18 19:06:24,143 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmptz0nywut/model.
2021-08-18 19:06:24,144 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpabv30ias/rouge_conf.xml
2021-08-18 19:06:24,144 [MainThread  ] [INFO ]  Running ROUGE with command /home/lily/mmm274/anaconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1

hash_code: bert-base-uncased_L8_no-idf_version=0.3.9(hug_trans=4.5.1)
{'bert_score_f1': 0.68541020154953}
[35mbleu metric[0m
{'bleu': 43.512402237542766}
[35mrouge metric[0m
{'rouge_1_f_score': 0.0, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.0}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.7002398081534772}
[35mmeteor metric[0m
{'meteor': 0.6541982424289587}
[32m#################### Test for ScisummNet dataset, TextRank model COMPLETE ####################

[0m


Reusing dataset summertime_summscreen (/home/lily/mmm274/.cache/huggingface/datasets/summertime_summscreen/default/0.0.0/3b7dab2d730657a8545307f56c2e997a817186018ef6141b5699deebdc103573)
100%|██████████| 22588/22588 [00:15<00:00, 1446.36it/s]
Reusing dataset summertime_summscreen (/home/lily/mmm274/.cache/huggingface/datasets/summertime_summscreen/default/0.0.0/3b7dab2d730657a8545307f56c2e997a817186018ef6141b5699deebdc103573)



SummScreen_fd+tms_tokenized has a training set of 22588 examples
[35mInitializing all matching model pipelines for SummScreen_fd+tms_tokenized dataset...[0m


  0%|          | 0/22588 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
usage: ipykernel_launcher.py [-h] [--command COMMAND] [--conf_file CONF_FILE] [--PYLEARN_M

<unittest.main.TestProgram at 0x7f68892a2610>