# SummerTime Midway Showcase

#### This notebook shows part of the current functionality of the SummerTime - A summarization library.

Note: This is not the production version of the library and more modules are being added, including but not limited to pip installation, more models, more datasets, etc.

## Installation
Cloning from GitHub at the moment, but will support `pip install` soon

In [1]:
## Uncomment to clone git repo if not already done so
## Swith to the Summertime directory
## Switch to the relevant git branch

# !git clone https://github.com/Yale-LILY/SummerTime.git
# %cd SummerTime/
# !git checkout origin/troyfeng116/integration-tests

/data/lily/mmm274/SummerTime/notebook/SummerTime
HEAD is now at 6b48c8b Merge branch 'main' into troyfeng116/integration-tests


### Install dependencies for the library

In [7]:
!pip install -r requirements.txt

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 6.8 MB/s eta 0:00:01
Collecting lexrank==0.1.0
  Using cached lexrank-0.1.0-py3-none-any.whl (69 kB)
Collecting datasets==1.6.2
  Using cached datasets-1.6.2-py3-none-any.whl (221 kB)
Collecting gensim==3.8.3
  Using cached gensim-3.8.3-cp38-cp38-manylinux1_x86_64.whl (24.2 MB)






Installing collected packages: lexrank, gensim, en-core-web-sm, datasets
Successfully installed datasets-1.6.2 en-core-web-sm-3.0.0 gensim-3.8.3 lexrank-0.1.0


In [8]:
## Finish setup

# Setup ROUGE
!export ROUGE_HOME=/usr/local/lib/python3.7/dist-packages/summ_eval/ROUGE-1.5.5/
!pip install -U  git+https://github.com/bheinzerling/pyrouge.git

Collecting git+https://github.com/bheinzerling/pyrouge.git
  Cloning https://github.com/bheinzerling/pyrouge.git to /tmp/pip-req-build-ant4nznt
  Running command git clone -q https://github.com/bheinzerling/pyrouge.git /tmp/pip-req-build-ant4nznt


In [None]:
## Uncomment to restart runtime if prompted to do so in either of the two previous cells; else ignore
## Restart runtime to install modules
# import os
# os.kill(os.getpid(), 9)

In [3]:
## Uncomment to move back into the Summertime directory if restarted runtime
## Only use if you cloned the repo
# %cd SummerTime/

In [5]:
# import modules for this notebook

from pprint import pprint
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Models

### Supported Models

SummerTime supports different models (*e.g.,* TextRank, BART, Longformer) as well as model wrappers for more complex summariztion tasks (*e.g.,* JointModel for multi-doc summarzation, BM25 retrieval for query-based summarization).

In [8]:
from model import SUPPORTED_SUMM_MODELS

pprint(SUPPORTED_SUMM_MODELS)

[<class 'model.single_doc.bart_model.BartModel'>,
 <class 'model.single_doc.lexrank_model.LexRankModel'>,
 <class 'model.single_doc.longformer_model.LongformerModel'>,
 <class 'model.single_doc.pegasus_model.PegasusModel'>,
 <class 'model.single_doc.textrank_model.TextRankModel'>,
 <class 'model.multi_doc.multi_doc_joint_model.MultiDocJointModel'>,
 <class 'model.multi_doc.multi_doc_separate_model.MultiDocSeparateModel'>,
 <class 'model.dialogue.hmnet_model.HMNetModel'>,
 <class 'model.query_based.tf_idf_model.TFIDFSummModel'>,
 <class 'model.query_based.bm25_model.BM25SummModel'>]


### Automatic Pipeline Assembly

### Model selection

In [9]:
import model

# Users can load a default summarization model
sample_model = model.summarizer()

In [10]:
from model import SUPPORTED_SUMM_MODELS, LexRankModel, PegasusModel

# Or a specific model
pegasus = PegasusModel()

In [11]:
# Users can easily access documentation to assist with model selection
sample_model.show_capability()

Pegasus is the default singe-document summarization model.
Pegasus is a abstractive, neural model for summarization. 
 #################### 
 Introduced in 2019, a large neural abstractive summarization model trained on web crawl and news data.
 Strengths: 
 - High accuracy 
 - Performs well on almost all kinds of non-literary written text 
 Weaknesses: 
 - High memory usage 
 Initialization arguments: 
 - `device = 'cpu'` specifies the device the model is stored on and uses for computation. Use `device='gpu'` to run on an Nvidia GPU.


### Inference

In [12]:
documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]

sample_model.summarize(documents)

["California's largest electricity provider has turned off power to hundreds of thousands of customers."]

# Datasets

### Datasets supported

SummerTime supports different summarization datasets across different domains (*e.g.,* CNNDM dataset - news article corpus, Samsum - dialogue corpus, QM-Sum - query-based dialogue corpus, MultiNews - multi-document corpus, ML-sum - multi-lingual corpus, PubMedQa - Medical domain, Arxiv - Science papers domain, among others.

In [13]:
from dataset import SUPPORTED_SUMM_DATASETS

pprint(SUPPORTED_SUMM_DATASETS)

[<class 'dataset.huggingface_datasets.CnndmDataset'>,
 <class 'dataset.huggingface_datasets.MultinewsDataset'>,
 <class 'dataset.huggingface_datasets.SamsumDataset'>,
 <class 'dataset.huggingface_datasets.XsumDataset'>,
 <class 'dataset.huggingface_datasets.PubmedqaDataset'>,
 <class 'dataset.huggingface_datasets.MlsumDataset'>,
 <class 'dataset.non_huggingface_datasets.ScisummnetDataset'>,
 <class 'dataset.non_huggingface_datasets.SummscreenDataset'>,
 <class 'dataset.non_huggingface_datasets.QMsumDataset'>,
 <class 'dataset.non_huggingface_datasets.ArxivDataset'>]


### Dataset Initialization

In [14]:
import dataset

cnn_dataset = dataset.CnndmDataset()

Reusing dataset cnn_dailymail (/home/lily/mmm274/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)


In [15]:
# Data is loaded using a generator to save on space and time

data_instance = next(cnn_dataset.train_set)
print("\n")
pprint(data_instance.__str__())

  0%|          | 0/287113 [00:00<?, ?it/s]



("{'source': 'It\\'s official: U.S. President Barack Obama wants lawmakers to "
 'weigh in on whether to use military force in Syria. Obama sent a letter to '
 'the heads of the House and Senate on Saturday night, hours after announcing '
 'that he believes military action against Syrian targets is the right step to '
 'take over the alleged use of chemical weapons. The proposed legislation from '
 'Obama asks Congress to approve the use of military force "to deter, disrupt, '
 'prevent and degrade the potential for future uses of chemical weapons or '
 'other weapons of mass destruction." It\\\'s a step that is set to turn an '
 'international crisis into a fierce domestic political battle. There are key '
 'questions looming over the debate: What did U.N. weapons inspectors find in '
 'Syria? What happens if Congress votes no? And how will the Syrian government '
 'react? In a televised address from the White House Rose Garden earlier '
 'Saturday, the president said he would take 

### A non-neural model
Below we train an unsupervised non-neural summarizer with a subset of the cnn_dailymail dataset: 

In [16]:
import itertools

# Get a slice of the train set - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance.source for instance in train_set]
pprint(corpus)

trad_model = LexRankModel(corpus)

['(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming '
 "his third gold in Moscow as he anchored Jamaica to victory in the men's "
 '4x100m relay. The fastest man in the world charged clear of United States '
 'rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar '
 'Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished '
 'second in 37.56 seconds with Canada taking the bronze after Britain were '
 'disqualified for a faulty handover. The 26-year-old Bolt has now collected '
 'eight gold medals at world championships, equaling the record held by '
 'American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention '
 'the small matter of six Olympic titles. The relay triumph followed '
 'individual successes in the 100 and 200 meters in the Russian capital. "I\'m '
 "proud of myself and I'll continue to work to dominate for as long as "
 'possible," Bolt said, having previously expressed his intention to carry on '


In [17]:
# Inference
text = [next(cnn_dataset.test_set).source]
pprint(text)

summary = trad_model.summarize(text)
pprint(summary)


  0%|          | 0/11490 [00:00<?, ?it/s][A

['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. '
 'Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. '
 'He was 88. Best died in hospice in Hickory, North Carolina, of complications '
 'from pneumonia, said Steve Latshaw, a longtime friend and Hollywood '
 "colleague. Although he'd been a busy actor for decades in theater and in "
 'Hollywood, Best didn\'t become famous until 1979, when "The Dukes of '
 'Hazzard\'s" cornpone charms began beaming into millions of American homes '
 "almost every Friday night. For seven seasons, Best's Rosco P. Coltrane "
 'chased the moonshine-running Duke boys back and forth across the back roads '
 'of fictitious Hazzard County, Georgia, although his "hot pursuit" usually '
 'ended with him crashing his patrol car. Although Rosco was slow-witted and '
 'corrupt, Best gave him a childlike enthusiasm that got laughs and made him '
 'endearing. His character became known for his distinctive "kew-kew

In [18]:
# More about lexrank

trad_model.show_capability()

LexRank is a extractive, non-neural model for summarization. 
 #################### 
 Works by using a graph-based method to identify the most salient sentences in the document. 
Strengths: 
 - Fast with low memory usage 
 - Allows for control of summary length 
 Weaknesses: 
 - Not as accurate as neural methods. 
 Initialization arguments: 
 - `corpus`: Unlabelled corpus of documents. ` 
 - `summary_length`: sentence length of summaries 
 - `threshold`: Level of salience required for sentence to be included in summary.


A spaCy pipeline for TextRank (another non-neueral extractive summarization model)

In [19]:
# TextRank model
textrank = model.TextRankModel()

In [20]:
textrank_summary = textrank.summarize(text[0:1])
pprint(textrank_summary)

["For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running "
 'Duke boys back and forth across the back roads of fictitious Hazzard County, '
 'Georgia, although his "hot pursuit" usually ended with him crashing his '
 'patrol car.']


In [21]:
# More about TextRank
textrank.show_capability()

TextRank is a extractive, non-neural model for summarization. 
 #################### 
 A graphbased ranking model for text processing. Extractive sentence summarization. 
 Strengths: 
 - Fast with low memory usage 
 - Allows for control of summary length 
 Weaknesses: 
 - Not as accurate as neural methods.


In [22]:
# Longformer2Roberta
longformer = model.LongformerModel()

You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
long_article = """(CNN)James Holmes made his introduction to the world in a Colorado cinema filled with spectators watching a midnight showing of the new Batman movie, "The Dark Knight Rises," in June 2012. The moment became one of the deadliest shootings in U.S. history. Holmes is accused of opening fire on the crowd, killing 12 people and injuring or maiming 70 others in Aurora, a suburb of Denver. Holmes appeared like a comic book character: He resembled the Joker, with red-orange hair, similar to the late actor Heath Ledger\'s portrayal of the villain in an earlier Batman movie, authorities said. But Holmes was hardly a cartoon. Authorities said he wore body armor and carried several guns, including an AR-15 rifle, with lots of ammo. He also wore a gas mask. Holmes says he was insane at the time of the shootings, and that is his legal defense and court plea: not guilty by reason of insanity. Prosecutors aren\'t swayed and will seek the death penalty. Opening statements in his trial are scheduled to begin Monday. Holmes admits to the shootings but says he was suffering "a psychotic episode" at the time,  according to court papers filed in July 2013 by the state public defenders, Daniel King and Tamara A. Brady. Evidence "revealed thus far in the case supports the defense\'s position that Mr. Holmes suffers from a severe mental illness and was in the throes of a psychotic episode when he committed the acts that resulted in the tragic loss of life and injuries sustained by moviegoers on July 20, 2012," the public defenders wrote. Holmes no longer looks like a dazed Joker, as he did in his first appearance before a judge in 2012. He appeared dramatically different in January when jury selection began for his trial: 9,000 potential jurors were summoned for duty, described as one of the nation\'s largest jury calls. Holmes now has a cleaner look, with a mustache, button-down shirt and khaki pants. In January, he had a beard and eyeglasses. If this new image sounds like one of an academician, it may be because Holmes, now 27, once was one. Just before the shooting, Holmes was a doctoral student in neuroscience, and he was studying how the brain works, with his schooling funded by a U.S. government grant. Yet for all his learning, Holmes apparently lacked the capacity to command his own mind, according to the case against him. A jury will ultimately decide Holmes\' fate. That panel is made up of 12 jurors and 12 alternates. They are 19 women and five men, and almost all are white and middle-aged. The trial could last until autumn. When jury summonses were issued in January, each potential juror stood a 0.2% chance of being selected, District Attorney George Brauchler told the final jury this month. He described the approaching trial as "four to five months of a horrible roller coaster through the worst haunted house you can imagine." The jury will have to render verdicts on each of the 165 counts against Holmes, including murder and attempted murder charges. Meanwhile, victims and their relatives are challenging all media outlets "to stop the gratuitous use of the name and likeness of mass killers, thereby depriving violent individuals the media celebrity and media spotlight they so crave," the No Notoriety group says. They are joined by victims from eight other mass shootings in recent U.S. history. Raised in central coastal California and in San Diego, James Eagan Holmes is the son of a mathematician father noted for his work at the FICO firm that provides credit scores and a registered nurse mother, according to the U-T San Diego newspaper. Holmes also has a sister, Chris, a musician, who\'s five years younger, the newspaper said. His childhood classmates remember him as a clean-cut, bespectacled boy with an "exemplary" character who "never gave any trouble, and never got in trouble himself," The Salinas Californian reported. His family then moved down the California coast, where Holmes grew up in the San Diego-area neighborhood of Rancho Peñasquitos, which a neighbor described as "kind of like Mayberry," the San Diego newspaper said. Holmes attended Westview High School, which says its school district sits in "a primarily middle- to upper-middle-income residential community." There, Holmes ran cross-country, played soccer and later worked at a biotechnology internship at the Salk Institute and Miramar College, which attracts academically talented students. By then, his peers described him as standoffish and a bit of a wiseacre, the San Diego newspaper said. Holmes attended college fairly close to home, in a neighboring area known as Southern California\'s "inland empire" because it\'s more than an hour\'s drive from the coast, in a warm, low-desert climate. He entered the University of California, Riverside, in 2006 as a scholarship student. In 2008 he was a summer camp counselor for disadvantaged children, age 7 to 14, at Camp Max Straus, run by Jewish Big Brothers Big Sisters of Los Angeles. He graduated from UC Riverside in 2010 with the highest honors and a bachelor\'s degree in neuroscience. "Academically, he was at the top of the top," Chancellor Timothy P. White said. He seemed destined for even higher achievement. By 2011, he had enrolled as a doctoral student in the neuroscience program at the University of Colorado Anschutz Medical Campus in Aurora, the largest academic health center in the Rocky Mountain region. The doctoral in neuroscience program attended by Holmes focuses on how the brain works, with an emphasis on processing of information, behavior, learning and memory. Holmes was one of six pre-thesis Ph.D. students in the program who were awarded a neuroscience training grant from the National Institutes of Health. The grant rewards outstanding neuroscientists who will make major contributions to neurobiology. A syllabus that listed Holmes as a student at the medical school shows he was to have delivered a presentation about microRNA biomarkers. But Holmes struggled, and his own mental health took an ominous turn. In March 2012, he told a classmate he wanted to kill people, and that he would do so "when his life was over," court documents said. Holmes was "denied access to the school after June 12, 2012, after he made threats to a professor," according to court documents. About that time, Holmes was a patient of University of Colorado psychiatrist Lynne Fenton. Fenton was so concerned about Holmes\' behavior that she mentioned it to her colleagues, saying he could be a danger to others, CNN affiliate KMGH-TV reported, citing sources with knowledge of the investigation. Fenton\'s concerns surfaced in early June, sources told the Denver station. Holmes began to fantasize about killing "a lot of people" in early June, nearly six weeks before the shootings, the station reported, citing unidentified sources familiar with the investigation. Holmes\' psychiatrist contacted several members of a "behavioral evaluation and threat assessment" team to say Holmes could be a danger to others, the station reported. At issue was whether to order Holmes held for 72 hours to be evaluated by mental health professionals, the station reported. "Fenton made initial phone calls about engaging the BETA team" in "the first 10 days" of June, but it "never came together" because in the period Fenton was having conversations with team members, Holmes began the process of dropping out of school, a source told KMGH. Defense attorneys have rejected the prosecution\'s assertions that Holmes was barred from campus. Citing statements from the university, Holmes\' attorneys have argued that his access was revoked because that\'s normal procedure when a student drops enrollment. What caused this turn for the worse for Holmes has yet to be clearly detailed. In the months before the shooting, he bought four weapons and more than 6,000 rounds of ammunition, authorities said. Police said he also booby-trapped his third-floor apartment with explosives, but police weren\'t fooled. After Holmes was caught in the cinema parking lot immediately after the shooting, bomb technicians went to the apartment and neutralized the explosives. No one was injured at the apartment building. Nine minutes before Holmes went into the movie theater, he called a University of Colorado switchboard, public defender Brady has said in court. The number he called can be used to get in contact with faculty members during off hours, Brady said. Court documents have also revealed that investigators have obtained text messages that Holmes exchanged with someone before the shooting. That person was not named, and the content of the texts has not been made public. According to The New York Times, Holmes sent a text message to a fellow graduate student, a woman, about two weeks before the shooting. She asked if he had left Aurora yet, reported the newspaper, which didn\'t identify her. No, he had two months left on his lease, Holmes wrote back, according to the Times. He asked if she had heard of "dysphoric mania," a form of bipolar disorder marked by the highs of mania and the dark and sometimes paranoid delusions of major depression. The woman asked if the disorder could be managed with treatment. "It was," Holmes wrote her, according to the Times. But he warned she should stay away from him "because I am bad news," the newspaper reported. It was her last contact with Holmes. After the shooting, Holmes\' family issued a brief statement: "Our hearts go out to those who were involved in this tragedy and to the families and friends of those involved," they said, without giving any information about their son. Since then, prosecutors have refused to offer a plea deal to Holmes. For Holmes, "justice is death," said Brauchler, the district attorney. In December, Holmes\' parents, who will be attending the trial, issued another statement: They asked that their son\'s life be spared and that he be sent to an institution for mentally ill people for the rest of his life, if he\'s found not guilty by reason of insanity. "He is not a monster," Robert and Arlene Holmes wrote, saying the death penalty is "morally wrong, especially when the condemned is mentally ill." "He is a human being gripped by a severe mental illness," the parents said. The matter will be settled by the jury. CNN\'s Ana Cabrera and Sara Weisfeldt contributed to this report from Denver."""
pprint(long_article)

('(CNN)James Holmes made his introduction to the world in a Colorado cinema '
 'filled with spectators watching a midnight showing of the new Batman movie, '
 '"The Dark Knight Rises," in June 2012. The moment became one of the '
 'deadliest shootings in U.S. history. Holmes is accused of opening fire on '
 'the crowd, killing 12 people and injuring or maiming 70 others in Aurora, a '
 'suburb of Denver. Holmes appeared like a comic book character: He resembled '
 "the Joker, with red-orange hair, similar to the late actor Heath Ledger's "
 'portrayal of the villain in an earlier Batman movie, authorities said. But '
 'Holmes was hardly a cartoon. Authorities said he wore body armor and carried '
 'several guns, including an AR-15 rifle, with lots of ammo. He also wore a '
 'gas mask. Holmes says he was insane at the time of the shootings, and that '
 'is his legal defense and court plea: not guilty by reason of insanity. '
 "Prosecutors aren't swayed and will seek the death penalty. O

In [24]:
longformer_summary = longformer.summarize([long_article])
pprint(longformer_summary)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Longformer model: processing document of tensor([2124]) tokens
['James Holmes, 27, is accused of opening fire on a Colorado theater.\n'
 'He was a doctoral student at University of Colorado.\n'
 'Holmes says he was suffering "a psychotic episode" at the time of the '
 'shooting.\n'
 "Prosecutors won't say whether Holmes was barred from campus."]


In [25]:
longformer.show_capability()

Longformer is a abstractive, neural model for summarization. 
 #################### 
 A Longformer2Roberta model finetuned on CNN-DM dataset for summarization.

Strengths:
 - Correctly handles longer (> 2000 tokens) corpus.

Weaknesses:
 - Less accurate on contexts outside training domain.

Initialization arguments:
  - `corpus`: Unlabelled corpus of documents.



# Evaluation

### Supported Evalutaionmetrics

SummerTime supports different evaluation metrics (*e.g.,* ROUGE, Bleu, BertScore, Meteor, etc)

In [26]:
from evaluation import SUPPORTED_EVALUATION_METRICS

pprint(SUPPORTED_EVALUATION_METRICS)

[<class 'evaluation.bertscore_metric.BertScore'>,
 <class 'evaluation.bleu_metric.Bleu'>,
 <class 'evaluation.rouge_metric.Rouge'>,
 <class 'evaluation.rougewe_metric.RougeWe'>,
 <class 'evaluation.meteor_metric.Meteor'>]


In [27]:
from evaluation.base_metric import SummMetric
from evaluation import Rouge, RougeWe, BertScore

import itertools

# Initializes a bertscore metric object
metric = BertScore()

# Evaluates model on subset of cnn_dailymail
# Get a slice of the train set - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance for instance in train_set]
pprint(corpus)

articles = [instance.source for instance in corpus]

summaries = sample_model.summarize(articles)
targets = [instance.summary for instance in corpus]



  0%|          | 6/287113 [00:55<740:45:34,  9.29s/it]

[<dataset.st_dataset.SummInstance object at 0x7f8f3fc39820>,
 <dataset.st_dataset.SummInstance object at 0x7f90901aec40>,
 <dataset.st_dataset.SummInstance object at 0x7f8f3fc398b0>,
 <dataset.st_dataset.SummInstance object at 0x7f8f3fc39880>,
 <dataset.st_dataset.SummInstance object at 0x7f8f3fc397c0>]


In [28]:
# Calculate BertScore
metric = RougeWe()
metric.evaluate(summaries, targets)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'rouge_we_3_f': 0.20183811597317614}

## More Model and dataset tests

The cells below demonstrate the features of the SummerTime library more comprehensively. They are a slightly modified version of our unit tests that applies different models on all the datasets and evaluates the results on each metric.

In [12]:
## Uncomment if using Ziva Server
## Installs the pytorch version compatible with the CUDA version

# !pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html -U
# !pip install tensorboard
    
    
## You might need to restart runtime for the changes to take effect
## Uncomment the two lines below to restart runtime 

# import os
# os.kill(os.getpid(), 9)

Looking in links: https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html
Collecting torchvision
  Using cached https://download.pytorch.org/whl/nightly/cu111/torchvision-0.11.0.dev20210820%2Bcu111-cp38-cp38-linux_x86_64.whl (21.4 MB)
Collecting torchaudio
  Using cached https://download.pytorch.org/whl/nightly/torchaudio-0.10.0.dev20210820-cp38-cp38-linux_x86_64.whl (2.0 MB)
Installing collected packages: torchvision, torchaudio
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.9.1
    Uninstalling torchvision-0.9.1:
      Successfully uninstalled torchvision-0.9.1
  Attempting uninstall: torchaudio
    Found existing installation: torchaudio 0.8.1
    Uninstalling torchaudio-0.8.1:
      Successfully uninstalled torchaudio-0.8.1
Successfully installed torchaudio-0.10.0.dev20210820 torchvision-0.11.0.dev20210820+cu111


In [None]:
# ## Run to check for successful Torch and Cuda compatibility

# import torch
# import sys

# print('A', sys.version)
# print('B', torch.__version__)
# print('C', torch.cuda.is_available())
# print('D', torch.backends.cudnn.enabled)
# device = torch.device('cuda')
# print('E', torch.cuda.get_device_properties(device))
# print('F', torch.tensor([1.0, 2.0]).cuda())

In [8]:
import unittest

from model.base_model import SummModel
from model import SUPPORTED_SUMM_MODELS, LexRankModel, PegasusModel, HMNetModel

from pipeline import assemble_model_pipeline

from evaluation.base_metric import SummMetric
from evaluation import SUPPORTED_EVALUATION_METRICS, Rouge, RougeWe

from dataset.st_dataset import SummInstance, SummDataset
from dataset import SUPPORTED_SUMM_DATASETS
from dataset.non_huggingface_datasets import ScisummnetDataset, SummscreenDataset, ArxivDataset
from dataset.huggingface_datasets import CnndmDataset, MlsumDataset, SamsumDataset

from tests.helpers import print_with_color, retrieve_random_test_instances

import random
import time
from typing import Dict, List, Union, Tuple
import sys

import nltk
nltk.download('stopwords')


class IntegrationTests(unittest.TestCase):
    
    def get_prediction(self, model: SummModel, dataset: SummDataset, test_instances: List[SummInstance]) -> Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]]:
        """
        Get summary prediction given model and dataset instances.

        :param SummModel `model`: Model for summarization task.
        :param SummDataset `dataset`: Dataset for summarization task.
        :param List[SummInstance] `test_instances`: Instances from `dataset` to summarize.
        :returns Tuple containing summary list of summary predictions and targets corresponding to each instance in `test_instances`.
        """

        src = [ins.source[0] for ins in test_instances] if isinstance(dataset, ScisummnetDataset) else [ins.source for ins in test_instances]
        tgt = [ins.summary for ins in test_instances]
        query = [ins.query for ins in test_instances] if dataset.is_query_based else None
        prediction = model.summarize(src, query)
        return prediction, tgt
    
    def get_eval_dict(self, metric: SummMetric, prediction: List[str], tgt: List[str]):
        """
        Run evaluation metric on summary prediction.

        :param SummMetric `metric`: Evaluation metric.
        :param List[str] `prediction`: Summary prediction instances.
        :param List[str] `tgt`: Target prediction instances from dataset.
        """
        score_dict = metric.evaluate(prediction, tgt)
        return score_dict

    def test_all(self):
        """
        Runs integration test on all compatible dataset + model + evaluation metric pipelines supported by SummerTime.
        """

        print_with_color("\nInitializing all evaluation metrics...", "35")
        evaluation_metrics = []
        for eval_cls in SUPPORTED_EVALUATION_METRICS:
            # # TODO: Temporarily skipping Rouge/RougeWE metrics to avoid local bug.
            # if eval_cls in [Rouge, RougeWe]:
            #     continue
            print(eval_cls)
            evaluation_metrics.append(eval_cls())

        print_with_color("\n\nBeginning integration tests...", "35")
        for dataset_cls in SUPPORTED_SUMM_DATASETS:
            # TODO: Temporarily skipping MLSumm (Gitlab: server-side login gating) and Arxiv (size/time)
            if dataset_cls in [MlsumDataset, ArxivDataset]:
                continue
            dataset = dataset_cls()
            if dataset.train_set is not None:
                dataset_instances = list(dataset.train_set)
                print(f"\n{dataset.dataset_name} has a training set of {len(dataset_instances)} examples")
                print_with_color(f"Initializing all matching model pipelines for {dataset.dataset_name} dataset...", "35")
                # # TODO Temporarily skipping HMNetModel to a avoid a bug on this branch
                matching_model_instances = assemble_model_pipeline(dataset_cls, list(filter(lambda m: m != HMNetModel, SUPPORTED_SUMM_MODELS)))
                for model, model_name in matching_model_instances:
                    test_instances = retrieve_random_test_instances(dataset_instances=dataset_instances, num_instances=1)
                    print_with_color(f"{'#' * 20} Testing: {dataset.dataset_name} dataset, {model_name} model {'#' * 20}", "35")
                    prediction, tgt = self.get_prediction(model, dataset, test_instances)
                    print(f"Prediction: {prediction}\nTarget: {tgt}\n")
                    for metric in evaluation_metrics:
                        print_with_color(f"{metric.metric_name} metric", "35")
                        score_dict = self.get_eval_dict(metric, prediction, tgt)
                        print(score_dict)

                    print_with_color(f"{'#' * 20} Test for {dataset.dataset_name} dataset, {model_name} model COMPLETE {'#' * 20}\n\n", "32")

unittest.main(argv=['first-arg-is-ignored'], exit=False)


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[35m
Initializing all evaluation metrics...[0m
<class 'evaluation.bertscore_metric.BertScore'>
<class 'evaluation.bleu_metric.Bleu'>
<class 'evaluation.rouge_metric.Rouge'>
<class 'evaluation.rougewe_metric.RougeWe'>
<class 'evaluation.meteor_metric.Meteor'>


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/lily/mmm274/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[35m

Beginning integration tests...[0m


Reusing dataset cnn_dailymail (/home/lily/mmm274/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)
100%|██████████| 287113/287113 [00:44<00:00, 6521.81it/s]



cnn_dailymail has a training set of 287113 examples
[35mInitializing all matching model pipelines for cnn_dailymail dataset...[0m


Reusing dataset cnn_dailymail (/home/lily/mmm274/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)
  logger.warn(
You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificatio

[35m#################### Testing: cnn_dailymail dataset, BART model ####################[0m
Prediction: ["Denver police showed up last week at Maryjane's Social Club, one of dozens of private pot-smoking clubs in Colorado. The officers handcuffed smokers, seized drug paraphernalia and ticketed the club's owner for violating state law banning indoor cigarette smoking. Three people were cited for smoking in public. Colorado law prohibits recreational pot consumption 'openly and publicly or in a manner that endangers others'"]
Target: ["Maryjane's Social Club is one of dozens of private pot-smoking clubs in Colorado operating in a legal grey area .\nThree people were cited by police for smoking in public and club owner was ticketed .\nColorado law prohibits recreational pot consumption 'openly and publicly or in a manner that endangers others'\nPlainclothed police officers were posing as new members at the club ."]

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_versi

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.17204301075268819}
[35mmeteor metric[0m
{'meteor': 0.1849241158451685}
[32m#################### Test for cnn_dailymail dataset, LexRank model COMPLETE ####################

[0m
[35m#################### Testing: cnn_dailymail dataset, Longformer model ####################[0m
Longformer model: processing document of tensor([263]) tokens
Prediction: ["NEW: The death toll has risen to 110, Pakistan's prime minister's office says.\nThe flooding has also destroyed 650 homes, officials say.\nFloods have also hit the Indian-administered Kashmir region.\nPakistan's prime ministers are meeting Saturday to discuss the situation."]
Target: ['Flooding caused by monsoon rains has destroyed 650 homes, officials say .\nThe Pakistani Prime Minister will attend a meeting on the floods Saturday .\nIndia has also been hit by flooding, which has killed at least 70 people there .']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'b

  return pagerank_scipy(


Prediction: ['Kelley reportedly met Adam Victor at the Republican National Convention in Tampa.']
Target: ['Jill Kelley tried to broker deal with energy mogul Adam Victor .\nDeal collapsed when Kelley demanded $80million commission .\nKelley and twin Natalie Khawam visited White House three times in three months as guests of aide they met at MacDill Air Force Base .']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.504211962223053}
[35mbleu metric[0m
{'bleu': 0.6674716951409333}
[35mrouge metric[0m
{'rouge_1_f_score': 0.22223, 'rouge_2_f_score': 0.03846, 'rouge_l_f_score': 0.18519}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.12}
[35mmeteor metric[0m
{'meteor': 0.2786855482933915}
[32m#################### Test for cnn_dailymail dataset, TextRank model COMPLETE ####################

[0m


Using custom data configuration default
Reusing dataset multi_news (/home/lily/mmm274/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376)
100%|██████████| 44972/44972 [00:09<00:00, 4552.64it/s]



multi_news has a training set of 44972 examples
[35mInitializing all matching model pipelines for multi_news dataset...[0m


Using custom data configuration default
Reusing dataset multi_news (/home/lily/mmm274/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376)
  0%|          | 0/44972 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be 

[35m#################### Testing: multi_news dataset, Multi-document joint (BART) model ####################[0m
Prediction: ['The big question today is whether Boehner’s bill passes in the House. If it passes the House, then it will simply fail in the Senate. The final bill will not pass through the House on a party-line vote. It will be a compromise proposal, and Boehner will lose far more than 23.']
Target: ['– Let the endgame begin: House Republicans say a vote on John Boehner\'s debt plan will still take place tonight, but the House speaker is apparently having trouble rounding up the necessary votes. Boehner predicted victory earlier today, but he halted debate on the measure about 5pm, reports Politico. Assuming he eventually gets to 217, Harry Reid continues to insist the Boehner plan will die a quick death in the Senate, after which Reid would try to force a vote on his own plan. (Eric Cantor is already threatening Reid with being responsible for a default, notes the Hill.) A

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.18556701030927833}
[35mmeteor metric[0m
{'meteor': 0.22222222222222224}
[32m#################### Test for multi_news dataset, Multi-document joint (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document joint (Longformer) model ####################[0m
Longformer model: processing document of tensor([3027]) tokens
Prediction: ['Joe Donnelly concedes, says he will do everything he can to ensure a smooth transition.\nGOP retained control of the Senate on Tuesday.\nRepublicans stood a solid chance of winning the House from Republicans.\nDemocrats stood a strong chance of gaining control of Texas.\nRepublican Rep. Marsha Blackburn defeated former Gov. Phil Bredesen.']
Target: ["– Republicans will keep control of the Senate for at least another two years—both CNN and the AP have called it. The GOP went into Tuesday's midterms with a 51-49 advantage, and the party appears to be on track to actually increa

hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.6257061958312988}
[35mbleu metric[0m
{'bleu': 19.512945386665386}
[35mrouge metric[0m
{'rouge_1_f_score': 0.5094, 'rouge_2_f_score': 0.19897, 'rouge_l_f_score': 0.20855}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.34423407917383825}
[35mmeteor metric[0m
{'meteor': 0.27976639993196833}
[32m#################### Test for multi_news dataset, Multi-document separate (BART) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (LexRank) model ####################[0m
Prediction: ["Former Edwards economic policy adviser Leo Hindery testified Thursday he was an intermediary between Edwards and former Sen. Tom Daschle, who was then with Barack Obama's campaign. Hunter was being closely watched over by Edwards' once-close confidant, Andrew Young, who falsely claimed paternity of boss' baby as the tabloid prepared to expose the affair.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.11678832116788321}
[35mmeteor metric[0m
{'meteor': 0.22667826874954944}
[32m#################### Test for multi_news dataset, Multi-document separate (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: multi_news dataset, Multi-document separate (Longformer) model ####################[0m
Longformer model: processing document of tensor([582]) tokens


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Longformer model: processing document of tensor([259]) tokens
Prediction: ['Harrison Ford was flying his private plane, a single-engine Husky, at John Wayne Airport in Orange County, California.\nThe pilot was told to land on a taxiway instead of the runway.\nFord has been involved in a series of crashes and near-crashes while flying aircraft. Actor Harrison Ford was cleared to land on a runway at John Wayne Airport.\nThe American Airlines 737 was waiting for his plane to land in Dallas.\nFord is known for his decades of experience in several different incidents.\nHe has also used his piloting skills to rescue hikers and rescue hikers.']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.6042613387107849}
[35mbleu metric[0m
{'bleu': 2.795710660323232}
[35mrouge metric[0m
{'rouge_1_f_score': 0.39118, 'rouge_2_f_score': 0.13851, 'rouge_l_f_score': 0.24793}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.2841225626740947}


Reusing dataset samsum (/home/lily/mmm274/.cache/huggingface/datasets/samsum/samsum/0.0.0/3f7dba43be72ab10ca66a2e0f8547b3590e96c2bd9f2cbb1f6bb1ec1f1488ba6)
100%|██████████| 14732/14732 [00:01<00:00, 12336.89it/s]



samsum has a training set of 14732 examples
[35mInitializing all matching model pipelines for samsum dataset...[0m


Reusing dataset samsum (/home/lily/mmm274/.cache/huggingface/datasets/samsum/samsum/0.0.0/3f7dba43be72ab10ca66a2e0f8547b3590e96c2bd9f2cbb1f6bb1ec1f1488ba6)
  0%|          | 0/14732 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequence


xsum has a training set of 204045 examples
[35mInitializing all matching model pipelines for xsum dataset...[0m


Using custom data configuration default
Reusing dataset xsum (/home/lily/mmm274/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499)
  0%|          | 0/204045 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly ide

[35m#################### Testing: xsum dataset, BART model ####################[0m
Prediction: ["White, 54, lost 10-7 to fellow Englishman Jack Lisowski at Ponds Forge. White may need to enter May's Q School to regain a full tour card. Former world champions Steve Davis and Stephen Hendry were offered wildcards after losing their places in 2014."]
Target: ["Jimmy White has lost his tour card after 37 years as a professional after defeat in the first round of qualifying for this month's World Championship."]

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.5201109051704407}
[35mbleu metric[0m
{'bleu': 2.062403823169552}
[35mrouge metric[0m
{'rouge_1_f_score': 0.25, 'rouge_2_f_score': 0.02857, 'rouge_l_f_score': 0.16666}
[35mrougeWE metric[0m
{'rouge_we_3_f': 0.08823529411764704}
[35mmeteor metric[0m
{'meteor': 0.08474576271186442}
[32m#################### Test for xsum dataset, BART model COMPLETE #########

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.2711864406779661}
[35mmeteor metric[0m
{'meteor': 0.2288711548935565}
[32m#################### Test for xsum dataset, LexRank model COMPLETE ####################

[0m
[35m#################### Testing: xsum dataset, Longformer model ####################[0m
Longformer model: processing document of tensor([420]) tokens
Prediction: ['Police say Stephen Arthuro Solis-Reyes stole 900 social insurance numbers.\nThe Canadian police say he stole 900 Social Insurance numbers last week.\nUK parenting site Mumsnet has provided fresh details about how it fell victim to the bug.\nHackers have been linked to the heart attack of Canadian mother Justine Roberts.']
Target: ['A 19-year-old Canadian became the first person to be arrested in relation to the Heartbleed security breach.']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.4598506987094879}
[35mbleu metric[0m
{'bleu': 1.7398283377474275}
[35mrouge 

Reusing dataset pubmed_qa (/home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d)
Loading cached split indices for dataset at /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-4a7fed3fa9d9c53a.arrow and /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-8917325612f1a482.arrow
Loading cached split indices for dataset at /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-8e297aa49ae0c600.arrow and /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-b3ad2bcb7941385f.arrow
100%|██████████| 169226/169226 [00:52<00:00, 3208.


pubmed_qa has a training set of 169226 examples
[35mInitializing all matching model pipelines for pubmed_qa dataset...[0m


Reusing dataset pubmed_qa (/home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d)
Loading cached split indices for dataset at /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-4a7fed3fa9d9c53a.arrow and /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-8917325612f1a482.arrow
Loading cached split indices for dataset at /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-8e297aa49ae0c600.arrow and /home/lily/mmm274/.cache/huggingface/datasets/pubmed_qa/pqa_artificial/1.0.0/2e65addecca4197502cd10ab8ef1919a47c28672f62d7abac7cc9afdcf24fb2d/cache-b3ad2bcb7941385f.arrow
  0%|          | 0/169226 [00:00<?, ?it/s]You are 

[35m#################### Testing: pubmed_qa dataset, TF-IDF (BART) model ####################[0m
Prediction: ['therefore, studied possible link perioperative aprotinin treatment renal dysfunction patients undergoing first-time coronary surgery high risk bleeding. performed matched cohort study, comparing 200 patients receiving high-dose aProtinin with tranexamic acid primary isolated coronary surgery. Secondary outcomes evaluations postoperative renal function, mortality, stroke, reoperation bleeding, transfusion requirements.']
Target: ['Aprotinin treatment during primary coronary surgery was not associated with impaired postoperative renal function in comparison with patients treated with tranexamic acid.']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.6612899899482727}
[35mbleu metric[0m
{'bleu': 5.203300368769515}
[35mrouge metric[0m
{'rouge_1_f_score': 0.34286, 'rouge_2_f_score': 0.17647, 'rouge_l_f_sco

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.12844036697247704}
[35mmeteor metric[0m
{'meteor': 0.22970414201183426}
[32m#################### Test for pubmed_qa dataset, TF-IDF (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, TF-IDF (Longformer) model ####################[0m
Longformer model: processing document of tensor([77]) tokens
Prediction: ['The average household income completed food reinforcement was a bmi.\nThe study was carried out by the University of New South Wales.\nLow socioeconomic status and high education levels were the main factors.\nHigh socioeconomic status was the highest.\nlevel.\nof bmi mediated part increased food reinforcement.']
Target: ['These findings support the hypothesis that deprivation and restricted food choice associated with low SES enhance food reinforcement, increasing the risk for obesity.']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.4

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.05714285714285714}
[35mmeteor metric[0m
{'meteor': 0.03424657534246575}
[32m#################### Test for pubmed_qa dataset, BM25 (LexRank) model COMPLETE ####################

[0m
[35m#################### Testing: pubmed_qa dataset, BM25 (Longformer) model ####################[0m
Longformer model: processing document of tensor([139]) tokens
Prediction: ['Genome-wide association studies colorectal cancer (crc)\nGenome studies coloresctal cancers found to be genetic variants.\nGenomics studies colourctal tumours found genetic variants in one-center group.\nStudy carried out by genome-based co-authors.']
Target: ['Our investigation confirms that variants across multiple risk regions of 8q24 are associated with CRC, and that associations at 18q21 differ by tumor site.']

[35mbert score metric[0m
hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.46549928188323975}
[35mbleu metric[0m
{'bleu': 1.3494116947566301}
[35mroug

Reusing dataset summertime_scisummnet (/home/lily/mmm274/.cache/huggingface/datasets/summertime_scisummnet/default/0.0.0/2de3b585a09db9aaf7f42cebe7c334a94ebae2d8104a752ce286c8a0d3fadaec)
  0%|          | 0/808 [00:00<?, ?it/s]

{'rouge_we_3_f': 0.0}
[35mmeteor metric[0m
{'meteor': 0.0}
[32m#################### Test for pubmed_qa dataset, BM25 (TextRank) model COMPLETE ####################

[0m


100%|██████████| 808/808 [00:00<00:00, 8074.69it/s]
Reusing dataset summertime_scisummnet (/home/lily/mmm274/.cache/huggingface/datasets/summertime_scisummnet/default/0.0.0/2de3b585a09db9aaf7f42cebe7c334a94ebae2d8104a752ce286c8a0d3fadaec)



ScisummNet has a training set of 808 examples
[35mInitializing all matching model pipelines for ScisummNet dataset...[0m


  0%|          | 0/808 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
 12%|█▏        | 99/808 [01:05<07:49,  1.51it/s]


[35m#################### Testing: ScisummNet dataset, BART model ####################[0m
Prediction: ['Parsing algorithms that process the input from left to right and construct a single derivation have often been considered inadequate for natural language parsing. This article presents a general framework for describing and analyzing algorithms for deterministic incremental dependency parsing. We show that all four algorithms give competitive accuracy, although the non-projective list-based algorithm generally outperforms the projective algorithms.']
Target: ['Algorithms for Deterministic Incremental Dependency Parsing\nParsing algorithms that process the input from left to right and construct a single derivation have often been considered inadequate for natural language parsing because of the massive ambiguity typically found in natural language grammars.\nNevertheless, it has been shown that such algorithms, combined with treebank-induced classifiers, can be used to build highly a

hash_code: bert-base-uncased_L8_no-idf_version=0.3.10(hug_trans=4.5.1)
{'bert_score_f1': 0.4792468249797821}
[35mbleu metric[0m
{'bleu': 0.5974775194553664}
[35mrouge metric[0m
{'rouge_1_f_score': 0.0, 'rouge_2_f_score': 0.0, 'rouge_l_f_score': 0.0}
[35mrougeWE metric[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'rouge_we_3_f': 0.09497591190640056}
[35mmeteor metric[0m
{'meteor': 0.04862517593199704}
[32m#################### Test for ScisummNet dataset, LexRank model COMPLETE ####################

[0m
[35m#################### Testing: ScisummNet dataset, Longformer model ####################[0m
Longformer model: processing document of tensor([4096]) tokens
Prediction: ['The task was designed to promote research on semantic inference over texts written in different languages.\nThe results suggest that the most successful systems (including the most accurate) were in the first round of the task.\nResults suggest that systems performed better on the SP-EN dataset than the lowest score on DE-EN (0.25)\nThe final result is a monolingual English translation of the two classes of language pairs.']
Target: ['Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization\nThis paper presents the first round of the task on Cross-lingual Textual Entailment for Content Synchroniz

Reusing dataset summertime_summscreen (/home/lily/mmm274/.cache/huggingface/datasets/summertime_summscreen/default/0.0.0/3b7dab2d730657a8545307f56c2e997a817186018ef6141b5699deebdc103573)
  2%|▏         | 402/22588 [00:00<00:10, 2060.20it/s]

{'rouge_we_3_f': 0.2717557251908397}
[35mmeteor metric[0m
{'meteor': 0.1678766200821647}
[32m#################### Test for ScisummNet dataset, TextRank model COMPLETE ####################

[0m


100%|██████████| 22588/22588 [00:10<00:00, 2241.55it/s]
Reusing dataset summertime_summscreen (/home/lily/mmm274/.cache/huggingface/datasets/summertime_summscreen/default/0.0.0/3b7dab2d730657a8545307f56c2e997a817186018ef6141b5699deebdc103573)



SummScreen_fd+tms_tokenized has a training set of 22588 examples
[35mInitializing all matching model pipelines for SummScreen_fd+tms_tokenized dataset...[0m


  0%|          | 0/22588 [00:00<?, ?it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  0%|          | 99/22588 [01:06<4:12:17,  1.49it/s]
Reusing dataset summertime_qmsum (/ho


QMsum has a training set of 162 examples
[35mInitializing all matching model pipelines for QMsum dataset...[0m


 55%|█████▍    | 89/162 [00:12<00:00, 212.35it/s]You are using a model of type encoder_decoder to instantiate a model of type encoder-decoder. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 were not used when initializing EncoderDecoderModel: ['decoder.roberta.pooler.dense.weight', 'decoder.roberta.pooler.dense.bias']
- This IS expected if you are initializing EncoderDecoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EncoderDecoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
 61%|██████    | 99/162 [01:21<00:51,  1.22it/s] 
.
------------------------------

<unittest.main.TestProgram at 0x7f605dea9580>