In [1]:
import model as st_model
import dataset.stdatasets as st_data
import eval as st_eval

# Models

In [2]:
# Users can load a default summarization model
model = st_model.summarizer()

# Or a specific model
pegasus = st_model.pegasus()

### Model selection

In [3]:
# Users can easily access documentation to assist with model selection
model.show_capability()

Pegasus is the default singe-document summarization model.
Introduced in 2019, a large neural abstractive summarization model trained on web crawl and news data.
 Strengths: 
 - High accuracy 
 - Performs well on almost all kinds of non-literary written text 
 Weaknesses: 
 - High memory usage 
 Initialization arguments: 
 - `device = 'cpu'` specifies the device the model is stored on and uses for computation. Use `device='gpu'` to run on an Nvidia GPU.


### Inference

In [4]:
documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]

model.summarize(documents)

["California's largest electricity provider has turned off power to hundreds of thousands of customers."]

# Datasets

In [5]:
cnn_dataset = st_data.load_dataset('cnn_dailymail', '3.0.0')

Reusing dataset cnn_dailymail (/home/zhangir/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602)


In [6]:
cnn_dataset['train']['highlights'][0]

'Syrian official: Obama climbed to the top of the tree, "doesn\'t know how to get down"\nObama sends a letter to the heads of the House and Senate .\nObama to seek congressional approval on military action against Syria .\nAim is to determine whether CW were used, not by whom, says U.N. spokesman .'

### A non-neural model
Below we train an unsupervised non-neural summarizer with a subset of the cnn_dailymail dataset: 

In [7]:
corpus = cnn_dataset['train']['article'][0:5]

trad_model = st_model.lex_rank(corpus)

In [8]:
# Inference
text = cnn_dataset['test']['article'][0:1]
print(text)

trad_model.summarize(text)

['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such 

[['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness.',
  'Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games.']]

In [9]:
# More about lexrank

trad_model.show_capability()

A non-neural model for extractive summarization. 
 Works by using a graph-based method to identify the most salient sentences in the document. 
 Strengths: 
 - Fast with low memory usage 
 - Allows for control of summary length 
 Weaknesses: 
 - Not as accurate as neural methods. 
 Initialization arguments: 
 - `corpus`: Unlabelled corpus of documents. ` 
 - `summary_length`: sentence length of summaries 
 - `threshold`: Level of salience required for sentence to be included in summary.


# Evaluation

In [7]:
# Initializes a rouge metric object
metric = st_eval.rouge()

# Evaluates model on subset of cnn_dailymail
# Stores various versions of rouge as an object variable
metric.evaluate(model, cnn_dataset['test'][0:5]);

2021-04-21 16:22:37,361 [MainThread  ] [INFO ]  Set ROUGE home directory to /home/zhangir/miniconda3/lib/python3.8/site-packages/summ_eval/ROUGE-1.5.5/.
2021-04-21 16:23:17,360 [MainThread  ] [INFO ]  Writing summaries.
2021-04-21 16:23:17,366 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp79ubs_mc/system and model files to /tmp/tmp79ubs_mc/model.
2021-04-21 16:23:17,367 [MainThread  ] [INFO ]  Processing files in /tmp/tmprvf8rcma.
2021-04-21 16:23:17,368 [MainThread  ] [INFO ]  Processing system.3.txt.
2021-04-21 16:23:17,369 [MainThread  ] [INFO ]  Processing system.1.txt.
2021-04-21 16:23:17,370 [MainThread  ] [INFO ]  Processing system.0.txt.
2021-04-21 16:23:17,371 [MainThread  ] [INFO ]  Processing system.2.txt.
2021-04-21 16:23:17,372 [MainThread  ] [INFO ]  Processing system.4.txt.
2021-04-21 16:23:17,373 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp79ubs_mc/system.
2021-04-21 16:23:17,373 [MainThread  ] [INFO ]  Processing files in /tm

In [8]:
# Retrieve rouge1 f-scores. 
metric.get(['rouge_1_f_score_cb', 'rouge_1_f_score_ce'])

{'rouge_1_f_score_cb': 0.14119, 'rouge_1_f_score_ce': 0.51115}