
<h1 align = "center"> Inverse Hierarchical Multi-Document Summarization</h1>
<h2 align = "center"><em> Brandon Scolieri, Daphne Yang, Frank Bruni</em></h2> 
<h4 align = "center"> W266: Natural Language Processing with Deep Learning </h4>
<h4 align = "center"> April, Spring 2021 </h4> 

In [None]:
%%capture
############
# INSTALLS #
############

#Abstractive Summarizer Installs
!pip install datasets
!pip install transformers
!pip install rouge_score
!pip install sacrebleu

#Extractive Summarizer Installs
!pip install bert-extractive-summarizer
!pip install neuralcoref
!pip install spacy==2.1.3
!python -m spacy download en_core_web_md

In [18]:
###########
# IMPORTS #
###########

#Abstractive Summarizer Imports
import datasets
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  
from transformers import BertTokenizer, EncoderDecoderModel

#Extractive Summarizer Imports
from tqdm import tqdm_pandas
from tqdm import tqdm
from summarizer import Summarizer

#Cosine Similarity and Evaluation Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from rouge_score import rouge_scorer

#Utility Imports
from functools import reduce
from operator import add
import pandas as pd
import numpy as np
import pprint
import random

#Dataset Library Imports
from sklearn.datasets import fetch_20newsgroups

In [2]:
%%capture
###############
# GLOBAL VARS #
###############

vectorizer = TfidfVectorizer()
tokenizer = AutoTokenizer.from_pretrained("patrickvonplaten/bert2bert_cnn_daily_mail")  
abstractive_summarizer_model = AutoModelForSeq2SeqLM.from_pretrained("patrickvonplaten/bert2bert_cnn_daily_mail")
extractive_summarizer_model = Summarizer()

In [3]:
########
# DATA #
########

#CNN Dailymail dataset, initially used for testing summarizers
test_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="test")
twenty_news_dataset = fetch_20newsgroups()

#Toy data for testing the pipeline
covid_test_data = pd.read_csv("/home/jupyter/266/266_final/nyt_data_collection/toy_data/fp_covid_articles.csv")
golf_test_data = pd.read_csv("/home/jupyter/266/266_final/nyt_data_collection/toy_data/golf_articles.csv")

#Full NYT Data
full_nyt_data = pd.read_csv("/home/jupyter/266_final/full_nyt_dataset.csv")

Reusing dataset cnn_dailymail (/home/jupyter/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602)


In [93]:
####################
# HELPER FUNCTIONS #
####################

# MISC Helpers
divString = lambda size, char = "#": reduce(add, [char for i in range(size)])
flatten = lambda lst: [i for sublst in lst for i in sublst]
print_pipeline = lambda pipeline_output: pprint.pprint(pipeline_output[1])
compute_simple_saturation = lambda prevLst, newLst: len([i for i in newLst if i not in prevLst])

#Batch Summary Generation and Batch Metrics
def generate_summary(batch):
    """This function computes a summary for a given article from the Dataset object
    batch
    Params:
    batch: an article from the given Dataset object."""
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    batch["pred"] = output_str
    return batch


def compute_metrics(batch, batch_size=16, metric_name="rouge"):
    """This function computes the rouge or bleu scores for predicted summaries
    Params:
    batch: A Dataset object which contains the articles at the specified indices
    Use the select method for this function call. 
    Example format: Dataset.select([list of indices to select from the original dataset])
    metric_name: The prefered evaluation metric to use"""
    
    metric = datasets.load_metric(metric_name)
    results = batch.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
    summary_pred = results["pred"]
    label_ref = results["highlights"]
    if metric_name == "rouge":
        output = metric.compute(predictions=summary_pred, references=label_ref, rouge_types=["rouge2"])["rouge2"].mid
        print("\n" + "ROUGE SCORE:")
        return output
    else:
        # Else compute bleu score with metric name "sacrebleu"
        all_bleu_scores = []
        for i in range(len(batch)):
            output = metric.compute(predictions= [summary_pred[i]], references= [[label_ref[i]]])
            all_bleu_scores.append(output)
            print("\n\n")
            print(divString(100))
            print("\n\n" + "Summary prediction: " + "\n\n", summary_pred[i])
            print("\n\n" + "Reference Label: " + "\n\n", label_ref[i])
            print("\n\n" + "BLEU SCORE:" + "\n\n", output)
            print("\n")
        return all_bleu_scores
    

#Raw Text Summarization
def generate_abstractive_summary(raw_string, model = abstractive_summarizer_model, max_length=512):
    """This function produces an abstractive summary for a given article"
    Params:
    raw_string: an article string.
    model: An abstractive summarizer model"""
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(raw_string, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return output_str[0]


def generate_extractive_summary(raw_string, model = extractive_summarizer_model, min_summary_length = 50):
    """This function produces an extractive summary for a given article"
    Params:
    raw_string: an article string.
    model: An extractive summarizer model"""
    output_str = model(raw_string, min_length = min_summary_length)
    return output_str
    
    
#Search and Subset Dataset
def search_and_subset_data(df, keyword, column = "first_paragraph"):
    """
    This function takes in a dataframe, keyword to search, and an optional column to search through and returns a
    subset of the data as a pandas dataframe, with entries that contain the searched keyword.
    Params:
    df: Dataframe
    keyword: A keyword to search
    column: Optional that takes either 'first_paragraph' or 'keywords'
    """
    df = df.sort_values(by='date', ascending = False).reset_index().drop("index", axis = 1) # Ordering all documents chronologically so that indices don't need reodered when combining similar documents
    df = df.dropna(subset=[column])
    subset = df[df[column].str.lower().str.contains(keyword.strip().lower())]
    return subset


def multi_search(df, keywords, column = "first_paragraph"):
    """
    This function takes in a dataframe, keyword to search, and an optional column to search through and returns a
    subset of the data as a pandas dataframe, with entries that contain the searched keyword.
    Params:
    df: Dataframe
    keywords: A list of keywords to search for. It will create a subset that contains all keywords supplied. 
    column: Optional that takes either 'first_paragraph' or 'keywords'
    """
    df = df.sort_values(by='date', ascending = False).reset_index().drop("index", axis = 1) # Ordering all documents chronologically so that indices don't need reodered when combining similar documents
    df = df.dropna(subset=[column])
    num_words = len(keywords)
    if num_words == 1:
        subset = df[df[column].str.lower().str.contains(keywords[0].strip().lower())]
        return subset
    else:
        word = keywords[num_words - 1]
        subset = df[df[column].str.lower().str.contains(word.strip().lower())]
        return multi_search(subset, keywords[num_words - 1], column = column)


def select_random_document(df):
    """
    This function selects a single random row from a dataframe. This is used as a default for initial document selection at the beginning of the pipeline.
    df: A pandas dataframe
    """
    return df.sample()



# Similarity Clustering and Aggregate Document Synthesis
def compute_cosine_similarities(document, corpus, vectorizer = vectorizer):
    """
    This function computes the cosine similarity between a document and a specified corpus of documents.
    document: A string of text.
    corpus: An array of documents.
    vectorizer: A TfidfVectorizer() object. (Default: initialized in the constants cell)
    """
    vectorized_corpus = vectorizer.fit_transform(corpus)
    vectorized_document = vectorizer.transform([document])
    return linear_kernel(vectorized_document, vectorized_corpus).flatten()


def get_related_docs_indices(cos_similarities_array, n_docs=5, type_='top'):
    """
    This function returns the document indices with the highest cosine similarity.
    cos_similarities_array: An array of the computed cosine similarities for the whole corpus.
    n_docs: The number of scored documents to return. (e.g. n_docs=5 returns the top 5 highest scoring documents)
    type_: The order in which similarity scores are selected. (Options: "top", "bottom", "rand") (i.e. "top" will take the top highest cosine similarity scores, "bottom" will pick the lowest, "rand" will pick at random)
    """
    cos_similarities_array = np.array([i for i in cos_similarities_array if i < 0.999999999]) # This eliminates the case where the most similar document is the document itself, which has a similarity of 1.0
    if type_ =='bottom':
        return sorted(cos_similarities_array.argsort()[:(n_docs + 1)])
    if type_ =='rand':
        return sorted(random.sample(cos_similarities_array.argsort().tolist(), n_docs))
    return sorted(cos_similarities_array.argsort()[:-(n_docs + 1):-1])


def get_related_docs(corpus, related_docs_indices):
    return [corpus[i] for i in related_docs_indices]
    

def get_top_similarities(cos_similarities_array, n_docs=5, similarity_selection_rule="top"):
    """
    This function returns the highest document cosine similarity scores.
    cos_similarities_array: An array of the computed cosine similarities for the whole corpus.
    n_docs: The number of highest cosine scores to return. (e.g. n_docs=5 returns the top 5 highest cosine similarity scores)
    similarity_selection_rule: The order in which similarity scores are selected. (Options: "top", "bottom", "rand") (i.e. "top" will take the top highest cosine similarity scores, "bottom" will pick the lowest, "rand" will pick at random)
    """
    related_indices = get_related_docs_indices(cos_similarities_array, n_docs, type_=similarity_selection_rule)
    return cos_similarities_array[related_indices]


def concatenate_related_docs(corpus, related_docs_indices):
    docs = [corpus[i] for i in related_docs_indices]
    return " ".join(docs)
    

# Display Functions
def show_related_docs(document, corpus, related_docs_indices):
    """
    This function displays the seed document, selected similar documents, and the concatenated aggregate of the similar documents.
    """
    aggregate_doc = concatenate_related_docs(corpus, related_docs_indices)
    print("\n" + "SELECTED DOCUMENT: " + "\n")
    print(document)
    print("\n")
    print(divString(100))
    print("\n")
    print("SIMILAR DOCUMENTS: " + "\n")
    for i in related_docs_indices:
        print(corpus[i], "\n")
        print(divString(100, char = "~") + "\n")
    print(divString(100))
    print("\n")
    print("AGGREGATE DOCUMENT: " + "\n")
    print(aggregate_doc + "\n")
    print(divString(100))
    
    
def show_result_seed_comparison(seedDocument, resultDocument):
    print("Seed Document:" + "\n")
    print(seedDocument, "\n")
    print(divString(100) + "\n")
    print("Resulting Document:" + "\n")
    print(resultDocument, "\n")


#Pipeline Iteration
def iterate_pipeline(initial_document, corpus, num_iter = 5, aggregate_doc_variant = "abstractive_secondary"):
    """
    This function begins with an initial document. From there the initial document is clustered with similar documents, 
    """
    document = initial_document
    for i in range(num_iter):
        cosine_similarities = compute_cosine_similarities(document, corpus)
        related_docs_indices = get_related_docs_indices(cosine_similarities)
        aggregate_document = concatenate_related_docs(corpus, related_docs_indices)
        if aggregate_doc_variant == "abstractive_primary":
            document = generate_abstractive_summary(aggregate_document, model = abstractive_summarizer_model)
        if aggregate_doc_variant == "abstractive_secondary":
            extractive_summary_primary = generate_extractive_summary(aggregate_document, min_summary_length=100)
            document = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model)
        if aggregate_doc_variant == "extractive_primary":
            document = generate_extractive_summary(aggregate_document, min_summary_length=100)
    print("Seed Document:" + "\n")
    print(initial_document, "\n")
    print(divString(100) + "\n")
    print("Resulting Document:" + "\n")
    print(document, "\n")
    return document


def stack_summaries(pipeline_iteration_results, paramDict = {"aggregate_doc_variant" : "extractive_primary", "model" : abstractive_summarizer_model, "min_summary_length" : 100, "max_length" : 512}):
    """
    This function concatenates the resulting summaries of each iteration into a single composite summary. Then it runs one of the summarizer models on it.
    Params:
    pipeline_iteration_results: pipeline_iteration_results dictionary
    paramDict: An optional dictionary of possible hyperparameters to specify.
    """
    aggregate_doc_variant = paramDict["aggregate_doc_variant"]
    model = paramDict["model"]
    min_summary_length = paramDict["min_summary_length"]
    max_length = paramDict["max_length"]
    all_summaries = [pipeline_iteration_results[i]["summary"] for i in range(len(pipeline_iteration_results.keys()))]
    composite_summary = "\n".join(all_summaries)
    
    if aggregate_doc_variant == "abstractive_primary":
        document = generate_abstractive_summary(composite_summary, model = abstractive_summarizer_model, max_length = max_length)
    if aggregate_doc_variant == "abstractive_secondary":
        extractive_summary_primary = generate_extractive_summary(composite_summary, min_summary_length=100)
        document = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model, max_length = max_length)
    if aggregate_doc_variant == "extractive_primary":
        document = generate_extractive_summary(composite_summary, min_summary_length=100)
        
    pipeline_iteration_results["concatenated_summaries"] = all_summaries
    pipeline_iteration_results["composite_summary"] = document

    return pipeline_iteration_results


##############
# EVALUATION #
##############

def compute_rouge_scores(result_summary, reference_document):
    """
    This function computes the rouge scores between two documents.
    Params:
    result_summary: The resulting document we are scoring.
    reference_document: The reference document we are comparing to. 
    """
    scorer = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeL','rougeLsum'], use_stemmer=True)
    scores = scorer.score(result_summary, reference_document)
    return scores


def compute_bleu_scores(result_summary, reference_document):
    """
    This function computes the bleu score between two documents.
    Params:
    result_summary: The resulting document we are scoring.
    reference_document: The reference document we are comparing to. 
    """
    metric = datasets.load_metric("sacrebleu")
    score = metric.compute(predictions= [result_summary], references= [[reference_document]])
    return score


def saturation_score(pipeline_iteration_results):
    """
    This function measures the level of information saturation over the course of the pipeline. 
    Params:
    pipeline_iteration_results: A pipeline dictionary representation.
    """
    
def get_saturation_scores(pipeline_iteration_results):
    """
    This function measures the level of information saturation over the course of the pipeline. 
    Params:
    pipeline_iteration_results: A pipeline dictionary representation.
    Returns: A dictionary of the saturations at each iteration excluding iteration 0. 
    """
    saturation_scores = {}
    pipelineDict = pipeline_general_results[1]
    doc_indices = [(i, pipelineDict[i]["related_document_indices"]) for i in pipelineDict.keys()]
    i = 0
    while (i + 1) < len(doc_indices):
        saturation = compute_simple_saturation(doc_indices[i][1], doc_indices[i + 1][1])
        saturation_scores[i] = saturation
        i += 1
    return saturation_scores
    

def compute_all_metrics(result_summary, reference_document):
    evaluation_metrics = {"bleu_score": compute_bleu_scores(result_summary, reference_document),
                          "rouge_score": compute_rouge_scores(result_summary, reference_document)}
    return evaluation_metrics


def get_evaluation_metrics(pipeline_iteration_results, stacked = False):
    if stacked == True:
        return [pipeline_iteration_results[1][i]["evaluation_metrics"] for i in range(len(pipeline_iteration_results[1].keys()) - 2)]
    else :
        return [pipeline_iteration_results[1][i]["evaluation_metrics"] for i in pipeline_iteration_results[1].keys()]

    
def mean_bleu_score(pipeline_iteration_results, stacked = False):
    metrics = get_evaluation_metrics(pipeline_iteration_results, stacked = stacked)
    scores = [i["bleu_score"]["score"] for i in metrics]
    return np.mean(scores)


def mean_rouge_score(pipeline_iteration_results, stacked = False):
    metrics = get_evaluation_metrics(pipeline_iteration_results, stacked = stacked)
    fMeasureScores = [i["rouge_score"]["rouge2"][2] for i in metrics]
    return np.mean(fMeasureScores)


def aggregate_saturation_score(pipeline_iteration_results, stacked = False):
    metrics = get_evaluation_metrics(pipeline_general_results, stacked = stacked)
    saturation_scores = [i["saturation_score"] for i in metrics]
    num_docs = saturation_scores[0] * len(metrics)
    unique_docs = sum(saturation_scores)
    return unique_docs/num_docs


def show_model_scores(pipeline_iteration_results, stacked = False):
    print("Mean Bleu Score: ", mean_bleu_score(pipeline_iteration_results, stacked = stacked))
    print("Mean Rouge2 Score: ", mean_rouge_score(pipeline_iteration_results, stacked = stacked))
    print("Mean Aggregate Saturation Score: ", aggregate_saturation_score(pipeline_iteration_results, stacked = stacked))



In [6]:
# Helper Function Unit Tests

# generate_summary(test_data[0])
# compute_metrics(test_data.select([1,2]), metric_name = "rouge")
# compute_metrics(test_data.select([1, 2]), metric_name = "sacrebleu")
# generate_extractive_summary(test_data[0]["article"])
# generate_extractive_summary(test_data[0]["article"], min_summary_length=10)

### Baseline Algorithm Walkthrough
1. Select a group/subset of articles (Order the subset chronologically so that the dataframe indices are chronological)
2. Select a single article from the subset produced in step 1. (Baseline selection will be random)
3. Perform cosine similarity between the single/selected article and the entire subset of articles produced in step 1.
4. Select the top most similar indices and their associated articles.
5. Summarize them.

In [7]:
###############################
# PIPELINE (SINGLE ITERATION) #
###############################

# Document Selection
search_results_df = search_and_subset_data(covid_test_data, "death") # a df containing entries that have death in the first paragraph
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
document = select_random_document(search_results_df).first_paragraph.values[0] # Selected a random row from the search results dataframe and extracted the first paragraph text to serve as our document

# Similarity Clustering and Aggregate Document Synthesis
cosine_similarities = compute_cosine_similarities(document, corpus) # The cosine similarities between the target document and all documents contained in the corpus
related_docs_indices = get_related_docs_indices(cosine_similarities) # The indices of the most related docs
aggregate_document = concatenate_related_docs(corpus, related_docs_indices) # The resulting document that is produced by concatenating all of the most similar documents

# Summarization
extractive_summary_primary = generate_extractive_summary(aggregate_document, min_summary_length=100)
abstractive_summary_primary = generate_abstractive_summary(aggregate_document, model = abstractive_summarizer_model)
abstractive_summary_secondary = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model) #Secondary summary in the hierarchy (i.e. a summary of a summary)

In [8]:
print("Top Cosine Similarity Scores: ", get_top_similarities(cosine_similarities))
show_related_docs(document, corpus, related_docs_indices)

print("\n" + "SUMMARIZATION RESULTS" + "\n")
print(divString(100) + "\n")
print("EXTRACTIVE SUMMARY PRIMARY:" + "\n", extractive_summary_primary)
print("\n")
print("ABSTRACTIVE SUMMARY PRIMARY:" + "\n", abstractive_summary_primary)
print("\n")
print("ABSTRACTIVE SUMMARY SECONDARY:" + "\n", abstractive_summary_secondary)
print(divString(100) + "\n")

Top Cosine Similarity Scores:  [0.14345526 0.14457817 0.04430281 0.13621813 0.08780106]

SELECTED DOCUMENT: 

The death of Herman Cain, attributed to the coronavirus, has made Republicans and President Trump face the reality of the pandemic as it hit closer to home than ever before, claiming a prominent conservative ally whose frequently dismissive attitude about taking the threat seriously reflected the hands-off inconsistency of party leaders.


####################################################################################################


SIMILAR DOCUMENTS: 

WASHINGTON — President Trump on Wednesday rejected the professional scientific conclusions of his own government about the prospects for a widely available coronavirus vaccine and the effectiveness of masks in curbing the spread of the virus as the death toll in the United States from the disease neared 200,000. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Presid

In [9]:
########################################
# ITERATIVE PIPELINE (INITIAL VERSION) #
########################################

# Initial Document Selection
search_results_df = search_and_subset_data(covid_test_data, "death") # a df containing entries that have death in the first paragraph
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
initial_document = select_random_document(search_results_df).first_paragraph.values[0] # Selected a random row from the search results dataframe and extracted the first paragraph text to serve as our document

# Pipeline Iteration
iterate_pipeline(initial_document, corpus, num_iter = 5, aggregate_doc_variant = "abstractive_secondary")

Seed Document:

[Read more on Brazil’s Coronavirus cases and deaths.] 

####################################################################################################

Resulting Document:

the death toll in the u. s. has reached 200, 000. president trump rejected the professional findings of his own government. the public's interest in the case may make this one of the highest - profile trials in recent memory. the trial of derek chauvin will be a one - of - the - kind. 



"the death toll in the u. s. has reached 200, 000. president trump rejected the professional findings of his own government. the public's interest in the case may make this one of the highest - profile trials in recent memory. the trial of derek chauvin will be a one - of - the - kind."

In [10]:
#################################
# ITERATIVE PIPELINE (BASELINE) #
#################################

#Pipeline Iteration
def iterate_pipeline(initial_document, corpus, n_docs = 5, num_iter = 5, aggregate_doc_variant = "abstractive_secondary"):
    """
    This function begins with an initial document. From there the initial document is clustered with similar documents, 
    Params:
    initial_document: Some string.
    corpus: The corpus.
    n_docs: The number of similar documents that are aggregated and summarized at each iteration.
    num_iter: The number of iterations the pipeline is run for.
    aggregate_doc_variant: The specific summarization process to perform. (Options: abstractive_primary, abstractive_secondary, extractive_primary)
    """
    pipeline_iteration_results = {}
    document = initial_document
    
    for i in range(num_iter):
        cosine_similarities = compute_cosine_similarities(document, corpus)
        related_docs_indices = get_related_docs_indices(cosine_similarities, n_docs = n_docs)
        related_docs = get_related_docs(corpus, related_docs_indices)
        aggregate_document = concatenate_related_docs(corpus, related_docs_indices)
        
        pipelineDict = {"seed_document": document,
        "related_documents" : related_docs,
        "cosine_similarity_scores" : get_top_similarities(cosine_similarities),
        "related_document_indices" : related_docs_indices,
        "aggregate_document": aggregate_document,
        "summary_type" : aggregate_doc_variant,
        "summary": ""
       }
        if aggregate_doc_variant == "abstractive_primary":
            document = generate_abstractive_summary(aggregate_document, model = abstractive_summarizer_model)
        if aggregate_doc_variant == "abstractive_secondary":
            extractive_summary_primary = generate_extractive_summary(aggregate_document, min_summary_length=100)
            document = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model)
        if aggregate_doc_variant == "extractive_primary":
            document = generate_extractive_summary(aggregate_document, min_summary_length=100)
            
        pipelineDict["summary"] = document
        pipeline_iteration_results[i] = pipelineDict
            
        
    print("Seed Document:" + "\n")
    print(initial_document, "\n")
    print(divString(100) + "\n")
    print("Resulting Document:" + "\n")
    print(document, "\n")
    return document, pipeline_iteration_results

#######
# RUN #
#######

# Initial Document Selection
search_results_df = search_and_subset_data(covid_test_data, "death") # a df containing entries that have death in the first paragraph
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
initial_document = select_random_document(search_results_df).first_paragraph.values[0] # Selected a random row from the search results dataframe and extracted the first paragraph text to serve as our document

# Pipeline Iteration
pipeline_results = iterate_pipeline(initial_document, corpus, num_iter = 5, aggregate_doc_variant = "abstractive_secondary")

print_pipeline = lambda pipeline_output: pprint.pprint(pipeline_output[1])
print_pipeline(pipeline_results)

Seed Document:

WASHINGTON — The numbers the health officials showed President Trump were overwhelming. With the peak of the coronavirus pandemic still weeks away, he was told, hundreds of thousands of Americans could face death if the country reopened too soon. 

####################################################################################################

Resulting Document:


{0: {'aggregate_document': 'As coronavirus cases and deaths were rising in the '
                           'United States last spring, Jared Kushner, '
                           'President Trump’s senior adviser and son-in-law, '
                           'told an interviewer that the president had taken '
                           'the “country back from the doctors.” As '
                           'coronavirus cases and deaths were rising in the '
                           'United States last spring, Jared Kushner, '
                           'President Trump’s senior adviser and son-in-law, '
 

In [11]:
##################################
# ITERATIVE PIPELINE (RECURRENT) #
##################################

test_document = "WASHINGTON — The coronavirus pandemic is sweeping through death row at the federal penitentiary in Terre Haute, Ind., with at least 14 of the roughly 50 men there having tested positive, lawyers for the prisoners and others familiar with their cases said."

#Pipeline Iteration
def iterate_pipeline_recurrent(initial_document, corpus, n_docs = 5, num_iter = 5, max_length = 512, aggregate_doc_variant = "abstractive_secondary"):
    """
    This function begins with an initial document. From there the initial document is clustered with similar documents, 
    Params:
    initial_document: Some string.
    corpus: The corpus.
    n_docs: The number of similar documents that are aggregated and summarized at each iteration.
    num_iter: The number of iterations the pipeline is run for.
    max_length: The maximum number of tokens to cut off at. (Absolute max: 512 (This is due to the limitations of the Bert model we're using))
    aggregate_doc_variant: The specific summarization process to perform. (Options: abstractive_primary, abstractive_secondary, extractive_primary)
    """
    pipeline_iteration_results = {}
    document = initial_document
    
    for i in range(num_iter):
        cosine_similarities = compute_cosine_similarities(document, corpus)
        related_docs_indices = get_related_docs_indices(cosine_similarities, n_docs=n_docs)
        related_docs = get_related_docs(corpus, related_docs_indices)
        aggregate_document = concatenate_related_docs(corpus, related_docs_indices)
        
        pipelineDict = {"seed_document": document,
        "related_documents" : related_docs,
        "cosine_similarity_scores" : get_top_similarities(cosine_similarities),
        "related_document_indices" : related_docs_indices,
        "aggregate_document": aggregate_document,
        "summary_type" : aggregate_doc_variant,
        "summary": ""
       }
        # There is no summary to include on the very first iteration, so we skip it until the first summary is generated.
        # Recurrent Condition
        if i > 0:
            # add the previous summary to the current aggregate doc to be summarized together
            previous_summary = pipeline_iteration_results[i - 1]["summary"]
            aggregate_document = previous_summary + aggregate_document
            pipelineDict.update({"aggregate_document" : aggregate_document})
            
        if aggregate_doc_variant == "abstractive_primary":
            document = generate_abstractive_summary(aggregate_document, model = abstractive_summarizer_model, max_length = max_length)
        if aggregate_doc_variant == "abstractive_secondary":
            extractive_summary_primary = generate_extractive_summary(aggregate_document, min_summary_length=100)
            document = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model, max_length = max_length)
        if aggregate_doc_variant == "extractive_primary":
            document = generate_extractive_summary(aggregate_document, min_summary_length=100)
            
        pipelineDict["summary"] = document
        pipeline_iteration_results[i] = pipelineDict
            
    print("Seed Document:" + "\n")
    print(initial_document, "\n")
    print(divString(100) + "\n")
    print("Resulting Document:" + "\n")
    print(document, "\n")
    return document, pipeline_iteration_results

#######
# RUN #
#######

# Initial Document Selection
search_results_df = search_and_subset_data(covid_test_data, "death") # a df containing entries that have death in the first paragraph
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
initial_document = select_random_document(search_results_df).first_paragraph.values[0] # Selected a random row from the search results dataframe and extracted the first paragraph text to serve as our document

# Pipeline Iteration
pipeline_results = iterate_pipeline_recurrent(initial_document, corpus, num_iter = 5, max_length = 512, aggregate_doc_variant = "abstractive_secondary")
print_pipeline(pipeline_results)

Seed Document:


####################################################################################################

Resulting Document:

the world's leading authority on infectious disease expressed hope in april that no more would die from it. there is no evidence that the virus is going away anytime soon, according to a new york times database. a revered research center predicted that the figure would be just over 70, 000 by early august. 

{0: {'aggregate_document': 'The United States’ leading authority on infectious '
                           'disease expressed hope in April that no more than '
                           '60,000 people in the country would die from the '
                           'coronavirus. A revered research center predicted a '
                           'few weeks later that the figure would be just over '
                           '70,000 people by early August. When the number of '
                           'deaths shot up in May, President Trump sa

In [12]:
################################
# ITERATIVE PIPELINE (STACKED) #
################################

test_document = "WASHINGTON — The coronavirus pandemic is sweeping through death row at the federal penitentiary in Terre Haute, Ind., with at least 14 of the roughly 50 men there having tested positive, lawyers for the prisoners and others familiar with their cases said."

#Pipeline Iteration
def iterate_pipeline_stacked(initial_document, corpus, n_docs = 5, num_iter = 5, max_length = 512, aggregate_doc_variant = "abstractive_secondary"):
    """
    This function begins with an initial document. From there the initial document is clustered with similar documents, 
    Params:
    initial_document: Some string.
    corpus: The corpus.
    n_docs: The number of similar documents that are aggregated and summarized at each iteration.
    num_iter: The number of iterations the pipeline is run for.
    max_length: The maximum number of tokens to cut off at. (Absolute max: 512 (This is due to the limitations of the Bert model we're using))
    aggregate_doc_variant: The specific summarization process to perform. (Options: abstractive_primary, abstractive_secondary, extractive_primary)
    """
    pipeline_iteration_results = {}
    document = initial_document
    
    for i in range(num_iter):
        cosine_similarities = compute_cosine_similarities(document, corpus)
        related_docs_indices = get_related_docs_indices(cosine_similarities, n_docs=n_docs)
        related_docs = get_related_docs(corpus, related_docs_indices)
        aggregate_document = concatenate_related_docs(corpus, related_docs_indices)
        
        pipelineDict = {"seed_document": document,
        "related_documents" : related_docs,
        "cosine_similarity_scores" : get_top_similarities(cosine_similarities),
        "related_document_indices" : related_docs_indices,
        "aggregate_document": aggregate_document,
        "summary_type" : aggregate_doc_variant,
        "summary": ""
       }
            
        if aggregate_doc_variant == "abstractive_primary":
            document = generate_abstractive_summary(aggregate_document, model = abstractive_summarizer_model, max_length = max_length)
        if aggregate_doc_variant == "abstractive_secondary":
            extractive_summary_primary = generate_extractive_summary(aggregate_document, min_summary_length=100)
            document = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model, max_length = max_length)
        if aggregate_doc_variant == "extractive_primary":
            document = generate_extractive_summary(aggregate_document, min_summary_length=100)
        
        # Update pipeline dictionary
        pipelineDict["summary"] = document
        pipeline_iteration_results[i] = pipelineDict
           
    # Stacking the summaries, summarizing stack, updating pipeline dictionary
    pipeline_iteration_results.update(stack_summaries(pipeline_iteration_results))
    
    #Display Input/Output
    show_result_seed_comparison(initial_document, document)
    return document, pipeline_iteration_results

#######
# RUN #
#######

# Initial Document Selection
search_results_df = search_and_subset_data(full_nyt_data, "death") # a df containing entries that have death in the first paragraph
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
initial_document = select_random_document(search_results_df).first_paragraph.values[0] # Selected a random row from the search results dataframe and extracted the first paragraph text to serve as our document

# Pipeline Iteration
pipeline_results = iterate_pipeline_stacked(initial_document, corpus, num_iter = 3, max_length = 512, aggregate_doc_variant = "abstractive_secondary")
print_pipeline(pipeline_results)

Seed Document:

Approximately 85 percent of children with cancer are cured. However, about 15 percent confront the sort of aggressive disease that cut short the life of Tyler Trent at the age of 20 on Jan. 1, 2019. “One hundred years down the line, maybe my legacy could have an impact”: so Tyler said about his efforts to raise awareness of the need for further research in pediatric oncology. Two years after his death, Tyler’s physicians continue to help incurable as well as cured children lead longer and better lives. 

####################################################################################################

Resulting Document:


{0: {'aggregate_document': 'JERUSALEM —\xa0Fractured by internal political '
                           'conflicts, confusing instructions and a lack of '
                           'public trust in the government, Israel seems to be '
                           'fraying further under a second national lockdown '
                           'as the 

In [25]:
##################
# TEST DOCUMENTS #
##################

test_document = "WASHINGTON — The coronavirus pandemic is sweeping through death row at the federal penitentiary in Terre Haute, Ind., with at least 14 of the roughly 50 men there having tested positive, lawyers for the prisoners and others familiar with their cases said."
test_document_2 = "WASHINGTON — Kirstjen Nielsen, the homeland security secretary, said on Monday that cyberthreats against the United States were a national security crisis that she described as her top priority — not the situation for which President Trump last month declared a national emergency."
test_document_3 = "Los Angeles County could see 'catastrophic suffering and death' in the coming weeks, public health officials warn, as the nation's most populous county reported another record day of new coronavirus cases."
test_document_4 = "Was Ronald Reagan a kindhearted conservative who remade government and merits his standing as a beloved icon of the Republican Party? Or was he a glorified actor who won election with a coded racist appeal to white voters, setting the stage for the rise of President Trump?"

In [24]:
################################
# ITERATIVE PIPELINE (GENERAL) #
################################

# ONE PIPELINE TO RULE THEM ALL

#Pipeline Iteration
def iterate_pipeline_general(initial_document, corpus, n_docs = 5, num_iter = 5, max_length = 512, similarity_selection_rule = "top", aggregate_doc_variant = "abstractive_secondary", summarize_stack = False, use_recurrence = False):
    """
    This function begins with an initial document. From there the initial document is clustered with similar documents, 
    Params:
    initial_document: Some string.
    corpus: The corpus.
    n_docs: The number of similar documents that are aggregated and summarized at each iteration.
    num_iter: The number of iterations the pipeline is run for.
    max_length: The maximum number of tokens to cut off at. (Absolute max: 512 (This is due to the limitations of the Bert model we're using))
    aggregate_doc_variant: The specific summarization process to perform. (Options: abstractive_primary, abstractive_secondary, extractive_primary)
    """
    pipeline_iteration_results = {}
    document = initial_document
    
    for i in tqdm(range(num_iter)):
        cosine_similarities = compute_cosine_similarities(document, corpus)
        related_docs_indices = get_related_docs_indices(cosine_similarities, n_docs=n_docs, type_=similarity_selection_rule)
        related_docs = get_related_docs(corpus, related_docs_indices)
        aggregate_document = concatenate_related_docs(corpus, related_docs_indices)
        
        pipelineDict = {"seed_document": document,
        "related_documents" : related_docs,
        "cosine_similarity_scores" : get_top_similarities(cosine_similarities),
        "related_document_indices" : related_docs_indices,
        "aggregate_document": aggregate_document,
        "summary_type" : aggregate_doc_variant,
        "summary": ""
       }
        
        # There is no summary to include on the very first iteration, so we skip it until the first summary is generated.
        # Recurrent Condition
        if i > 0 and use_recurrence == True:
            # add the previous summary to the current aggregate doc to be summarized together
            previous_summary = pipeline_iteration_results[i - 1]["summary"]
            aggregate_document = previous_summary + aggregate_document
            pipelineDict.update({"aggregate_document" : aggregate_document})
            
        # Summarization
        if aggregate_doc_variant == "abstractive_primary":
            document = generate_abstractive_summary(aggregate_document, model = abstractive_summarizer_model, max_length = max_length)
        if aggregate_doc_variant == "abstractive_secondary":
            extractive_summary_primary = generate_extractive_summary(aggregate_document, min_summary_length=100)
            document = generate_abstractive_summary(extractive_summary_primary, model = abstractive_summarizer_model, max_length = max_length)
        if aggregate_doc_variant == "extractive_primary":
            document = generate_extractive_summary(aggregate_document, min_summary_length=100)
    
        # Update Pipeline Dictionary
        pipelineDict["summary"] = document
        pipelineDict["evaluation_metrics"] = compute_all_metrics(pipelineDict["summary"], pipelineDict["aggregate_document"])
        pipeline_iteration_results[i] = pipelineDict
        
        # Compute and add saturation score to evaluation metrics.
        if i == 0:
            pipeline_iteration_results[i]["evaluation_metrics"]["saturation_score"] = len(pipeline_iteration_results[0]["related_document_indices"])
        if (i > 0) and (i < num_iter):
            prev = pipeline_iteration_results[i - 1]["related_document_indices"]
            curr = pipeline_iteration_results[i]["related_document_indices"]
            saturation = compute_simple_saturation(prev, curr)
            pipeline_iteration_results[i]["evaluation_metrics"]["saturation_score"] = saturation
    
    # Stacking the summaries, summarizing stack, updating pipeline dictionary
    if summarize_stack == True:
        pipeline_iteration_results.update(stack_summaries(pipeline_iteration_results))
        
    # Show Inputs/Outputs
    show_result_seed_comparison(initial_document, document)
    return document, pipeline_iteration_results



#######
# RUN #
#######

# # Initial Document Selection (Random Selection)
# search_results_df = multi_search(full_nyt_data, ["death", "trump"])# a df containing entries that have death and trump in the first paragraph
# corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
# initial_document = select_random_document(search_results_df).first_paragraph.values[0] # Selected a random row from the search results dataframe and extracted the first paragraph text to serve as our document

# Initial Document Selection (Using Test Documents Defined At Top Of Cell)
search_results_df = multi_search(covid_test_data, ["death"])
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
initial_document = test_document_3

# Pipeline Iteration
pipeline_general_results = iterate_pipeline_general(initial_document,
                         corpus,
                         n_docs = 5,
                         num_iter = 3,
                         max_length = 512,
                         similarity_selection_rule = "top",
                         aggregate_doc_variant = "abstractive_secondary",
                         summarize_stack = False,
                         use_recurrence = True)

# Print Pipeline
print_pipeline(pipeline_general_results)
print("Saturation Scores: ", get_saturation_scores(pipeline_general_results))

100%|██████████| 3/3 [00:26<00:00,  8.78s/it]

Seed Document:

Los Angeles County could see 'catastrophic suffering and death' in the coming weeks, public health officials warn, as the nation's most populous county reported another record day of new coronavirus cases. 

####################################################################################################

Resulting Document:

los angeles county is one of the hardest - hit areas in the u. s. the number of people with the coronavirus in the united states has passed 300, 000 on monday. as the total number of coronavirus cases reaches 24 million on monday, as the number continues to rise. 

{0: {'aggregate_document': 'As the total number of U.S. coronavirus cases '
                           'surpassed 24 million on Monday, Los Angeles '
                           'County, one of the hardest-hit areas, may face '
                           'even more dire weeks ahead. Deaths in the county '
                           'have continued to climb as the national death toll '





In [46]:
# Initial Document Selection (Using Test Documents Defined At Top Of Cell)
search_results_df = multi_search(covid_test_data, ["death"])
corpus = search_results_df.first_paragraph.to_list()  # Converting to a list to vectorize the entries
initial_document = test_document_3

# Pipeline Iteration
baseline_model = iterate_pipeline_general(initial_document,
                                          corpus,
                                          n_docs = 5,
                                          num_iter = 1,
                                          max_length = 512,
                                          similarity_selection_rule = "top",
                                          aggregate_doc_variant = "abstractive_secondary",
                                          summarize_stack = False,
                                          use_recurrence = False)


stacked_model = iterate_pipeline_general(initial_document,
                                          corpus,
                                          n_docs = 5,
                                          num_iter = 3,
                                          max_length = 512,
                                          similarity_selection_rule = "top",
                                          aggregate_doc_variant = "abstractive_secondary",
                                          summarize_stack = True,
                                          use_recurrence = False)


recurrent_model = iterate_pipeline_general(initial_document,
                                          corpus,
                                          n_docs = 5,
                                          num_iter = 3,
                                          max_length = 512,
                                          similarity_selection_rule = "top",
                                          aggregate_doc_variant = "abstractive_secondary",
                                          summarize_stack = False,
                                          use_recurrence = True)


100%|██████████| 1/1 [00:09<00:00,  9.56s/it]
  0%|          | 0/3 [00:00<?, ?it/s]

Seed Document:

Los Angeles County could see 'catastrophic suffering and death' in the coming weeks, public health officials warn, as the nation's most populous county reported another record day of new coronavirus cases. 

####################################################################################################

Resulting Document:

los angeles county is one of the hardest - hit areas in the u. s. the number of people with the coronavirus in the united states has passed 300, 000 on monday. the total number of coronavirus cases reached a quarter - million on monday, less than four weeks after the nation's death toll reached 24 million. 



100%|██████████| 3/3 [00:27<00:00,  9.14s/it]
  0%|          | 0/3 [00:00<?, ?it/s]

Seed Document:

Los Angeles County could see 'catastrophic suffering and death' in the coming weeks, public health officials warn, as the nation's most populous county reported another record day of new coronavirus cases. 

####################################################################################################

Resulting Document:

los angeles county is one of the hardest - hit areas in the u. s. the number of people with the coronavirus in the united states has passed 300, 000 on monday. the total number of coronavirus cases reached a quarter - million on monday, less than four weeks after the nation's death toll reached 24 million. 



100%|██████████| 3/3 [00:25<00:00,  8.60s/it]

Seed Document:

Los Angeles County could see 'catastrophic suffering and death' in the coming weeks, public health officials warn, as the nation's most populous county reported another record day of new coronavirus cases. 

####################################################################################################

Resulting Document:

los angeles county is one of the hardest - hit areas in the u. s. the number of people with the coronavirus in the united states has passed 300, 000 on monday. as the total number of coronavirus cases reaches 24 million on monday, as the number continues to rise. 






In [94]:
print("BASELINE MODEL")
show_model_scores(baseline_model)
print(divString(100))
print("STACKED MODEL")
show_model_scores(stacked_model, stacked = True)
print(divString(100))
print("RECURRENT MODEL")
show_model_scores(recurrent_model)

BASELINE MODEL
Mean Bleu Score:  2.0961369038010087
Mean Rouge2 Score:  0.30769230769230776
Mean Aggregate Saturation Score:  0.6666666666666666
####################################################################################################
STACKED MODEL
Mean Bleu Score:  1.9178383819283444
Mean Rouge2 Score:  0.3051553864274713
Mean Aggregate Saturation Score:  1.0
####################################################################################################
RECURRENT MODEL
Mean Bleu Score:  1.9859537652485748
Mean Rouge2 Score:  0.3056128567205132
Mean Aggregate Saturation Score:  0.6666666666666666


### PROBLEMS & IDEAS 

1. Problem: Bert max token length is 512. As such, our aggregate summaries can grow to be too long in the recurrent and stacked pipelines. Idea: To combat this what if we further reduce the 
   dimensionality of the resulting summary prior to the final iteration by using the nltk POS parser to filter out tokens that are not nouns, verbs, etc. The idea being
   that tokens of certain parts of speech contain enough of the semantic meaning to not reduce the quality of the final summary, while keeping the dimensionality of
   the intermediate summaries minimal.
   
2. Problem: Not enough of the information from previous iterations of the pipeline is retained accross iterations. This is because there is no memory of previous
   summaries in the baseline pipeline model. Idea: To use an idea similar to the intuition behind a recurrent layer. At each iteartion, we will concatenate the previous 
   summary with the aggregate documents that are about to be summarized. The idea being that by including the summary from the previous iteration at each time step,
   we may be able to persist information accross iterations. I speculate that this may even improve summary quality, by offering an opportunity to correct semantic
   hallucinations at iterative step. (ATTEMPTED AND OPERATIONAL)
   
3. Saturation metric to deal with saturation. Potentially use this to eliminate information that is not new and therefore conense the number of documents to summarize. 
