# Hybrid Summarization Methods: A Comparative Analysis

## Abstract

In the pursuit of efficient summarization techniques, a hybrid approach that combines extractive and abstractive methods has been proposed. This document aims to analyze various combinations of these methods to determine the most effective hybrid approach for summarizing documents while maintaining accuracy, relevance, clarity, and cohesiveness.

## Introduction

Summarization algorithms can be broadly categorized into two types: extractive and abstractive. Extractive summarization methods identify key sentences or fragments in the text and compile them to form a summary. In contrast, abstractive summarization methods generate new sentences, often rephrasing or interpreting the original content to produce a concise version.

## Methodology

In this research, a methodical first-principles approach was employed to critically evaluate and compare hybrid summarization methods, integrating the precision of extractive techniques with the narrative quality of abstractive methods. Initially, a deep dive into the core principles of each method informed the creation of five theoretically sound hybrid models. These models were empirically tested against performance metrics—accuracy, relevance, clarity, and cohesiveness—and subsequently underwent a rigorous critical analysis to assess their practical application and the validity of the results. The culmination of this process was a logical synthesis of the evidence, leading to the selection of the most effective hybrid model for summarization, ensuring that the conclusions were as robust in practice as they were in theory.

In [2]:
import tkinter as tk
from tkinter import filedialog
# Initialize the Tkinter GUI
root = tk.Tk()
root.withdraw()  # Hide the root window

# Choose a file
def choose_file():
    file_path = filedialog.askopenfilename(filetypes=[("PDF files", "*.pdf")])  # Only allow PDFs
    return file_path


In [2]:
#!pip install "unstructured[pdf]"

## Text Extraction and Preprocessing
The first step in text summarization is to extract the text from the source, which in this case is a PDF document. After extraction, the text is preprocessed to facilitate further analysis. This involves tokenizing the text into sentences and words, removing stopwords (common words that do not contribute much meaning), and stemming (reducing words to their root form).

In [23]:
from pdfminer.high_level import extract_text
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from pdfminer.high_level import extract_text

# Extract text from PDF
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

# Preprocess text
def preprocess_text(text):
    nltk.download('punkt')
    nltk.download('stopwords')
    
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    
    sentences = sent_tokenize(text)
    preprocessed_text = []
    for sentence in sentences:
        tokens = [stemmer.stem(word) for word in word_tokenize(sentence.lower())
                  if word not in stop_words and word.isalnum()]
        if tokens:
            preprocessed_text.append(tokens)
    
    sentences = [sentences[i] for i, tokens in enumerate(preprocessed_text) if tokens]
    
    return sentences, preprocessed_text


## Building the Similarity Matrix
The similarity matrix is a crucial component in the TextRank algorithm. It represents the similarity between different sentences in the text. A common approach is to use vector representations of sentences and calculate the cosine similarity between these vectors. For this purpose, we train a Word2Vec model on the preprocessed text. The similarity matrix is then constructed by calculating the cosine similarity between the vector representations of all pairs of sentences.

In [24]:
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

# Build similarity matrix for TextRank
def build_similarity_matrix(sentences, preprocessed_text):
    model = Word2Vec(preprocessed_text, vector_size=100, window=2, min_count=1, workers=2)
    model_wv = model.wv
    
    sim_matrix = np.zeros((len(sentences), len(sentences)))
    
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j and preprocessed_text[i] and preprocessed_text[j]:
                vector_i = np.mean([model_wv[word] for word in preprocessed_text[i] if word in model_wv], axis=0)
                vector_j = np.mean([model_wv[word] for word in preprocessed_text[j] if word in model_wv], axis=0)
                if not (np.any(np.isnan(vector_i)) or np.any(np.isnan(vector_j)) or np.any(np.isinf(vector_i)) or np.any(np.isinf(vector_j))):
                    sim_matrix[i][j] = cosine_similarity([vector_i], [vector_j])[0, 0]
    
    return sim_matrix

## Applying TextRank
TextRank is an algorithm based on the PageRank algorithm used by Google for ranking web pages. In the context of text summarization, it ranks sentences based on their importance within the text. The apply_text_rank function applies the TextRank algorithm to the similarity matrix, using the networkx library's implementation of PageRank. If the algorithm fails to converge, we dynamically adjust the tolerance and iteration count, and as a last resort, we use degree centrality as a fallback strategy.

In [None]:
import networkx as nx

# Apply TextRank to the similarity matrix
def apply_text_rank(sim_matrix, sentences):
    normalized_sim_matrix = normalize_similarity_matrix(sim_matrix)
    
    nx_graph = nx.from_numpy_array(normalized_sim_matrix)
    tolerance = 1e-3
    max_iter = 1000
    scores = None

    for attempt in range(5):
        try:
            scores = nx.pagerank(nx_graph, max_iter=max_iter, tol=tolerance)
            break
        except nx.PowerIterationFailedConvergence:
            tolerance *= 10
            max_iter *= 2

    if scores is None:
        print("PageRank failed to converge, using fallback strategy.")
        scores = nx.degree_centrality(nx_graph)

    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    return [s for score, s in ranked_sentences]

# Normalize the similarity matrix
def normalize_similarity_matrix(sim_matrix):
    row_sums = sim_matrix.sum(axis=1)
    normalized_matrix = np.divide(sim_matrix, row_sums[:, np.newaxis], where=row_sums[:, np.newaxis] != 0)
    return normalized_matrix


## Integration of T5 for Summarization
The T5 (Text-to-Text Transfer Transformer) model is a powerful NLP tool capable of performing various text-based tasks, including summarization. The code integrates T5 to refine the summary produced by TextRank. It first prepares the text by prefixing it with "summarize:" and then encodes it using the T5 tokenizer. The T5 model then generates a summary that is concise and relevant.

In [25]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load T5 model and tokenizer
model_name = 't5-base'  
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Prepare text for T5
def prepare_text_for_t5(text):
    preprocessed_text = "summarize: " + text
    return tokenizer.encode(preprocessed_text, return_tensors="pt", max_length=512, truncation=True)

# Summarize text with T5
def summarize_with_t5(encoded_text):
    summary_ids = model.generate(encoded_text, min_length=30, max_length=200, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 694766da-8580-4c5b-a5d8-6498442a4bf3)')' thrown while requesting HEAD https://huggingface.co/t5-base/resolve/main/config.json


## Creating a final Summary
This is the final summary created

In [41]:
# Other imports and function definitions remain unchanged...

# Main function to summarize each paragraph
def final_summary(pdf_path):
    print("Starting text extraction from PDF...")
    text = extract_text_from_pdf(pdf_path)
    print("Extraction complete. Preprocessing text...")

    paragraphs = text.split('\n\n')
    filtered_paragraphs = [paragraph for paragraph in paragraphs if len(paragraph.split()) >= 50]

    print(f"Number of paragraphs extracted: {len(filtered_paragraphs)}")
    summaries = []

    for paragraph in filtered_paragraphs:
        print("Preparing paragraph for T5 summarization...")
        encoded_text = prepare_text_for_t5(paragraph)
        print("Paragraph prepared for T5.")

        print("Generating summary for paragraph with T5...")
        summary = summarize_with_t5(encoded_text)
        print("Summary generated for paragraph with T5.")

        summaries.append(summary)

    # Combine all paragraph summaries into one final document summary
    final_summary = ' '.join(summaries)
    print("All paragraphs summarized. Final summary created.")

    return final_summary

In [42]:
# Run the hybrid summarizer
T5_summary = final_summary(pdf_path)
print(T5_summary)

Starting text extraction from PDF...
Extraction complete. Preprocessing text...
Number of paragraphs extracted: 13
Preparing paragraph for T5 summarization...
Paragraph prepared for T5.
Generating summary for paragraph with T5...
Summary generated for paragraph with T5.
Preparing paragraph for T5 summarization...
Paragraph prepared for T5.
Generating summary for paragraph with T5...
Summary generated for paragraph with T5.
Preparing paragraph for T5 summarization...
Paragraph prepared for T5.
Generating summary for paragraph with T5...
Summary generated for paragraph with T5.
Preparing paragraph for T5 summarization...
Paragraph prepared for T5.
Generating summary for paragraph with T5...
Summary generated for paragraph with T5.
Preparing paragraph for T5 summarization...
Paragraph prepared for T5.
Generating summary for paragraph with T5...
Summary generated for paragraph with T5.
Preparing paragraph for T5 summarization...
Paragraph prepared for T5.
Generating summary for paragraph w

In [None]:
#!pip install language_tool_python

In [53]:
import language_tool_python
tool = language_tool_python.LanguageTool('en-UK')
matches = tool.check(T5_summary)
corrected_text = language_tool_python.utils.correct(T5_summary, matches)
print(corrected_text)

The purpose of this brief commentary is to provide a brief overview of Weber’s life, work, and contributions to management thought. The commentary begins with a brief biographical sketch followed by an examination of Weber’s conceptualization of authority, its influence on the field of management and its relevancy in the twenty-first century. Some organizational theorists have questioned the relevancy of Weber’s theories in today’s late-modern knowledge-based information age. A complete examination of the enduring influence of the entire breadth of Weber’s writings in the current context is well beyond the scope of this brief commentary. Max Weber was born in 1864 in Erfurt, Germany, the oldest of eight children. He studied law at the university of Heidelberg before completing his doctoral dissertation. His academic career would span law, history, economics, philosophy, and sociology. In 1904, after a five-year period during which he published virtually nothing, Weber began publishing 

## gpt-4 + K-Mean clusters approach

In [5]:
import os 
import openai
from pdfminer.high_level import extract_text
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from langchain.embeddings import OpenAIEmbeddings
from langchain import LLMChain, PromptTemplate
import os
from langchain import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import (
                StuffDocumentsChain,
                LLMChain,
                ReduceDocumentsChain,
                MapReduceDocumentsChain,
            )
# Extract text from PDF and remove short paragraphs
def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    paragraphs = text.split('\n\n')
    filtered_paragraphs = [p for p in paragraphs if len(p.split()) >= 50]
    return '\n\n'.join(filtered_paragraphs)

# Replace 'your_pdf_path.pdf' with the actual path to your PDF file

pdf_path = choose_file()
pdf_text = extract_text_from_pdf(pdf_path)

# Assume pdf_text is a list of paragraphs or sentences
df = pd.DataFrame({'text': pdf_text.split('\n\n')})

# Embedding
os.environ["OPENAI_API_KEY"] = "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"
embeddings = OpenAIEmbeddings()
df['embedding'] = df['text'].apply(lambda x: embeddings.embed_query(x))

# Clustering
n_clusters = 10  # or another number based on your data
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(list(df['embedding']))

# Summarization with GPT-4 Turbo
map_template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the key concepts conveyed
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
llm = LLMChain(llm=ChatOpenAI(model_name="gpt-4-1106-preview"), prompt=map_prompt)  # Define your prompt template

def summarize_cluster(cluster_texts):
    # Combine texts and pass to GPT-4 for summarization
    combined_text = "\n".join(cluster_texts)
    return llm.run(combined_text)

# Apply summarization to each cluster
cluster_summaries = df.groupby('Cluster')['text'].apply(summarize_cluster)

# Print or process the summaries
for cluster_id, summary in cluster_summaries.iteritems():
    print(f"Cluster {cluster_id} Summary:")
    print(summary)
    print("\n" + "="*80 + "\n")
    
# Aggregate cluster summaries
all_summaries = [summary for cluster_id, summary in cluster_summaries.iteritems()]

# Combine all summaries into one text
combined_summaries = "\n".join(all_summaries)




Cluster 0 Summary:
The key concepts conveyed in this set of documents include:

1. **Charisma in Emergencies**: Bendix (1977) suggests that charisma tends to emerge or become more apparent during emergency situations. This implies that in times of crisis, individuals with charismatic qualities are more likely to be recognized or to assume leadership roles due to their ability to inspire or mobilize people.

2. **Abnormal Conditions and Charisma**: Mommsen (1974) points out that the purest form of charisma is often linked to abnormal situations. This concept aligns with the idea that charisma is not a common feature of everyday life but rather something that stands out or becomes evident in extraordinary circumstances.

3. **Disruption of Everyday Life**: Schluchter (1988) argues that charisma becomes particularly sought after when normalcy is significantly disrupted. This concept suggests that when the routine structure of life is shattered, people tend to look for individuals who exhi

  for cluster_id, summary in cluster_summaries.iteritems():
  all_summaries = [summary for cluster_id, summary in cluster_summaries.iteritems()]


In [37]:
# Summarization template for the combined summaries
summary_template = """The following points summarize the key themes and sub-themes from a set of documents:
{points}
Please create a concise summary that includes all the major themes in a fraction of the original text's length."""

# Create a prompt using the template
summary_prompt = PromptTemplate.from_template(summary_template.format(points=combined_summaries))

# Summarize the combined summaries
concise_summary = llm.run(summary_prompt)

print(concise_summary)

The main themes identified from the set of documents are: 

1. Max Weber's influence on management and organizational theory, including his conceptualization of authority and its relevance in the 21st century.
2. The three types of authority outlined by Weber - Traditional, Rational-Legal, and Charismatic, and their contrasting characteristics, roles, and impacts.
3. The process of "routinizing" Charismatic authority, its transient nature in times of crisis, and its role as a form of rebellion.
4. The transformation of authority structures in the industrial and information age sparked by technological advancements.
5. The concept of self-managing organizational forms and the shift in control from traditional management to workers.
6. Weber's Iron Cage concept and its implications in modern organizational practices.
7. The paradox of empowerment potentially leading to stricter control within organizations.
8. The impact of these concepts on contemporary understanding of complex organiza

In [22]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from pdfminer.high_level import extract_text
import language_tool_python

# Load the tokenizer and model
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

def generate_summary(paragraphs):
    summaries = []
    for paragraph in paragraphs:
        tokens = tokenizer(paragraph, truncation=True, padding="longest", max_length=512, return_tensors="pt")
        summary_tokens = model.generate(**tokens)
        summary = tokenizer.decode(summary_tokens[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

def extract_and_summarize(pdf_path):
    text = extract_text(pdf_path)
    paragraphs = text.split('\n\n')
    filtered_paragraphs = [p for p in paragraphs if len(p.split()) >= 50]

    print(f"Number of paragraphs extracted: {len(filtered_paragraphs)}")
    print("Extraction complete. Preprocessing text...")

    summaries = generate_summary(filtered_paragraphs)
    combined_summary = ' '.join(summaries)

    return combined_summary



Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
# Run the summary function
pdfpath = choose_file()
summary = extract_and_summarize(pdfpath)

print(summary)

Number of paragraphs extracted: 38
Extraction complete. Preprocessing text...
The sociologist Max Weber considered three types of authority: traditional, bureaucratic and charismatic. In our series of letters from African journalists, filmmaker and columnist Ahmedou Ould-Abdallah looks at Max Weber’s interest in bureaucratic authority. There are two types of authority: legal authority and traditional authority. There are three types of charisma: charisma as a force for good, charisma as a force for evil, and charisma as a force for good. Weber was interested in charisma as a type of Herrschaft. This article is the first in a series on Weber’s theories of charisma. Herrschaft can be rendered as ‘rule’, ‘dominion’, ‘control’, ‘power’ or ‘sway’. Weber often refers to Herrschaft as ‘autorit at’ or ‘authority’. In our series of letters from German journalists, film-maker and columnist Paul Honigsheim looks at one of the greatest preoccupations of German intellectuals of the 20th Century, Em

In [25]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from pdfminer.high_level import extract_text
import torch

# Load the tokenizer and model
CUDA_LAUNCH_BLOCKING=1
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

# Enable GPU if it's available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def generate_summary(paragraphs, batch_size=4):
    summaries = []
    for i in range(0, len(paragraphs), batch_size):
        batch = paragraphs[i:i+batch_size]
        tokens = tokenizer(batch, truncation=True, padding=True, max_length=512, return_tensors="pt").to(device)
        summary_tokens = model.generate(**tokens)
        batch_summaries = [tokenizer.decode(t, skip_special_tokens=True) for t in summary_tokens]
        summaries.extend(batch_summaries)
    return summaries

def extract_and_summarize(pdf_path):
    text = extract_text(pdf_path)
    paragraphs = text.split('\n\n')
    filtered_paragraphs = [p for p in paragraphs if len(p.split()) >= 50]

    print(f"Number of paragraphs extracted: {len(filtered_paragraphs)}")
    print("Extraction complete. Preprocessing text...")

    summaries = generate_summary(filtered_paragraphs, batch_size=4)  # Adjust the batch size based on your GPU's memory
    combined_summary = ' '.join(summaries)

    return combined_summary

pdfpath = choose_file()
summary = extract_and_summarize(pdfpath)

print(summary)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## Conclusion
The combination of these steps forms a pipeline for text summarization. The preprocessed text is used to build a similarity matrix, which is then used by the TextRank algorithm to rank sentences by importance. The most important sentences are selected to form a summary of the original text.

For the most effective similarity matrix in the context of TextRank, the cosine similarity between Word2Vec sentence embeddings is a strong choice, as it captures the semantic similarity between sentences. However, depending on the specific use case and the nature of the text, other methods like TF-IDF or BERT embeddings could be considered for potentially better performance.




## References

- Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts. Association for Computational Linguistics.
- Erkan, G., & Radev, D. R. (2004). LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research.
- Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development.
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science.
- Radev, D. R., Jing, H., Styś, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing & Management.
- Steinberger, J., & Jezek, K. (2004). Using Latent Semantic Analysis in Text Summarization and Summary Evaluation. Proceedings of ISIM.
- Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-Based n-gram Models of Natural Language. Computational Linguistics.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research.
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Association for Computational Linguistics.
- Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Association for Computational Linguistics.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165.
