# An Investigation Into Different Approaches For Summerisation Using NLP

## Abstract

Text summarization in NLP is the process of summarising the large body of text into smaller chunk suitable which is more suitable for the readers comprehension. In this article, I will investigate different approaches to summerisation from traditional to advance methods such as generative AI

## Introduction
Summerisation of large body of text has been an essential need, during our daily lives when we are working or reading, we realise a lot of information we are reading can be summerised to a smaller more digestable format, we also have issue of needing to read a document but lacking the time as we are too busy with other matter. This could rnage from newspapers, articles, research papers, books etc. 

What I want to accomplish through this investigation is to discover the most cost-effective approach to summerisation and integrate this feature to our AI learning assistant we are developing. We will be researching and documenting various techniques used for summerisation and choose the most suitable approach after rigourously evaluating the method. 

## Types of Text Summarization
After an extensive investigation finding numerous summerisation techniques we have categorised the two different types of approaches to text summerisation, these are the approaches, Extractive and Abstractive

### Extractive Text Summarization
It is the traditional method developed first. The main objective is to identify the significant sentences of the text and add them to the summary. You need to note that the summary obtained contains exact sentences from the original text.

### Abstractive Text Summarization
It is a more advanced method, with many advancements coming out frequently. The approach is to identify the important sections, interpret the context, and reproduce in a new way. This ensures that the core information is conveyed through the shortest text possible. Here, the sentences in the summary are generated, not just extracted from the original text.


## Extractive Summerisation Method: TextRank with Gensim
TextRank is a graph-based ranking algorithm specifically adapted for text processing. It is similar to LexRank in that it is based on the concept of ranking sentences for importance within the text. The core idea is that sentences "recommend" each other, much like web pages do in Google's PageRank algorithm. Each sentence is treated as a node, and the connections between them are based on their similarity. A voting or recommendation process occurs where the importance of a sentence is determined by the importance of the sentences recommending it.

As a result of this iterative voting or recommendation process, TextRank identifies sentences that are central to the text and thus should be included in the summary. The sentences chosen for the summary are those that are most highly ranked according to this algorithm.

TextRank excels when applied to texts that have a dense web of semantic similarities between sentences, such as scientific articles, technical papers, and legal documents. It is particularly useful for documents where the use of domain-specific vocabulary leads to clear patterns of word use within the text.

### Step 1: Add all neccessary installation and choose a File
Add all neccessary imports and create function to prompt the user to select a PDF file from their system using a file dialog.


In [1]:
#pip install pdfminer.six nltk networkx gensim tk

In [7]:
import tkinter as tk
from tkinter import filedialog

def choose_file():
    # Initialize the Tkinter GUI
    root = tk.Tk()
    root.withdraw()  # Hide the root window
    file_path = filedialog.askopenfilename(filetypes=[("PDF files", "*.pdf")])  # Only allow PDFs
    return file_path


### Step 2: Text Extraction and Preprocessing

The following functions are used to extract text from a PDF and preprocess it for summarization.


In [3]:
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)


In [4]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    
    # Split text into sentences
    sentences = sent_tokenize(text)
    
    # Tokenize, stem, and remove stop words
    preprocessed_text = []
    for sentence in sentences:
        tokens = [stemmer.stem(word) for word in word_tokenize(sentence.lower())
                  if word not in stop_words and word.isalnum()]
        if tokens:  # Only add non-empty lists
            preprocessed_text.append(tokens)
    
    # Filter out sentences that correspond to empty preprocessed lists
    sentences = [sentences[i] for i, tokens in enumerate(preprocessed_text) if tokens]
    
    return sentences, preprocessed_text



[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


### Step 3: Implement TextRank

We will now define functions to build a similarity matrix using Word2Vec and apply the TextRank algorithm to rank sentences.


In [18]:
import numpy as np
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec

def build_similarity_matrix(sentences, preprocessed_text):
    # Train a Word2Vec model
    if preprocessed_text:
        model = Word2Vec(preprocessed_text, vector_size=100, window=2, min_count=1, workers=2)
        model_wv = model.wv
    else:
        raise ValueError("The preprocessed text is empty. Please check your preprocessing steps.")
    
    # Create an empty similarity matrix
    sim_matrix = np.zeros((len(sentences), len(sentences)))
    
    # Build the similarity matrix
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                # Ensure each sentence has at least one word after preprocessing
                if preprocessed_text[i] and preprocessed_text[j]:
                    vector_i = np.mean([model_wv[word] for word in preprocessed_text[i] if word in model_wv], axis=0)
                    vector_j = np.mean([model_wv[word] for word in preprocessed_text[j] if word in model_wv], axis=0)
                    
                    # Check that vectors are valid (not NaN or infinite)
                    if not np.all(np.isfinite(vector_i)) or not np.all(np.isfinite(vector_j)):
                        continue
                    
                    sim_matrix[i][j] = cosine_similarity([vector_i], [vector_j])[0, 0]
    
    return sim_matrix


def apply_text_rank(sim_matrix, sentences):
    nx_graph = nx.from_numpy_array(sim_matrix)
    scores = nx.pagerank(nx_graph, max_iter=10000, tol=1e-3)
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    return ranked_sentences

def summarize(pdf_path, num_sentences=5):
    text = extract_text_from_pdf(pdf_path)
    sentences, preprocessed_text = preprocess_text(text)
    sim_matrix = build_similarity_matrix(sentences, preprocessed_text)
    ranked_sentences = apply_text_rank(sim_matrix, sentences)
    summary = ' '.join([s[1] for s in ranked_sentences[:num_sentences]])
    return summary


In [19]:
pdf_path = choose_file()
summary = summarize(pdf_path)
print(summary)


5. But
along with honor comes, or must come, the charismatic leader’s recognition of his
‘Eigenverantwortung’  (‘Self-responsibility’)  (1992b:  180). For  him,  both  Stefan  George  and  Tolstoy  were  charismatic  leaders  who  were
‘irrational’. For  Weber,  bureaucratic  authority  has  many  positive  features:  it  is  based
upon reason, it is impartially implemented by paid trained ofﬁcials, and its future

 
is stable. One of the differences between bureaucratic and traditional Herrschaft, if not the key one is that
the former is based upon the concept of ‘competence’, which is lacking in the latter (see Weber,
1988: 478, 482).


## Extractive Summerisation Method: LexRank

LexRank is an unsupervised approach to text summarization based on graph theory. Sentences within a given text are represented as vertices in a graph. Edges between sentences are created based on the similarity between sentences, which can be computed using measures like cosine similarity with TF-IDF weighting. The LexRank algorithm then applies a method similar to Google's PageRank to this graph: a sentence is considered important if it is similar to many other sentences, and those sentences are themselves considered important.

LexRank is an unsupervised approach to text summarization based on graph theory. Sentences within a given text are represented as vertices in a graph. Edges between sentences are created based on the similarity between sentences, which can be computed using measures like cosine similarity with TF-IDF weighting. The LexRank algorithm then applies a method similar to Google's PageRank to this graph: a sentence is considered important if it is similar to many other sentences, and those sentences are themselves considered important.

 LexRank is particularly effective on structured and well-written texts where the salient information is distributed throughout the document. It works well with news articles, research papers, and technical documents, where the recurrence of similar concepts can be used to gauge the importance of sentences.

### Step 1: Import Necessary Libraries

Before we begin, let's import all the necessary libraries. If you haven't installed these libraries, you can do so using `pip`.


In [None]:
# Run this cell to install the necessary libraries
#!pip install pdfminer.six numpy scipy networkx

In [20]:
# Importing necessary libraries
import pdfminer
from pdfminer.high_level import extract_text
import numpy as np
from scipy.sparse.csgraph import connected_components
from scipy.sparse import csr_matrix
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('punkt')  # Download the tokenizer model if not already downloaded
nltk.download('stopwords')

# Define function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

# Define function to preprocess text
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    sentences = sent_tokenize(text)
    preprocessed_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
        preprocessed_sentences.append(' '.join(filtered_words))
    return preprocessed_sentences

# Define function to calculate sentence similarity
def sentence_similarity(sent1, sent2):
    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
    
    # Try-except block to handle the ValueError
    try:
        tfidf = vectorizer.fit_transform([sent1, sent2])
        return ((tfidf * tfidf.T).A)[0, 1]
    except ValueError:
        # In case of an empty vocabulary, return 0 similarity
        return 0.0

# Define function to build the similarity graph
def build_similarity_graph(sentences):
    # Create an empty similarity matrix
    S = np.zeros((len(sentences), len(sentences)))

    # Populate the similarity matrix
    for idx1, sentence1 in enumerate(sentences):
        for idx2, sentence2 in enumerate(sentences):
            if idx1 == idx2:
                continue
            S[idx1][idx2] = sentence_similarity(sentence1, sentence2)

    # Convert the similarity matrix to a graph
    graph = nx.from_numpy_array(S)
    return graph

# Define function to rank sentences using LexRank
def lexrank_summarization(text, num_sentences=5):
    # Preprocess the text
    sentences = preprocess_text(text)

    # Build the graph
    graph = build_similarity_graph(sentences)

    # Use the pagerank algorithm to rank sentences
    scores = nx.pagerank(graph, max_iter=1000, tol=1e-06)
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)

    # Extract top N sentences as the summary
    summary = " ".join([s[1] for s in ranked_sentences[:num_sentences]])
    return summary


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now we will choose a PDF file and apply the summarization algorithm to extract the key points from the text.


In [21]:
# Prompt the user to select a PDF file
pdf_path = choose_file()

# Extract text from the PDF
text = extract_text_from_pdf(pdf_path)

# Generate summary
summary = lexrank_summarization(text)

# Print the summary
print("Summary:\n", summary)


Summary:
 1993 Max Weber conceptualization charismatic authority inﬂuence organizational research Leadership Quarterly Vol Findings Weber writings charismatic authority continue instrumental shaping modern leadership theory charismatic form authority may particularly applicable effective today chaotic rapidly changing environments empowered organizational forms century may represent merely different incarnation Weber iron cage authority short propose Weber writings charismatic authority continue instrumental shaping modern leadership theory charismatic form authority may particularly applicable effective today chaotic rapidly changing environments empowered organizational forms century may simply represent different embodiment Weber iron cage authority Weber forwarded three basic types authority traditional charismatic Weber 1968 Indeed concepts along aspects Weber theory charismatic authority recently prompted lively debate among leadership scholars within pages Leadership Quarterly B

## Extractive Summerisation Method: Latent Semantic Analysis (LSA)

Next we will demostrate how to summarise a PDF document using the Latent Semantic Analysis (LSA) approach. LSA is based on the singular value decomposition (SVD) of a term-sentence matrix to reduce its dimensions and thus identify patterns that represent the underlying "latent" structure of the semantic relationships within the text. By mapping the high-dimensional space of terms into a lower-dimensional space, LSA can infer the importance of sentences based on the concepts they contain, even if they do not share specific keywords.

The summarization effect of LSA is to identify and extract sentences that carry the essence of the topics within the text. These sentences may not necessarily be the most frequently occurring or the most interconnected but rather those that best capture the main themes and variations in topic within the document.

LSA is effective for complex texts with sophisticated structures, such as academic literature, research papers, and technical documents, where simple word frequency is insufficient to understand the importance of sentences. It is particularly useful for summarizing texts that require the identification of thematic importance and where synonymy and polysemy (words that have multiple meanings) are prevalent.

### Step 1: Import Necessary Libraries

Let's begin by importing all necessary libraries. Install them using pip if they are not already installed.


In [None]:
#!pip install pdfminer.six numpy scipy scikit-learn nltk

### Step 2: Extract and Preprocess Text

We need to preprocess the text by tokenizing sentences, removing stop words, and filtering non-alphanumeric characters.


In [22]:
# Importing necessary libraries
import pdfminer
from pdfminer.high_level import extract_text
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt') # Download the tokenizer model if not already downloaded
nltk.download('stopwords')

# Define function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

# Define function to preprocess text
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    sentences = sent_tokenize(text)
    preprocessed_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
        preprocessed_sentences.append(' '.join(filtered_words))
    return preprocessed_sentences


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 3: Implement LSA for Summarization

LSA is used for summarization by applying dimensionality reduction to the term-sentence matrix and then extracting the sentences that contribute most to the resulting components.



In [23]:
# Define function to perform LSA summarization
def lsa_summarization(sentences, num_topics=1, num_sentences=5):
    # Create a TfidfVectorizer for vectorization of the sentences
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(sentences)

    # Perform SVD to reduce dimensionality
    svd_model = TruncatedSVD(n_components=num_topics)
    lsa = make_pipeline(svd_model, Normalizer(copy=False))
    X_lsa = lsa.fit_transform(X)

    # Rank sentences based on the weight in the first topic
    ranked_sentences = [(sentence, X_lsa[index]) for index, sentence in enumerate(sentences)]
    ranked_sentences = sorted(ranked_sentences, key=lambda x: x[1], reverse=True)

    # Extract top N sentences as the summary
    summary = " ".join([sentence for sentence, weight in ranked_sentences[:num_sentences]])
    return summary

### Step 4: Display the Summary

Finally, we display the summarized text.


In [24]:
# Prompt the user to select a PDF file
pdf_path = choose_file()

text = extract_text_from_pdf(pdf_path)

# Preprocess the text to be summarized
preprocessed_text = preprocess_text(text)

# Generate summary
summary = lsa_summarization(preprocessed_text)

# Generate summary
print("LSA Summary:\n", summary)

LSA Summary:
 current issue full text archive journal available Max Weber notion authority still hold century Jeffery Houghton College Business Economics West Virginia University Morgantown West Virginia USA Max Weber notion authority 449 Abstract Purpose purpose brief commentary provide brief overview Max Weber life work contributions management thought addressing question whether notion authority still holds century commentary begins brief biographical sketch followed examination Weber conceptualization authority inﬂuence ﬁeld management relevancy century Findings Weber writings charismatic authority continue instrumental shaping modern leadership theory charismatic form authority may particularly applicable effective today chaotic rapidly changing environments empowered organizational forms century may represent merely different incarnation Weber iron cage authority commentary makes important contribution management history literature examining important aspect Weber inﬂuence manage

## Extractive Summerisation Method: Luhn Algorithm
Luhn Algorithm algorithm is based on the frequency of words within the text; it assumes that words occurring more frequently are more significant. Luhn's insight was that there is a middle ground of word frequency that captures keywords: very common words are uninformative, and very rare words may be irrelevant. Furthermore, the Luhn algorithm pays attention to the proximity of these significant words to each other within a sentence, proposing that clusters of significant words are likely to convey more information.

The summarization effect is that sentences that contain a higher density of these mid-frequency significant words, especially where they occur in close proximity, are selected for the summary. The resulting summary is therefore a collection of sentences that are rich in content-bearing words, which Luhn suggested represent the main points of the text.

 The Luhn algorithm is most effective for texts where the significant content can be identified through keyword frequency and distribution, such as news reports and business and technical papers. It works particularly well with texts that have a good signal-to-noise ratio in terms of keyword frequencies: significant words should stand out from the less important text while still being frequently enough used to indicate central themes.

### Step 1: Import Necessary Libraries

Before we begin, let's import all the necessary libraries. If you haven't installed these libraries, you can do so using pip.


In [None]:
#!pip install pdfminer.six nltk

In [25]:
import pdfminer
from pdfminer.high_level import extract_text
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Step 2: Extract Text from PDF

In [26]:
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

### Step 3: Preprocess the Text
We define a function to preprocess text by tokenizing sentences, removing stop words, and filtering non-alphanumeric characters.

In [27]:
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
    
    return ' '.join(filtered_words)

### Step 4: Implement the Luhn Algorithm for Summarization
The Luhn Algorithm scores sentences based on the frequency of significant words. Sentences with the highest scores are included in the summary.

In [28]:
def luhn_summarization(text, num_sentences=5):
    sentences = sent_tokenize(text)
    preprocessed_text = preprocess_text(text)
    word_frequencies = FreqDist(word_tokenize(preprocessed_text))
    
    # Compute the higher frequency threshold using average frequency
    avg_frequency = sum(word_frequencies.values()) / len(word_frequencies)
    significant_words = {word for word in word_frequencies if word_frequencies[word] > avg_frequency}
    
    ranked_sentences = {}
    for i, sentence in enumerate(sentences):
        words = word_tokenize(sentence.lower())
        word_count = len(words)
        score = sum([word_frequencies[word] for word in words if word in significant_words]) / word_count
        ranked_sentences[i] = score
    
    selected_sentences = sorted(ranked_sentences, key=ranked_sentences.get, reverse=True)[:num_sentences]
    summary = ' '.join([sentences[i] for i in selected_sentences])
    
    return summary


### Step 5: Display the Summary
Finally, we display the summarized text after prompting the user to select a PDF file.

In [30]:
# Prompt the user to select a PDF file
pdf_path = choose_file()

# Extract text from the PDF
text = extract_text_from_pdf(pdf_path)

# Generate summary
summary = luhn_summarization(text)

# Print the summary
print("Luhn Summary:\n", summary)


Luhn Summary:
 The ﬁrst is traditional authority. KEYWORDS authority,  bureaucratic  authority,  charisma,  dominance,  leadership,
traditional authority, Weber

Max Weber’s longstanding interest in the notion of authority is well documented. This authority is based upon strong traditional rules and has
much  in  common  with  legal  authority. The charismatic leader should
and often does have these traits. The charismatic leader is a d¨amonischer type who
appears  only  in  chaotic  times.


## Extractive Summerisation Method: KL-Sum
KL-Sum algorithm is based on minimizing the Kullback-Leibler (KL) divergence, which is a way of measuring the difference between two probability distributions. For text summarization, KL-Sum aims to select sentences that, when taken together, provide a probability distribution of words as close as possible to the distribution in the original document. It iteratively adds sentences to the summary that most decrease the KL divergence.

The effect of the KL-Sum algorithm is to create a summary that maintains the original distribution of words, which presumably represents the topics and the information content of the entire document. This tends to produce summaries that are representative of the original text's thematic structure.

KL-Sum is effective for texts with distinct keyword distributions that are indicative of the text's content, such as scientific articles and technical documents. It can be especially useful in domains where the goal is to capture the key information without imposing much interpretation or paraphrasing, thus maintaining the document's original terminology and meaning.

In [None]:
#!pip install numpy scipy nltk pymupdf nltk

In [None]:
#!pip install --upgrade pymupdf

In [31]:
# Import necessary libraries
import numpy as np
from scipy.special import kl_div
from scipy.stats import entropy
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
import fitz  

# Ensure that you have the necessary NLTK models downloaded
nltk.download('punkt')

# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    # Open the PDF file
    with fitz.open(pdf_path) as pdf:
        text = ""
        # Iterate over each page in the PDF
        for page in pdf:
            # Extract text from the page
            text += page.get_text()
        return text
    
# Function to extract sentences from a string of text
def get_sentences(text):
    return sent_tokenize(text)

# Function to create the word frequency distribution
def word_freq_dist(text):
    words = word_tokenize(text.lower())
    return Counter(words)

# Function to calculate the KL divergence
# Function to calculate the KL divergence
def kl_divergence(summary_freq_dist, doc_freq_dist, vocab_size):
    """
    Calculate the KL divergence, adding a small value (1/vocab_size) to zero-counts to avoid infinity.
    """
    P = np.array([summary_freq_dist.get(word, 0) + 1.0/vocab_size for word in doc_freq_dist])
    Q = np.array([doc_freq_dist.get(word, 0) + 1.0/vocab_size for word in doc_freq_dist])
    
    return entropy(P, Q)

# Function to perform KL-Sum summarization
# Function to perform KL-Sum summarization
def kl_sum(text, summary_size=5):
    sentences = get_sentences(text)
    vectorizer = CountVectorizer()
    vectorizer.fit(sentences)
    vocab = vectorizer.get_feature_names_out()
    
    doc_freq_dist = word_freq_dist(text)
    
    summary = []
    summary_freq_dist = Counter()
    remaining_sentences = sentences.copy()
    
    vocab_size = len(vocab)  # V is the size of the vocabulary
    
    while len(summary) < summary_size and remaining_sentences:
        kl_scores = []
        for sentence in remaining_sentences:
            temp_summary = ' '.join(summary + [sentence])
            temp_summary_freq_dist = word_freq_dist(temp_summary)
            kl_score = kl_divergence(temp_summary_freq_dist, doc_freq_dist, vocab_size)
            kl_scores.append((kl_score, sentence))
            
        # Select the sentence that minimizes the KL divergence
        min_kl_sentence = min(kl_scores, key=lambda x: x[0])[1]
        summary.append(min_kl_sentence)
        summary_freq_dist.update(word_freq_dist(min_kl_sentence))
        remaining_sentences.remove(min_kl_sentence)
    
    return ' '.join(summary)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [32]:
# Prompt the user to select a PDF file
pdf_path = choose_file()

# Extract text from the PDF
text = extract_text_from_pdf(pdf_path)

# Generate summary
summary = kl_sum(text)

# Print the summary
print("Luhn Summary:\n", summary)

Luhn Summary:
 408-37. (1993), “Max Weber’s conceptualization of charismatic authority: its inﬂuence
on organizational research”, Leadership Quarterly, Vol. Findings – Weber’s writings on charismatic authority have been and continue to be instrumental
in shaping modern leadership theory, that the charismatic form of authority may be particularly applicable
and effective in today’s chaotic and rapidly changing environments, and that the empowered and
self-managing organizational forms of the twenty-ﬁrst century may represent merely a different incarnation
of Weber’s iron cage of legal/rational authority. Indeed, these concepts along with other
aspects of Weber’s theory of charismatic authority recently prompted a lively debate
among leadership scholars within the pages of Leadership Quarterly (Bass, 1999;
Max Weber’s
notion of
authority
451
Downloaded by Queen Mary University of London At 07:14 02 January 2019 (PT)
Beyer, 1999; House, 1999; Shamir, 1999). Jeffery D. Houghton
College of 

## Extractive Summerisation Method: K-Mean Clustering
This method breaks the document into smaller chunks, such as paragraphs, and then generates vector embeddings for each chunk. These embeddings capture the semantic meaning of the text in a multi-dimensional vector space, enabling the algorithm to measure the "distance" between chunks in terms of content and meaning. Paragraphs that are semantically similar will cluster together in this space. A clustering algorithm like K-means is then used to identify these clusters. The algorithm determines the central points of these clusters, which are the chunks that best represent the "average meaning" of the topics within each cluster.

The approach essentially distills the document down to its key thematic elements. It identifies the main topics discussed throughout and selects the most representative sections for each topic. By combining the central chunks, a summary is created that is not only concise but also rich in context, reflecting the various key topics of the document.

This method is versatile and can be effective across a range of text types. It is particularly useful for long documents such as reports, research papers, and lengthy articles where there are distinct sections or chapters. Since it relies on semantic understanding, it can handle complex materials where themes and ideas are more important than just individual keywords.

In [None]:
pip install numpy scikit-learn gensim pdfplumber "unstructured[pdf]"

In [None]:
#pip install "unstructured[all-docs]"

In [None]:
import pdfplumber

def extract_text_from_pdf(pdf_path):
    text = ''
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            current_page_text = page.extract_text()
            if current_page_text:  # Check if the text was extracted
                text += current_page_text + "\n\n"  # Adding a double newline as a paragraph separator
    return text


In [None]:
# Prompt the user to select a PDF file
pdf_path = choose_file()

# After extracting text from the PDF
text = extract_text_from_pdf(pdf_path)

In [None]:
print(text)

In [None]:
from PyPDF2 import PdfReader

# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PdfReader(file)
        text = ''
        for page in reader.pages:
            text += page.extract_text()
        return text

# Summarization function
def KMean_cluster_summerisation(document):
    # Check the length of paragraphs after the split
    paragraphs = document.split('\n\n')  # Adjust this line as needed based on the actual paragraph breaks
    print("Number of paragraphs:", len(paragraphs))

    if len(paragraphs) < 2:
        return "Document is too short to summarize using clustering."

    tagged_data = [TaggedDocument(words=par.split(), tags=[str(i)]) for i, par in enumerate(paragraphs)]
    model = Doc2Vec(vector_size=50, min_count=2, epochs=40)
    model.build_vocab(tagged_data)
    model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
    embeddings = [model.infer_vector(par.split()) for par in paragraphs]
    optimal_k = min(len(paragraphs), 10)
    clustering_model = KMeans(n_clusters=optimal_k)
    clustering_model.fit(embeddings)
    cluster_assignment = clustering_model.labels_
    centroids = clustering_model.cluster_centers_
    representative_paragraphs = []

    for i in range(optimal_k):
        centroid = centroids[i]
        distances = np.linalg.norm(embeddings - centroid, axis=1)
        cluster_paragraphs = [par for par, cluster in zip(paragraphs, cluster_assignment) if cluster == i]
        representative_paragraphs.append(cluster_paragraphs[np.argmin(distances)])
    sorted_representative_paragraphs = sorted(representative_paragraphs, key=lambda par: paragraphs.index(par))
    summary = '\n\n'.join(sorted_representative_paragraphs)
    return summary


In [None]:
# Generate summary
summary = KMean_cluster_summerisation(text)

# Print the summary
print("K-Mean Clustering Summary:\n", summary)

## Abstractive Summerisation
Abstractive summarization is based on the principle of understanding and interpreting the content like a human would, and then expressing that understanding in a new, condensed form. This process involves paraphrasing and rephrasing the essence of the text rather than simply extracting sentences directly from it. Abstractive summarization systems must first comprehend the text on a deeper level than just identifying keywords or phrases. They use natural language processing (NLP) techniques to understand the context, nuances, and the relationships between concepts in the text.

After understanding the text, the system generates new sentences that capture the core meanings and the most important information from the original content. This generation phase often employs advanced NLP techniques such as natural language generation (NLG) and machine learning models, especially sequence-to-sequence models.

The generated sentences are then condensed to form a coherent summary. The summary should convey the main points of the original text but with fewer words and potentially different phrasing. The goal is to produce a shorter version of the text that retains the essential information and is coherent and fluent to read.

The summarization process must ensure that the summary is not only concise but also logically structured and understandable, with transitions that make sense and maintain the flow of information.

Abstractive summarization techniques are particularly useful when dealing with complex texts that require a high level of interpretation, such as news articles, stories, or even conversations where the context and the subtleties are crucial. This approach can also be effective for summarizing content where an extractive summary might be too fragmented or when a paraphrased, reworded summary would be more useful to the reader.

Advanced abstractive summarization systems often leverage deep learning models like transformers, which have the ability to generate human-like text. These systems are trained on large datasets to learn patterns in language usage, enabling them to mimic the way humans summarize information. However, abstractive summarization is computationally intensive and can be prone to inaccuracies or loss of certain nuances, making it an ongoing area of research and development in the field of artificial intelligence.

In [1]:
#!pip install transformers pymupdf




[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Abstractive Summerisation Method: Bart

Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
### Importing Bart Model and Tokenizer


In [1]:
from transformers import BartTokenizer, BartForConditionalGeneration
import fitz  

In [2]:
def extract_text_from_pdf(pdf_path):
    # Open the PDF file
    with fitz.open(pdf_path) as pdf:
        text = ""
        for page in pdf:
            text += page.get_text()
    return text


In [4]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [5]:
def summarize_with_bart(text, max_length=1024, min_length=100):
    inputs = tokenizer([text], max_length=max_length, return_tensors='pt', truncation=True)
    
    # Generate Summary
    summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=max_length, min_length=min_length, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return summary


In [8]:
# Assume pdf_path is the path to your PDF file
pdf_path = choose_file()

# Extract text from the PDF
extracted_text = extract_text_from_pdf(pdf_path)

# Summarize the extracted text
summary = summarize_with_bart(extracted_text)

# Print the summary
print("Summary:\n", summary)


Summary:
 Jeffery D. Houghton: Does Max Weber’s notion of authority still hold in the twenty-ﬁrst century? He says Weber's writings on charismatic authority have been and continue to be instrumental in shaping modern leadership theory. He argues that the empowered and self-managing organizational forms of the 20- ﬁRst century may represent merely a different incarnation of Weber's iron cage of legal/rational authority. The commentary makes an important contribution to the management history and management literature by examining an important aspect of his inﬂuence on management thought.


## Abstractive Summerisation Method: PEGASUS Model

The next summerisation model we will investigate is the PEGASUS model from Hugging Face's Transformers library. PEGASUS is a state-of-the-art model for abstractive text summarization that can generate coherent and concise summaries.

### Step 1: Install Required Libraries

First, we need to install the transformers library and a library to read PDF documents, such as pdfplumber.



In [9]:
#!pip install transformers tokenizers pdfplumber



### Step 2: Import Libraries

Import the necessary libraries for the summarization process.


In [36]:
import pdfplumber
from transformers import AutoTokenizer, PegasusForConditionalGeneration

### Step 3: PDF Text Extraction

Define a function to extract text from a PDF file using pdfplumber, which provides accurate text extraction and keeps the textual flow intact.


In [37]:
def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
    return text


### Step 4: Load PEGASUS Model and Tokenizer

Load the PEGASUS model and its corresponding tokenizer. We will use the 'google/pegasus-xsum' pre-trained model.


In [38]:
model_name = 'google/pegasus-xsum'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Step 5: Summarize Text with PEGASUS

Summarize the text using the PEGASUS model with the encoded text as input. Adjust the parameters like `num_beams` and `length_penalty` for different summarization quality.


In [44]:
# Function to summarize text
def summarize_with_pegasus(text):
    model = PegasusForConditionalGeneration.from_pretrained(model_name)
    batch = tokenizer(text, truncation=True, padding='longest', return_tensors="pt")
    translated = model.generate(**batch)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

### Step 6: Run the Summary

Specify the path to your PDF document, extract the text, encode it, and then generate the summary using PEGASUS.


In [49]:
# Assume pdf_path is the path to your PDF file
pdf_path = choose_file()

# Extract text from the PDF
extracted_text = extract_text_from_pdf(pdf_path)

# Summarize the extracted text
summary_texts = summarize_with_pegasus(extracted_text)

# Print the summaries
for summary in summary_texts:
    print("Summary:\n", summary)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Summary:
 Weber’s writings on charismatic authority have been and continue to be instrumental in shaping modern leadership.<n>The empowered and self-managing organizationalformsofthe twenty-first centurymayrepresentmerelyadifferentincarnation ofWeber’sironcageoflegal/rationalauthority.


## Abstractive Summerisation Method: T5 Transformer 

Now we will test the summerisation capabilities of T5 model from Hugging Face's Transformers library. T5 stands for "Text-to-Text Transfer Transformer" and is a versatile model that frames all NLP tasks as a text-to-text problem. For summarization, it generates a concise version of a given input text.

### Step 1: Install Required Libraries

Begin by installing the `transformers` library, which provides the T5 model and utilities, and `pdfminer.six` for extracting text from PDFs.



In [None]:
#!pip install transformers pdfminer.six

### Step 2: Import Libraries

Now import the necessary libraries for the summarization process.

In [51]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from pdfminer.high_level import extract_text

### Step 3: PDF Text Extraction

We'll use `pdfminer.six` to extract text from PDF documents. This library allows us to convert PDF files to text, which can then be summarized.

In [52]:
def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text


### Step 4: Load T5 Model and Tokenizer

Load the pre-trained T5 model and its corresponding tokenizer. We will use the 't5-small' model for this example, but larger models are available for better performance.



In [53]:
model_name = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Step 5: Prepare Text for T5

T5 requires the task to be specified as part of the input text. For summarization, we prepend "summarize: " to the input text.


In [56]:
def prepare_text_for_t5(text, tokenizer, max_length=512):
    preprocessed_text = "summarize: " + text
    return tokenizer.encode(preprocessed_text, return_tensors="pt", max_length=max_length, truncation=True)


### Step 6: Summarize Text with T5

We generate a summary by decoding the tokens produced by the T5 model.



In [57]:
def summarize_with_t5(encoded_text, model):
    summary_ids = model.generate(encoded_text, min_length=30, max_length=200, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

### Step 7: Run the Summary

Finally, we use the functions to extract text from a PDF, prepare it for T5, and generate the summary.



In [58]:
# Example PDF file
pdf_path = choose_file()

# Extract text from the PDF
text = extract_text_from_pdf(pdf_path)

# Encode and prepare the text for T5
encoded_text = prepare_text_for_t5(text, tokenizer)

# Generate the summary
summary = summarize_with_t5(encoded_text, model)

# Print the summary
print("T5 Summary:\n", summary)


T5 Summary:
 the current issue and full text archive of this journal is available at www.emeraldinsight.com/1751-1348.htm Max Weber’s notion of authority 449 Abstract Purpose – The purpose of this brief commentary is to provide a brief overview of Weber’s life, work, and contributions to management thought.


## Abstractive Summerisation Method: OpenAI's GPT-3.5 and Langchain
OpenAI's GPT (generative pre-trained transformer) models have been trained to understand natural language and code. GPTs provide text outputs in response to their inputs. The inputs to GPTs are also referred to as "prompts". Designing a prompt is essentially how you “program” a GPT model, usually by providing instructions or some examples of how to successfully complete a task.

### Step 1: Setup and Authentication


In [66]:
#!pip install PyPDF



In [63]:
!pip install openai langchain langchainhub



In [60]:
import os
import openai
import numpy as np

In [61]:
os.environ['OPENAI_API_KEY'] = ''

In [62]:
openai.api_key = os.getenv('OPENAI_API_KEY')

In [64]:
import os
import ast
import pandas as pd
from langchain import LLMChain
from langchain.chat_models import ChatAnthropic
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import (
                StuffDocumentsChain,
                LLMChain,
                ReduceDocumentsChain,
                MapReduceDocumentsChain,
            )

In [83]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

In [79]:
from langchain.document_loaders import PyPDFLoader
pdf_path = choose_file()
loader = PyPDFLoader(pdf_path)
pages = loader.load()

In [85]:
docs = text_splitter.split_documents(pages)

In [86]:
docs

[Document(page_content='Does Max Weber’s notion\nof authority still hold\nin the twenty-ﬁrst century?\nJeffery D. Houghton\nCollege of Business and Economics, West Virginia University,\nMorgantown, West Virginia, USA\nAbstract\nPurpose – The purpose of this brief commentary is to provide a brief overview of Max Weber’s life,\nwork, and contributions to management thought before addressing the question of whether his notionof authority still holds in the twenty-ﬁrst century.\nDesign/methodology/approach – The commentary begins with a brief biographical sketch followed\nby an examination of Weber’s conceptualization of authority, its inﬂuence on the ﬁeld of management andits relevancy in the twenty-ﬁrst century.\nFindings – Weber’s writings on charismatic authority have been and continue to be instrumental', metadata={'source': 'D:/Downloads/Houghton (2010) Does Max Webers Notion of Authority Still Hold in the Twenty-First Century.pdf', 'page': 0}),
 Document(page_content='Findings – Web

In [97]:
# Reduce
reduce_template = """The following is set of summaries:
{doc_summaries}
Take these and distill it into a final, consolidated summary of the main themes. 
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)

In [99]:
!pip install langchainhub

Collecting langchainhub
  Downloading langchainhub-0.1.13-py3-none-any.whl.metadata (478 bytes)
Downloading langchainhub-0.1.13-py3-none-any.whl (3.4 kB)
Installing collected packages: langchainhub
Successfully installed langchainhub-0.1.13


In [100]:
from langchain import hub
# Note we can also get this from the prompt hub, as noted above
reduce_prompt = hub.pull("rlm/map-prompt")

In [101]:
reduce_prompt

ChatPromptTemplate(input_variables=['docs'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['docs'], output_parser=None, partial_variables={}, template='The following is a set of documents:\n{docs}\nBased on this list of docs, please identify the main themes \nHelpful Answer:', template_format='f-string', validate_template=True), additional_kwargs={})])

In [102]:
# Run chain
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

In [103]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

In [104]:
print(map_reduce_chain.run(split_docs))

Based on the list of documents provided, the main themes that can be identified are:

1. The historical context and impact of the 1755 Lisbon earthquake: The documents likely discuss the devastation caused by the earthquake and the subsequent efforts to rebuild and recover.

2. Lessons from history for contemporary leadership: The documents likely explore the concept of charismatic leadership and its relevance in modern times, drawing on Weber's theories and applying them to current leadership practices.

Overall, the main themes revolve around historical events and their implications for leadership and management.
