# SUMMARIZING PAPERS CONTAINS TWO PARTS:-
- EXTRACTING TEXT FROM PDF/WORD FILE
- SUMMARIZING TEXT(EXTRACTIVE/ABSTRACTIVE)

# INSTALLING LIBRARIES

In [1]:
!pip install pdfplumber
!pip install python-docx
!pip install rouge

Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.4-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading 

## EXTRACTING TEXT FROM A WORD/PDF FILE
- We have two options for extracting text from pdf :- PyPDF2 and pdfplumber
- We are using pdfplumber library of Python rather than PyPDF2 because it retains some of the original formatting of the documents like table,bullet points,etc which are not retained by PyPDF2
- We will extract text from word file using docx library

In [22]:
import pdfplumber
from docx import Document
import os

# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    print("Provided Document is PDF file\n\n")
    extracted_text = ""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages):
                extracted_text += page.extract_text()
    except Exception as e:
        print(f"Error reading PDF file: {e}")
    return extracted_text

# Function to extract text from Word (.docx)
def extract_text_from_docx(docx_path):
    print("Provided Document is Word file\n\n")
    extracted_text = ""
    try:
        doc = Document(docx_path)
        extracted_text = '\n'.join([para.text for para in doc.paragraphs])
    except Exception as e:
        print(f"Error reading DOCX file: {e}")
    return extracted_text

# Function to handle both PDF and Word files
def extract_text_from_file(file_path):
    _, file_extension = os.path.splitext(file_path)
    
    if file_extension.lower() == '.pdf':
        return extract_text_from_pdf(file_path)
    elif file_extension.lower() == '.docx':
        return extract_text_from_docx(file_path)
    else:
        return f"Unsupported file format: {file_extension}"


# path
path = '/kaggle/input/papers/SegNet Network Architecture for Deep Learning Image.pdf'

# Extract text based on file type
extracted_text = extract_text_from_file(path)

# Print extracted text
# print("Extracted Text is: \n" )
# print(extracted_text)


tokens = len(extracted_text.split())
print(f'No of tokens in text: {tokens}')


Provided Document is PDF file


No of tokens in text: 3205


In [23]:
# Evaluation Metrix ROGUE

from rouge import Rouge


rouge = Rouge()



In [24]:
# # Sample Text for Summary Generation on all models
# extracted_text = """The Role of Renewable Energy in Sustainable Development
# The pursuit of sustainable development has become one of the most pressing challenges of our time. As the global population continues to rise and the demand for energy increases, the reliance on fossil fuels poses significant environmental and economic threats. In this context, renewable energy sources have emerged as a viable alternative, offering a path toward sustainable and environmentally friendly energy solutions.

# 1. Understanding Renewable Energy

# Renewable energy refers to energy derived from resources that are naturally replenished, such as sunlight, wind, rain, tides, waves, and geothermal heat. Unlike fossil fuels, which are finite and contribute to greenhouse gas emissions, renewable energy sources provide a cleaner and more sustainable energy supply. Solar power, for instance, harnesses sunlight using photovoltaic cells to generate electricity, while wind energy converts the kinetic energy of wind into power through turbines.

# 2. Environmental Benefits

# The transition to renewable energy plays a crucial role in mitigating climate change. According to the Intergovernmental Panel on Climate Change (IPCC), adopting renewable energy could significantly reduce global greenhouse gas emissions, which are responsible for global warming and climate instability. By reducing our reliance on coal, oil, and natural gas, we can decrease air pollution, leading to improved public health outcomes and a reduction in healthcare costs associated with pollution-related illnesses.

# 3. Economic Opportunities

# Investing in renewable energy not only benefits the environment but also creates substantial economic opportunities. The renewable energy sector has been a significant source of job creation, often outpacing traditional fossil fuel industries. In 2020, the International Renewable Energy Agency (IRENA) reported that the renewable energy sector employed over 11 million people worldwide, with jobs expected to increase as more countries commit to sustainable energy goals.

# 4. Energy Security and Independence

# Renewable energy also enhances energy security and independence. By diversifying energy sources, countries can reduce their reliance on imported fuels, which can be subject to price volatility and geopolitical tensions. For instance, countries that invest in solar and wind energy can harness their local resources, leading to greater energy autonomy. This not only stabilizes energy prices but also enhances national security by reducing vulnerability to external energy supply disruptions.

# 5. Challenges and Considerations

# Despite the numerous advantages, the transition to renewable energy is not without challenges. Issues such as the intermittent nature of renewable sources—like solar and wind—require the development of energy storage technologies and grid management solutions. Furthermore, the initial investment costs for renewable energy infrastructure can be high, although these costs have been decreasing rapidly due to technological advancements and economies of scale.

# 6. Conclusion

# In conclusion, renewable energy represents a critical component of a sustainable future. By reducing greenhouse gas emissions, creating jobs, enhancing energy security, and providing clean power, renewable energy sources can significantly contribute to sustainable development. Policymakers, businesses, and individuals must work collaboratively to overcome existing challenges and accelerate the transition toward a greener, more sustainable energy landscape. As we move forward, embracing renewable energy will be essential in addressing the global challenges of climate change, economic inequality, and energy insecurity."""

## ABSTRACTIVE SUMMARIZATION

### It generates a summary by creating new sentences that capture the meaning of the original text, rather than extracting exact phrases or sentences.

### We can paraphrase the output text for better understanding 

In [25]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cpu"

paraphrase_tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")

paraphrase_model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def paraphrase(
    question,
    num_beams=5,
    num_beam_groups=5,
    num_return_sequences=5,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=128
):
    input_ids = paraphrase_tokenizer(
        f'paraphrase: {question}',
        return_tensors="pt", padding="longest",
        max_length=max_length,
        truncation=True,
    ).input_ids.to(device)
    
    outputs = paraphrase_model.generate(
        input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
        num_return_sequences=num_return_sequences, no_repeat_ngram_size=no_repeat_ngram_size,
        num_beams=num_beams, num_beam_groups=num_beam_groups,
        max_length=max_length, diversity_penalty=diversity_penalty
    )

    res = paraphrase_tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
    

    return res

### Based on number of tokens in the extracted text we will choose the model which is best for that particular task.
- upto 1024 token :- BART(fine-tuned on CNN-Dailymail) AND PEGASUS(fine-tuned on CNN-Dailymail and various other)
- upto 4096 tokens :-  LONGT5(Pre-Trained on C4)
- upto 16,384 tokens :-Longformer (Encoder and Decoder) (fine-tuned on ArXiv and Pubmed)

In [26]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM 
from transformers import LEDForConditionalGeneration, LEDTokenizer
import torch
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

In [27]:
summaries = {}

In [28]:
# BART Large
def summarize_with_bart_large(extracted_text):
    bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
    bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
    summarizer = pipeline("summarization", model=bart_model, tokenizer=bart_tokenizer, clean_up_tokenization_spaces=True)
    
    summarized_text = summarizer(extracted_text)
    
    
    
    paraphrased = paraphrase(summarized_text[0]["summary_text"])
    
    bart_paraphrased = ' '.join(paraphrased)
    
    
    summaries['BART'] = summarized_text[0]["summary_text"]
    
    bart_score = rouge.get_scores(extracted_text , summarized_text[0]["summary_text"])
    
    
    
    summaries['BART PARAPHRASED'] = bart_paraphrased
    
    summaries['BART ROUGE SCORES'] = bart_score
    
    
    print("Summary and Rouge scores and saved")
#     print(f'BART Summarized Text:- {summarized_text[0]["summary_text"]}\n\n')

In [29]:
# Pegasus Large
def summarize_with_pegasus_large(extracted_text):
    pegasus_tokenizer = AutoTokenizer.from_pretrained("google/pegasus-large")
    pegasus_model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-large")
    summarizer = pipeline("summarization", model=pegasus_model, tokenizer=pegasus_tokenizer, clean_up_tokenization_spaces=True)
    
    summarized_text = summarizer(extracted_text)
    
    
    paraphrased = paraphrase(summarized_text[0]["summary_text"])
    
    pegasus_paraphrased = ' '.join(paraphrased)
    
    pegasus_score = rouge.get_scores(extracted_text , summarized_text[0]["summary_text"])
    
    
    summaries['PEGASUS'] = summarized_text[0]["summary_text"]
    
    summaries['PEGASUS PARAPHRASED'] = pegasus_paraphrased
    
    summaries['PEGASUS ROUGE SCORES'] = pegasus_score
    
    print("Summary and Rouge score are saved")
#     print(f'PEGASUS Summarized Text:- {summarized_text[0]["summary_text"]}\n\n')

In [30]:
# LongT5
def summarize_with_longt5(extracted_text):
    longt5_tokenizer = AutoTokenizer.from_pretrained("google/long-t5-tglobal-base")
    longt5_model = AutoModelForSeq2SeqLM.from_pretrained("google/long-t5-tglobal-base")
    summarizer = pipeline("summarization", model=longt5_model, tokenizer=longt5_tokenizer, clean_up_tokenization_spaces=True)
    
    summarized_text = summarizer(extracted_text)
    
    
    paraphrased = paraphrase(summarized_text[0]["summary_text"])
    
    longt5_paraphrased = ' '.join(paraphrased)
    
    long_t5_score = rouge.get_scores(extracted_text , summarized_text[0]["summary_text"])
    
    
    summaries['LONGT5'] = summarized_text[0]["summary_text"]
    
    summaries['LONGT5 PARAPHRASED'] = longt5_paraphrased
    
    summaries['LONGT5 ROUGE SCORES'] = long_t5_score
    
    print("Summary and Rouge scores are saved")
#     print(f'LongT5 Summarized Text:- {summarized_text[0]["summary_text"]}\n\n')

In [31]:
def summarize_with_bigbird_pegasus(extracted_text):
        tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")

        # by default encoder-attention is `block_sparse` with num_random_blocks=3, block_size=64
        model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv")

        # decoder attention type can't be changed & will be "original_full"
        # you can change `attention_type` (encoder only) to full attention like this:
        model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv", attention_type="original_full")

        # you can change `block_size` & `num_random_blocks` like this:
        model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv", block_size=16, num_random_blocks=2)

        
        inputs = tokenizer(extracted_text, return_tensors='pt' , padding=False)
        prediction = model.generate(**inputs)
        prediction = tokenizer.batch_decode(prediction)
        
        
        paraphrased = paraphrase(prediction[0])
    
        bigbird_paraphrased = ' '.join(paraphrased)
    
    
        bigbird_score = rouge.get_scores(extracted_text , prediction[0])
    
        summaries['BIGBIRD'] = prediction[0]
        
        summaries['BIGBIRD PARAPHRASED'] = bigbird_paraphrased
        
        summaries['BIGBIRD ROGUE SCORES'] = bigbird_score
        
        print("Summary and Rouge scores are saved")
#         print(f"BIGBIRD Summarized Text: {prediction}")
        
        


In [32]:
# LED (Longformer Encoder-Decoder)
def summarize_with_led_large(extracted_text):
    tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")

    input_ids = tokenizer(extracted_text, return_tensors="pt" , padding=False).input_ids
    global_attention_mask = torch.zeros_like(input_ids)
    # set global_attention_mask on first token
    global_attention_mask[:, 0] = 1

    model = LEDForConditionalGeneration.from_pretrained("allenai/led-large-16384-arxiv", return_dict_in_generate=True)

    sequences = model.generate(input_ids, global_attention_mask=global_attention_mask).sequences

    summary = tokenizer.batch_decode(sequences)
    
    paraphrased = paraphrase(summary[0])
    
    
    led_paraphrased = ' '.join(paraphrased)
    
    
    led_score = rouge.get_scores(extracted_text , summary[0])

    summaries["LED"] = summary[0]
    
    summaries['LED PARAPHRASED'] = led_paraphrased
    
    summaries['LED ROUGE SCORES'] = led_score
    
    print("Summary and rouge scores are saved")
    
#     print(f'LED LARGE Summarized Text: {summary}\n')

### Checking for each models if they can summarize

In [33]:
# BART
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
tokenized_output = bart_tokenizer(extracted_text)
bart_tokens = len(tokenized_output[0])

print(f'BART TOKENIZED LENGTH: {bart_tokens}')
if bart_tokens <= 1024:
    print("Generating Summary...")
    summarize_with_bart_large(extracted_text)
else:
    print("BART can not be used here")

# Pegasus
pegasus_tokenizer = AutoTokenizer.from_pretrained("google/pegasus-large")
tokenized_output = pegasus_tokenizer(extracted_text)
pegasus_tokens = len(tokenized_output[0])

print(f'PEGASUS TOKENIZED LENGTH: {pegasus_tokens}')
if pegasus_tokens <= 1024:
    print("Generating Summary...")
    summarize_with_pegasus_large(extracted_text)
else:
    print("PEGASUS can not be used here")


# LongT5
longt5_tokenizer = AutoTokenizer.from_pretrained("google/long-t5-tglobal-base")
tokenized_output = longt5_tokenizer(extracted_text)
longt5_tokens = len(tokenized_output[0])

print(f'LONGT5 TOKENIZED LENGTH: {longt5_tokens}')
if longt5_tokens <= 4096:
    print("Generating Summary...")
    summarize_with_longt5(extracted_text)
else:
    print("LONGT5 can not be used here")

# BIGBIRD
bigbird_tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")
tokenized_output = bigbird_tokenizer(extracted_text, return_tensors="pt")
bigbird_tokens = tokenized_output['input_ids'].shape[1]


print(f'BIGBIRD TOKENIZED LENGTH: {bigbird_tokens}')
if bigbird_tokens <= 4096:
    print("Generating Summary...")
    summarize_with_bigbird_pegasus(extracted_text)
else:
    print("BIGBIRD cannot be used here")

# LED (Longformer Encoder-Decoder)
led_tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")
tokenized_output = led_tokenizer(extracted_text)
led_tokens = len(tokenized_output[0])

print(f'LED TOKENIZED LENGTH: {led_tokens}')
if led_tokens <= 16384:# LED can handle even longer sequences
    print("Generating Summary...")
    summarize_with_led_large(extracted_text)


BART TOKENIZED LENGTH: 5473
BART can not be used here


Token indices sequence length is longer than the specified maximum sequence length for this model (4645 > 1024). Running this sequence through the model will result in indexing errors


PEGASUS TOKENIZED LENGTH: 4645
PEGASUS can not be used here
LONGT5 TOKENIZED LENGTH: 5498
LONGT5 can not be used here


Token indices sequence length is longer than the specified maximum sequence length for this model (4645 > 4096). Running this sequence through the model will result in indexing errors


BIGBIRD TOKENIZED LENGTH: 4645
BIGBIRD cannot be used here
LED TOKENIZED LENGTH: 5473
Generating Summary...


Input ids are automatically padded from 5473 to 6144 to be a multiple of `config.attention_window`: 1024


Summary and rouge scores are saved


In [34]:
for model in summaries:
  print(model.upper())
  print(summaries[model])
  print("")

LED
</s> image segmentation is a crucial task in computer vision , with applications ranging from autonomous driving to medical image analysis. in recent years , deep learning has revolutionized this field , leading to the development of various neural network models aimed at improving segmentation accuracy . 
 one such architecture is SegNet , which we explore in this article. 
 SegNet consists of an encoder network , a corresponding decoder network , and a pixel-wise classification layer . 
 it can be trained end-to-end using stochastic gradient descent (SGD ) optimization . 
 the innovation lies in the decoder network s approach to upsampling , utilizing pooled indices from the encoder s maximum pooling step to perform nonlinear up sampling . 
 this eliminates the need for additional learning during up sampling , making SegNet efficient in both storage and computation . 
 furthermore , it can achieve similar segmentation performance to traditional methods while reducing computationa

# EXTRACTIVE SUMMARIZATION

## It selects key sentences or phrases from the original text to form a concise summary without altering the original content.

In [35]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize, word_tokenize
import numpy as np

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [36]:
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from string import punctuation
import networkx as nx

# Initialize stopwords
stop_words = list(stopwords.words('english'))

# Preprocess text
text = ' '.join(extracted_text.split())  # Remove extra whitespace
original_sentences = sent_tokenize(text)

# Clean sentences
cleaned_sentences = []
for sentence in original_sentences:
    # Convert to lowercase and remove punctuation
    cleaned = sentence.lower()
    cleaned = ''.join(char for char in cleaned if char not in punctuation)
    cleaned_sentences.append(cleaned)

# Calculate TF-IDF matrix. It will generate a matrix where each row represent a sentence and coloumn represent a term
tfidf = TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = tfidf.fit_transform(cleaned_sentences)

# Calculate similarity matrix between sentences
similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()

# Calculate TextRank scores
nx_graph = nx.from_numpy_array(similarity_matrix)
textrank_scores = nx.pagerank(nx_graph, alpha=0.85, max_iter=50)

# Calculate position scores which should be higher for starting and ending
num_sentences = len(original_sentences)
positions = np.arange(num_sentences)
middle = num_sentences // 2
position_scores = 1 - np.minimum(positions, num_sentences - 1 - positions) / middle

# Combine scores (0.7 weight for TextRank, 0.3 for position)
final_scores = {}
for idx in range(num_sentences):
    final_scores[idx] = 0.7 * textrank_scores[idx] + 0.3 * position_scores[idx]

# Select top sentences
min_sentences = 3
ratio = 0.3
num_sentences = max(min_sentences, int(len(original_sentences) * ratio))

# Sort sentences by score and get top ones
selected_indices = sorted(sorted(final_scores.items(), key=lambda x: x[1], reverse=True)[:num_sentences])

# Create summary maintaining original sentence order
summary_sentences = [original_sentences[idx] for idx, _ in selected_indices]
summary = ' '.join(summary_sentences)


extractive_score = rouge.get_scores(extracted_text , summary)

# Print results
print("Original Text Length:", len(text))
print("Summary Length:", len(summary))
print("\nSummary:\n", summary)


print(f'Rouge Scores are {extractive_score}')

Original Text Length: 22799
Summary Length: 7197

Summary:
 Academic Journal of Science and Technology ISSN: 2771-3032 | Vol. 9, No. 2, 2024 SegNet Network Architecture for Deep Learning Image Segmentation and Its Integrated Applications and Prospects Chenwei Zhang1, *, Wenran Lu2, Jiang Wu3, Chunhe Ni4, Hongbo Wang5 1Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL, USA 2Electrical Engineering, University of Texas at Austin, Austin, TX, USA 3Computer Science, University of Southern California, Los Angeles, CA, USA 4Computer Science, University of Texas at Dallas, Richardson, TX, USA 5Computer Science, University of Southern California, Los Angeles, CA, USA * Corresponding author: zchenwei66@gmail.com Abstract: Semantic image segmentation is a crucial task in computer vision, with applications ranging from autonomous driving to medical image analysis. In recent years, deep learning has revolutionized this field, leading to the development of vari