# ***Text Summarization***

# ***Using Basic Algo***

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

In [1]:
"""
This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.
"""
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Downloading sentence_transformers-4.0.2-py3-none-any.whl (340 kB)
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-4.0.2


In [None]:
# importing the required modules
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import string
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download("punkt_tab")
nltk.download("stopwords")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LOQ\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [47]:
def preprocess(text):
    # Lowercasing
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenization
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    # Join words to form the cleaned-up text
    cleaned_text = ' '.join(filtered_words)
    return cleaned_text

def summarize(text, n):
    sentences = text.split('.')
    vectorizer = TfidfVectorizer()
    sentence_vectors = vectorizer.fit_transform(sentences)
    scores = np.sum(cosine_similarity(sentence_vectors[0:1], sentence_vectors), axis=1)
    top_sentence_indices = np.argsort(scores)[::-1][:n]
    summary = '. '.join(np.array(sentences)[top_sentence_indices])
    return summary

document = """Machines have been an integral part of human progress for centuries. They have significantly transformed industries, economies, and daily life. In this essay, we will explore the definition and evolution of machines, their impact on society, significant contributors to the field, various perspectives on the implications of machine development, and future developments that are likely to shape the modern landscape.

Firstly, machines can be defined as mechanical devices that utilize energy to perform tasks. They vary from simple tools like levers and pulleys to complex systems like computers and robots. Throughout history, machines have evolved from rudimentary designs to sophisticated entities. The Industrial Revolution marked a pivotal moment in machine development. The introduction of steam power revolutionized industries, leading to mass production and efficiency.


One cannot discuss the evolution of machines without recognizing key figures who have contributed to their advancement. For instance, James Watt is well-known for improving the steam engine, which played a crucial role during the Industrial Revolution. His innovations sparked a wave of mechanization, changing the way people worked. Similarly, Nikola Tesla and Thomas Edison were instrumental in the development of electricity and electrical machinery. Their inventions laid the groundwork for modern machines and automation that are prevalent today. In recent years, figures like Elon Musk and pioneers in artificial intelligence have emerged as significant influencers in the realm of machine development. Their work continues to push the boundaries of what machines can achieve.

The impact of machines on society is profound. On one hand, machines have increased productivity, improved efficiency, and reduced manual labor. For instance, agricultural machines have transformed farming, leading to increased crop yields and food production. Similarly, manufacturing machines have enabled higher production rates, reducing costs and enhancing accessibility. These advancements contribute to economic growth and improvement in living standards.

However, the rise of machines has also led to significant societal challenges. One of the most pressing concerns is the impact of automation on employment. With machines taking over repetitive tasks, many fear job displacement. This concern has been echoed in various sectors, including manufacturing, clerical work, and even professional services. The World Economic Forum predicts that automation may displace millions of jobs while creating new ones that require different skill sets. This transition can lead to social upheaval and calls for policies that can address workforce retraining and education.

From an ethical perspective, the reliance on machines raises questions about autonomy, privacy, and security. As machines become more intelligent, their decision-making capabilities have garnered significant attention. The development of artificial intelligence, particularly machine learning, has led to machines that can learn from data and make decisions without human intervention. This raises ethical dilemmas regarding accountability and transparency. If an automated vehicle causes an accident, who is responsible? Such questions challenge the very notion of free will and human oversight in critical decisions.

Moreover, the integration of machines into daily life presents psychological impacts. The dependency on technology can lead to decreased physical activity and social interaction. For example, the prevalence of personal devices like smartphones has altered communication patterns, sometimes fostering isolation rather than connection. Therefore, while machines enhance convenience and efficiency, they can lead to unintended consequences on individual behavior and societal norms.

Despite these challenges, the future of machines is promising. Innovations like robotics, artificial intelligence, and the Internet of Things are pushing machines to operate at unprecedented levels. Industries are increasingly integrating smart machines that can communicate, share data, and optimize performance. This interconnectedness is paving the way for a digital economy where machines can make real-time decisions, improving efficiency across sectors.

Furthermore, as environmental concerns gain prominence, the development of machines aimed at sustainability is on the rise. Green technologies, energy-efficient machines, and renewable energy systems represent a significant shift towards minimizing the ecological footprint. These innovations not only demonstrate the adaptability of machine technology but also its potential to address pressing global challenges like climate change.

In conclusion, machines have fundamentally altered how we work, live, and interact with the world. While they offer remarkable benefits, such as increased productivity and enhanced quality of life, they also present significant ethical, social, and economic challenges. The ongoing dialogue about the implications of machine development is essential in navigating this complex landscape. As we look to the future, the integration of machines must be balanced with considerations of human welfare, societal impact, and ethical responsibility. The discourse surrounding machines will continue to evolve as technological advancements reshape our understanding and interaction with these powerful tools.

References

[1] J. Watt, "Improvements on the steam engine," in Proceedings of the Royal Society, 1776, pp. 23-25.

[2] N. Tesla, "Alternating Current Electricity," in Electrical Engineering, vol. 12, no. 4, pp. 56-60, 1893.

[3] T. Edison, "Electric Light," in Journal of Electrical Engineering, vol. 10, no. 1, pp. 1-12, 1879.

[4] World Economic Forum, "The Future of Jobs Report," 2020.

[5] J. Doe, "Ethics of AI and Autonomous Machines," in AI Ethics Journal, vol. 34, no. 2, pp. 101-115, 2022.

[6] National Renewable Energy Laboratory, "Green Technologies and Sustainable Machines," 2021.

"""

summary = summarize(preprocess(document), 4)
print("Summary:", summary)

Summary: machines integral part human progress centuries significantly transformed industries economies daily life essay explore definition evolution machines impact society significant contributors field various perspectives implications machine development future developments likely shape modern landscape firstly machines defined mechanical devices utilize energy perform tasks vary simple tools like levers pulleys complex systems like computers robots throughout history machines evolved rudimentary designs sophisticated entities industrial revolution marked pivotal moment machine development introduction steam power revolutionized industries leading mass production efficiency one discuss evolution machines without recognizing key figures contributed advancement instance james watt wellknown improving steam engine played crucial role industrial revolution innovations sparked wave mechanization changing way people worked similarly nikola tesla thomas edison instrumental development ele

In [4]:
pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [51]:
import time
from rouge import Rouge

def evaluate_summary(generated_summary, reference_summary, original_text):
    # Calculate ROUGE scores
    rouge = Rouge()
    scores = rouge.get_scores(summary, document, avg=True)

    # Word count metrics
    original_word_count = len(word_tokenize(original_text))
    summary_word_count = len(word_tokenize(generated_summary))
    doc_word_count = len(word_tokenize(reference_summary))

    # Calculate word count ratios
    compression_ratio = summary_word_count / original_word_count
    relative_length_to_reference = summary_word_count / doc_word_count

    # Print evaluation metrics
    print("ROUGE Scores:", scores)
    print("Original Text Word Count:", original_word_count)
    print("Generated Summary Word Count:", summary_word_count)
    print("Compression Ratio (Summary to Original):", compression_ratio)
    print("Relative Length to Reference:", relative_length_to_reference)


evaluate_summary(summary, document, document)


ROUGE Scores: {'rouge-1': {'r': 0.5906313645621182, 'p': 0.7651715039577837, 'f': 0.666666661749531}, 'rouge-2': {'r': 0.19695044472681067, 'p': 0.29245283018867924, 'f': 0.23538344241894885}, 'rouge-l': {'r': 0.5906313645621182, 'p': 0.7651715039577837, 'f': 0.666666661749531}}
Original Text Word Count: 1013
Generated Summary Word Count: 550
Compression Ratio (Summary to Original): 0.5429417571569596
Relative Length to Reference: 0.5429417571569596


# **BASIC ALGORITHM TEST PDF FILES FOR SUMMARIZATION**

In [None]:
import fitz  # PyMuPDF
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from rouge import Rouge


In [59]:

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

def preprocess_text(text):
    sentences = sent_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        filtered_sentence = ' '.join([word.lower() for word in words if word.isalnum() and word.lower() not in stop_words])
        filtered_sentences.append(filtered_sentence)
    return filtered_sentences


#def summarize_text(preprocessed_sentences, top_n=10):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(preprocessed_sentences)
    scores = tfidf_matrix.sum(axis=0)
    scored_sentences = [(score, sentence) for score, sentence in zip(scores.tolist()[0], preprocessed_sentences)]
    scored_sentences.sort(reverse=True)
    summarized = ' '.join([sentence for _, sentence in scored_sentences[:top_n]])
    return summarized
def summarize_text(preprocessed_sentences):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(preprocessed_sentences)
    
    # Calculate sentence importance based on the sum of TF-IDF scores
    scores = tfidf_matrix.sum(axis=1).flatten().tolist()[0]  # Ensure this is a flat list of scores
    
    # Normalize scores to fall between 0 and 1
    max_score = max(scores) if max(scores) > 0 else 1
    normalized_scores = [score / max_score for score in scores]
    
    # Determine cut-off score dynamically: mean score + 0.5 * std deviation of scores
    mean_score = sum(normalized_scores) / len(normalized_scores)
    std_dev_score = (sum([(x - mean_score) ** 2 for x in normalized_scores]) / len(normalized_scores)) ** 0.5
    cut_off_score = mean_score + 0.2 * std_dev_score  # Tighter threshold

    # Collect sentences that exceed the cut-off score
    summarized_sentences = [sentence for score, sentence in zip(normalized_scores, preprocessed_sentences) if score >= cut_off_score]
    
    summarized = ' '.join(summarized_sentences)
    
    return summarized
def evaluate_summary(original_text, summary):
    rouge = Rouge()
    scores = rouge.get_scores(summary, original_text)
    original_count = len(word_tokenize(original_text))
    summary_count = len(word_tokenize(summary))
    return scores, original_count, summary_count

pdf_text = extract_text_from_pdf('../Prroject -Text Summarization/PC.pdf')
preprocessed_sentences = preprocess_text(pdf_text)
summary = summarize_text(preprocessed_sentences)
scores, original_count, summary_count = evaluate_summary(pdf_text, summary)

print("ROUGE Scores:", scores)
print("Original Word Count:", original_count)
print("Summary Word Count:", summary_count)



RecursionError: maximum recursion depth exceeded

# Advanced Algorithm

In [None]:
import fitz  # PyMuPDF
import numpy as np
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from rouge import Rouge




def read_pdf(file_path = "../Prroject -Text Summarization/PC.pdf"):
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    print("Extracted text length:", len(text))  # Debugging line
    return text

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    cleaned_text = ' '.join(filtered_words)
    return cleaned_text

def summarize2(text, max_words):
    sentences = sent_tokenize(text)
    vectorizer = TfidfVectorizer()
    sentence_vectors = vectorizer.fit_transform(sentences)
    scores = np.sum(cosine_similarity(sentence_vectors[0:1], sentence_vectors), axis=1)

    sorted_indices = np.argsort(scores)[::-1]
    summary_sentences = []
    word_count = 0

    for index in sorted_indices:
        sentence_word_count = len(word_tokenize(sentences[index]))
        if word_count + sentence_word_count > max_words:
            break
        summary_sentences.append(sentences[index])
        word_count += sentence_word_count

    summary = ' '.join(summary_sentences)
    return summary


file_path = "../Prroject -Text Summarization/PC.pdf"
document = read_pdf(file_path)
processed_text = preprocess(document)
Adsummary = summarize2(processed_text, max_words=45505)
print("Summary:", Adsummary)

Extracted text length: 117508
Summary: introduction parallel computing computer software written conventionally serial computing meant solve problem algorithm divides problem smaller instructions discrete instructions executed central processing unit computer one one one instruction finished next one starts reallife example would people standing queue waiting movie ticket cashier cashier giving tickets one one persons complexity situation increases 2 queues one cashier short serial computing following 1 problem statement broken discrete instructions 2 instructions executed one one 3 one instruction executed moment time look point 3 causing huge problem computing industry one instruction getting executed moment time huge waste hardware resources one part hardware running particular instruction time problem statements getting heavier bulkier amount time execution statements examples processors pentium 3 pentium 4 let ’ come back reallife problem could definitely say complexity decrease 2

In [23]:
def evaluate_PDF(original_text, generated_summary):
    original_count = len(word_tokenize(original_text))
    summary_count = len(word_tokenize(Adsummary))
    compression_ratio = summary_count / original_count
    return original_count, summary_count, compression_ratio

original_count, summary_count, compression_ratio = evaluate_PDF(document, summary)
print("Original Word Count:", original_count)
print("Summary Word Count:", summary_count)
print("Compression Ratio:", compression_ratio)


Original Word Count: 18858
Summary Word Count: 11084
Compression Ratio: 0.5877611623714074


**Hybrid Techniques to implement the Text summarization**

 hybrid text summarization involves combining machine learning and optimization techniques. 

In [11]:
pip install nltk gensim scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [39]:
pip install python-docx


Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
Installing collected packages: python-docx
Successfully installed python-docx-1.1.2
Note: you may need to restart the kernel to use updated packages.


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from sklearn.cluster import KMeans
import numpy as np
from docx import Document
from rouge import Rouge


def read_text_from_pdf(pdf_path):
    from PyPDF2 import PdfReader
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

def preprocess_sentence(sentence):
    return [word.lower() for word in nltk.word_tokenize(sentence) if word.isalnum()]

def create_sentence_vectors(sentences, dictionary, tfidf):
    sentence_vectors = []
    for sentence in sentences:
        words = preprocess_sentence(sentence)
        bow = dictionary.doc2bow(words)
        dense_vector = np.zeros(len(dictionary), dtype=float)
        for idx, score in tfidf[bow]:
            dense_vector[idx] = score
        sentence_vectors.append(dense_vector)
    return sentence_vectors

def cluster_sentences(sentence_vectors, num_clusters):
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(sentence_vectors)
    return kmeans.labels_

def summarize_text(text, num_clusters=520):
    sentences = sent_tokenize(text)
    processed_sents = [preprocess_sentence(sent) for sent in sentences]
    
    dictionary = Dictionary(processed_sents)
    bow_corpus = [dictionary.doc2bow(sent) for sent in processed_sents]
    tfidf_model = TfidfModel(bow_corpus)
    
    sentence_vectors = create_sentence_vectors(sentences, dictionary, tfidf_model)
    clusters = cluster_sentences(sentence_vectors, num_clusters)
    
    cluster_to_sentence = {}
    for i, cluster_id in enumerate(clusters):
        if cluster_id not in cluster_to_sentence:
            cluster_to_sentence[cluster_id] = i
        else:
            existing_idx = cluster_to_sentence[cluster_id]
            if sum(sentence_vectors[i]) > sum(sentence_vectors[existing_idx]):
                cluster_to_sentence[cluster_id] = i
    
    selected_sentences = sorted(cluster_to_sentence.values())
    summary = ' '.join([sentences[idx] for idx in selected_sentences])
    
    return summary

def evaluate_summary(hypothesis, reference):
    rouge = Rouge()
    scores = rouge.get_scores(hypothesis, reference)
    return scores

def word_count(text):
    return len(word_tokenize(text))

def save_to_word(document_text, filename):
    doc = Document()
    doc.add_paragraph(document_text)
    doc.save(filename)
    print(f'Word document saved as {filename}')
    
    
pdf_path = '../Prroject -Text Summarization/PC.pdf'
text = read_text_from_pdf(pdf_path)
generated_summary = summarize_text(text)

# Calculate and print word counts and compression ratio
original_word_count = word_count(text)
summary_word_count = word_count(generated_summary)
compression_ratio = summary_word_count / original_word_count

print("Original Text Word Count:", original_word_count)
print("Generated Summary Word Count:", summary_word_count)
print("Compression Ratio:", compression_ratio)

save_to_word(generated_summary, 'summarized_output.docx')




Original Text Word Count: 19158
Generated Summary Word Count: 12229
Compression Ratio: 0.6383234158054076
Word document saved as summarized_output.docx
