Summary

The given code script loads a DataFrame containing blog text data, and then applies three different summarization algorithms (TextRank, LSA, and LexRank) with varying compression rates to generate summaries for each blog text. It calculates the number of sentences needed for the summary based on the compression rate and randomly selects the summarization approach for each blog. The script then stores the generated summaries in the 'Summary' column of the DataFrame. Additionally, it calculates and displays the percentage of each summarization approach and compression rate used for generating the summaries. Finally, it saves the DataFrame with the generated summaries as a new CSV file. In summary, the script focuses on text summarization and provides an analysis of the summarization approaches used for the given blog text data.

In [1]:
import pandas as pd
!pip install sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.utils import get_stop_words
from sumy.summarizers.text_rank import TextRankSummarizer
import random



In [2]:
df = pd.read_csv("../data/Blogs_result/dataset.csv")

In [3]:
text_rank_summarizer = TextRankSummarizer()
lsa_summarizer = LsaSummarizer()
lex_rank_summarizer = LexRankSummarizer()

compression_rates = [0.1, 0.3, 0.5, 0.7]  

approach_counts = {summarizer.__class__.__name__: 0 for summarizer in [text_rank_summarizer, lsa_summarizer, lex_rank_summarizer]}
compression_rate_counts = {rate: 0 for rate in compression_rates}

for index, row in df.iterrows():
    approach = random.choice([text_rank_summarizer, lsa_summarizer, lex_rank_summarizer])
    compression_rate = random.choice(compression_rates)
    
    approach_counts[approach.__class__.__name__] += 1
    compression_rate_counts[compression_rate] += 1

    tokenizer = Tokenizer("english")
    sentences = tokenizer.to_sentences(row['Text'])

    num_sentences = int(len(sentences) * compression_rate)

    summarizer = approach
    parser = PlaintextParser.from_string(row['Text'], tokenizer)
    summary = summarizer(parser.document, num_sentences)

    df.loc[index, 'Summary'] = ' '.join(str(sentence) for sentence in summary)


total_summaries = len(df)

approach_percentages = {approach: (count / total_summaries) * 100 for approach, count in approach_counts.items()}
compression_rate_percentages = {rate: (count / total_summaries) * 100 for rate, count in compression_rate_counts.items()}

for approach, percentage in approach_percentages.items():
    print(f"The approach {approach} was used for {percentage:.2f}% of the summaries.")

for rate, percentage in compression_rate_percentages.items():
    print(f"The compression rate {rate} was used for {percentage:.2f}% of the summaries.")

The approach TextRankSummarizer was used for 33.57% of the summaries.
The approach LsaSummarizer was used for 34.14% of the summaries.
The approach LexRankSummarizer was used for 32.29% of the summaries.
The compression rate 0.1 was used for 24.62% of the summaries.
The compression rate 0.3 was used for 25.13% of the summaries.
The compression rate 0.5 was used for 25.99% of the summaries.
The compression rate 0.7 was used for 24.25% of the summaries.


In [5]:
df.head()

Unnamed: 0,Text,Class,Summary
0,I have heard nothing from the Ambassador about...,Political speech,"I fully covered, in my conference last week, m..."
1,I think it is in the public interest to procee...,Political speech,I think it is in the public interest to procee...
2,The A-11 aircraft now at Edwards Air force Bas...,Political speech,The development of a supersonic commercial tra...
3,It is one of the most comprehensive bills in t...,Political speech,I hope that we can work toward the goal of som...
4,"So long as there remains a man without a job, ...",Political speech,But while we pursue these unfinished tasks at ...


In [6]:
import pandas as pd

folder_path = '../data/Blogs_result/'

file_name = 'blogsresult_refsummarys.csv'

file_path = folder_path + '/' + file_name

df.to_csv(file_path, index=False)

print(f"DataFrame saved as CSV file at: {file_path}")

DataFrame saved as CSV file at: ../data/Blogs_result//blogsresult_refsummarys.csv
