# Text or Document Summarization

## Extraction Based

In this code, we first load the NLTK library and the stop words for English. We then define the text that we want to summarize and tokenize it into sentences using the sent_tokenize() function. Next, we create a frequency table for the words in the text using a defaultdict() object from the collections module. We iterate through each sentence, tokenize it into words, and count the frequency of each word, excluding stop words.

We then calculate a score for each sentence by adding up the frequencies of its constituent words. We use another defaultdict() object to store the sentence scores. Finally, we select the top N sentences based on their scores and print them as the summary.

This is a simple example of an extraction-based approach to document summarization using a frequency-based algorithm. Other more advanced techniques such as Latent Semantic Analysis (LSA) or TextRank could be used to improve the performance of the summarization algorithm.

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist
import heapq

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import wordnet
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# text = '''
# Back in 2019, a non-profit research group called OpenAI created a software program that could generate paragraphs of coherent text and perform rudimentary reading comprehension and analysis without specific instruction.

# OpenAI initially decided not to make its creation, called GPT-2, fully available to the public out of fear that people with malicious intent could use it to generate massive amounts disinformation and propaganda. In a press release announcing its decision, the group called the program "too dangerous".

# Fast forward three years, and artificial intelligence capabilities have increased exponentially.

# In contrast to that last limited distribution, the next offering, GPT-3, was made readily available in November. The Chatbot-GPT interface derived from that programming was the service that launched a thousand news articles and social media posts, as reporters and experts tested its capabilities - often with eye-popping results.

# Chatbot-GPT scripted stand-up routines in the style of the late comedian George Carlin about the Silicon Valley Bank failure. It opined on Christian theology. It wrote poetry. It explained quantum theory physics to a child as though it were rapper Snoop Dogg. Other AI models, like Dall-E, generated visuals so compelling they have sparked controversy over their inclusion on art websites.

# Machines, at least to the naked eye, have achieved creativity.

# On Tuesday, OpenAI debuted the latest iteration of its program, GPT-4, which it says has robust limits on abusive uses. Early clients include Microsoft, Merrill Lynch and the government of Iceland. And at the South by Southwest Interactive conference in Austin, Texas, this week - a global gathering of tech policymakers, investors and executives - the hottest topic of conversation was the potential, and power, of artificial intelligence programs.

# Arati Prabhakar, director of the White House's Office of Science and Technology Policy, says she is excited about the possibilities of AI, but she also had a warning.

# "What we are all seeing is the emergence of this extremely powerful technology. This is an inflection point," she told a conference panel audience. "All of history shows that these kinds of powerful new technologies can and will be used for good and for ill."

# Her co-panelist, Austin Carson, was a bit more blunt.

# "If in six months you are not completely freaked the (expletive) out, then I will buy you dinner," the founder of SeedAI, an artificial intelligence policy advisory group, told the audience.

# '''

In [4]:
text = '''
Alice was not a bit hurt, and she jumped up on to her feet in a moment:
she looked up, but it was all dark overhead; before her was another long
passage, and the White Rabbit was still in sight, hurrying down it.
There was not a moment to be lost: away went Alice like the wind, and
was just in time to hear it say, as it turned a corner, "Oh my ears and
whiskers, how late it's getting!" She was close behind it when she
turned the corner, but the Rabbit was no longer to be seen: she found
herself in a long, low hall, which was lit up by a row of lamps hanging
from the roof.

There were doors all round the hall, but they were all locked; and when
Alice had been all the way down one side and up the other, trying every
door, she walked sadly down the middle, wondering how she was ever to
get out again.

Suddenly she came upon a little three-legged table, all made of solid
glass; there was nothing on it but a tiny golden key, and Alice's first
idea was that this might belong to one of the doors of the hall; but,
alas! either the locks were too large, or the key was too small, but at
any rate it would not open any of them. However, on the second time
round, she came upon a low curtain she had not noticed before, and
behind it was a little door about fifteen inches high: she tried the
little golden key in the lock, and to her great delight it fitted!
Alice opened the door and found that it led into a small passage, not
much larger than a rat-hole: she knelt down and looked along the passage
into the loveliest garden you ever saw. How she longed to get out of
that dark hall, and wander about among those beds of bright flowers and
those cool fountains, but she could not even get her head through the
doorway; "and even if my head would go through," thought poor Alice, "it
would be of very little use without my shoulders. Oh, how I wish I could
shut up like a telescope! I think I could, if I only knew how to begin."
For, you see, so many out-of-the-way things had happened lately, that
Alice had begun to think that very few things indeed were really
impossible.

There seemed to be no use in waiting by the little door, so she went
back to the table, half hoping she might find another key on it, or at
any rate a book of rules for shutting people up like telescopes: this
time she found a little bottle on it ("which certainly was not here
before," said Alice,) and tied round the neck of the bottle was a paper
label, with the words "DRINK ME" beautifully printed on it in large
letters.

It was all very well to say "Drink me," but the wise little Alice was
not going to do _that_ in a hurry. "No, I'll look first," she said, "and
see whether it's marked '_poison_' or not;" for she had read several
nice little stories about children who had got burnt, and eaten up by
wild beasts, and other unpleasant things, all because they _would_ not
remember the simple rules their friends had taught them: such as, that a
red-hot poker will burn you if you hold it too long; and that, if you
cut your finger _very_ deeply with a knife, it usually bleeds; and she
had never forgotten that, if you drink much from a bottle marked
"poison," it is almost certain to disagree with you, sooner or later.

However, this bottle was _not_ marked "poison," so Alice ventured to
taste it, and finding it very nice (it had, in fact, a sort of mixed
flavour of cherry-tart, custard, pineapple, roast turkey, coffee, and
hot buttered toast,) she very soon finished it off.
'''

In [5]:
sentences = sent_tokenize(text)
sentences[:2]

['\nAlice was not a bit hurt, and she jumped up on to her feet in a moment:\nshe looked up, but it was all dark overhead; before her was another long\npassage, and the White Rabbit was still in sight, hurrying down it.',
 'There was not a moment to be lost: away went Alice like the wind, and\nwas just in time to hear it say, as it turned a corner, "Oh my ears and\nwhiskers, how late it\'s getting!"']

In [6]:
word_freq = defaultdict(int)
for i in range(len(sentences)-1):
    words = word_tokenize(sentences[i])
    
    for word in words:
        if word not in stop_words:
            word_freq[word] += 1


In [7]:
word_freq

defaultdict(int,
            {'Alice': 9,
             'bit': 1,
             'hurt': 1,
             ',': 53,
             'jumped': 1,
             'feet': 1,
             'moment': 2,
             ':': 7,
             'looked': 2,
             'dark': 2,
             'overhead': 1,
             ';': 8,
             'another': 2,
             'long': 3,
             'passage': 3,
             'White': 1,
             'Rabbit': 2,
             'still': 1,
             'sight': 1,
             'hurrying': 1,
             '.': 11,
             'There': 3,
             'lost': 1,
             'away': 1,
             'went': 2,
             'like': 3,
             'wind': 1,
             'time': 3,
             'hear': 1,
             'say': 2,
             'turned': 2,
             'corner': 2,
             '``': 8,
             'Oh': 2,
             'ears': 1,
             'whiskers': 1,
             'late': 1,
             "'s": 3,
             'getting': 1,
             '!': 4,
      

In [8]:
# Calculate the score for each sentence
sentence_scores = defaultdict(int)
for sentence in sentences:
    words = word_tokenize(sentence)
    for word in words:
        if word in word_freq:
            sentence_scores[sentence] += word_freq[word]

In [9]:
sentence_scores

defaultdict(int,
            {'\nAlice was not a bit hurt, and she jumped up on to her feet in a moment:\nshe looked up, but it was all dark overhead; before her was another long\npassage, and the White Rabbit was still in sight, hurrying down it.': 272,
             'There was not a moment to be lost: away went Alice like the wind, and\nwas just in time to hear it say, as it turned a corner, "Oh my ears and\nwhiskers, how late it\'s getting!"': 282,
             'She was close behind it when she\nturned the corner, but the Rabbit was no longer to be seen: she found\nherself in a long, low hall, which was lit up by a row of lamps hanging\nfrom the roof.': 206,
             'There were doors all round the hall, but they were all locked; and when\nAlice had been all the way down one side and up the other, trying every\ndoor, she walked sadly down the middle, wondering how she was ever to\nget out again.': 272,
             "Suddenly she came upon a little three-legged table, all made of 

In [10]:
# Select the top N sentences based on their scores
N = 3
summary_sentences = sorted(
    sentence_scores, 
    key=sentence_scores.get, 
    reverse=True)[:N]

In [11]:
# Print the summary
print("Summary:")
for sentence in summary_sentences:
    print(sentence)

Summary:
"No, I'll look first," she said, "and
see whether it's marked '_poison_' or not;" for she had read several
nice little stories about children who had got burnt, and eaten up by
wild beasts, and other unpleasant things, all because they _would_ not
remember the simple rules their friends had taught them: such as, that a
red-hot poker will burn you if you hold it too long; and that, if you
cut your finger _very_ deeply with a knife, it usually bleeds; and she
had never forgotten that, if you drink much from a bottle marked
"poison," it is almost certain to disagree with you, sooner or later.
However, this bottle was _not_ marked "poison," so Alice ventured to
taste it, and finding it very nice (it had, in fact, a sort of mixed
flavour of cherry-tart, custard, pineapple, roast turkey, coffee, and
hot buttered toast,) she very soon finished it off.
There seemed to be no use in waiting by the little door, so she went
back to the table, half hoping she might find another key on it, 

# Abstraction Based

In this code, we first load the NLTK library and the stop words and lemmatizer for English. We then define the text that we want to summarize and tokenize it into sentences and words using the sent_tokenize() and word_tokenize() functions. We remove stop words and lemmatize the remaining words using the WordNetLemmatizer.

Next, we calculate the frequency distribution of the remaining words using FreqDist(). We use this to calculate the term frequency (TF) scores for each sentence, which we store in a dictionary called tf_scores.

We then calculate the inverse document frequency (IDF) scores for each word based on its frequency across all sentences in the text. We use this to calculate the TF-IDF scores for each sentence, which we store in a dictionary called tfidf_scores.

Finally, we select the top N sentences with the highest TF-IDF scores using the nlargest() function from the heapq module and join them together into a summary.

Note that this is just a basic implementation and there are many ways to improve the accuracy and efficiency of the summarization algorithm, such as by using more advanced natural language processing techniques or incorporating user feedback.

In [12]:
import spacy

In [13]:
nlp = spacy.load('en_core_web_sm')


In [14]:
doc = nlp(text)
print(doc)


Alice was not a bit hurt, and she jumped up on to her feet in a moment:
she looked up, but it was all dark overhead; before her was another long
passage, and the White Rabbit was still in sight, hurrying down it.
There was not a moment to be lost: away went Alice like the wind, and
was just in time to hear it say, as it turned a corner, "Oh my ears and
whiskers, how late it's getting!" She was close behind it when she
turned the corner, but the Rabbit was no longer to be seen: she found
herself in a long, low hall, which was lit up by a row of lamps hanging
from the roof.

There were doors all round the hall, but they were all locked; and when
Alice had been all the way down one side and up the other, trying every
door, she walked sadly down the middle, wondering how she was ever to
get out again.

Suddenly she came upon a little three-legged table, all made of solid
glass; there was nothing on it but a tiny golden key, and Alice's first
idea was that this might belong to one of the d

In [15]:
sentence_lengths = [len(sent) for sent in doc.sents]
print(sentence_lengths)

[54, 48, 49, 55, 54, 27, 54, 41, 68, 14, 13, 39, 104, 30, 152, 68]


the line `sentence_lengths = [len(sent) for sent in doc.sents]`, we are calculating the length of each sentence in the document.

Here's how it works:

doc.sents is a generator that yields each sentence in the document as a Span object.
We use a list comprehension to iterate over each sentence in doc.sents and apply the len() function to it to get its length.
The resulting list contains the length of each sentence in the document in the order that they were yielded by doc.sents.
So, sentence_lengths is a list of integers representing the length (in characters) of each sentence in the document. We can use this information to filter out sentences that are too short or too long to be included in the summary, which we do later in the code.

In [16]:
# Calculate the median sentence length
median_sentence_length = sorted(sentence_lengths)[len(sentence_lengths) // 2]

In [17]:
print(median_sentence_length)

54


In [18]:
candidate_sentences = []
# Loop over each sentence in doc.sents
for sent in doc.sents:
    # Calculate the length of the sentence
    sent_length = len(sent)
    # Calculate the lower and upper bounds for acceptable sentence lengths
    lower_bound = median_sentence_length / 2
    upper_bound = median_sentence_length * 2
    # Check if the sentence length is within the acceptable range
    if sent_length > lower_bound and sent_length < upper_bound:
        # If the sentence length is within the acceptable range, add it to the list of candidate sentences
        candidate_sentences.append(sent)

In [19]:
candidate_sentences

[
 Alice was not a bit hurt, and she jumped up on to her feet in a moment:
 she looked up, but it was all dark overhead; before her was another long
 passage, and the White Rabbit was still in sight, hurrying down it.,
 There was not a moment to be lost: away went Alice like the wind, and
 was just in time to hear it say, as it turned a corner, "Oh my ears and
 whiskers, how late it's getting!",
 She was close behind it when she
 turned the corner, but the Rabbit was no longer to be seen: she found
 herself in a long, low hall, which was lit up by a row of lamps hanging
 from the roof.
 ,
 There were doors all round the hall, but they were all locked; and when
 Alice had been all the way down one side and up the other, trying every
 door, she walked sadly down the middle, wondering how she was ever to
 get out again.
 ,
 Suddenly she came upon a little three-legged table, all made of solid
 glass; there was nothing on it but a tiny golden key, and Alice's first
 idea was that this migh

In [20]:
# filter our sentences that are too short or too long
# candidate_sentences = [sent for sent in doc.sents if len(sent) > median_sentence_length / 2 and len(sent) < median_sentence_length * 2]
# candidate_sentences

In [21]:
similarity_matrix = []
for i in range(len(candidate_sentences)):
    row = []
    for j in range(len(candidate_sentences)):
        if i == j:
            row.append(0)
        else:
            row.append(candidate_sentences[i].similarity(candidate_sentences[j]))
    similarity_matrix.append(row)

    
print(similarity_matrix)
        

[[0, 0.6548227667808533, 0.753572404384613, 0.8083294630050659, 0.7180418968200684, 0.8074862360954285, 0.6896841526031494, 0.6557883024215698, 0.5392252802848816, 0.7499995231628418, 0.5550321340560913, 0.7000885605812073], [0.6548227667808533, 0, 0.6230484247207642, 0.6471386551856995, 0.5806734561920166, 0.6194282174110413, 0.5468577146530151, 0.7098353505134583, 0.6016184091567993, 0.6485611200332642, 0.6798654198646545, 0.5928280353546143], [0.753572404384613, 0.6230484247207642, 0, 0.7864801287651062, 0.6730203628540039, 0.7235516309738159, 0.777519941329956, 0.5989688038825989, 0.4803265631198883, 0.7750964760780334, 0.4993300437927246, 0.5884562134742737], [0.8083294630050659, 0.6471386551856995, 0.7864801287651062, 0, 0.6109024882316589, 0.7448523640632629, 0.7179693579673767, 0.6400219202041626, 0.5631218552589417, 0.6638158559799194, 0.5871242880821228, 0.594321072101593], [0.7180418968200684, 0.5806734561920166, 0.6730203628540039, 0.6109024882316589, 0, 0.6922146081924438,

  


In [22]:
# Rank the sentences based on their centrality in the similarity graph
centrality_scores = [sum(row) for row in similarity_matrix]
# sentence_indices = sorted(range(len(candidate_sentences)), key=lambda i: centrality_scores[i], reverse=True)

In [23]:
sentence_indices = sorted(
    range(len(candidate_sentences)), 
    key=lambda i: centrality_scores[i], 
    reverse=True
)

In [24]:
centrality_scores

[7.63207072019577,
 6.90467756986618,
 7.279370993375778,
 7.36407744884491,
 6.94361224770546,
 7.324995189905167,
 6.899665087461472,
 6.948745667934418,
 5.798914074897766,
 7.479759007692337,
 5.996812433004379,
 6.720615983009338]

In [25]:
sentence_indices

[0, 9, 3, 5, 2, 7, 4, 1, 6, 11, 10, 8]

In [26]:
# Select the top N sentences with the highest centrality scores
N =2
summary_sentences = [candidate_sentences[i] for i in sentence_indices[:N]]


In [27]:
summary_sentences

[
 Alice was not a bit hurt, and she jumped up on to her feet in a moment:
 she looked up, but it was all dark overhead; before her was another long
 passage, and the White Rabbit was still in sight, hurrying down it.,
 There seemed to be no use in waiting by the little door, so she went
 back to the table, half hoping she might find another key on it, or at
 any rate a book of rules for shutting people up like telescopes: this
 time she found a little bottle on it ("which certainly was not here
 before," said Alice,) and tied round the neck of the bottle was a paper
 label, with the words "DRINK ME" beautifully printed on it in large
 letters.
 ]

In [28]:
# Print summary
summary = ' '.join([sent.text for sent in summary_sentences])
print(summary)


Alice was not a bit hurt, and she jumped up on to her feet in a moment:
she looked up, but it was all dark overhead; before her was another long
passage, and the White Rabbit was still in sight, hurrying down it.
 There seemed to be no use in waiting by the little door, so she went
back to the table, half hoping she might find another key on it, or at
any rate a book of rules for shutting people up like telescopes: this
time she found a little bottle on it ("which certainly was not here
before," said Alice,) and tied round the neck of the bottle was a paper
label, with the words "DRINK ME" beautifully printed on it in large
letters.


