
## **Text Summarization**

One of the application of text analysis and NLP is Text Summarization. It is a technique of shortening long pieces of text into a  short message.  The intention is to create a cohesive and fluent summary include only the main points outlined in the document. 

Text summarization can be divided into two categories - Extractive Summarization and Abstractive Summarization.

**Extractive Summarization** is based on an extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. It is important to identifying important phrases or sentences from the original text because it is of utmost importance in this method.

**Abstractive Summarization** is a relies on generating new sentences from the original text. The sentences generated through this approach might not even be present in the original text. In these methods most often use advanced NLP techniques.

In this project we will focus on using the both text summarization methods. We will show three examples of the extractive technique such as calculating word frequency with spacy library, TFIDF vectorizer implementation and automatic text summarization with gensim library. To show abstractive technigues we will use Hugging Face Transformer library. 


In [None]:
pip install transformers

In [None]:
# word frequency 
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

# tfidf 
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# gensim
from gensim.summarization.summarizer import summarize 
from gensim.summarization import keywords 

# transformers
from transformers import pipeline
from transformers import BartTokenizer, BartForConditionalGeneration

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
cd '/content/drive/My Drive/moje pliki/data'

/content/drive/My Drive/moje pliki/data


Reading data:

In [None]:
with open('text1.txt') as file:
  text = file.read()
  print(text)

Scientists say they have made a major step forward in efforts to store information as molecules of DNA, which are more compact and long-lasting than other options.
The magnetic hard drives we currently use to store computer data can take up lots of space.
And they have to be replaced over time.
Using life's preferred storage medium to back up our precious data would allow vast amounts of information to be archived in tiny molecules.
The data would also last thousands of years, according to scientists.
A team in Atlanta, US, has now developed a chip that they say could improve on existing forms of DNA storage by a factor of 100.
"The density of features on our new chip is [approximately] 100x higher than current commercial devices," Nicholas Guise, senior research scientist at Georgia Tech Research Institute (GTRI), told BBC News.
"So once we add all the control electronics - which is what we're doing over the next year of the program - we expect something like a 100x improvement over e

In [None]:
len(text)

4601

### **Extractive Summarization**

**Text summary based on word frequency with Spacy library**

One of the method of text summary is calculating word frequencies and then normalizing the word frequencies by dividing by the maximum frequency. Next we are finding the sentences with high frequencies and taking the most important sentences to convert into the summary.

The algorithm is composed on a few steps:
- text cleaning: removing stopwords and making the words in lower case.
- tokenization each word from sentences;
- calculating of word frequency;
- sentence tokenization;
- creating summarization.


Firstly we build an NLP object, list of stopwords and create a tokenization of text.

In [None]:
nlp = spacy.load('en')

#build an nlp object
doc = nlp(text)

In [None]:
stopwords = list(STOP_WORDS)

In [None]:
my_tokens = [token.text for token in doc]

*Word frequency*

Now we calculate the word frequency. It is a dictionary of words and their counts. It counts how many times each word appears in the document after removing stopwords.

Calculating word frequencies from the text after removing stopwords:

In [None]:
word_frequencies = {}

for word in doc:
  if word.text.lower() not in stopwords:
    if word.text not in word_frequencies.keys():
       word_frequencies[word.text] = 1
    else:
      word_frequencies[word.text] += 1

*Maximum Word Frequency*

The maximum word frequency finds the weighted frequency of words. It is based on find each word over most occurring word and long sentence over short sentence.

Calculating the maximum frequency and divide it by all frequencies to get normalized word frequencies:

In [None]:
max_frequency = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

*Word Frequency Distribution*

In this step we calculate sentence score and ranking of words in each sentence. We calculate the most important sentences by adding the word frequencies in each sentence.

First we create sentence tokens:

In [None]:
sent_tokens = [sent for sent in doc.sents]

In [None]:
sentence_scores = {}  

for sent in sent_tokens:
    for word in sent:
      if word.text.lower() in word_frequencies.keys():
        if sent not in sentence_scores.keys():
          sentence_scores[sent] = word_frequencies[word.text.lower()]
        else:
          sentence_scores[sent] += word_frequencies[word.text.lower()]

In [None]:
print(sentence_scores)

Then we select 30% sentences with a maximum score. We calculate it by using headhq library from we import nlargest method.

In [None]:
length = int(len(sentence_scores) *0.3)
print(length)

11


In [None]:
summary = nlargest(length, sentence_scores, key=sentence_scores.get)

Finally we get the summary of text:

In [None]:
final_summary = [word.text for word in summary]
summary = ''.join(final_summary)
print(summary)

With DNA, however, "as long as you keep the temperature low enough, the data will survive for thousands of years, so the cost of ownership drops to almost zero", Dr Guise explained.
"The density of features on our new chip is [approximately] 100x higher than current commercial devices," Nicholas Guise, senior research scientist at Georgia Tech Research Institute (GTRI), told BBC News.
Because of the time required for reading the sequence, the technique would be most useful for information that must be kept available for a long time, but accessed infrequently.
If we can get the cost of this technology competitive with the cost of writing data magnetically, the cost of storing and maintaining information in DNA over many years should be lower."
The high cost of DNA storage has so far restricted the technology to "boutique customers", such as those seeking to archive information in time capsules.
A team in Atlanta, US, has now developed a chip that they say could improve on existing forms

Below the function to create the summary based on word frequency with the spacy library.

In [None]:
def get_summary(text):
  doc = nlp(text)
  
  word_frequencies= {}
  for word in doc:
    if word.text.lower() not in stopwords:
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1

  max_frequency = max(word_frequencies.values())
  for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

  sent_tokens = [sent for sent in doc.sents]
  sentence_scores = {} 
  for sent in sent_tokens:
    for word in sent:
      if word.text.lower() in word_frequencies.keys():
        if sent not in sentence_scores.keys():
          sentence_scores[sent] = word_frequencies[word.text.lower()]
        else:
          sentence_scores[sent] += word_frequencies[word.text.lower()]
  
  length = int(len(sentence_scores) *0.3)
  summary = nlargest(length, sentence_scores, key=sentence_scores.get)
  final_summary = [word.text for word in summary]
  summary = ''.join(final_summary)
  return summary
                    


**Text Summarization using TF-IDF vectorizer**


In this approach we are using TF-IDF to get text summarization. We are using the TF-IDF vectorizer for sentence score(importance) and returning only the most important sentences in the text. 

The TF-IDF (term frequency-inverse document frequency) is a vectorizer that converts the text into a vector. It has two terms term frequency and inverse document frequency. The value of TF-IDF is the product of these two terms. Term frequency is the number of repetitions of words in a sentence by the total number of words in that sentence. Inverse document frequency is the log of number of sentences by the number of sentences containing the given word. 

First we we divide the text into tokens. We are using sentence tokenizer from nltk library for tokenization.

In [None]:
tokens = sent_tokenize(text)

Now we create a tf-idf vectorizer that based of this we will get the scores of each sentence that we created during tokenization.

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
tf_idf = vectorizer.fit_transform(tokens) 

In this approach, we are using TFIDF score of each word to calculate the total sentence score.

Calculating sentence score:

In [None]:
sent_index = 0
sent_score = []
for i in tf_idf:
  score = i.sum()/len(i.data)
  sent_index +=1
  sent_score.append(score)

In [None]:
len(sent_score)

36

In [None]:
len(tokens)

36

We calculate the  average of sentence scores:

In [None]:
avg_sent = sum(sent_score)/len(sent_score) 
avg_sent

0.32622956634369615

Finally we get the summary by finding the most important sentences:

In [None]:
index = 0
summary1 = []
for i in sent_score:
  if (i > (avg_sent)):
    summary1.append(tokens[index])
  index += 1
output_text = ''
for i in summary1:
  output_text=output_text + str(i)

In [None]:
output_text

'And they have to be replaced over time.The data would also last thousands of years, according to scientists.They are: adenine, cytosine, guanine and thymine.Alternatively, a one and zero could be mapped to just two of the four bases.This will allow larger amounts of DNA to be grown in a shorter space of time.Because it\'s a prototype, not all the microwells are wired up yet.However, Dr Guise explained, when everything\'s up and running, that will change.But the new technology could write 100 times more DNA data in the same amount of time.The team at GTRI believes their work could help reshape the cost curve.This type of data is currently stored on magnetic tapes which should be replaced around every 10 years."It only costs much money to write the DNA once at the beginning and then to read the DNA at the end.DNA storage has a higher error rate than conventional hard drive storage.'

The complete code is given as follows:

In [None]:
def create_summary(text):
  tokens = sent_tokenize(text)

  vectorizer = TfidfVectorizer(stop_words='english')
  tf_idf = vectorizer.fit_transform(tokens) 

  sent_index = 0
  sent_score = []
  for i in tf_idf:
    score = i.sum()/len(i.data)
    sent_index +=1
    sent_score.append(score)
  
  avg_sent = sum(sent_score)/len(sent_score) 

  index = 0
  summary1 = []
  for i in sent_score:
    if (i > (avg_sent)):
      summary1.append(tokens[index])
    index += 1
  output_text = ''
  for i in summary1:
    output_text = output_text + str(i)
  return output_text

**Text summary using Gensim library**

One of the method to get text summary is used library to automatic text summarization such as gensim. The library provides is automatic summarization based on TextRank algorithm.

We import the appropriate method from library and we pass on text which we want to summarize. The word_count parametr specifies the number of words summary should contain. We set word count =50.


In [None]:
summ_text = summarize(text, word_count = 100) 

In [None]:
print(summ_text.replace('\n', '\n'))

There are different potential ways to store this information in DNA - for example, a zero in binary code could be represented by the bases adenine or cytosine and a one might be represented by guanine or thymine.
The high cost of DNA storage has so far restricted the technology to "boutique customers", such as those seeking to archive information in time capsules.
If we can get the cost of this technology competitive with the cost of writing data magnetically, the cost of storing and maintaining information in DNA over many years should be lower."


### **Abstractive Text Summarization** 


**Text Summarization using Hugging Face Transformer**

Hugging Face Transformer uses an abstractive summarization approach where the model elaborates new sentences in a new form exactly like people and produces an entire separate text that is shorter than the original one.

To get the summary we import the pipeline from the transformer module.

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
text_sum = summarizer(text, min_length=10, max_length=300)

In [None]:
summary_text = ' '.join([str(i) for i in text_sum])

Final summary:

In [None]:
summary_text

"{'summary_text': ' Scientists say they have made a major step forward in efforts to store information as molecules of DNA, which are more compact and long-lasting than other options . A team in Atlanta, US, has now developed a chip that they say could improve on existing forms of DNA storage by a factor of 100 . The technology works by growing unique strands of DNA one building block at a time .'}"

**Text Summarization using BART model**

The BART(Bidirectional and Auto-Regressive Transformer) is a transformer that is now commonly used for sequence-to-sequence problems.  Its architecture uses a standard Seq2Seq bidirectional encoder (like BERT) and a left-to-right autoregressive decoder (like GPT) i.e. BART = BERT + GPT. BART is suitable for summarization, machine translation, question-answering, etc.

First we import the Bart pre-trained tokenizer and Bart pre-trained model for the summarization.

In [None]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Then we encode our text using the Bart Tokenizer and we generate the output summarization using the Bart Summarization model.

In [None]:
input = tokenizer.encode(text, return_tensors="pt", max_length=512)

summary_ids = model.generate(input, max_length=160, min_length=12, length_penalty=1.0, num_beams=4, early_stopping=True)

The outputs will be a tensor in order to get text out of it, we need to decode it using the same Bart Tokenizer model.


In [None]:
output_summ = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]

Final summary:

In [None]:
print(output_summ)

['Scientists say they have made a major step forward in efforts to store information as molecules of DNA. The technology works by growing unique strands of DNA one building block at a time. The structures on the chip used to grow the DNA are called microwells.']


### Summary 

From our analysis we can see that the best summary we achaived using abstractive methods such as BART model. The received short summary the best reflects the content of the studied article. The reading this summary we can easily understand what has been described within the article. 
In the ecstractive methods which we have studied we did not receive such good results as in the abstracive ones. In the first method based on word frequency we achieved a pretty long summary of article however we can understand. By using the TFIDF vectorizer for summarization we also obtained similar results as in the previous case. We used automatic text summarization tools like a gensim library as well. By virtue few simple steps we got a summary with  quite good result. 
