## Import Libraries

In [47]:

from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import numpy as np
import pandas as pd
import re
import os
import codecs
from sklearn import feature_extraction
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import pairwise_distances_argmin_min
import nltk
nltk.download('punkt')
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer 


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hackintosh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Find Similarities Function
This function will determine which sentences to extract from the article's text by finding the cosine similarity between all tf-idf transformed sentences. The extracted sentences will have the highest average cosine similarity to the remaining sentences. By doing this, the summary should include sentences that show the highest importance to the article.

> Purnamasari (2019), Cosine Similarity models term into vectors and calculate the cosine value of two tokens. The token used
can be in the form of words, paragraphs, or even the entire source-text. Sorted cosine value will indicate
the similarity among terms. High scores will indicate a large degree of similarity.

In [48]:
def find_similarities(text):
    # tokenize sentences
    sentences = sent_tokenize(text)
    #set stop words
    stops = list(set(stopwords.words('english'))) + list(punctuation)
    
    #vectorize sentences and remove stop words
    vectorizer = TfidfVectorizer(stop_words = stops)
    #transform using TFIDF vectorizer
    trsfm=vectorizer.fit_transform(sentences)
    
    #creat df for input article
    text_df = pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names(),index=sentences)
    
    #declare how many sentences to use in summary
    num_sentences = text_df.shape[0]
    num_summary_sentences = int(np.ceil(num_sentences**.5))
        
    #find cosine similarity for all sentence pairs
    similarities = cosine_similarity(trsfm, trsfm)
    
    #create list to hold avg cosine similarities for each sentence
    avgs = []
    for i in similarities:
        avgs.append(i.mean())
     
    #find index values of the sentences to be used for summary
    top_idx = np.argsort(avgs)[-num_summary_sentences:]
    
    return top_idx

In [49]:

def build_summary(text):
    #find sentences to extract for summary
    sents_for_sum = find_similarities(text)
    #sort the sentences
    sort = sorted(sents_for_sum)
    #display which sentences have been selected
    # print(sort)
    
    sent_list = sent_tokenize(text)
    #print number of sentences in full article
    # print(len(sent_list))
    
    
    #extract the selected sentences from the original text
    sents = []
    for i in sort:
        sents.append(sent_list[i].replace('\n', ''))
    
    #join sentences together for final output
    summary = ' '.join(sents)
    return summary

In [51]:

file = '../text_data/summary.txt'
file = open(file , 'r')
text = file.read()

print('Cosine similarity: \n')
print(build_summary(text))

Cosine similarity: 

Today organizations require better use of data and analytics to support their business deci- sions. Internet power and business trend changes have provided a broad term for data analytics – Big Data. It helps to provide not only reliable data and good analysis but at the same time to optimize the process and increase added value.
