In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from gensim.models import KeyedVectors
import numpy as np
import spacy
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from spacy.lang.en.stop_words import STOP_WORDS

In [3]:
nlp = spacy.load('en_core_web_lg')

At the time of doing this project, spacy's english language model (en_core_web_lg) has problem in recognizing stop words. This can be seen in the following example.

In [4]:
doc = nlp('This is a sentence. And the cat jumped over the dog. The cat returned as the prisoner of Azkaban.')
for token in doc:
    print(token.text, token.is_stop)

This False
is False
a False
sentence False
. False
And False
the False
cat False
jumped False
over False
the False
dog False
. False
The False
cat False
returned False
as False
the False
prisoner False
of False
Azkaban False
. False


As you can see above, it doesn't recognize 'This', 'is', 'a', 'the', etc. as stop words, we are going to address this issue, by the following steps.

In [5]:
def add_stop_words(stop_words):
    for stop_word in stop_words:
        for word in (stop_word, stop_word.capitalize(), stop_word.upper()):
            lex = nlp.vocab[word]
            lex.is_stop = True
    

In [6]:
add_stop_words(STOP_WORDS)

Let us recheck the above scenario again and see whether the model is recognizing stop words or not.

In [7]:
doc = nlp('This is a sentence. And the cat jumped over the dog. The cat returned as the prisoner of Azkaban.')
for token in doc:
    print(token.text, token.is_stop)

This True
is True
a True
sentence False
. False
And True
the True
cat False
jumped False
over True
the True
dog False
. False
The True
cat False
returned False
as True
the True
prisoner False
of True
Azkaban False
. False


Since, we will be working with news articles in this project, there are words that can be marked as stop words.

In [8]:
custom_stop_words = ['say', 'says', 'said', 'saying', '\'s', 'n\'t','mr', 'ms', 'mr.', 'ms.', 'people']
add_stop_words(custom_stop_words)

In [9]:
doc2 = nlp('Mr. Washington said that climate change is a serious issue and needs to be addressed.')
for token in doc2:
    print(token.text, token.is_stop)

Mr. True
Washington False
said True
that True
climate False
change False
is True
a True
serious True
issue False
and True
needs False
to True
be True
addressed False
. False


In this project we are going to work on the extractive text summarization of news articles. Let us build the functions that will be used in the summarization.

We will be using pre-trained word vectors by Google. It can be downloaded from here: https://code.google.com/archive/p/word2vec/

In [10]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [11]:
model.most_similar('brilliant')

[('superb', 0.7657862901687622),
 ('marvelous', 0.7389472723007202),
 ('splendid', 0.7077070474624634),
 ('terrific', 0.6837816834449768),
 ('masterful', 0.6830281615257263),
 ('magnificent', 0.6709308624267578),
 ('dazzling', 0.6706756353378296),
 ('brilliantly', 0.6550824046134949),
 ('brilliance', 0.6550251245498657),
 ('scintillating', 0.6493905782699585)]

In [12]:
model.most_similar('horrendous')

[('terrible', 0.8467271327972412),
 ('horrible', 0.8412425518035889),
 ('dreadful', 0.8110991716384888),
 ('atrocious', 0.8046820163726807),
 ('horrific', 0.7891628742218018),
 ('horrid', 0.7628018856048584),
 ('appalling', 0.7606023550033569),
 ('awful', 0.6970030069351196),
 ('hideous', 0.6831786632537842),
 ('ghastly', 0.6630186438560486)]

In [13]:
cosine_similarity(model['horrendous'].reshape(1,300), model['terrible'].reshape(1,300))[0][0]

0.84672713

In [14]:
def read_file(fileName):
    with open(fileName, 'r') as file:
        text = file.read()
        text = text.replace('\n', ' ')
        
    return text
        

In [25]:
def get_sentences(text):
    doc = nlp(text)
    original_sentences = []
    cleaned_sentences = []
    
    for sentence in doc.sents:
        original_sentences.append(sentence)
        
    clean_text = text[:]
    doc2 = nlp(clean_text)
    
    for sentence in doc2.sents:
        words = []
        for token in sentence:
            if not token.is_stop and not token.is_punct and not token.is_space:
                words.append(token.text)
        cleaned_sentence = ' '.join(words)
        cleaned_sentences.append(cleaned_sentence)
        
    return (original_sentences, cleaned_sentences)

In [23]:
def get_sentence_vector(sentence):
    # iterate through words in a sentence, each word will have a vector associated with it, 
    #the word vector we will get from Google's pre-trained word vector, take the mean of the word vectors in a sentence 
    #and that will be the sentence vector. return the sentence vector.
    words = sentence.split(' ')
    num_words = len(words)
    sentence_vector = np.zeros((300,))
    for word in words:
        try:
            sentence_vector += model[word]
        except:
            sentence_vector += np.zeros((300,))

    return sentence_vector/num_words

In [17]:
def get_similarity_score(sentence_vector_1, sentence_vector_2):
    #input to the function will be 2 sentence vectors, we will calculate the cosine similarity and return the value
    return cosine_similarity(sentence_vector_1.reshape(1,300), sentence_vector_2.reshape(1,300))[0][0]

In [60]:
def create_similarity_matrix(sentences):
    num_of_sentences = len(sentences)
    similarity_matrix = np.zeros((num_of_sentences, num_of_sentences))
    for index1 in range(num_of_sentences):
        sentence_vector_1 = get_sentence_vector(sentences[index1])
        for index2 in range(num_of_sentences):
            if index1 == index2:
                continue
            sentence_vector_2 = get_sentence_vector(sentences[index2])
            similarity_score = get_similarity_score(sentence_vector_1, sentence_vector_2)
            similarity_matrix[index1][index2] = similarity_score
    
    return similarity_matrix

In [19]:
def get_sentence_rankings(similarity_matrix):
    similarity_graph = nx.from_numpy_array(similarity_matrix)
    return nx.pagerank(similarity_graph)

In [44]:
def get_text_summarization(fileName, numOfSentencesInSummary=5):
    text = read_file(fileName)
    (original_sentences, cleaned_sentences) = get_sentences(text)
    similarity_matrix = create_similarity_matrix(cleaned_sentences)
    sentence_rankings = get_sentence_rankings(similarity_matrix)
    list_of_sentences_and_scores = sorted([(sentence, sentence_rankings[index]) for index, sentence in enumerate(original_sentences)], key= lambda sent:sent[1], reverse=True)
    num_of_sentences_to_display = min(len(original_sentences), numOfSentencesInSummary)
    sorted_sentences = [sentence[0] for sentence in list_of_sentences_and_scores]
    sorted_sentences_for_user = sorted_sentences[0:num_of_sentences_to_display]
    sorted_sentences_as_paragraph = "".join([str(sentence) for sentence in sorted_sentences_for_user])
    return "Text Summary: " + '\n' + sorted_sentences_as_paragraph

Let us read one of the Atlantic articles which we had obtained by web scraping

In [67]:
text = read_file('atlantic/science/15.txt')
print(text)

The Massive Mystery of Saturn’s Rings Astronomers have produced the best measure yet of the planet’s signature bands. Saturn has confounded scientists since Galileo, who found that the planet was “not alone,” as he put it. “I do not know what to say in a case so surprising, so unlooked-for, and so novel,” he wrote. He didn’t realize it then, but he had seen the planet’s rings, a cosmic garland of icy material. From Earth, the rings look solid, but up close, they are translucent bands made of countless particles, mostly ice, some rock. Some are no larger than a grain of sugar, others as enormous as mountains. Around and around they go, held in place by a delicate balance between Saturn’s gravity and their orbiting speed, which pulls them out toward space. Scientists got their best look at the planet nearly 400 years after Galileo’s discovery, using a NASA spacecraft called Cassini. Cassini spent 13 years looping around Saturn until, in September 2017, it ran out of fuel and engineers de

Now let us see what the summarized version looks like:

In [68]:
print(get_text_summarization('atlantic/science/15.txt'))

Text Summary: 
Estimates of the mass of the rings have varied wildly for decades, starting with the twin Voyager spacecraft, which whizzed by Saturn in the late 1970s and early 1980s on their way through the solar system.You got a combined mass of Saturn plus the rings, and there was really no way to separate it out,” says Linda Spilker, the lead scientist for the Cassini mission, who was not involved in the latest research.A primordial origin story would have been a very convenient one: The young solar system was a chaotic mess of flying debris, and it would have been possible for Saturn to lasso some of it into a lasting orbit.One believed that the ring system formed when Saturn did, 4.6 billion years ago, when the solar system as we know it emerged from swirling clouds of dust left over from the fiery birth of the sun.Buratti is convinced that someday, with telescope technology powerful enough, we’ll make out the curves of the rings around a distant planet, in another solar system.


Let us read another Atlantic article which we had obtained by web scraping

In [62]:
text2 = read_file('atlantic/science/7.txt')
print(text2)

The Unprecedented Surge in Fear About Climate Change More Americans than ever are worried about climate change, but they’re not willing to pay much to stop it. A surging number of Americans understand that climate change is happening and believe that it could harm their family and the country, according to a new poll from Yale and George Mason University. But at the same time, Americans are not any more willing to pay money to fight climate change than they were three years ago, says another new poll, conducted by the Associated Press and the University of Chicago. The polls suggest that public opinion about climate change is in a state of upheaval. Even as President Donald Trump has cast doubt on climate change, most Americans have rejected his position. Record numbers of Americans describe climate change as a real and present danger. Nearly a quarter of the country says they already see its tidings in their day-to-day life, saying “personal observations of weather” helped convince th

Now let us read the summarized version of the article. Here, I want to read the top 6 sentences in the article.

In [63]:
print(get_text_summarization('atlantic/science/7.txt', 6))

Text Summary: 
A surging number of Americans understand that climate change is happening and believe that it could harm their family and the country, according to a new poll from Yale and George Mason University.More Americans than ever—29 percent—also say they are “very worried” about climate change, an eight-point increase.It reflects a large shift, as an outright majority of Americans—a record-high number—believe that climate change could endanger their loved ones.The AP poll found Americans were least supportive of this plan: Three out of four said they would oppose a carbon tax that “eases climate-related regulation,” and only half liked the idea of a monthly rebate.The AP survey found that seven out of 10 of Americans understand climate change is happening.But at the same time, Americans are not any more willing to pay money to fight climate change than they were three years ago, says another new poll, conducted by the Associated Press and the University of Chicago.
