##News summarization using nltk and custom functions

In [1]:
import numpy as np
import nltk
import re

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
# Helper functions for preprocessing
def casefolding(sentence):
    return sentence.lower()

def cleaning(sentence):
    return re.sub(r'[^a-z]', ' ', re.sub("’", '', sentence))

def tokenization(sentence):
    return sentence.split()

In [3]:
# Helper function to transform the text into a collection of sentences
def sentence_split(paragraph):
    return nltk.sent_tokenize(paragraph)

In [4]:
#Count the number of words in the document
def word_freq(data):
    w = []
    for sentence in data:
        for words in sentence:
            w.append(words)
    bag = list(set(w))
    res = {}
    for word in bag:
        res[word] = w.count(word)
    return res

In [6]:
# After that we calculate the weight of each sentence of the text
# This returns the sentence that best represents the whole story
def sentence_weight(data):
    weights = []
    for words in data:
        temp = 0
        for word in words:
            temp += wordfreq[word]
        weights.append(temp)
    return weights

In [7]:
news = """
IIn a time in which even a virus has become the subject of partisan disinformation and myth-making, it’s essential that mainstream journalistic institutions reaffirm their bona fides as disinterested purveyors of fact and honest brokers of controversy. In this regard, a recent course of action by the New York Times is cause for alarm.On December 27, 2019, the Times published a column by their opinion journalist Bret Stephens, “The Secrets of Jewish Genius,” and the ensuing controversy led to an extraordinary response by the editors.Stephens took up the question of why Ashkenazi Jews are statistically overrepresented in intellectual and creative fields. This disparity has been documented for many years, such as in the 1995 book Jews and the New American Scene by the eminent sociologists Seymour Martin Lipset and Earl Raab. In his Times column, Stephens cited statistics from a more recent peer-reviewed academic paper, coauthored by an elected member of the National Academy of Sciences. Though the authors of that paper advanced a genetic hypothesis for the overrepresentation, arguing that Ashkenazi Jews have the highest average IQ of any ethnic group because of inherited traits, Stephens did not take up that argument. In fact, his essay quickly set it aside and argued that the real roots of Jewish achievement are culturally and historically engendered habits of mind.Nonetheless, the column incited a furious and ad hominem response. Detractors discovered that one of the authors of the paper Stephens had cited went on to express racist views, and falsely claimed that Stephens himself had advanced ideas that were “genetic” (he did not), “racist” (he made no remarks about any race) and “eugenicist” (alluding to the discredited political movement to improve the human species by selective breeding, which was not remotely related to anything Stephens wrote).It would have been appropriate for the New York Times to acknowledge the controversy, to publish one or more replies, and to allow Stephens and his critics to clarify the issues. Instead, the editors deleted parts of the column—not because anything in it had been shown to be factually incorrect but because it had become controversial.Worse, the explanation for the deletions in the Editors’ Note was not accurate about the edits the paper made after publication. The editors did not just remove “reference to the study.” They expurgated the article’s original subtitle (which explicitly stated “It’s not about having higher IQs”), two mentions of Jewish IQs, and a list of statistics about Jewish accomplishment: “During the 20th century, [Ashkenazi Jews] made up about 3 percent of the U.S. population but won 27 percent of the U.S. Nobel science prizes and 25 percent of the ACM Turing awards. They account for more than half of world chess champions.” These statistics about Jewish accomplishments were quoted directly from the study, but they originated in other studies. So, even if the Times editors wanted to disavow the paper Stephens referenced, the newspaper could have replaced the passage with quotes from the original sources.The Times’ handling of this column sets three pernicious precedents for American journalism.First, while we cannot know what drove the editors’ decision, the outward appearance is that they surrendered to an outrage mob, in the process giving an imprimatur of legitimacy to the false and ad hominem attacks against Stephens. The Editors’ Note explains that Stephens “was not endorsing the study or its authors’ views,” and that it was not his intent to “leave an impression with many readers that [he] was arguing that Jews are genetically superior.” The combination of the explanation and the post-publication revision implied that such an impression was reasonable. It was not.Unless the Times reverses course, we can expect to see more such mobs, more retractions, and also preemptive rejections from editors fearful of having to make such retractions. Newspapers risk forfeiting decisions to air controversial or unorthodox ideas to outrage mobs, which are driven by the passions of their most ideological police rather than the health of the intellectual commons.Second, the Times redacted a published essay based on concerns about retroactive moral pollution, not about accuracy. While it is true that an author of the paper Stephens mentioned, the late anthropologist Henry Harpending, made some deplorable racist remarks, that does not mean that every point in every paper he ever coauthored must be deemed radioactive. Facts and arguments must be evaluated on their content. Will the Times and other newspapers now monitor the speech of scientists and scholars and censor articles that cite any of them who, years later, say something offensive? Will it crowdsource that job to Twitter and then redact its online editions whenever anyone quoted in the Times is later “canceled”?Third, for the Times to “disappear” passages of a published article into an inaccessible memory hole is an Orwellian act that, thanks to the newspaper’s actions, might now be seen as acceptable journalistic practice. It is all the worse when the editors’ published account of what they deleted is itself inaccurate. This does a disservice to readers, historians and journalists, who are left unable to determine for themselves what the controversy was about, and to Stephens, who is left unable to defend himself against readers’ worst suspicions.We strongly oppose racism, anti-Semitism and all forms of bigotry. And we believe that the best means of combating them is the open exchange of ideas. The Times’ retroactive censoring of passages of a published article appears to endorse a different view. And in doing so, it hands ammunition to the cynics and obfuscators who claim that every news source is merely an organ for its political coalition."""

In [10]:
# Process the news by using the helper functions defined above
sentence_list = sentence_split(news)
data = []
for sentence in sentence_list:
    data.append(tokenization(cleaning(casefolding(sentence))))
data = (list(filter(None, data)))

In [17]:
# Count the number of words in the document
wordfreq = word_freq(data)
wordfreq

{'a': 13,
 'about': 9,
 'academic': 1,
 'academy': 1,
 'acceptable': 1,
 'accomplishment': 1,
 'accomplishments': 1,
 'account': 2,
 'accuracy': 1,
 'accurate': 1,
 'achievement': 1,
 'acknowledge': 1,
 'acm': 1,
 'act': 1,
 'action': 1,
 'actions': 1,
 'ad': 2,
 'advanced': 2,
 'after': 1,
 'against': 2,
 'air': 1,
 'alarm': 1,
 'all': 2,
 'allow': 1,
 'alluding': 1,
 'also': 1,
 'american': 2,
 'ammunition': 1,
 'an': 10,
 'and': 30,
 'anthropologist': 1,
 'anti': 1,
 'any': 3,
 'anyone': 1,
 'anything': 2,
 'appearance': 1,
 'appears': 1,
 'appropriate': 1,
 'are': 5,
 'argued': 1,
 'arguing': 2,
 'argument': 1,
 'arguments': 1,
 'article': 2,
 'articles': 2,
 'as': 3,
 'ashkenazi': 3,
 'aside': 1,
 'attacks': 1,
 'author': 1,
 'authors': 3,
 'average': 1,
 'awards': 1,
 'based': 1,
 'be': 4,
 'because': 3,
 'become': 2,
 'been': 3,
 'believe': 1,
 'best': 1,
 'bigotry': 1,
 'bona': 1,
 'book': 1,
 'breeding': 1,
 'bret': 1,
 'brokers': 1,
 'but': 3,
 'by': 7,
 'can': 1,
 'canceled'

In [18]:
# And calculate the weight for each sentence
rank = sentence_weight(data)
rank

[355,
 876,
 347,
 258,
 495,
 412,
 1253,
 750,
 956,
 194,
 1139,
 772,
 293,
 547,
 375,
 48,
 377,
 580,
 245,
 440,
 297,
 230,
 262]

In [20]:
# Pick the 2 main sentences to output the news summary
n = 2
result = ''
sort_list = np.argsort(rank)[::-1][:n]
for i in range(n):
    result += '{} '.format(sentence_list[sort_list[i]])

In [21]:
print(result)

Detractors discovered that one of the authors of the paper Stephens had cited went on to express racist views, and falsely claimed that Stephens himself had advanced ideas that were “genetic” (he did not), “racist” (he made no remarks about any race) and “eugenicist” (alluding to the discredited political movement to improve the human species by selective breeding, which was not remotely related to anything Stephens wrote).It would have been appropriate for the New York Times to acknowledge the controversy, to publish one or more replies, and to allow Stephens and his critics to clarify the issues. So, even if the Times editors wanted to disavow the paper Stephens referenced, the newspaper could have replaced the passage with quotes from the original sources.The Times’ handling of this column sets three pernicious precedents for American journalism.First, while we cannot know what drove the editors’ decision, the outward appearance is that they surrendered to an outrage mob, in the pro

## News summarization using the Newspaper3k library

In [23]:
from newspaper import Article

In [25]:
# link to be scraped
article = Article("https://www.nytimes.com/2020/05/10/us/ahmaud-arbery-georgia.html")

In [26]:
#You can also use a specific language
article = Article("https://www.nytimes.com/2020/05/10/us/ahmaud-arbery-georgia.html", "en")

In [27]:
#Download and parse the article
article.download()
article.parse()

In [28]:
# Everything is set, we can use different methods to extract information about the article
article.authors

['Richard Fausset']

In [29]:
article.publish_date

datetime.datetime(2020, 5, 10, 0, 0)

In [30]:
article.text

'Mr. Arbery was a natural mimic, and Mr. Baker remembered laughing at his impressions on weekday mornings while sitting next to him on the bus. Mr. Arbery was also dazzling on the empty lot where they played a one-man-against-the-world football game called “hot ball.”\n\nMr. Arbery preferred to play barefoot, and patterned his moves on Reggie Bush, the fleet and nimble N.F.L. running back. “Quick cuts, spin and juke moves, step backs,” Mr. Baker said. “It just left you in awe.”\n\nMuch of their childhood was spent outside, drinking water from a spigot and playing ball until dark, and eventually Mr. Baker trimmed down and became a good athlete in his own right. Soon they were both linebackers for the Brunswick High School Pirates.\n\nMr. Arbery won accolades for his talent, and dreamed of playing for the N.F.L. Mr. Baker, who loved to read, dreamed of becoming a doctor.\n\nMr. Baker got his driver’s license first. The two men would drive around their little town in an old Buick Century 

In [32]:
# Use the nlp method for simple text processing
article.nlp()
article.keywords

['moves',
 'ahmaud',
 'ends',
 'text',
 'playing',
 'dreamed',
 'college',
 'good',
 'morehouse',
 'baker',
 'lost',
 'lifetime',
 'maud',
 'running',
 'arbery',
 'mr',
 'nfl']

In [33]:
article.summary

'Mr. Arbery was a natural mimic, and Mr. Baker remembered laughing at his impressions on weekday mornings while sitting next to him on the bus.\nMr. Arbery won accolades for his talent, and dreamed of playing for the N.F.L.\nMr. Baker, partial to the dense, thorny lyrics of the rapper Kendrick Lamar, was the more fluid wordsmith.\nMr. Arbery tended to provide the sounds of encouragement — the “oohs” and “oh, yeahs” — as Mr. Baker freestyled.\nWhile Mr. Baker made plans for college, Mr. Arbery planned to stay home.'