In [16]:
import lxml
import bs4 as bs  
import urllib
import re
import nltk
from nltk.corpus import stopwords

In [26]:
scraped_data = urllib.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')  
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:  
    article_text += p.text

print(article_text)


Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.  In computer science  AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving".[2]
The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip, "AI is whatever hasn't been done yet."[3] For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology.[4] Modern machine

# Preprocessing
The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. Take a look at the script below:.

In [22]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)  
article_text = re.sub(r'\s+', ' ', article_text)
print(article_text)

 Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip, "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Modern machine capabilities 

# Text Cleaning
The article_text object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.

To clean the text and calculate weighted frequences, we will create another object. 

In [27]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )  
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

print (formatted_article_text)

 Artificial intelligence AI sometimes called machine intelligence is intelligence demonstrated by machines in contrast to the natural intelligence displayed by humans and other animals In computer science AI research is defined as the study of intelligent agents any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals Colloquially the term artificial intelligence is applied when a machine mimics cognitive functions that humans associate with other human minds such as learning and problem solving The scope of AI is disputed as machines become increasingly capable tasks considered as requiring intelligence are often removed from the definition a phenomenon known as the AI effect leading to the quip AI is whatever hasn t been done yet For instance optical character recognition is frequently excluded from artificial intelligence having become a routine technology Modern machine capabilities generally classified as AI include s

# Converting Text To Sentences
At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use the article_text object for tokenizing the article to sentence since it contains full stops. The formatted_article_text does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.



In [28]:
sentence_list = nltk.sent_tokenize(article_text)  

print(sentence_list)

[u'\nArtificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.', u'In computer science  AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.', u'[1] Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving".', u'[2]\nThe scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip, "AI is whatever hasn\'t been done yet.', u'"[3] For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine techn

# Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:

In [29]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}  
for word in nltk.word_tokenize(formatted_article_text):  
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
            
print(word_frequencies)

{u'limited': 1, u'AGI': 4, u'comparatively': 1, u'dynamic': 4, u'four': 1, u'influenza': 3, u'Sedol': 1, u'Yann': 1, u'poorly': 2, u'relationships': 1, u'whose': 3, u'votes': 1, u'patches': 2, u'electricity': 1, u'Kubrick': 1, u'Jennings': 1, u'unanswered': 1, u'presents': 1, u'worth': 2, u'risk': 11, u'Keyword': 1, u'activation': 3, u'rise': 2, u'pathfinding': 2, u'every': 5, u'affect': 1, u'vast': 1, u'believed': 1, u'exponential': 1, u'Economists': 1, u'skills': 7, u'companies': 5, u'Pitts': 2, u'correct': 1, u'unrelated': 1, u'phase': 1, u'Go': 10, u'enhance': 1, u'Paul': 1, u'unauthorised': 1, u'Ro': 1, u'Lenat': 1, u'force': 2, u'leaders': 1, u'awake': 1, u'warns': 1, u'estimates': 1, u'direct': 1, u'elegant': 1, u'second': 3, u'street': 1, u'merge': 1, u'implemented': 1, u'LSTM': 5, u'machines': 24, u'tactile': 1, u'even': 11, u'established': 3, u'organisms': 1, u'GMDH': 1, u'hyperintelligence': 1, u'beaten': 1, u'revolves': 1, u'children': 1, u'superintelligent': 1, u'Amazon': 

Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:

In [30]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():  
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

print(maximum_frequncy)

163


# Calculating Sentence Scores
We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence.

In [31]:
sentence_scores = {}  
for sent in sentence_list:  
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
                    
print(sentence_scores)

{u'A number of researchers began to look into "sub-symbolic" approaches to specific AI problems.': 0, u'[355] This idea, called transhumanism, which has roots in Aldous Huxley and Robert Ettinger.': 0, u'[174] Alternatively, distributed search processes can coordinate via swarm intelligence algorithms.': 0, u'[14]\nCan intelligent behavior be described using simple, elegant principles (such as logic or optimization)?': 0, u'[352] The new intelligence could thus increase exponentially and dramatically surpass humans.': 0, u'Concern over risk from artificial intelligence has led to some high-profile donations and investments.': 0, u'Propositional logic[180] involves truth functions such as "or" and "not".': 0, u'Once humans develop artificial intelligence, it will take off on its own and redesign itself at an ever-increasing rate.': 0, u'[13]\nEarly researchers developed algorithms that imitated step-by-step reasoning that humans use when they solve puzzles or make logical deductions.': 

# Getting the Summary
Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

In [21]:
import heapq  
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)  
print(summary) 

A number of researchers began to look into "sub-symbolic" approaches to specific AI problems. A very different kind of search came to prominence in the 1990s, based on the mathematical theory of optimization. However, beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, longer-lasting hiatus began. Concern over risk from artificial intelligence has led to some high-profile donations and investments. Recent developments in autonomous automobiles have made the innovation of self-driving trucks possible, though they are still in the testing phase. Deep Blue became the first computer chess-playing system to beat a reigning world chess champion, Garry Kasparov on 11 May 1997. Once humans develop artificial intelligence, it will take off on its own and redesign itself at an ever-increasing rate.
