# Natural Language Processing (NLP)

## Text Summarization 

## Objectives

On completing this assignment, students will be able to write a simple ai application that summarizes a given text by selecting a few most relevant sentences from the text.  

## Description
 
Write an AI application that will scrape a Wikipedia article on Neural Networking from the Internet and will summarize it by selecting the three most relevant sentences which are less than 20 words long from the article.

### Additionally, do the following:

Allow sentences of the following maximum length to be included in the calculations and see which one produces a good summary.

Max sentence length:
15, 20, 25, 30, or any length

After selecting a suitable length for the above, try out the following numbers for sentences to be included in the final summary.

Number of sentences to be included in the document summary:
1, 2, 3, 4, or 5

What number seems most suitable. 

Write a paragraph that both describes your experience and summarizes the results of carrying out the above experiment.


## Discussion

There are two ways to summarize an article. One way is to fully comprehend the article and then summarize it in your own words. This way we produce an abstract of the article. The second way is to extract from the article a few most relevant sentences and use them to constitute the summary. This type of summary is called an executive summary.

In this assignment, we have chosen the second approach of producing an executive summary of the article. 

## Coding

Follow the steps below.

In [541]:
import bs4 as bs
import urllib.request
raw_data = urllib.request.urlopen ('https://en.wikipedia.org/wiki/Neural_network')
#print (raw_data)

Read the raw page from the connected website

In [543]:
document=raw_data.read()
#print (document)

Cleanup the page to make it a clean html page

In [545]:
parsed_document = bs.BeautifulSoup(document, 'lxml')
#print (parsed_document)

Prepare a list of all <p> tag objects (<p> tags and the enclosed text) (html paragraphs)

In [547]:
article_paras=parsed_document.find_all ('p')
#print (article_paras)

By iterating over the list, extract and put together the text parts (html paragraph text par) 

In [549]:
scrapped_data=""
for para in article_paras:
    scrapped_data += para.text
#print (scrapped_data)

At the end of text parts, there are reference numbers such as: [1] etc. Do the cleanup of the whole text and remove them as done below.

In [551]:
import re
scrapped_data = re.sub (r'\[[0-9]*\]', ' ', scrapped_data)
scrapped_data = re.sub (r'\s+', ' ', scrapped_data)
#print(scrapped_data)

Tokenize (surround with single quotes) all sentences and make a list of them in which quoted sentences are separated by comma.

In [553]:
from nltk import sent_tokenize
all_sentences = sent_tokenize (scrapped_data)
#print (all_sentences)

Start with the data in which the data is not yet sentence tokenized and prepare a word frequency list (dictionary) for all the data. [Word frequency list is a list (dictionary) that contains words and their corresponding frequencies (the number of times the words are used in the document)]. Do this the the following way:

- cleanup the document so that it contains only alphabetic text
- tokenize words (surround each word with single quotes and put them in a list)
- iterate on the tokenized word list and prepare a word frequency list (dictionary) while making sure not to include stopwords (short words such as 'to', 'is', etc.) in the frequency count

In [555]:
import re
from nltk import word_tokenize
from nltk.corpus import stopwords

scrapped_data = re.sub ('[^a-zA-Z]', ' ', scrapped_data)
formatted_text = re.sub (r'\s+', ' ', scrapped_data)
#print (formatted_text)

word_freq = {}
for word in word_tokenize (formatted_text):
    if word not in stopwords.words('english'):
        if word not in word_freq.keys():
            word_freq [word] = 1
        else:
            word_freq [word] += 1

#print (word_freq)

Convert the word frequencies to relative word frequencies by dividing them all to maximum frequency.

In [557]:
max_freq=max(word_freq.values())
for word in word_freq.keys():
    word_freq [word]= word_freq [word] / max_freq
#print (word_freq)

In [558]:
sent_scores = {}
for sent in all_sentences:
    if len (sent.split(' ')) < 25:
        for word in word_tokenize (sent):
            if word in word_freq.keys():            
                if sent in sent_scores.keys():
                    sent_scores [sent] += word_freq [word]
                else:
                    sent_scores [sent] = word_freq [word]

#print (sent_scores)
        

From the sentence score list (dictionary), extract three sentences with top three scores and put them in a list.

In [560]:
import heapq
selected_sentences = heapq.nlargest(3, sent_scores, sent_scores.get)
#print (selected_sentences)

Convert the above list of sentences into printable quoted text. Then print the quoted text.

In [562]:
selected_summary = " ".join (selected_sentences)
print (selected_summary)

Artificial neural networks were originally used to model biological neural networks starting in the 1930s under the approach of connectionism. Populations of interconnected neurons that are smaller than neural networks are called neural circuits. A neural network is a group of interconnected units called neurons that send signals to one another.
