# Natural Language Processing (NLP)

## Text Summarization 

## Objectives

On completing this assignment, students will be able to write a simple ai application that summarizes a given text by selecting a few most relevant sentences from the text.  

## Description
 
Write an AI application that will scrape a Wikipedia article on Neural Networking from the Internet and will summarize it by selecting the three most relevant sentences which are less than 20 words long from the article.

### Additionally, do the following:

Allow sentences of the following maximum length to be included in the calculations and see which one produces a good summary.

Max sentence length:
15, 20, 25, 30, or any length

After selecting a suitable length for the above, try out the following numbers for sentences to be included in the final summary.

Number of sentences to be included in the document summary:
1, 2, 3, 4, or 5

What number seems most suitable. 

Write a paragraph that both describes your experience and summarizes the results of carrying out the above experiment.


## Discussion

There are two ways to summarize an article. One way is to fully comprehend the article and then summarize it in your own words. This way we produce an abstract of the article. The second way is to extract from the article a few most relevant sentences and use them to constitute the summary. This type of summary is called an executive summary.

In this assignment, we have chosen the second approach of producing an executive summary of the article. 

## Coding

Follow the steps below.

## Title: NLP Web Scraping and Text Summarization

### Keith Yrisarri Stateson
July 18, 2024. Python 3.11.0

##### Summary
This program is an AI application designed to summarize a given Wikipedia article on Neural Networking by web scraping the content. The program uses Beautiful Soup to scrape and parse the article from the Internet, extracting the three most relevant sentences, each containing fewer than 20 words. 

Additionally, the program explores the impact of different maximum sentence lengths and varying the number of sentences in the summary to identify the best configuration for summarizing the text.

Assumptions
The program assumes that the most relevant sentences are those with the highest word frequency scores, and that the paragraphs are in the 'p' tag.

##### Overview
Install, Import, and Raw Data Collection
- Importing Libraries
- Scraping the Wikipedia Article

Data Cleaning
- Parsing HTML Content
- Cleaning and Formatting Text
- Removing Reference Numbers

Text Tokenization and Text Analysis
- Sentence Tokenization
- Text Analysis
- Word Tokenization
- Text Analysis

Word Frequency Calculation
- Creating a Word Frequency Dictionary
- Normalizing Word Frequencies

Sentence Scoring and Selection
- Calculating Sentence Scores
- Filtering Sentences by Length
- Selecting Top Sentences

Variations 1, 2, 3
- Experimenting with Different Configurations

Conclusion

## Install, Import, and Raw Data Collection

In [108]:
pip install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [109]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/keithstateson/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [110]:
import bs4 as bs
import urllib.request  # for opening and reading URLs
raw_data = urllib.request.urlopen ('https://en.wikipedia.org/wiki/Neural_network')  # open the URL and get the raw data
print (raw_data)

<http.client.HTTPResponse object at 0x1602b5b40>


Read the raw page from the connected website

In [111]:
document=raw_data.read()  # read the raw data
# print (document)

## Data Cleaning

Cleanup the page to make it a clean html page. Parse the document using lxml parser.

In [112]:
parsed_document = bs.BeautifulSoup(document, 'lxml')
print (parsed_document)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Neural network - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-

Prepare a list of all <p> tag objects (<p> tags and the enclosed text) (html paragraphs)

In [113]:
# .find_all method is used to find all the tags. 'p' is the tag for paragraph
article_paras=parsed_document.find_all ('p')  # article_paras is a list of all the paragraphs in the article
print (article_paras[0:2])  # print the first 2 paragraphs

[<p>A <b>neural network</b> is a group of interconnected units called <a class="mw-redirect" href="/wiki/Neurons" title="Neurons">neurons</a> that send signals to one another. Neurons can be either <a href="/wiki/Cell_(biology)" title="Cell (biology)">biological cells</a> or <a href="/wiki/Mathematical_model" title="Mathematical model">mathematical models</a>. While individual neurons are simple, many of them together in a network can perform complex tasks. There are two main types of neural network.
</p>, <p>In the context of biology, a neural network is a population of biological <a href="/wiki/Neuron" title="Neuron">neurons</a> chemically connected to each other by <a href="/wiki/Synapse" title="Synapse">synapses</a>. A given neuron can be connected to hundreds of thousands of synapses.<sup class="reference" id="cite_ref-shao_1-0"><a href="#cite_note-shao-1">[1]</a></sup>
Each neuron sends and receives <a href="/wiki/Electrochemistry" title="Electrochemistry">electrochemical</a> sig

In [114]:
# prettify method will print the html in a more readable format
print('\nprettify method:')
for para in article_paras[:2]:
    print(para.prettify())


prettify method:
<p>
 A
 <b>
  neural network
 </b>
 is a group of interconnected units called
 <a class="mw-redirect" href="/wiki/Neurons" title="Neurons">
  neurons
 </a>
 that send signals to one another. Neurons can be either
 <a href="/wiki/Cell_(biology)" title="Cell (biology)">
  biological cells
 </a>
 or
 <a href="/wiki/Mathematical_model" title="Mathematical model">
  mathematical models
 </a>
 . While individual neurons are simple, many of them together in a network can perform complex tasks. There are two main types of neural network.
</p>

<p>
 In the context of biology, a neural network is a population of biological
 <a href="/wiki/Neuron" title="Neuron">
  neurons
 </a>
 chemically connected to each other by
 <a href="/wiki/Synapse" title="Synapse">
  synapses
 </a>
 . A given neuron can be connected to hundreds of thousands of synapses.
 <sup class="reference" id="cite_ref-shao_1-0">
  <a href="#cite_note-shao-1">
   [1]
  </a>
 </sup>
 Each neuron sends and receives
 

In [115]:
# get_text method will print the text in the paragraphs
print('\nget_text method:')
for para in article_paras[:2]:
    print(para.get_text())
    
# # .text method will print the text in the paragraphs, same output as get_text method
# print('\n .text method will print the text in the paragraphs')
# for para in article_paras[:2]:  # print the first 2 paragraphs
#     print(para.text)


get_text method:
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or mathematical models. While individual neurons are simple, many of them together in a network can perform complex tasks. There are two main types of neural network.

In the context of biology, a neural network is a population of biological neurons chemically connected to each other by synapses. A given neuron can be connected to hundreds of thousands of synapses.[1]
Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors. A neuron can serve an excitatory role, amplifying and propagating signals it receives, or an inhibitory role, suppressing signals instead.[1]



By iterating over the list, extract and put together the text parts (html paragraph text par) 

In [116]:
scrapped_data=""
for para in article_paras:
    scrapped_data += para.text
print (scrapped_data[:500], '\n') # print the first 500 characters of the scrapped data

A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or mathematical models. While individual neurons are simple, many of them together in a network can perform complex tasks. There are two main types of neural network.
In the context of biology, a neural network is a population of biological neurons chemically connected to each other by synapses. A given neuron can be connected to hundreds of thousands of syn 



At the end of text parts, there are reference numbers such as: [1] etc. Do the cleanup of the whole text and remove them as done below.

In [117]:
import re
scrapped_data = re.sub (r'\[[0-9]*\]', ' ', scrapped_data)  # remove the reference numbers, *\ means any number of digits
scrapped_data = re.sub (r'\s+', ' ', scrapped_data)

print(scrapped_data[:500])

A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or mathematical models. While individual neurons are simple, many of them together in a network can perform complex tasks. There are two main types of neural network. In the context of biology, a neural network is a population of biological neurons chemically connected to each other by synapses. A given neuron can be connected to hundreds of thousands of syn


## Text Tokenization and Text Analysis

Tokenize (surround with single quotes) all sentences and make a list of them in which quoted sentences are separated by comma.

In [118]:
from nltk import sent_tokenize
all_sentences = sent_tokenize (scrapped_data)
print (all_sentences[:500])

['A neural network is a group of interconnected units called neurons that send signals to one another.', 'Neurons can be either biological cells or mathematical models.', 'While individual neurons are simple, many of them together in a network can perform complex tasks.', 'There are two main types of neural network.', 'In the context of biology, a neural network is a population of biological neurons chemically connected to each other by synapses.', 'A given neuron can be connected to hundreds of thousands of synapses.', 'Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors.', 'A neuron can serve an excitatory role, amplifying and propagating signals it receives, or an inhibitory role, suppressing signals instead.', 'Populations of interconnected neurons that are smaller than neural networks are called neural circuits.', 'Very large interconnected networks are called large scale brain networks, and many of these together form brains 

Text Analysis

In [119]:
# Analyze word count, sentence count, average word length, and average sentence length
# all_sentences is tokenized into words but scrapped_data is not yet formatted

print(f'Total number of words: {len(scrapped_data.split())}')
print(f'Total number of sentences: {len(all_sentences)}')
print(f'Average length of words: {sum(len(word) for word in scrapped_data.split()) / len(scrapped_data.split()):.1f} characters')
print(f'Average sentence length: {len(scrapped_data) / len(all_sentences):.1f} characters')

sentence_length = [len(sent.split()) for sent in all_sentences]
length_distribution = nltk.FreqDist(sentence_length)

print("\nSentence Length Distribution:")
for length, count in sorted(length_distribution.items()):
    print(f"Length {length}: {count} sentences")

# Print sentences based on length
print("\nSentences shorter than 15 words:")
for sent in all_sentences:
    if len(sent.split(' ')) < 15:
        print(sent)

print("\nSentences between 15 and 30 words:")
for sent in all_sentences:
    if 15 <= len(sent.split(' ')) < 30:
        print(sent)

Total number of words: 494
Total number of sentences: 24
Average length of words: 5.8 characters
Average sentence length: 139.4 characters

Sentence Length Distribution:
Length 8: 1 sentences
Length 9: 1 sentences
Length 12: 1 sentences
Length 14: 2 sentences
Length 15: 1 sentences
Length 16: 5 sentences
Length 17: 1 sentences
Length 20: 4 sentences
Length 21: 2 sentences
Length 24: 1 sentences
Length 28: 1 sentences
Length 29: 1 sentences
Length 33: 1 sentences
Length 37: 1 sentences
Length 52: 1 sentences

Sentences shorter than 15 words:
Neurons can be either biological cells or mathematical models.
There are two main types of neural network.
A given neuron can be connected to hundreds of thousands of synapses.
Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors.
Populations of interconnected neurons that are smaller than neural networks are called neural circuits.

Sentences between 15 and 30 words:
A neural network is a group 

Start with the data in which the data is not yet sentence tokenized and prepare a word frequency list (dictionary) for all the data. [Word frequency list is a list (dictionary) that contains words and their corresponding frequencies (the number of times the words are used in the document)]. Do this the following way:

- cleanup the document so that it contains only alphabetic text
- tokenize words (surround each word with single quotes and put them in a list)
- iterate on the tokenized word list and prepare a word frequency list (dictionary) while making sure not to include stopwords (short words such as 'to', 'is', etc.) in the frequency count

In [120]:
import re
from nltk import word_tokenize
from nltk.corpus import stopwords

scrapped_data = re.sub ('[^a-zA-Z]', ' ', scrapped_data)  # remove all characters except alphabets
formatted_text = re.sub (r'\s+', ' ', scrapped_data)
print(f'formatted_text:\n{formatted_text[:500]}\n')

word_freq = {}
for word in word_tokenize (formatted_text):  # word_tokenize will split text e.g., formatted_text into words and return a list of words
    if word not in stopwords.words('english'):
        if word not in word_freq.keys():
            word_freq [word] = 1
        else:
            word_freq [word] += 1

# print (word_freq)

# Print the first 10 key-value pairs in the word_freq dictionary. Can use itertools, which provides better performance for large dictionaries
print('First 10 key-value pairs in the word_freq dictionary:')
for i, (word, freq) in enumerate(word_freq.items()):
    if i < 10:
        print(f"{word}: {freq}")
    else:
        break

formatted_text:
A neural network is a group of interconnected units called neurons that send signals to one another Neurons can be either biological cells or mathematical models While individual neurons are simple many of them together in a network can perform complex tasks There are two main types of neural network In the context of biology a neural network is a population of biological neurons chemically connected to each other by synapses A given neuron can be connected to hundreds of thousands of synapses E

First 10 key-value pairs in the word_freq dictionary:
A: 4
neural: 15
network: 9
group: 1
interconnected: 3
units: 1
called: 4
neurons: 7
send: 1
signals: 4


Text Analysis

In [121]:
# Analyze word count, sentence count, average word length, and average sentence length

print(f'Total number of words: {len(formatted_text.split())}')
print(f'Total number of sentences: {len(all_sentences)}')
print(f'Average length of words: {sum(len(word) for word in formatted_text.split()) / len(formatted_text.split()):.1f} characters')
print(f'Average sentence length: {len(formatted_text) / len(all_sentences):.1f} characters')

sentence_length = [len(sent.split()) for sent in all_sentences]
length_distribution = nltk.FreqDist(sentence_length)

print("\nSentence Length Distribution:")
for length, count in sorted(length_distribution.items()):
    print(f"Length {length}: {count} sentences")

# Print sentences based on length
print("\nSentences shorter than 15 words:")
for sent in all_sentences:
    if len(sent.split(' ')) < 15:
        print(sent)

print("\nSentences between 15 and 30 words:")
for sent in all_sentences:
    if 15 <= len(sent.split(' ')) < 30:
        print(sent)

Total number of words: 489
Total number of sentences: 24
Average length of words: 5.7 characters
Average sentence length: 135.7 characters

Sentence Length Distribution:
Length 8: 1 sentences
Length 9: 1 sentences
Length 12: 1 sentences
Length 14: 2 sentences
Length 15: 1 sentences
Length 16: 5 sentences
Length 17: 1 sentences
Length 20: 4 sentences
Length 21: 2 sentences
Length 24: 1 sentences
Length 28: 1 sentences
Length 29: 1 sentences
Length 33: 1 sentences
Length 37: 1 sentences
Length 52: 1 sentences

Sentences shorter than 15 words:
Neurons can be either biological cells or mathematical models.
There are two main types of neural network.
A given neuron can be connected to hundreds of thousands of synapses.
Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors.
Populations of interconnected neurons that are smaller than neural networks are called neural circuits.

Sentences between 15 and 30 words:
A neural network is a group 

## Word Frequency Calculation

Convert the word frequencies to relative word frequencies by dividing each frequency by the maximum frequency.

This allows the comparison of the frequencies of different words on the same scale.

In [122]:
max_freq=max(word_freq.values())
for word in word_freq.keys():
    word_freq [word]= word_freq [word] / max_freq

# print (word_freq)

print('First 10 key-value pairs in the word_freq dictionary after normalization:')
for i, (word, freq) in enumerate(word_freq.items()):
    if i < 10:
        print(f"{word}: {freq}")
    else:
        break

First 10 key-value pairs in the word_freq dictionary after normalization:
A: 0.26666666666666666
neural: 1.0
network: 0.6
group: 0.06666666666666667
interconnected: 0.2
units: 0.06666666666666667
called: 0.26666666666666666
neurons: 0.4666666666666667
send: 0.06666666666666667
signals: 0.26666666666666666


## Sentence Scoring and Selection

In [123]:
sent_scores = {}
for sent in all_sentences:
    if len (sent.split(' ')) < 25:
        for word in word_tokenize (sent):
            if word in word_freq.keys():            
                if sent in sent_scores.keys():
                    sent_scores [sent] += word_freq [word]
                else:
                    sent_scores [sent] = word_freq [word]

# print (sent_scores)

print('key-value pairs in the sent_scores dictionary:')
for i, (sent, score) in enumerate(sent_scores.items()):
    if i < 100:
        print(f"{sent}: {score}")
    else:
        break

key-value pairs in the sent_scores dictionary:
A neural network is a group of interconnected units called neurons that send signals to one another.: 3.533333333333334
Neurons can be either biological cells or mathematical models.: 0.7999999999999999
While individual neurons are simple, many of them together in a network can perform complex tasks.: 1.9333333333333333
There are two main types of neural network.: 1.8666666666666667
In the context of biology, a neural network is a population of biological neurons chemically connected to each other by synapses.: 3.2
A given neuron can be connected to hundreds of thousands of synapses.: 1.2
Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors.: 1.6666666666666665
A neuron can serve an excitatory role, amplifying and propagating signals it receives, or an inhibitory role, suppressing signals instead.: 2.0666666666666664
Populations of interconnected neurons that are smaller than neural net

## Variations

### Variation 1. Max sentence length: 25. Number of sentences: 3

From the sentence score list (dictionary), extract three sentences with top three scores and put them in a list.

In [139]:
import heapq
selected_sentences = heapq.nlargest(3, sent_scores, key=sent_scores.get)

print (selected_sentences[:2])

['Artificial neural networks were originally used to model biological neural networks starting in the 1930s under the approach of connectionism.', 'Populations of interconnected neurons that are smaller than neural networks are called neural circuits.']


Convert the above list of sentences into printable quoted text. Then print the quoted text.

In [140]:
selected_summary = " ".join (selected_sentences)
for sent in selected_sentences:
    print (sent)

Artificial neural networks were originally used to model biological neural networks starting in the 1930s under the approach of connectionism.
Populations of interconnected neurons that are smaller than neural networks are called neural circuits.
A neural network is a group of interconnected units called neurons that send signals to one another.


### Variation 2. Max sentence length: 30. Number of sentences: 4

In [132]:
sent_scores2 = {}
for sent in all_sentences:
    if len (sent.split(' ')) < 30:
        for word in word_tokenize (sent):
            if word in word_freq.keys():            
                if sent in sent_scores2.keys():
                    sent_scores2 [sent] += word_freq [word]
                else:
                    sent_scores2 [sent] = word_freq [word]

print('Key-value pairs in the sent_scores2 dictionary:')
for i, (sent, score) in enumerate(sent_scores2.items()):
    if i < 100:
        print(f"{sent}: {score}")
    else:
        break

Key-value pairs in the sent_scores2 dictionary:
A neural network is a group of interconnected units called neurons that send signals to one another.: 3.533333333333334
Neurons can be either biological cells or mathematical models.: 0.7999999999999999
While individual neurons are simple, many of them together in a network can perform complex tasks.: 1.9333333333333333
There are two main types of neural network.: 1.8666666666666667
In the context of biology, a neural network is a population of biological neurons chemically connected to each other by synapses.: 3.2
A given neuron can be connected to hundreds of thousands of synapses.: 1.2
Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors.: 1.6666666666666665
A neuron can serve an excitatory role, amplifying and propagating signals it receives, or an inhibitory role, suppressing signals instead.: 2.0666666666666664
Populations of interconnected neurons that are smaller than neural ne

In [133]:
import heapq
selected_sentences2 = heapq.nlargest(4, sent_scores, sent_scores.get)

selected_summary2 = " ".join (selected_sentences2)

for sent in selected_sentences2:
    print (sent)

Artificial neural networks were originally used to model biological neural networks starting in the 1930s under the approach of connectionism.
Populations of interconnected neurons that are smaller than neural networks are called neural circuits.
A neural network is a group of interconnected units called neurons that send signals to one another.
Very large interconnected networks are called large scale brain networks, and many of these together form brains and nervous systems.


### Variation 3. Max sentence length: 15. Number of sentences: 5

In [136]:
sent_scores3 = {}
for sent in all_sentences:
    if len (sent.split(' ')) < 15:
        for word in word_tokenize (sent):
            if word in word_freq.keys():            
                if sent in sent_scores3.keys():
                    sent_scores3 [sent] += word_freq [word]
                else:
                    sent_scores3 [sent] = word_freq [word]

print('Key-value pairs in the sent_scores3 dictionary:')
for i, (sent, score) in enumerate(sent_scores3.items()):
    if i < 100:
        print(f"{sent}: {score}")
    else:
        break

Key-value pairs in the sent_scores3 dictionary:
Neurons can be either biological cells or mathematical models.: 0.7999999999999999
There are two main types of neural network.: 1.8666666666666667
A given neuron can be connected to hundreds of thousands of synapses.: 1.2
Each neuron sends and receives electrochemical signals called action potentials to its connected neighbors.: 1.6666666666666665
Populations of interconnected neurons that are smaller than neural networks are called neural circuits.: 3.8666666666666667


In [137]:
import heapq
selected_sentences3 = heapq.nlargest(5, sent_scores, sent_scores.get)

selected_summary3 = " ".join (selected_sentences3)

for sent in selected_sentences3:
    print (sent)

Artificial neural networks were originally used to model biological neural networks starting in the 1930s under the approach of connectionism.
Populations of interconnected neurons that are smaller than neural networks are called neural circuits.
A neural network is a group of interconnected units called neurons that send signals to one another.
Very large interconnected networks are called large scale brain networks, and many of these together form brains and nervous systems.
In machine learning, a neural network is an artificial mathematical model used to approximate nonlinear functions.


## Conclusion

The summaries are largely identical regardless of the different configurations used.

The consistency across variations of the summaries produced are very similar. This indicates that the key sentences identified by the algorithm are consistently ranked high irrespective of the sentence length limit and the number of sentences chosen.

The different sentence length limits (15, 25, and 30) did not significantly alter the outcome. This may imply that most sentences in the text fall within the shorter length limits, or the high-scoring sentences are inherently short.

One possible reason is the dataset used in this example is relatively small, and the majority of the sentences are of similar length. As a result, the algorithm may have difficulty distinguishing between the importance of different sentences. In a larger dataset with more diverse sentence lengths and structures, the variations in the summaries may be more pronounced.

Considering the sentence scoring and text analysis, medium length sentences (15-30 words) seem to contain more substantial information and context, making them valuable for summaries. They strike a balance between detail and conciseness.