<b>Name:</b> Daniel Manova<br>
<b>Project:</b> Text Summarization using NLTK<br>

<b>Problem Statement:</b> Text Summarization for the. give file of type .txt,.docx or web url using NLP based technique<br>
<b>Objective:</b> Summarize the text using nltk

## Input the required libraries

In [1]:
# Initialize all necessary packages
import nltk #NLTK is a libraries for Natural Language Processing. It is a platform for building Python programs to process natural language. NLTK is written in the Python programming language.
import re #Built-in package to work with Regular Expressions.
import bs4 as bs #Beautiful Soup is a Python library for pulling data out of HTML and XML files.
import urllib.request  #The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.0.
import docx2txt #library used to convert Microsoft Office Docx documents to equivalent Text documents.
import heapq #Heap queue is a special tree structure in which each parent node is less than or equal to its child node.

## Data Preprocessing

#### Reading text Content
- We can summarise the content which are of file type  MS-Word,text,Weburl
- Library docx2txt is used to read the content from word document and convert to text format
- Module urllib.request to open the web url and with help of Beautiful Soup pull data out of HTML and XML files.
- open method is used to read text file


In [2]:
def read_content(filepath_or_url,filetype):
    if filetype == 'word':
        doc_content = docx2txt.process(filepath_or_url)
    elif filetype == 'url':
        scraped_data = urllib.request.urlopen(filepath_or_url)
        content = scraped_data.read()
        parsed_content = bs.BeautifulSoup(content,'lxml')
        paragraphs = parsed_content.find_all('p')
        doc_content = ""
        for p in paragraphs:
            doc_content += p.text       
    elif filetype == 'text':
        file = open(filepath_or_url, 'r', errors='ignore')
        doc_content=file.read()
    else:
        print("Please makesure the file type is .txt, .docx or url\n")
    return doc_content

#### Text cleaning
- Remove special characters and extra spaces from the content
- Generate the list of sentences from the content
- Define a variable stopwords and store all the English stop words from the nltk library which would be used for checking stop words in the given sentence

In [3]:
def cleanup(data_to_clean):
    # Removing some special characers and extra space
    data_to_clean = re.sub(r'\[[0-9]*\]', ' ', data_to_clean)
    data_to_clean = re.sub(r'\s+', ' ', data_to_clean)

    # Removing special characters and digits
    formated_content = re.sub('[^a-zA-Z]', ' ', data_to_clean )
    formated_content = re.sub(r'\s+', ' ', formated_content)

    # Fetch sentence lenth 
    sentence_list = nltk.sent_tokenize(data_to_clean)

    # Set stopwords using nltk corpus
    stopwords = nltk.corpus.stopwords.words('english')
    
    return formated_content,sentence_list,stopwords

#### Identify frequency of each word by Occurrence

- Lets identify the frequency of each words by using formated_content and stopwords variable extracted from cleanup function.
- Makeuse of stopwords variable.Loopover all the sentences and than respective words and check if they are stopwords.
    - If  present do nothing
    - If not, check whether the words present in word_frequency dictionary.
        - if the word is comeup for the first time, it will added to the word_frequencies dictionary as a key and its value is set to 1
        - If the word previously exists in the dictionary, its value is simply updated by 1.
- Now to find the weighted frequency, lets divide the number of occurances of all the words by the frequency of the most occurring word.

In [4]:
def word_max_freq(formated_content,stopwords):
    word_frequencies = {}
    for word in nltk.word_tokenize(formated_content):
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1 
            else:
                word_frequencies[word] += 1
        maximum_frequncy = max(word_frequencies.values())
    return maximum_frequncy,word_frequencies

#### Generate Scores for the Sentences

- Lets calculate the score of each sentence by adding weighted frequencies of the words that occur in that particular sentence.
- Create an empty sentence_scores dictionary and update the Sentences as key and the corresponding score would be defined as value, loop through each sentence in the sentence_list and tokenize the sentence into words.
- Check if the word exists in the word_frequencies dictionary fetched from word_max_freq function.This validation is performed since we created the sentence_list list from the doc_content object; on the other hand, the word frequencies were calculated using the formated_content object, which do not contain any stop words, numbers,..
- Logic defined to calculate scores for the sentences only contains words less than 28 words, as we do not want long content in the summary.This number can been tweeked specific to usecase.
- Check if the sentence available in the sentence_scores dictionary. 
    - If the sentence doesn't exist, we add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value.
    - If the sentence exists in the dictionary, add the weighted frequency of the word to the existing value.

In [5]:
def sent_score(word_frequencies,maximum_frequncy):
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
        sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) < 28:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]
    return sentence_scores

#### Generate the Summary
- Lets make us of sentence_scores dictionary (contains respective sentence score) to summarize the doc_content
- making use of nlargest function to generate the top n sentences with the highest scores (In this we have used 5 sentences. However we can modify specific to usecase)

In [6]:
# Main
#text_to_clean = read_content("about_corona.docx","word") #Use this for summarizing the word document
#text_to_clean = read_content("about_climat_change.txt","text") #use this for summarizing text document
text_to_clean = read_content("https://en.wikipedia.org/wiki/Artificial_intelligence","url") #use this for summarizing web url document

formated_content,sentence_list,stopwords = cleanup(text_to_clean)
maximum_frequncy,word_frequencies = word_max_freq(formated_content,stopwords)
sentence_scores = sent_score(word_frequencies,maximum_frequncy)

summary_sentences = heapq.nlargest(5, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print("This is the summary of the provided document:\n---------------------------------------------\n", summary)


This is the summary of the provided document:
---------------------------------------------
  Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to natural intelligence displayed by animals including humans. A machine with general intelligence can solve a wide variety of problems with a breadth and versatility similar to human intelligence. A superintelligence, hyperintelligence, or superhuman intelligence, is a hypothetical agent that would possess intelligence far surpassing that of the brightest and most gifted human mind. Deep learning has drastically improved the performance of programs in many important subfields of artificial intelligence, including computer vision, speech recognition, image classification and others. [q] AI founder John McCarthy said: "Artificial intelligence is not, by definition, simulation of human intelligence".
