# Text Summarization on Wikipedia pages using NLP
### Text Summarization?
Text Summarization is one of the techniques used in NLP to create short meaningful collection of text called summaries from text resources like articles, books, research papers or even a webpage.

### Scrapping Wikipedia Article


In [10]:
import bs4 as bs
import urllib.request
import re

# Normally, in Jupyter Notebooks, you may prefer to give a fixed URL, change the URL when you need it
# and not ask for user input.But I wanted to see which articles, 
# I can get a better summary and when the NLTK does "so so":)
userLink = input("Which Wikipedia article would you want me to summarize: ")
# Provide the Wikipedia URL like this: https://
raw_data = urllib.request.urlopen(userLink) 
document = raw_data.read()

parsed_document = bs.BeautifulSoup(document,'lxml')

article_paras = parsed_document.find_all('p')

scrapped_data = ""

for para in article_paras:
    scrapped_data += para.text

Which Wikipedia article would you want me to summarize: https://en.wikipedia.org/wiki/Cannabis


In [11]:
print(scrapped_data[:1500])


Cannabis (/ˈkænəbɪs/)[2] is a genus of flowering plants in the family Cannabaceae. The number of species within the genus is disputed. Three species may be recognized: Cannabis sativa, Cannabis indica, and Cannabis ruderalis; C. ruderalis may be included within C. sativa; all three may be treated as subspecies of a single species, C. sativa;[1][3][4][5] or C. sativa may be accepted as a single undivided species.[6] The genus is widely accepted as being indigenous to and originating from Asia.[7][8][9]
The plant is also known as hemp, although this term is often used to refer only to varieties of Cannabis cultivated for non-drug use. Cannabis has long been used for hemp fibre, hemp seeds and their oils, hemp leaves for use as vegetables and as juice, medicinal purposes, and as a recreational drug. Industrial hemp products are made from cannabis plants selected to produce an abundance of fiber. To satisfy the UN Narcotics Convention, some cannabis strains have been bred to produce minim

### Text Cleaning

In [12]:
scrapped_data = re.sub(r'\[[0-9]*\]', ' ',  scrapped_data)
scrapped_data = re.sub(r'\s+', ' ',  scrapped_data)

In [13]:
formatted_text = re.sub('[^a-zA-Z]', ' ', scrapped_data)
formatted_text = re.sub(r'\s+', ' ', formatted_text)

### Finding Word Frequencies

In [14]:
import nltk #if you don't have it, then>> python3 -m pip install nltk
all_sentences = nltk.sent_tokenize(scrapped_data)

In [15]:
# Stop Words are the words that you will most probably ignore, so we filter them out of the text.
stopwords = nltk.corpus.stopwords.words('english')

word_freq = {}
for word in nltk.word_tokenize(formatted_text):
    if word not in stopwords:
        if word not in word_freq.keys():
            word_freq[word] = 1
        else:
            word_freq[word] += 1

In [16]:
max_freq = max(word_freq.values())

for word in word_freq.keys():
    word_freq[word] = (word_freq[word]/max_freq)

### Finding Sentence Scores

In [17]:
sentence_scores = {}
for sentence in all_sentences:
    for token in nltk.word_tokenize(sentence.lower()):
        if token in word_freq.keys():
            if len(sentence.split(' ')) <25:
                if sentence not in sentence_scores.keys():
                    sentence_scores[sentence] = word_freq[token]
                else:
                    sentence_scores[sentence] += word_freq[token]

### Printing Summaries

In [18]:
import heapq
selected_sentences= heapq.nlargest(5, sentence_scores, key=sentence_scores.get)

text_summary = ' '.join(selected_sentences)
print(text_summary)

The plant is also known as hemp, although this term is often used to refer only to varieties of Cannabis cultivated for non-drug use. Cannabis sativa cultivars are used for fibers due to their long stems; Sativa varieties may grow more than six metres tall. In 1785, noted evolutionary biologist Jean-Baptiste de Lamarck published a description of a second species of Cannabis, which he named Cannabis indica Lam. The name Cannabis indica was listed in various Pharmacopoeias, and was widely used to designate Cannabis suitable for the manufacture of medicinal preparations. This taxonomic interpretation was embraced by Cannabis aficionados who commonly distinguish narrow-leafed "Sativa" strains from wide-leafed "Indica" strains.
