# Scrapping and summarizing my university's website

This code is a full webscrapping process from accessing a website to transforming its data to a summarized text :

1. requests
2. BeautifulSoup
3. nltk

We will work on my universty's "ISTIC" website to scrap some data automatically First install missing libraries :

In [8]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [None]:
import bs4 as bs
import urllib.request
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

## Go to website and parse the page with BeautifulSoup

In [17]:
# Get data to summarize
scraped_data = urllib.request.urlopen('http://www.istic.rnu.tn/fr/presentation/presentation.html')
data = scraped_data.read()

parsed_data = bs.BeautifulSoup(data,'lxml')

paragraphs = parsed_data.find_all('p')

# Input text - to summarize
data_text = ""

for p in paragraphs:
    data_text += p.text


## Clean the paragraphs

In [18]:
# Removing numbers 
data_text = re.sub(r'\[[0-9]*\]', ' ', data_text)
data_text = re.sub(r'\s+', ' ', data_text)

In [19]:
# Removing special characters and digits
formatted_data_text = re.sub('[^a-zA-Z]', ' ', data_text )
formatted_data_text = re.sub(r'\s+', ' ', formatted_data_text)

In [20]:
sentence_list = nltk.sent_tokenize(data_text)

## Remove stop words 

In [21]:
stopwords = nltk.corpus.stopwords.words('french')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_data_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

## Create frequency table of words

In [22]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

## Assign score to each sentence depending on the words it contains and the frequency table

In [23]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

## Build summary

In [24]:
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

L’ISTIC est une institution universitaire publique créée par les décrets n° 2011-1010 du 24 août 2011 et n° 1645 de 2012, sous tutelle de l’Université de Carthage.


## Ref : https://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk