Skip to content

akashp1712/summarize-webpage

Repository files navigation

Summarize-Webpage, Powered by LessenText.

A Flask application that extract and summarize webpage using Natural Language Processing.

screen_webpage_1

Index

Motivation

Motivation of this project to make production ready API for Text Summarization Algorithm and leverage it for the real-world use cases. This project implements a NLP algorithm using Python and serves using flask API.

How to start

You need to manually clone or download this repository. Project is ready to run (with some requirements).

You need to run app.py file in your development environment.

Open http://127.0.0.1:5000/, customize project files and have fun.

Requirements

The suggested way is to use python virtual environment. The project is developed using python 3.7.1

Included modules support

Python

This project uses very simple python web framework called Flask, which is very easy to learn and adopt. (even scale!!!)

The NLTK - Natural Language ToolKit is used for the Text Summarization Algorithm implementation.

HTML

The HTML Template used in this project is Stanley - Bootstrap Freelancer Template.

JavaScript

  • Vanilla Javascript

CSS

  • Vanilla CSS

Installation

Run requirements.txt to install the required python packages.

$ pip install -r requirements.txt

Implementation

Project Structure

|───config/
|───framework/
|───implementaion/
|───static/
|───templates/
|───app.py
|───wsgi.py

Framework

├──framework
| |──justext
| |──parser

jusText (the original framework) is developed by miso-belica

  • justext is modified code from jusText which is a Heuristic based boilerplate removal tool. The original code is modified to parse some of the tags (i.e, <P>, <li>, <b>, <H1>...<H6>), etc

    • Please note that, this project only uses English stopwords from the original project.
  • We're using jusText framework to download the webcontent and parse it using parser.

    • parser creates list of Paragraph object which has following properties:
1. is_heading -> boolean
   :: returns true if paragraph is heading (<H1>...<H6>) 


2. is_list_set -> boolean 
   :: returns true if paragraph is list tag (<li>)


3. is_paragraph -> boolean
   :: returns true if paragraph is paragraph tag (<p>)

4. is_first_paragraph(self):
   :: returns true if the paragraph is the first paragraph from the content.

5. text(self):
   :: get the text content of the paragraph without any tags

Summarization Algorithm

├──implementaion
| |──word_frequency_summarize_parser.py

This is the core module of this project: The implementation of the Summarization Algorithm.

Word_Frequency_Summarization: Summarization implementation using word frequency.

Important: This project has implemented slightly modified version of the Algorithm, where scoring the sentences method considers the web Text properties such as Header or list text.

i.e, it gives more weighing to Header or Bold text than normal text.

# All weightage for structure doc
# Important: These scores are for the experimenting purpose only

WEIGHT_FOR_LIST = 5
WEIGHT_FOR_HIGHLIGHTED = 10
WEIGHT_FOR_NUMERICAL = 5
WEIGHT_FIRST_PARAGRAPH = 5
WEIGHT_BASIC = 1

...

 for word in words:
    if paragraph.is_list_set:
        weight = WEIGHT_FOR_LIST
    else:
        weight = WEIGHT_BASIC

    if word in highlighted_words:
        weight += WEIGHT_FOR_HIGHLIGHTED

    if word.isnumeric() and len(word) >= 2:
        weight += WEIGHT_FOR_NUMERICAL

    if paragraph.is_first_paragraph:
        weight += WEIGHT_FIRST_PARAGRAPH

    word = ps.stem(word)
    if word in stopWords:
        continue

    if word in freqTable:
        freqTable[word] += weight
    else:
        freqTable[word] = weight

This way we can give extra weightage to words which are part of the headers or list. This way we can give more importnace to such words.

Idea: Play with the weightage and see the difference in the result!


Flask service

├──app.py

What if we want to make our Algorithm as servable API? (SAAS startup ???) Yes! we can do that, The app.py is flask module which serves an API that summarize the webpage

# `summarize` method takes webpage url as argument and returns the summarized text as json object
@app.route('/v1/summarize', methods=['GET'])
def summarize():
    ...
Usage:

This is a GET API which can be queried easily using CURL, Postman or your favourite browser.

ie, GET /v1/summarize?url=https://medium.com/@bnoll12/real-freedom-539c8e9499bb

OR via browser

http://localhost:5000/v1/summarize?url=https://medium.com/@bnoll12/real-freedom-539c8e9499bb


Let's add some UI

├──templates
├ ├──index.html
├──static
├ ├──assets
├ ├ ├──css
├ ├ ├──js
1. Accept the website url from the user

The following interface takes the website url and request the API we've developed using ajax.

screen_webpage_1

2. Ajax request using javascript: main.js
$.ajax({
    url: baseUrl + "?url=" + mediumURL
}).then(function(data) {
   processSummary(mediumURL, data.summary);
});
3. Process API response and display on UI

The API response is displayed on the HTML page using javascript.

var summary = document.createElement('p');
summary.innerHTML = "<b>Summary</b>: " + summaryData;
$('#summary').append(summary);

screen_webpage_2

Contribution

Feel free to raise an issue for bug or feature request And pull request for any kind of improvements.

Ideas

If you find this project interesting, you can do pretty more now, followings ideas might help you.

  • We can customize the API by adding more options to manipulate the output. ie, summary length, ignoring list text, etc
  • Display list of sentences instead of paragraph.
  • Create chrome plugin and highlight the sentences.

Credits

This application uses Open Source components. You can find the source code of their open source projects along with license information below.

I acknowledge and is grateful to these developers for their contributions to open source.

jusText used in /framework
Project: Heuristic based boilerplate removal tool https://github.com/miso-belica/jusText
Copyright (c) 2011, Jan Pomikalek <jan.pomikalek@gmail.com> Copyright (c) 2013, Michal Belica. All Rights Reserved.
License (2-Clause BSD) https://github.com/miso-belica/jusText/blob/dev/LICENSE.rst
HTML template theme
Project: Stanley - HTML theme by TemplateMag (https://templatemag.com)
Copyrights Stanley. All Rights Reserved.
Licensing information: https://templatemag.com/license/

PS: if you like my work, please support by adding yourself for LessenText Beta program.