Skip to content
A simple website demonstrating TextRank's extractive summarization capability.
HTML Python Jupyter Notebook Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LASER_PROJECT Merge commit 'd787bf8e5e3d01672974b1214bab6c33e763b313' into laser Jun 8, 2019
baidu_sdk Embed the Baidu SDK into this repo Apr 26, 2019
imgs Update README Dec 26, 2018
snapshots Support Xling + ZH/JA; add some snapshots Apr 26, 2019
static/images First public release Dec 7, 2018
templates Add LASER model Jun 8, 2019
.dockerignore Provide a dockerfile for LASER Jun 8, 2019
.gitignore First public release Dec 7, 2018
Dockerfile.cpu Provide a dockerfile for LASER Jun 8, 2019
Dockerfile.laser_cpu Dockerfile with Universal Sentence Encoder enabled Sep 17, 2019
Dockerfile.use_cpu Dockerfile with Universal Sentence Encoder enabled Sep 17, 2019
LICENSE Initial commit Dec 7, 2018
README.md Dockerfile with Universal Sentence Encoder enabled Sep 17, 2019
api.py Add LASER model Jun 8, 2019
baidunlp.py
demo.py Dockerfile with Universal Sentence Encoder enabled Sep 17, 2019
laser Add LASER model Jun 8, 2019
requirements.txt Dockerfile with Universal Sentence Encoder enabled Sep 17, 2019
summa_score_sentences.py
summa_score_sentences_laser.py Add LASER model Jun 8, 2019
summa_score_sentences_use.py Provide a dockerfile for LASER Jun 8, 2019
summa_score_sentences_xling.py API server (draft) May 3, 2019
summa_score_words.py Fix typo Apr 26, 2019
test_text_cleaning_ja.py
test_text_cleaning_zh.py Add crude Japanese support Dec 26, 2018
text_cleaning_en.py Use spacy to segment sentences Apr 25, 2019
text_cleaning_ja.py Disable some debugging messages Apr 26, 2019
text_cleaning_zh.py Disable some debugging messages Apr 26, 2019

README.md

TextRank Demo

A simple website demonstrating TextRank's extractive summarization capability. Currently supports English and Chinese.

Major updates

September 2019

  • I managed to find the exact setup that makes Universal Sentence Encoder work and put it in Dockerfile.use_cpu. The main issue was that tf-sentencepiece doesn't work with Python 3.7. Downgrading to Python 3.6 solves the problem for me.

June 2019

  • Add LASER sentence encoder(multi-lingual). LASER has rather complicated installation steps, so a dedicated Dockerfile(Dockerfile.laser_cpu) is provided.
  • Xling variant of Universal Sentence Encoder stops working due to some problem of tf-sentencepiece package despite the same version specification in requirements.txt. It has happened before and fixing it was really annoying. Since Google has dropped support for this integration, it's unlikely to get better (see the quote below). I decided to drop the official support of Xling. It's still on the demo page as an encoder option, but expect it to fail.

We will be no longer supported direct integration with Tensorflow. Tensorflow users are suggested to adopt the new Tokzenization ops published as part of TF.Text. Those ops will soon support running pre-trained SentencePiece models. (source)

April 2019

  • Similarity metrics using the Universal Sentence Encoders from Tensorflow Hub has been added. Use the "Similarity Metric" dropdown menu to switch between models.

  • All USE models supports English, but only the Xling variant supports Japanese and Chinese.

  • A Dockerfile(Dockerfile.cpu) has been added for easier reproduction. Because the "base" model only supports CPU version of Tensorflow, at this moment we don't provide a GPU version of the Dockerfile.

  • Use spacy to segment sentence for English texts.

Usage

WARNING: At the current state, the backend does almost to none input value validation. Please do not anticipate it to have production quality.

Docker (Recommended)

Two options for you:

  1. Dockerfile.cpu: No LASER support. Classic textrank, USE-base and USE-large work.
  2. Dockerfile.laser_cpu: Classic textrank, USE-base, USE-large, and LASER work.

(USE-xling probably won't work in both cases due to reason described in the June 2019 update log.)

Build the docker image using:

docker build -t <name> -f <Dockerfile.cpu or Dockerfile.laser_cpu> .

Start a container using:

docker run -u 1000:1000 --rm -ti -p 8000:8000 -e TFHUB_CACHE_DIR=/some/path/tf_hub_cache/ -e BAIDU_APP_ID=<ID> -e BAIDU_APP_KEY=<key> -e BAIDU_SECRET_KEY=<secret> -v /some/path:/some/path <name>

If you're not feeding Chinese text to the server, you can skip BAIDU related environment variables. Setting TFHUB_CACHE_DIR is recommended to save the time by avoiding downloading the models every time you start a new container.

Visit http://localhost:8000 in your browser.

Local Python Environment

  • This project uses Starlette (a lightweight ASGI framework/toolkit), so Python 3.6+ is required.

  • Install dependencies by running pip install -r requirements.txt.

  • Start the demo server by running python demo.py, and then visit http://localhost:8000 in your browser.

(Depending on your Python setup, you might need to replace pip with pip3, and python with python3.)

API server

There's a very simple script(api.py) to create a api server using fastapi. It might be a good starting point for you to expand upon (that's what I did to create a private textrank api server).

Languages supported

English

Demo: A static snapshot with an example from Wikipedia.

This largely depends on language preprocessing functions and classes from summanlp/textrank. This project just exposes some of their internal data.

Accoring to summanlp/textrank, you can install an extra library to improve keyword extraction:

For a better performance of keyword extraction, install Pattern.

From a quick glance at the source code, it seems to be using Pattern (if available) to do some POS tag filtering.

Chinese

Demo: A static snapshot with an example from a news article.

This project uses Baidu's free NLP API to do word segmentation and POS tagging. You need to create an account there, install the Python SDK, and set the following environment variables:

  • BAIDU_APP_ID
  • BAIDU_APP_KEY
  • BAIDU_SECRET_KEY

You can of course use other offline NLP tools instead. Please refer to test_text_cleaning_zh.py for information on the data structures expected by the main function.

Traditional Chinese will be converted to Simplified Chinese due to the restrictions of Baidu API.

Japanese

Demo: A static snapshot with an example from a news article.

It uses nagisa to do word segmentation and POS tagging. There are some Japanese peculiarities that make it a bit tricky, and I had to add a few stopwords go get more reasonable results for demo. Obviously there is much room of improvement here.

Snapshots

English

highlights

sentence network

word network

Chinese

highlights

sentence network

word network

Japanese

highlights

sentence network

word network

You can’t perform that action at this time.