Article-Extractor

It is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from [Readability.js] by [Mozilla] and [omnivore].

For some websites, specific configuration templates are used to improve the accuracy of extractor.

Installation

To install this package, just run go get :

go get github.com/beclab/article-extractor

Usage

To get the readable content from an URL, you can use processor.ArticleReadabilityExtractor. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content.

Input parameters	describe
rawContent	raw content of the page
entryUrl	url of the entry
feedUrl	feed url， it can be "" if don’t have the value
rules	custom parsing rules
isrecommend	reserved parameters ,not used yet

Out parameters	describe
content	content of the page
pureContent	pure content
publishedDate	published date,parsed by readability
image	cover image of the page
title	title of the page
author	author of the page,parsed by templates
byline	byline , parsed by readability
publishedAtTimeStamp	published timeStamp,parsed by templates

To get the published date, publishedAtTimeStamp field can be used first, if the value is not empty. To get the author of article, author field can be used first, if the value is not empty.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
infra/develop_container		infra/develop_container
processor		processor
readability		readability
rewrite		rewrite
sanitizer		sanitizer
templates		templates
url		url
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra/develop_container

infra/develop_container

processor

processor

readability

readability

rewrite

rewrite

sanitizer

sanitizer

templates

templates

url

url

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

go.mod

go.mod

go.sum

go.sum

Repository files navigation

Article-Extractor

Table of Contents

Installation

Usage

About

Releases 7

Packages

Contributors 2

Languages

License

beclab/article-extractor

Folders and files

Latest commit

History

Repository files navigation

Article-Extractor

Table of Contents

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Languages