essence

An automatic web page content extractor for Kotlin and Java.

Given an HTML document, essence automatically extracts the main text content (and much more).

Try out the demo - a simple webapp to demonstrate essence.

This library is inspired by node-unfluff and its lineage

Usage

Java

import io.github.cdimascio.essence.Essence;

EssenceResult data = Essence.extract(html);
System.out.println(data.getText());

Kotlin

val data = Essence.extract(html)
println(data.text)

See Extracted data elements for additional extracted metadata.

Install

Maven

<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency>

Gradle

compile 'io.github.cdimascio:essence:0.13.0'

Try the Essence web demo

Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.

The essence web project lives here

Extracted data elements

essence attempts to extract the following content:

title - The document's title
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
(coming soon...)videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

Credits

node-unfluff by https://github.com/ageitgey
python-goose by Xavier Grangier
goose by Gravity Labs

License

Apache 2.0

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Clément P.}
💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
assets		assets
src		src
.all-contributorsrc		.all-contributorsrc
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

essence

Usage

Install

Try the Essence web demo

Extracted data elements

Credits

License

Contributors ✨

About

Releases 2

Packages

Contributors 4

Languages

License

cdimascio/essence

Folders and files

Latest commit

History

Repository files navigation

essence

Usage

Install

Try the Essence web demo

Extracted data elements

Credits

License

Contributors ✨

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages