Skip to content
boilerpipe provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
C#
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
LICENSE
README.md

README.md

boilerpipe-dotnet-core

The boilerpipe-dotnet-core is a port of the boilerpipe Java library and provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides. A video of the presentation is freely available on Videolectures.net (turn speaker balance to the left to improve audio quality).

Commercial support is available through Kohlschütter Search Intelligence.

Demo

A (commercial) Web API is available here. It is hosted on Google App Engine. The underlying library, boilerpipe, is available under the Apache 2.0 license on GitHub. This Web Application may use a more recent version than the one released in the GitHub repository. You might thus get slightly different (hopefully better) results.

About the Author

Christian Kohlschütter has done his PhD research on boilerplate removal at L3S Research Center. His main research interests are in the area of Web Information Retrieval and Quantitative Linguistics.

Credits

This port is based on Charalambos Theodorou's .NET Core Port, which in turn is a port Rasmus John Pedersen's port of the origin boilerplate 1.2.0 code. But this code is also based on Christian Kohlschütter's boilerplate GitHub repository, which has a more recent code base (v2.0).

You can’t perform that action at this time.