Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: convert readme to markdown #147

Merged
merged 6 commits into from
Apr 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Htmldate: Find the Publication Date of Web Pages

[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)
[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)

<br/>

<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png" alt="Logo as PNG image" width="60%"/>

<br/>

Find **original and updated publication dates** of any web page. **On
the command-line or with Python**, all the steps needed from web page
download to HTML parsing, scraping, and text analysis are included. The
package is used in production on millions of documents and integrated by
[multiple
libraries](https://github.com/adbar/htmldate/network/dependents).


## In a nutshell

<br/>

<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif" alt="Demo as GIF image" width="80%"/>

<br/>

### With Python

``` python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
```

### On the command-line

``` bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
```

## Features

- Flexible input: URLs, HTML files, or HTML trees can be used as input
(including batch processing).
- Customizable output: Any date format (defaults to [ISO 8601
YMD](https://en.wikipedia.org/wiki/ISO_8601)).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.

### How it works

Htmldate operates by sifting through HTML markup and if necessary text
elements. It features the following heuristics:

1. **Markup in header**: Common patterns are used to identify relevant
elements (e.g. `link` and `meta` elements) including [Open Graph
protocol](http://ogp.me/) attributes.
2. **HTML code**: The whole document is searched for structural markers
like `abbr` or `time` elements and a series of attributes (e.g.
`postmetadata`).
3. **Bare HTML content**: Heuristics are run on text and markup:
- In `fast` mode the HTML page is cleaned and precise patterns are
targeted.
- In `extensive` mode all potential dates are collected and a
disambiguation algorithm determines the best one.

Finally, the output is validated and converted to the chosen format.

## Performance

1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)

| Python Package | Precision | Recall | Accuracy | F-Score | Time |
| -------------- | --------- | ------ | -------- | ------- | ---- |
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |

For the complete results and explanations see [evaluation
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).

## Installation

Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.6 upwards. It can notably be installed with `pip` (`pip3`
where applicable) from the PyPI package repository:

- `pip install htmldate`
- (optionally) `pip install htmldate[speed]`

## Documentation

For more details on installation, Python & CLI usage, **please refer to
the documentation**:
[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)

## License

This package is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).

Versions prior to v1.8.0 are under GPLv3+ license.

## Author

This project is part of methods to derive information from web documents
in order to build [text databases for
research](https://www.dwds.de/d/k-web) (chiefly linguistic analysis and
natural language processing).

Extracting and pre-processing web texts to meet the exacting standards
is a significant challenge. It is often not possible to reliably
determine the date of publication or modification using either the URL
or the server response. For more information:

[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)


``` shell
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
```

- Barbaresi, A. \"[htmldate: A Python package to extract publication
dates from web pages](https://doi.org/10.21105/joss.02439)\",
Journal of Open Source Software, 5(51), 2439, 2020. DOI:
10.21105/joss.02439
- Barbaresi, A. \"[Generic Web Content Extraction with Open-Source
Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\",
Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. \"[Efficient construction of metadata-enhanced web
corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\",
Proceedings of the [10th Web as Corpus Workshop
(WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.

You can contact me via my [contact page](https://adrien.barbaresi.eu/)
or [GitHub](https://github.com/adbar).

## Contributing

[Contributions](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)
are welcome as well as issues filed on the [dedicated
page](https://github.com/adbar/htmldate/issues).

Special thanks to the
[contributors](https://github.com/adbar/htmldate/graphs/contributors)
who have submitted features and bugfixes!

## Acknowledgements

Kudos to the following software libraries:

- [lxml](http://lxml.de/),
[dateparser](https://github.com/scrapinghub/dateparser)
- A few patterns are derived from the
[python-goose](https://github.com/grangier/python-goose),
[metascraper](https://github.com/ianstormtaylor/metascraper),
[newspaper](https://github.com/codelucas/newspaper) and
[articleDateExtractor](https://github.com/Webhose/article-date-extractor)
libraries. This module extends their coverage and robustness
significantly.
192 changes: 0 additions & 192 deletions README.rst

This file was deleted.

3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@

def get_long_description():
"Return the README"
with open("README.rst", "r", encoding="utf-8") as filehandle:
with open("README.md", "r", encoding="utf-8") as filehandle:
long_description = filehandle.read()
# long_description += "\n\n"
# with open("CHANGELOG.md", encoding="utf8") as f:
Expand Down Expand Up @@ -64,6 +64,7 @@ def get_version(package):
version=get_version("htmldate"),
description="Fast and robust extraction of original and updated publication dates from URLs and web pages.",
long_description=get_long_description(),
long_description_content_type="text/markdown",
classifiers=[
# As from http://pypi.python.org/pypi?%3Aaction=list_classifiers
"Development Status :: 5 - Production/Stable",
Expand Down
Loading