Skip to content

Conversation

@pyup-bot
Copy link
Collaborator

This PR pins w3lib to the latest release 2.3.1.

Changelog

2.3.1

------------------

- Fix a merge error, no code changes.

2.3.0

------------------

- Dropped Python 3.8 support (232).

- Removed the following functions, deprecated in 2.0.0:

 - ``w3lib.util.str_to_unicode``
 - ``w3lib.util.to_native_str``
 - ``w3lib.util.unicode_to_str``

(235).

- Added Python 3.13 support (232).

- Fixed running tests with newer point releases of Python 3.10 and 3.11 (233).

- Cleanup and CI improvements (232, 234).

2.2.1

------------------

- :func:`~w3lib.url.canonicalize_url` no longer applies lowercase to the
userinfo URL component. (229, 230)

2.2.0

------------------

- Dropped Python 3.7 support (214).

- Added Python 3.12 and PyPy 3.10 support (218).

- Added the description to the package metadata (227).

- Improved type hints (226).

- Added ``.readthedocs.yml`` (219).

- Updated the intersphinx URLs (224).

- Added the ``pre-commit`` configuration, code reformatted with ``black``
(220).

- Updated CI configuration (217, 227).

2.1.2

------------------

- Fix test failures on Python 3.11.4+ (212, 213).
- Fix an incorrect type hint (211).
- Add project URLs to setup.py (215).

2.1.1

------------------

- :func:`~w3lib.url.safe_url_string`, :func:`~w3lib.url.safe_download_url`
and :func:`~w3lib.url.canonicalize_url` now strip whitespace and control
characters urls according to the URL living standard.

2.1.0

------------------

-   Dropped Python 3.6 support, and made Python 3.11 support official. (195,
 200)

-   :func:`~w3lib.url.safe_url_string` now generates safer URLs.

 To make URLs safer for the `URL living standard`_:

 .. _URL living standard: https://url.spec.whatwg.org/

 -   ``;=`` are percent-encoded in the URL username.

 -   ``;:=`` are percent-encoded in the URL password.

 -   ``'`` is percent-encoded in the URL query if the URL scheme is `special
     <https://url.spec.whatwg.org/#special-scheme>`__.

 To make URLs safer for `RFC 2396`_ and `RFC 3986`_, ``|[]`` are
 percent-encoded in URL paths, queries, and fragments.

 .. _RFC 2396: https://www.ietf.org/rfc/rfc2396.txt
 .. _RFC 3986: https://www.ietf.org/rfc/rfc3986.txt

 (80, 203)

-   :func:`~w3lib.encoding.html_to_unicode` now checks for the `byte order
 mark`_ before inspecting the ``Content-Type`` header when determining the
 content encoding, in line with the `URL living standard`_. (189, 191)

 .. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark

-   :func:`~w3lib.url.canonicalize_url` now strips spaces from the input URL,
 to be more in line with the `URL living standard`_. (132, 136)

-   :func:`~w3lib.html.get_base_url` now ignores HTML comments. (70, 77)

-   Fixed :func:`~w3lib.url.safe_url_string` re-encoding percent signs on
 the URL username and password even when they were being used as part of an
 escape sequence. (187, 196)

-   Fixed :func:`~w3lib.http.basic_auth_header` using the wrong flavor of
 base64 encoding, which could prevent authentication in rare cases. (181,
 192)

-   Fixed :func:`~w3lib.html.replace_entities` raising :exc:`OverflowError` in
 some cases due to `a bug in CPython
 <https://github.com/python/cpython/issues/76763>`__. (#199, 202)

-   Improved typing and fixed typing issues. (190, 206)

-   Made CI and test improvements. (197, 198)

-   Adopted a Code of Conduct. (194)

2.0.1

------------------
Minor documentation fix (release date is set in the changelog).

2.0.0

------------------

Backwards incompatible changes:

- Python 2 is no longer supported; Python 3.6+ is required now (168, 175).
- :func:`w3lib.url.safe_url_string` and :func:`w3lib.url.canonicalize_url`
no longer convert "%23" to "" when it appears in the URL path. This is a bug
fix. It's listed as a backward-incomatible change because in some cases the
output of :func:`w3lib.url.canonicalize_url` is going to change, and so, if
this output is used to generate URL fingerprints, new fingerprints might be
incompatible with those created with the previous w3lib versions
(141).

Deprecation removals (169):

- The ``w3lib.form`` module is removed.
- The ``w3lib.html.remove_entities`` function is removed.
- The ``w3lib.url.urljoin_rfc`` function is removed.

The following functions are deprecated, and will be removed in future releases
(170):

- ``w3lib.util.str_to_unicode``
- ``w3lib.util.unicode_to_str``
- ``w3lib.util.to_native_str``

Other improvements and bug fixes:

- Type annotations are added (172, 184).
- Added support for Python 3.9 and 3.10 (168, 176).
- Fixed :func:`w3lib.html.get_meta_refresh` for ``<meta>`` tags where
``http-equiv`` is written after ``content`` (179).
- Fixed :func:`w3lib.url.safe_url_string` for IDNA domains with ports (174).
- :func:`w3lib.url.url_query_cleaner` no longer adds an unneeded ` when
``keep_fragments=True`` is passed, and the URL doesn't have a fragment
(159).
- Removed a workaround for an ancient pathname2url bug (142)
- CI is migrated to GitHub Actions (166, 177); other CI improvements (160,
182).
- The code is formatted using black (173).

1.22.0

-------------------

- Python 3.4 is no longer supported (issue 156)
- :func:`w3lib.url.safe_url_string` now supports an optional ``quote_path``
parameter to disable the percent-encoding of the URL path (issue 119)
- :func:`w3lib.url.add_or_replace_parameter` and
:func:`w3lib.url.add_or_replace_parameters` no longer remove duplicate
parameters from the original query string that are not being added or
replaced (issue 126)
- :func:`w3lib.html.remove_tags` now raises a :exc:`ValueError` exception
instead of :exc:`AssertionError` when using both the ``which_ones`` and the
``keep`` parameters (issue 154)
- Test improvements (issues 143, 146, 148, 149)
- Documentation improvements (issues 140, 144, 145, 151, 152, 153)
- Code cleanup (issue 139)

1.21.0

-------------------

- Add the ``encoding`` and ``path_encoding`` parameters to
:func:`w3lib.url.safe_download_url` (issue 118)
- :func:`w3lib.url.safe_url_string` now also removes tabs and new lines
(issue 133)
- :func:`w3lib.html.remove_comments` now also removes truncated comments
(issue 129)
- :func:`w3lib.html.remove_tags_with_content` no longer removes tags which
start with the same text as one of the specified tags (issue 114)
- Recommend pytest instead of nose to run tests (issue 124)

1.20.0

-------------------

- Fix url_query_cleaner to do not append "?" to urls without a query string (issue 109)
- Add support for Python 3.7 and drop Python 3.3 (issue 113)
- Add `w3lib.url.add_or_replace_parameters` helper (issue 117)
- Documentation fixes (issue 115)

1.19.0

-------------------

- Add a workaround for CPython segfault (https://bugs.python.org/issue32583)
which affect w3lib.encoding functions. This is technically **backwards
incompatible** because it changes the way non-decodable bytes are replaced
(in some cases instead of two ``\ufffd`` chars you can get one).
As a side effect, the fix speeds up decoding in Python 3.4+.
- Add 'encoding' parameter for w3lib.http.basic_auth_header.
- Fix pypy testing setup, add pypy3 to CI.

1.18.0

-------------------

- Include additional assets used for distribution packages in the source tarball
- Consider ``[`` and ``]`` as safe characters in path and query components
of URLs, i.e. they are not escaped anymore
- Disable codecov project coverage check

1.17.0

-------------------

- Add Python 3.5 and 3.6 support
- Add ``w3lib.url.parse_data_uri`` helper for parsing "data:" URIs
- Add ``w3lib.html.strip_html5_whitespace`` function to strip leading and
trailing whitespace as per W3C recommendations, e.g. for cleaning
"href" attribute values
- Fix ``w3lib.http.headers_raw_to_dict`` for multiple headers with same name
- Do not distribute tests/test_*.pyc artifacts

1.16.0

-------------------

- ``canonicalize_url()`` and ``safe_url_string()``:
strip ":" when no port is specified (as per `RFC 3986`_;
see also https://github.com/scrapy/scrapy/issues/2377)
- ``url_query_cleaner()``: support new ``keep_fragments`` argument
(defaulting to ``False``)

1.15.0

-------------------

- Add ``canonicalize_url()`` to ``w3lib.url``

1.14.3

-------------------

Bugfix release:

- Handle IDNA encoding failures in ``safe_url_string()`` (issue 62)

1.14.2

-------------------

Bugfix release:

- fix function import for (deprecated) ``urljoin_rfc`` (issue 51)
- only expose wanted functions from ``w3lib.url``, via ``__all__``
(see issue 54, https://github.com/scrapy/scrapy/issues/1917)

1.14.1

-------------------

Bugfix release:

- For bytes URLs, when supplied encoding (or default UTF8) is wrong,
``safe_url_string`` falls back to percent-encoding offending bytes.

1.14.0

-------------------

Changes to safe_url_string:

- proper handling of non-ASCII characters in Python2 and Python3
- support IDNs
- new `path_encoding` to override default UTF-8 when serializing non-ASCII
characters before percent-encoding

html_body_declared_encoding also detects encoding when not sole attribute
in ``<meta>``.

Package is now properly marked as ``zip_safe``.

1.13.0

-------------------

- remove_tags removes uppercase tags as well;
- ignore meta-redirects inside script or noscript tags by default,
but add an option to not ignore them;
- replace_entities now handles entities without trailing semicolon;
- fixed uncaught UnicodeDecodeError when decoding entities.

1.12.0

-------------------

- meta_refresh regex now handles leading newlines and whitespaces in the url;
- include tests folder in source distribution.

1.11.0

-------------------

- url_query_cleaner now supports str or list parameters;
- add support for resolving base URLs in <base> tags with attributes
before href.

1.10.0

-------------------

- reverted all 1.9.0 changes.

1.9.0

------------------

- all url-related functions accept bytes and unicode and now return bytes.

1.8.1

------------------

- w3lib.http.basic_auth_header now returns bytes

1.8.0

------------------

- add support for big5-hkscs encoding.

1.7.1

------------------

- PY3 fixed headers_raw_to_dict and headers_dict_to_raw;
- documentation improvements;
- provide wheels.

1.6

----------------

- w3lib.form.encode_multipart is deprecated;
- docstrings and docs are improved;
- w3lib.url.add_or_replace_parameter is re-implemented on top of
stdlib functions;
- remove_entities is renamed to replace_entities.

1.5

----------------

- Python 2.6 support is dropped.

1.4

----------------

- Python 3 support;
- get_meta_refresh encoding handling is fixed;
- check for '?' in add_or_replace_parameter;
- ISO-8859-1 is used for HTTP Basic Auth;
- fixed unicode handling in replace_escape_chars;

1.3

----------------

- support non-standard gb_2312_80 encoding;
- drop Python 2.5 support.

1.2

----------------

- Detect encoding for content attr before http-equiv in meta tag.

1.1

----------------

- w3lib.html.remove_comments handles multiline comments;
- Added w3lib.encoding module, containing functions for working with character
encoding, like encoding autodetection from HTML pages.
- w3lib.url.urljoin_rfc is deprecated.

1.0

----------------

First release of w3lib.
Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants