fetch_url doesn't return RawResponse and doesn't provide access to response code #322

edkrueger · 2023-04-10T23:22:56Z

The documentation says that fetch_url returns a RawResponse with headers, body and status code. However, when passed decode=True (the default), it actually returns either the HTML as a string or None. What appears to be happening is that both _send_request and _send_pycurl_request return a RawResponse, but _handle_response converts the return to str or None when decode=True.

The reason I think this is important is that I think the automatic decoding is a nice feature, but I'd also like to be able to see the status code saved.

Proposed solution:

I think it would be nice to guarantee that the return type of this function is consistent but also accommodate the decode=True option.

So, I propose that:

RawResponse be replaced with a class Response with fields data, status, url and html fields and default values of None.
When decode=False, Response should have the field html be None and if decode=True the html field should hold the HTML as string.
_handle_response should both take and return a Response returning a copy with Response.html set to the HTML as a string when decode=True
Make a bool method that returns self.data == None to preserve truthy/false checks
For convenience, and easy serialization, Response should have an as_dict method (like in Document)
Update the documentation of fetch_url to say that it returns a Response.

I realize this change could break some downstream code, but I think it would also make fetch_url more useful.

Let me know what you think!

The text was updated successfully, but these errors were encountered:

adbar · 2023-04-11T12:05:25Z

Hi @edkrueger, thanks for your feedback. I agree there is something missing here but in my opinion it's more a documentation issue.

Rather than risking to break existing code I'd prefer to explain how to use RawResponse in order to preserve necessary information. To sum up, we have two different modes:

expert: RawResponse
seamless with decode=True (it may have been a debatable design decision that the return values are not consistent though)

We could break things in expert mode (i.e. change the class and its attributes) and extend the docs, does that work for you and do you see a way to move on?

edkrueger · 2023-04-11T16:16:39Z

Hey @adbar, I think the mixed returns add some complexity to using the tool. For example, consider someone has written code in the "expert mode" but then decides to use decode=True. Changing that flag would break their code.

Also, getting the status code along with the HTML seems essential to most analyses. For example, you want to filter out cases where you get the HTML for a 404 page.

In order to avoid breaking anything in the external interface, here is an idea:

let's leave fetch_url function's external interface alone, document it and add a deprecation warning.
Make a new "easy mode" function that returns the HTML and status code as a tuple, so you can do html, status_code = easy_fetch(url).

adbar · 2023-04-12T10:10:46Z

The problem is that fetch_url() is mentioned in various places on the Internet (not just in the docs) with its current behavior and people are going to wonder why it doesn't work anymore (especially beginners). Even if a deprecation is nice we could maybe work on the more advanced functionality first.

I believe it's easier to make construction work on parts for which developers are likely to read the docs or are ready for a bit of tinkering:

Let's assume your easy_fetch suggestion is actually the current fetch_url(decode=True), albeit without status code
From there we could start to deprecate fetch_url(decode=False) and create a new function like expert_fetch() (there might be a better name)
The latter would offer the functionality you mention: Response object with data, status, url, and html (conditional on value of decode argument), as well as as_dict method or argument, and even more since we're at it
Since _handle_response is clearly designed for internal use it shouldn't be troublesome to update it
Anybody able to catch and understand the deprecation notice in (2) will likely be happy about the new, more configurable function

What do you think?

adbar · 2023-05-03T12:06:14Z

@edkrueger What do you think and do you have time to work on the PR?

robertour · 2024-01-12T07:58:48Z

My grain of sand. I would suggest the names fetch_content(url) and fetch_response(url), and down the line, deprecate fetch_url.

adbar · 2024-01-17T13:43:46Z

To sum up, here is what I'd suggest in order to implement useful changes step-by-step:

Create a new function to replace the decode=False behavior, say fetch_response()
hint: a new function prevents us from breaking the existing one
Add a deprecation warning for fetch_url(decode=False)
hint: just this one first, we can see the rest later if needed
Change RawResponse to Response class with additional fields and update _handle_response() accordingly
hint: we can break things here
Add helpers to new function or class (bool, as_dict, etc.)
Update the docs

If we agree on most of the sequence we could start working on it. Pull requests are welcome.

robertour · 2024-01-17T14:02:00Z

Sounds good, I am only missing the two versions of fetch. I like the idea of separating the simple fetch (just content) from the advanced fetch (full response). Many users would prefer to use the simple fetch.

(not sure about the RawResponse and Response, so I am not commenting on that one)

adbar · 2024-01-17T17:35:48Z

In the end fetch_url() would remain (as easy mode, without args). The advanced mode would be available as fetch_response().

In a second step we could also change that but users proficient enough to be interested in status codes would probably just use fetch_response() and that would be it. So continuity would be ensured.

adbar · 2024-01-18T13:31:20Z

Ongoing work is in #479.

adbar · 2024-01-19T12:50:24Z

Steps 1 to 3 are now implemented. Feel free to provide feedback or additional functionality with a PR.

I'll leave this thread open at least until the next release (and docs update).

adbar added the documentation Docs in need of update or extension label Apr 11, 2023

adbar added the enhancement New feature or request label Apr 12, 2023

adbar mentioned this issue Apr 12, 2023

feat: add basic auth support #323

Closed

adbar added this to the v2.0 milestone Jun 20, 2023

adbar mentioned this issue Jan 11, 2024

fetch_url('spiegel.de/....') returns None #474

Closed

adbar linked a pull request Feb 2, 2024 that will close this issue

Response class: add convenience functions #497

Merged

adbar closed this as completed in #497 Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch_url doesn't return RawResponse and doesn't provide access to response code #322

fetch_url doesn't return RawResponse and doesn't provide access to response code #322

edkrueger commented Apr 10, 2023

adbar commented Apr 11, 2023

edkrueger commented Apr 11, 2023 •

edited

Loading

adbar commented Apr 12, 2023

adbar commented May 3, 2023

robertour commented Jan 12, 2024

adbar commented Jan 17, 2024

robertour commented Jan 17, 2024

adbar commented Jan 17, 2024

adbar commented Jan 18, 2024

adbar commented Jan 19, 2024

fetch_url doesn't return RawResponse and doesn't provide access to response code #322

fetch_url doesn't return RawResponse and doesn't provide access to response code #322

Comments

edkrueger commented Apr 10, 2023

adbar commented Apr 11, 2023

edkrueger commented Apr 11, 2023 • edited Loading

adbar commented Apr 12, 2023

adbar commented May 3, 2023

robertour commented Jan 12, 2024

adbar commented Jan 17, 2024

robertour commented Jan 17, 2024

adbar commented Jan 17, 2024

adbar commented Jan 18, 2024

adbar commented Jan 19, 2024

edkrueger commented Apr 11, 2023 •

edited

Loading