Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch_url doesn't return RawResponse and doesn't provide access to response code #322

Closed
edkrueger opened this issue Apr 10, 2023 · 10 comments · Fixed by #497
Closed

fetch_url doesn't return RawResponse and doesn't provide access to response code #322

edkrueger opened this issue Apr 10, 2023 · 10 comments · Fixed by #497
Labels
documentation Docs in need of update or extension enhancement New feature or request
Milestone

Comments

@edkrueger
Copy link
Contributor

The documentation says that fetch_url returns a RawResponse with headers, body and status code. However, when passed decode=True (the default), it actually returns either the HTML as a string or None. What appears to be happening is that both _send_request and _send_pycurl_request return a RawResponse, but _handle_response converts the return to str or None when decode=True.

The reason I think this is important is that I think the automatic decoding is a nice feature, but I'd also like to be able to see the status code saved.

Proposed solution:

I think it would be nice to guarantee that the return type of this function is consistent but also accommodate the decode=True option.

So, I propose that:

  • RawResponse be replaced with a class Response with fields data, status, url and html fields and default values of None.
  • When decode=False, Response should have the field html be None and if decode=True the html field should hold the HTML as string.
  • _handle_response should both take and return a Response returning a copy with Response.html set to the HTML as a string when decode=True
  • Make a bool method that returns self.data == None to preserve truthy/false checks
  • For convenience, and easy serialization, Response should have an as_dict method (like in Document)
  • Update the documentation of fetch_url to say that it returns a Response.

I realize this change could break some downstream code, but I think it would also make fetch_url more useful.

Let me know what you think!

@adbar adbar added the documentation Docs in need of update or extension label Apr 11, 2023
@adbar
Copy link
Owner

adbar commented Apr 11, 2023

Hi @edkrueger, thanks for your feedback. I agree there is something missing here but in my opinion it's more a documentation issue.

Rather than risking to break existing code I'd prefer to explain how to use RawResponse in order to preserve necessary information. To sum up, we have two different modes:

  • expert: RawResponse
  • seamless with decode=True (it may have been a debatable design decision that the return values are not consistent though)

We could break things in expert mode (i.e. change the class and its attributes) and extend the docs, does that work for you and do you see a way to move on?

@edkrueger
Copy link
Contributor Author

edkrueger commented Apr 11, 2023

Hey @adbar, I think the mixed returns add some complexity to using the tool. For example, consider someone has written code in the "expert mode" but then decides to use decode=True. Changing that flag would break their code.

Also, getting the status code along with the HTML seems essential to most analyses. For example, you want to filter out cases where you get the HTML for a 404 page.

In order to avoid breaking anything in the external interface, here is an idea:

  • let's leave fetch_url function's external interface alone, document it and add a deprecation warning.
  • Make a new "easy mode" function that returns the HTML and status code as a tuple, so you can do html, status_code = easy_fetch(url).

@adbar
Copy link
Owner

adbar commented Apr 12, 2023

The problem is that fetch_url() is mentioned in various places on the Internet (not just in the docs) with its current behavior and people are going to wonder why it doesn't work anymore (especially beginners). Even if a deprecation is nice we could maybe work on the more advanced functionality first.

I believe it's easier to make construction work on parts for which developers are likely to read the docs or are ready for a bit of tinkering:

  1. Let's assume your easy_fetch suggestion is actually the current fetch_url(decode=True), albeit without status code
  2. From there we could start to deprecate fetch_url(decode=False) and create a new function like expert_fetch() (there might be a better name)
  3. The latter would offer the functionality you mention: Response object with data, status, url, and html (conditional on value of decode argument), as well as as_dict method or argument, and even more since we're at it
  4. Since _handle_response is clearly designed for internal use it shouldn't be troublesome to update it
  5. Anybody able to catch and understand the deprecation notice in (2) will likely be happy about the new, more configurable function

What do you think?

@adbar adbar added the enhancement New feature or request label Apr 12, 2023
@adbar
Copy link
Owner

adbar commented May 3, 2023

@edkrueger What do you think and do you have time to work on the PR?

@adbar adbar added this to the v2.0 milestone Jun 20, 2023
@robertour
Copy link

My grain of sand. I would suggest the names fetch_content(url) and fetch_response(url), and down the line, deprecate fetch_url.

@adbar
Copy link
Owner

adbar commented Jan 17, 2024

To sum up, here is what I'd suggest in order to implement useful changes step-by-step:

  1. Create a new function to replace the decode=False behavior, say fetch_response()
    hint: a new function prevents us from breaking the existing one
  2. Add a deprecation warning for fetch_url(decode=False)
    hint: just this one first, we can see the rest later if needed
  3. Change RawResponse to Response class with additional fields and update _handle_response() accordingly
    hint: we can break things here
  4. Add helpers to new function or class (bool, as_dict, etc.)
  5. Update the docs

If we agree on most of the sequence we could start working on it. Pull requests are welcome.

@robertour
Copy link

Sounds good, I am only missing the two versions of fetch. I like the idea of separating the simple fetch (just content) from the advanced fetch (full response). Many users would prefer to use the simple fetch.

(not sure about the RawResponse and Response, so I am not commenting on that one)

@adbar
Copy link
Owner

adbar commented Jan 17, 2024

In the end fetch_url() would remain (as easy mode, without args). The advanced mode would be available as fetch_response().

In a second step we could also change that but users proficient enough to be interested in status codes would probably just use fetch_response() and that would be it. So continuity would be ensured.

@adbar
Copy link
Owner

adbar commented Jan 18, 2024

Ongoing work is in #479.

@adbar
Copy link
Owner

adbar commented Jan 19, 2024

Steps 1 to 3 are now implemented. Feel free to provide feedback or additional functionality with a PR.

I'll leave this thread open at least until the next release (and docs update).

@adbar adbar linked a pull request Feb 2, 2024 that will close this issue
@adbar adbar closed this as completed in #497 Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Docs in need of update or extension enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants