Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot fetch url with more than 2 redirections #450

Closed
julienlambert42 opened this issue Dec 1, 2023 · 1 comment · Fixed by #461
Closed

Cannot fetch url with more than 2 redirections #450

julienlambert42 opened this issue Dec 1, 2023 · 1 comment · Fixed by #461
Labels
enhancement New feature or request

Comments

@julienlambert42
Copy link

julienlambert42 commented Dec 1, 2023

Hello,

The fetch_url() function returns None when a redirection url is used as argument.

For example :
Redirecting url returns None because this url redirects to Actual url

The requests.get(url).text function from requests library allow redirections.

--> Is this behaviour done on purpose (for security reasons for example) ?
--> If yes, could you add an argument to the function to allow redirections (actually it would even be better to have a num_redirection argument set to 0 by default, so we can set a threshold to avoid scam redirections)

Thanks !

@adbar adbar added the enhancement New feature or request label Dec 1, 2023
@adbar
Copy link
Owner

adbar commented Dec 1, 2023

Hi @julienlambert42, the maximum number of redirects is currently set to 2. This page needs a few more, that's why the download fails.

We could put this number in the settings.cfg file to let the users decide. In the meantime you can always use another download utility and use trafilatura on the HTML.

@adbar adbar changed the title Cannot fetch url redirections Cannot fetch url with more than 2 redirections Dec 1, 2023
vbarbaresi added a commit to vbarbaresi/trafilatura that referenced this issue Dec 30, 2023
…dling

Fixes issue adbar#450

After setting `MAX_REDIRECTS` to 5, I could fetch the original URL from the issue:
`trafilatura -u https://www.hydrogeninsight.com/production/breaking-us-reveals-the-seven-regional-hydrogen-hubs-to-receive-7bn-of-government-funding/2-1-1534596`

I also fixed this old issue: adbar#128
The underlying urllib3 bug has not been fixed: urllib3/urllib3#2475

I had to pass the retry strategy to the actual request method: it doesn't propagate from the pool maanger
@adbar adbar closed this as completed in #461 Jan 4, 2024
adbar added a commit that referenced this issue Jan 4, 2024
…dling (#461)

* introduce `MAX_REDIRECTS` config setting and fix urllib3 redirect handling

Fixes issue #450

After setting `MAX_REDIRECTS` to 5, I could fetch the original URL from the issue:
`trafilatura -u https://www.hydrogeninsight.com/production/breaking-us-reveals-the-seven-regional-hydrogen-hubs-to-receive-7bn-of-government-funding/2-1-1534596`

I also fixed this old issue: #128
The underlying urllib3 bug has not been fixed: urllib3/urllib3#2475

I had to pass the retry strategy to the actual request method: it doesn't propagate from the pool maanger

* use httpbin everywhere instead of httpbun

* restore MAX_REDIRECTS defaut config value to 2

* reset global objects after test_fetch to avoid side-effects

* pin lxml to < 5

* test no_ssl True & False to fix coverage

* move _reset_global_objects() to the test file

* rewrite assignment in multiple lines for readability

---------

Co-authored-by: Adrien Barbaresi <adbar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants