Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce MAX_REDIRECTS config setting and fix urllib3 redirect handling #461

Merged
merged 9 commits into from Jan 4, 2024

Conversation

vbarbaresi
Copy link
Contributor

@vbarbaresi vbarbaresi commented Dec 30, 2023

Fixes issue #450

After setting MAX_REDIRECTS to 5, I could fetch the original URL from the issue: trafilatura -u https://www.hydrogeninsight.com/production/breaking-us-reveals-the-seven-regional-hydrogen-hubs-to-receive-7bn-of-government-funding/2-1-1534596

I also fixed this old issue: #128
The underlying urllib3 bug has not been fixed: github.com/urllib3/urllib3/issues/2475

I had to pass the retry strategy to the actual request method: it doesn't propagate from the pool maanger

…dling

Fixes issue adbar#450

After setting `MAX_REDIRECTS` to 5, I could fetch the original URL from the issue:
`trafilatura -u https://www.hydrogeninsight.com/production/breaking-us-reveals-the-seven-regional-hydrogen-hubs-to-receive-7bn-of-government-funding/2-1-1534596`

I also fixed this old issue: adbar#128
The underlying urllib3 bug has not been fixed: urllib3/urllib3#2475

I had to pass the retry strategy to the actual request method: it doesn't propagate from the pool maanger
@vbarbaresi
Copy link
Contributor Author

Tests are failing with the following:

   def test_ab_with_p_parent_resolved():
        parser = XMLParser(remove_blank_text=True)
        xml_doc = fromstring("<text><p><head>text1</head></p></text>")
        cleaned = check_tei(xml_doc, "fake_url")
        assert cleaned.find(".//ab") is not None and cleaned.find(".//p") is None
        xml_doc = fromstring("<body><p>text1<head>text2</head></p></body>")
        cleaned = check_tei(xml_doc, "fake_url")
        result = cleaned.find(".//ab")
        assert result.getparent().tag == "body" and result.text == "text2"
        xml_doc = fromstring("<TEI><text><body><p><head>text1</head></p>text2</body></text></TEI>")
        cleaned = check_tei(xml_doc, "fake_url")
>       assert cleaned.find(".//ab").text == "text1" and cleaned.find(".//p").text == "text2"
E       AttributeError: 'NoneType' object has no attribute 'text'

tests/xml_tei_tests.py:285: AttributeError

I'm not reproducing locally and I don't get how my changes could have affected this part 🤔

@vbarbaresi
Copy link
Contributor Author

oh no.. lxml 5.0 just got released yesterday: https://pypi.org/project/lxml/#history
I'll pin lxml to < 5.0 for now.

Copy link

codecov bot commented Dec 30, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (396b991) 96.76% compared to head (3019f9d) 96.91%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   96.76%   96.91%   +0.15%     
==========================================
  Files          22       22              
  Lines        3367     3370       +3     
==========================================
+ Hits         3258     3266       +8     
+ Misses        109      104       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar adbar linked an issue Jan 2, 2024 that may be closed by this pull request
@adbar
Copy link
Owner

adbar commented Jan 2, 2024

It works, thanks, but does it really fix the issue urllib3/urllib3#2475 ? Merging this PR would close it.

@vbarbaresi
Copy link
Contributor Author

does it really fix the issue urllib3/urllib3#2475 ? Merging this PR would close it.

It doesn't, but will it close it? I wouldn't think merging a PR could close an issue living in another repository

@adbar
Copy link
Owner

adbar commented Jan 2, 2024

You could unlink it to be sure (I cannot).

@vbarbaresi
Copy link
Contributor Author

done, I edited the description to quote the issue URL

trafilatura/downloads.py Outdated Show resolved Hide resolved
@adbar adbar merged commit 6a79511 into adbar:master Jan 4, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot fetch url with more than 2 redirections
2 participants