Skip to content

fix: possible infinity loop in Apify-Scrapy proxy middleware #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 4, 2024

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Sep 3, 2024

From the template log... it scrapes normally as it should...

...
2024-09-02T21:34:06.9322150Z [title_spider] [INFO] TitleSpider is parsing <200 https://apify.com/run-scrapy-in-cloud>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
2024-09-02T21:34:06.9455173Z [title_spider] [INFO] TitleSpider is parsing <200 https://docs.apify.com/academy/web-scraping-for-beginners>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
2024-09-02T21:34:06.9530763Z [title_spider] [INFO] TitleSpider is parsing <200 https://apify.com/success-stories>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
2024-09-02T21:34:07.0125374Z [title_spider] [INFO] TitleSpider is parsing <200 https://apify.com/templates/ts-crawlee-playwright-chrome>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
...

But when processing https://console.apify.com/robots.txt, it throws an exception in the proxy middleware, which is caught and logged:

2024-09-02T21:34:07.0410494Z [apify] [WARN] ApifyHttpProxyMiddleware: TunnelError occurred for request="<GET https://console.apify.com/robots.txt>", reason="Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}]", skipping...

but then it incorrectly returns the request object here:

if isinstance(exception, TunnelError):  
    Actor.log.warning(  
        f'ApifyHttpProxyMiddleware: TunnelError occurred for request="{request}", '  
        'reason="{exception}", skipping...'  
    )  
    return request  

Which causes it to be rescheduled, and we're stuck in a loop.

Also check the apify/actor-templates#288 - where the tests are executed with alpha release from this branch.

@github-actions github-actions bot added this to the 97th sprint - Tooling team milestone Sep 3, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 3, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Sep 3, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@vdusek vdusek force-pushed the fix-scrapy-proxy-middleware branch from ae5640f to 353b22c Compare September 3, 2024 16:23
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@vdusek vdusek requested a review from janbuchar September 3, 2024 16:35
@vdusek vdusek added bug Something isn't working. adhoc Ad-hoc unplanned task added during the sprint. labels Sep 3, 2024
@vdusek vdusek marked this pull request as ready for review September 3, 2024 16:47
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek merged commit 8647a94 into master Sep 4, 2024
38 checks passed
@vdusek vdusek deleted the fix-scrapy-proxy-middleware branch September 4, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants