Skip to content

Request deduplication does not work in Apify-Scrapy integration #395

@vdusek

Description

@vdusek

Description

Request deduplication does not always work in the Apify-Scrapy integration.

Reproduction

{
  "allowedDomains": [
    "crawlee.dev"
  ],
  "proxyConfiguration": {
    "useApifyProxy": false
  },
  "startUrls": [
    {
      "url": "https://crawlee.dev/",
      "method": "GET"
    }
  ]
}

Observed behavior

Logs:

2025-02-10T17:26:24.729Z ACTOR: Pulling Docker image of build 0VUi8LhZspd5TTEGF from repository.
2025-02-10T17:26:31.307Z ACTOR: Creating Docker container.
2025-02-10T17:26:32.118Z ACTOR: Starting Docker container.
2025-02-10T17:26:35.619Z [apify] INFO  Initializing Actor...
2025-02-10T17:26:35.622Z [apify] INFO  Initializing Actor... ({"message": "Initializing Actor..."})
2025-02-10T17:26:35.625Z [apify] INFO  System info ({"apify_sdk_version": "2.2.2", "apify_client_version": "1.9.1", "crawlee_version": "0.5.4", "python_version": "3.12.8", "os": "linux"})
2025-02-10T17:26:35.628Z [apify] INFO  System info ({"apify_sdk_version": "2.2.2", "apify_client_version": "1.9.1", "crawlee_version": "0.5.4", "python_version": "3.12.8", "os": "linux", "message": "System info"})
2025-02-10T17:26:35.731Z [scrapy.addons] INFO  Enabled addons:
2025-02-10T17:26:35.734Z [] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:35.841Z [scrapy.middleware] INFO  Enabled extensions:
2025-02-10T17:26:35.844Z ['scrapy.extensions.corestats.CoreStats',
2025-02-10T17:26:35.846Z  'scrapy.extensions.memusage.MemoryUsage',
2025-02-10T17:26:35.848Z  'scrapy.extensions.logstats.LogStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:35.851Z [scrapy.crawler] INFO  Overridden settings:
2025-02-10T17:26:35.854Z {'BOT_NAME': 'titlebot',
2025-02-10T17:26:35.856Z  'DEPTH_LIMIT': 1,
2025-02-10T17:26:35.859Z  'LOG_LEVEL': 'INFO',
2025-02-10T17:26:35.862Z  'NEWSPIDER_MODULE': 'src.spiders',
2025-02-10T17:26:35.864Z  'ROBOTSTXT_OBEY': True,
2025-02-10T17:26:35.867Z  'SCHEDULER': 'apify.scrapy.scheduler.ApifyScheduler',
2025-02-10T17:26:35.869Z  'SPIDER_MODULES': ['src.spiders'],
2025-02-10T17:26:35.872Z  'TELNETCONSOLE_ENABLED': False,
2025-02-10T17:26:35.874Z  'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-10T17:26:36.176Z [apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False.
2025-02-10T17:26:36.179Z [apify] INFO  ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False. ({"message": "ApifyHttpProxyMiddleware is not going to be used. Actor input field \"proxyConfiguration.useApifyProxy\" is set to False."})
2025-02-10T17:26:36.182Z [scrapy.middleware] INFO  Enabled downloader middlewares:
2025-02-10T17:26:36.185Z ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
2025-02-10T17:26:36.188Z  'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
2025-02-10T17:26:36.191Z  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
2025-02-10T17:26:36.193Z  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
2025-02-10T17:26:36.195Z  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
2025-02-10T17:26:36.198Z  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
2025-02-10T17:26:36.200Z  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
2025-02-10T17:26:36.203Z  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
2025-02-10T17:26:36.205Z  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
2025-02-10T17:26:36.208Z  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
2025-02-10T17:26:36.211Z  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
2025-02-10T17:26:36.213Z  'scrapy.downloadermiddlewares.stats.DownloaderStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.215Z [scrapy.middleware] INFO  Enabled spider middlewares:
2025-02-10T17:26:36.218Z ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
2025-02-10T17:26:36.220Z  'scrapy.spidermiddlewares.referer.RefererMiddleware',
2025-02-10T17:26:36.223Z  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
2025-02-10T17:26:36.225Z  'scrapy.spidermiddlewares.depth.DepthMiddleware'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.227Z [scrapy.middleware] INFO  Enabled item pipelines:
2025-02-10T17:26:36.230Z ['apify.scrapy.pipelines.ActorDatasetPushPipeline'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.232Z [scrapy.core.engine] INFO  Spider opened ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:36.343Z [scrapy.extensions.logstats] INFO  Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:36.832Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:38.548Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:39.136Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/examples>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:39.909Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.332Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.527Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.539Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/python>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.042Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/next/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.311Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core/changelog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.331Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.663Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.676Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.11/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.897Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.931Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.10/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.207Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.227Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.9/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.456Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.478Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.8/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.762Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.7/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.972Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.6/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.216Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.5/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.416Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.4/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.700Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.3/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.923Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.2/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.189Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.1/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.423Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/3.0/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.638Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/introduction>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.851Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/javascript-rendering>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.061Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/typescript-project>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.288Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/avoid-blocking>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.504Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/cheerio-crawler-guide>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.710Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/jsdom-crawler-guide>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.936Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/javascript-rendering>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.127Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core/class/AutoscaledPool>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.372Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/proxy-management>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.618Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/result-storage>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.787Z [scrapy.spidermiddlewares.urllength] INFO  Ignoring link (url length > 2083): https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcbiAgICBcImNvZGVcIjogXCJpbXBvcnQgeyBQbGF5d3JpZ2h0Q3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIEltcG9ydCB0aGUgYEFjdG9yYCBjbGFzcyBmcm9tIHRoZSBBcGlmeSBTREsuXFxuaW1wb3J0IHsgQWN0b3IgfSBmcm9tICdhcGlmeSc7XFxuXFxuLy8gU2V0IHVwIHRoZSBpbnRlZ3JhdGlvbiB0byBBcGlmeS5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gQ3Jhd2xlciBzZXR1cCBmcm9tIHRoZSBwcmV2aW91cyBleGFtcGxlLlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICAvLyBVc2UgdGhlIHJlcXVlc3RIYW5kbGVyIHRvIHByb2Nlc3MgZWFjaCBvZiB0aGUgY3Jhd2xlZCBwYWdlcy5cXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MsIHB1c2hEYXRhLCBsb2cgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBsb2cuaW5mbyhgVGl0bGUgb2YgJHtyZXF1ZXN0LmxvYWRlZFVybH0gaXMgJyR7dGl0bGV9J2ApO1xcblxcbiAgICAgICAgLy8gU2F2ZSByZXN1bHRzIGFz... [line-too-long]
2025-02-10T17:26:46.851Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides/request-storage>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.052Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/utils/namespace/social>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.287Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/utils>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.503Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.776Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.970Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/deployment/aws-cheerio>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.029Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/deployment/gcp-cheerio>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.081Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/guides>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.114Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/examples>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.426Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/docs/upgrading/upgrading-to-v3>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.500Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.530Z [title_spider] INFO  TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.277Z [scrapy.core.engine] INFO  Closing spider (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.281Z [scrapy.statscollectors] INFO  Dumping Scrapy stats:
2025-02-10T17:26:49.283Z {'downloader/request_bytes': 13234,
2025-02-10T17:26:49.286Z  'downloader/request_count': 49,
2025-02-10T17:26:49.288Z  'downloader/request_method_count/GET': 49,
2025-02-10T17:26:49.291Z  'downloader/response_bytes': 1384307,
2025-02-10T17:26:49.293Z  'downloader/response_count': 49,
2025-02-10T17:26:49.296Z  'downloader/response_status_count/200': 49,
2025-02-10T17:26:49.298Z  'elapsed_time_seconds': 12.935337,
2025-02-10T17:26:49.301Z  'finish_reason': 'finished',
2025-02-10T17:26:49.303Z  'finish_time': datetime.datetime(2025, 2, 10, 17, 26, 49, 277550, tzinfo=datetime.timezone.utc),
2025-02-10T17:26:49.306Z  'httpcompression/response_bytes': 8749039,
2025-02-10T17:26:49.308Z  'httpcompression/response_count': 49,
2025-02-10T17:26:49.310Z  'item_scraped_count': 48,
2025-02-10T17:26:49.313Z  'items_per_minute': None,
2025-02-10T17:26:49.316Z  'log_count/INFO': 58,
2025-02-10T17:26:49.318Z  'memusage/max': 105684992,
2025-02-10T17:26:49.320Z  'memusage/startup': 105684992,
2025-02-10T17:26:49.322Z  'offsite/domains': 10,
2025-02-10T17:26:49.324Z  'offsite/filtered': 20,
2025-02-10T17:26:49.327Z  'request_depth_max': 1,
2025-02-10T17:26:49.329Z  'response_received_count': 49,
2025-02-10T17:26:49.332Z  'responses_per_minute': None,
2025-02-10T17:26:49.335Z  'robotstxt/request_count': 1,
2025-02-10T17:26:49.338Z  'robotstxt/response_count': 1,
2025-02-10T17:26:49.340Z  'robotstxt/response_status_count/200': 1,
2025-02-10T17:26:49.343Z  'start_time': datetime.datetime(2025, 2, 10, 17, 26, 36, 342213, tzinfo=datetime.timezone.utc),
2025-02-10T17:26:49.345Z  'urllength/request_ignored_count': 1} ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.348Z [scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.351Z [apify] INFO  Exiting Actor ({"exit_code": 0})
2025-02-10T17:26:49.353Z [apify] INFO  Exiting Actor ({"exit_code": 0, "message": "Exiting Actor"})

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions