Skip to content

Conversation

Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Oct 15, 2025

Description

  • Ensure that BasicCrawler is persisting statistics by default.
  • Ensure that BasicCrawler is recovering existing statistics by default if Configuration.purge_on_start is False.
  • Let the BasicCrawler emit Event.PERSIST_STATE when finishing.

Issues

Testing

@github-actions github-actions bot added this to the 125th sprint - Tooling team milestone Oct 15, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Oct 15, 2025
@Pijukatel Pijukatel force-pushed the crawler-persistance branch from 55e7316 to ebc350d Compare October 15, 2025 14:31
@Pijukatel Pijukatel requested review from janbuchar and vdusek October 16, 2025 13:27
@Pijukatel Pijukatel marked this pull request as ready for review October 16, 2025 13:27
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?

Comment on lines -44 to +47
crawler = HttpCrawler(
configuration=configuration,
storage_client=storage_client,
)
service_locator.set_configuration(configuration)
service_locator.set_storage_client(storage_client)

crawler = HttpCrawler()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client

Mentioned here:
#1438 (comment)

I am open to discussion about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we use the storage client passed to the crawler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but do we want to? I had an inconclusive discussion about this with @janbuchar
I am still not sure about this.

self._statistics = statistics or cast(
'Statistics[TStatisticsState]',
Statistics.with_default_state(
persistence_enabled=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose changing the default values (persistence_enabled: bool = True) is a no-go in patch releases, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it is changing the default value of the internal attribute... since we consider previous behavior a bug, this is probably ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant changing the default value in the Statistics.with_default_state:

Statistics.with_default_state(persistence_enabled: bool = True)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes, that is why I changed it only in BasicCrawler and not in Statistics.with_default_state

TODO: Figure out reason for stats difference in request_total_finished_duration
self._statistics = statistics or cast(
'Statistics[TStatisticsState]',
Statistics.with_default_state(
persistence_enabled=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant changing the default value in the Statistics.with_default_state:

Statistics.with_default_state(persistence_enabled: bool = True)

await exit_stack.enter_async_context(context) # type: ignore[arg-type]

await self._autoscaled_pool.run()
async with self._crawler_state_rec_task:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._crawler_state_rec_task can't be part of contexts_to_enter? It's aexit must be called before the event_manager.emit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Crawler on migration not remembering statistics

2 participants