-
Notifications
You must be signed in to change notification settings - Fork 501
fix: Fix BasicCrawler
statistics persistance
#1490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
55e7316
to
ebc350d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we use the SDK_CRAWLER_STATISTICS_...
key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING
, __RQ_STATE_...
) for other cases. Could we update the key name, please?
crawler = HttpCrawler( | ||
configuration=configuration, | ||
storage_client=storage_client, | ||
) | ||
service_locator.set_configuration(configuration) | ||
service_locator.set_storage_client(storage_client) | ||
|
||
crawler = HttpCrawler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because RecoverableState
of statistics persists to/recovers from global storage_client
. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client
Mentioned here:
#1438 (comment)
I am open to discussion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we use the storage client passed to the crawler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but do we want to? I had an inconclusive discussion about this with @janbuchar
I am still not sure about this.
self._statistics = statistics or cast( | ||
'Statistics[TStatisticsState]', | ||
Statistics.with_default_state( | ||
persistence_enabled=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose changing the default values (persistence_enabled: bool = True
) is a no-go in patch releases, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is changing the default value of the internal attribute... since we consider previous behavior a bug, this is probably ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant changing the default value in the Statistics.with_default_state
:
Statistics.with_default_state(persistence_enabled: bool = True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes, that is why I changed it only in BasicCrawler
and not in Statistics.with_default_state
TODO: Figure out reason for stats difference in request_total_finished_duration
self._statistics = statistics or cast( | ||
'Statistics[TStatisticsState]', | ||
Statistics.with_default_state( | ||
persistence_enabled=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant changing the default value in the Statistics.with_default_state
:
Statistics.with_default_state(persistence_enabled: bool = True)
await exit_stack.enter_async_context(context) # type: ignore[arg-type] | ||
|
||
await self._autoscaled_pool.run() | ||
async with self._crawler_state_rec_task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._crawler_state_rec_task
can't be part of contexts_to_enter
? It's aexit
must be called before the event_manager.emit
?
Description
BasicCrawler
is persisting statistics by default.BasicCrawler
is recovering existing statistics by default ifConfiguration.purge_on_start
is False.BasicCrawler
emitEvent.PERSIST_STATE
when finishing.Issues
Testing