Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Snapshotter #20

Merged
merged 29 commits into from
Mar 12, 2024
Merged

Add Snapshotter #20

merged 29 commits into from
Mar 12, 2024

Conversation

vdusek
Copy link
Collaborator

@vdusek vdusek commented Jan 17, 2024

Description

  • add Snapshotter component;
  • adjust SystemStatus, LocalEventManager, and relevant data classes;
  • some functionality for system measurement was moved to crawlee/_utils/system.py;
  • add unit tests to cover new code

Additional notes

  • Removing crawlee.config module for now since we do not need it. Let's add it again in the future and move some fields to it if needed.
  • Regarding the type casting (e.g. cast(list[Snapshot], self._memory_snapshots)) in Snapshotter. It is there because of this https://mypy.readthedocs.io/en/stable/common_issues.html#variance, I am not sure if we can address it better.

Testing

Unit tests

  • Unit tests should cover new code.
---------- coverage: platform linux, python 3.9.18-final-0 -----------
Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
...
src/crawlee/autoscaling/snapshotter.py         112      2    98%
src/crawlee/autoscaling/system_status.py        55      4    93%
...

Manual testing / execution

  • For manual testing or just execution of the new components, check the following script.
from __future__ import annotations

import asyncio
from datetime import timedelta

from crawlee.autoscaling import Snapshotter, SystemStatus
from crawlee.events import LocalEventManager


async def test_snapshotter() -> None:
    async with LocalEventManager(system_info_interval=timedelta(seconds=1)) as event_manager:
        snapshotter = Snapshotter(
            event_manager=event_manager,
            snapshot_history=timedelta(seconds=5),
            max_event_loop_delay=timedelta(milliseconds=1),  # to simulate overloading
        )

        await snapshotter.start()

        for _ in range(10):
            print('waiting...\n')
            await asyncio.sleep(1)

            cpu_sample = snapshotter.get_cpu_sample()
            print(f'cpu_sample: {cpu_sample}\n')

            mem_sample = snapshotter.get_memory_sample()
            print(f'mem_sample: {mem_sample}\n')

            event_loop_sample = snapshotter.get_event_loop_sample()
            print(f'event_loop_sample: {event_loop_sample}\n')

        system_status = SystemStatus(snapshotter)
        current_status = system_status.get_current_status()
        print(f'status: {current_status}\n')

        historical_status = system_status.get_historical_status()
        print(f'hstatus: {historical_status}')

        await snapshotter.stop()


if __name__ == '__main__':
    asyncio.run(test_snapshotter())

@vdusek vdusek force-pushed the add-snapshotter branch 3 times, most recently from a0fbdd2 to 3e703c2 Compare January 18, 2024 12:15
@vdusek vdusek changed the title Add Snapshotter, SystemStatus, MemoryInfo [WIP] Add Snapshotter, SystemStatus, MemoryInfo Jan 19, 2024
@vdusek vdusek changed the title [WIP] Add Snapshotter, SystemStatus, MemoryInfo [WIP] Add Snapshotter, SystemStatus, MemoryInfo, EventManager Jan 22, 2024
@vdusek vdusek force-pushed the add-snapshotter branch 2 times, most recently from 5d6a540 to aa93706 Compare January 26, 2024 10:29
@vdusek vdusek changed the title [WIP] Add Snapshotter, SystemStatus, MemoryInfo, EventManager [WIP] Add Snapshotter Jan 29, 2024
@vdusek vdusek force-pushed the add-snapshotter branch 3 times, most recently from 59566e5 to b96b1b0 Compare January 30, 2024 10:13
@github-actions github-actions bot added this to the 82nd sprint - Tooling team milestone Jan 30, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 30, 2024
@vdusek vdusek force-pushed the add-snapshotter branch 7 times, most recently from 46f625b to 3519dcb Compare January 31, 2024 15:56
@vdusek vdusek removed the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 15, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 15, 2024
@vdusek vdusek force-pushed the add-snapshotter branch 4 times, most recently from dd3c770 to 87a1daf Compare February 20, 2024 15:46
src/crawlee/_utils/math.py Outdated Show resolved Hide resolved
src/crawlee/_utils/math.py Outdated Show resolved Hide resolved
src/crawlee/autoscaling/system_status.py Outdated Show resolved Hide resolved
src/crawlee/autoscaling/system_status.py Outdated Show resolved Hide resolved
src/crawlee/autoscaling/system_status.py Outdated Show resolved Hide resolved
Comment on lines +39 to +41
snapshotter._snapshot_cpu(event_system_data_info)
assert len(snapshotter._cpu_snapshots) == 1
assert snapshotter._cpu_snapshots[0].used_ratio == event_system_data_info.cpu_info.used_ratio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda sad from seeing all those protected member accesses. Is there any way we could refactor the snapshotter to avoid this? Is this test really valuable? From what I see here, we're really just verifying some internal details, not the behavior of the Snapshotter class.

Copy link
Collaborator Author

@vdusek vdusek Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you test otherwise that the snapshotter correctly read and stored the used ratio of the CPU? I don't know, it seems valuable to me to have these checks there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it would be better to set up an EventManager and have it emit a SystemInfo event. And then, instead of looking through the snapshots property, to use one of those get_*_system_info methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we would need to either directly invoke the private method _emit_system_info_event or wait for the system_info_interval to trigger its execution by RecurringTask.

And we would have to update it for all other test_snapshot_*.

Also, isn't this a better unit test approach? I mean, to have it isolated. When a bug in the RT occurs we would see that only unit tests for RT are failing and will be able to easily identify where the problem is. Rather than failing everything and looking for the cause.

I understand your point of touching just the public interface in the tests, but I'd prefer to stay with the current implementation in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we would need to either directly invoke the private method _emit_system_info_event or wait for the system_info_interval to trigger its execution by RecurringTask.

Or we could make a testing implementation of EventManager where emitting events could be done from the outside (I mean from the test).

And we would have to update it for all other test_snapshot_*.

Yes, but I'd consider that a benefit 🙂

Also, isn't this a better unit test approach? I mean, to have it isolated. When a bug in the RT occurs we would see that only unit tests for RT are failing and will be able to easily identify where the problem is. Rather than failing everything and looking for the cause.

Well, depends what you consider to be an unit, but for me, this is a sub-unit test. You're right that using a non-trivial event manager would make it closer to an integration test. With a fake (not sure if that's correct TDD terminology) implementation, I don't see an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agreed. Taking into account the current situation, I opened a new issue for that - #73. Let's improve the testing of the Snapshotter later and merge this now.

src/crawlee/autoscaling/snapshotter.py Outdated Show resolved Hide resolved
src/crawlee/autoscaling/snapshotter.py Outdated Show resolved Hide resolved
src/crawlee/autoscaling/snapshotter.py Outdated Show resolved Hide resolved
src/crawlee/autoscaling/snapshotter.py Show resolved Hide resolved
@vdusek vdusek merged commit 492ee38 into master Mar 12, 2024
19 checks passed
@vdusek vdusek deleted the add-snapshotter branch March 12, 2024 16:13
@vdusek vdusek mentioned this pull request Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants