Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ All notable changes to this project will be documented in this file.

### 🐛 Bug Fixes

- Fix session managment with retire ([#947](https://github.com/apify/crawlee-python/pull/947)) ([caee03f](https://github.com/apify/crawlee-python/commit/caee03fe3a43cc1d7a8d3f9e19b42df1bdb1c0aa)) by [@Mantisus](https://github.com/Mantisus)
- Fix session management with retire ([#947](https://github.com/apify/crawlee-python/pull/947)) ([caee03f](https://github.com/apify/crawlee-python/commit/caee03fe3a43cc1d7a8d3f9e19b42df1bdb1c0aa)) by [@Mantisus](https://github.com/Mantisus)
- Fix templates - poetry-plugin-export version and camoufox template name ([#952](https://github.com/apify/crawlee-python/pull/952)) ([7addea6](https://github.com/apify/crawlee-python/commit/7addea6605359cceba208e16ec9131724bdb3e9b)) by [@Pijukatel](https://github.com/Pijukatel), closes [#951](https://github.com/apify/crawlee-python/issues/951)
- Fix convert relative link to absolute in `enqueue_links` for response with redirect ([#956](https://github.com/apify/crawlee-python/pull/956)) ([694102e](https://github.com/apify/crawlee-python/commit/694102e163bb9021a4830d2545d153f6f8f3de90)) by [@Mantisus](https://github.com/Mantisus), closes [#955](https://github.com/apify/crawlee-python/issues/955)
- Fix `CurlImpersonateHttpClient` cookies handler ([#946](https://github.com/apify/crawlee-python/pull/946)) ([ed415c4](https://github.com/apify/crawlee-python/commit/ed415c433da2a40b0ee62534f0730d0737e991b8)) by [@Mantisus](https://github.com/Mantisus)
Expand Down Expand Up @@ -688,4 +688,4 @@ All notable changes to this project will be documented in this file.
- Storage manager & purging the defaults ([#150](https://github.com/apify/crawlee-python/pull/150)) ([851042f](https://github.com/apify/crawlee-python/commit/851042f25ad07e25651768e476f098ef0ed21914)) by [@vdusek](https://github.com/vdusek)


<!-- generated by git-cliff -->
<!-- generated by git-cliff -->
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ make run-docs
Publishing new versions to [PyPI](https://pypi.org/project/crawlee) is automated through GitHub Actions.

- **Beta releases**: On each commit to the master branch, a new beta release is automatically published. The version number is determined based on the latest release and conventional commits. The beta version suffix is incremented by 1 from the last beta release on PyPI.
- **Stable releases**: A stable version release may be created by triggering the `release` GitHub Actions workflow. The version number is determined based on the latest release and conventional commits (`auto` release type), or it may be overriden using the `custom` release type.
- **Stable releases**: A stable version release may be created by triggering the `release` GitHub Actions workflow. The version number is determined based on the latest release and conventional commits (`auto` release type), or it may be overridden using the `custom` release type.

### Publishing to PyPI manually

Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/apify_platform.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ apify run
For running Crawlee code as an Actor on [Apify platform](https://apify.com/actors) you need to wrap the body of the main function of your crawler with `async with Actor`.

:::info NOTE
Adding `async with Actor` is the only important thing needed to run it on Apify platform as an Actor. It is needed to initialize your Actor (e.g. to set the correct storage implementation) and to correctly handle exitting the process.
Adding `async with Actor` is the only important thing needed to run it on Apify platform as an Actor. It is needed to initialize your Actor (e.g. to set the correct storage implementation) and to correctly handle exiting the process.
:::

Let's look at the `BeautifulSoupCrawler` example from the [Quick start](../quick-start) guide:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
id: playwright-crawler-with-fingeprint-generator
id: playwright-crawler-with-fingerprint-generator
title: Playwright crawler with fingerprint generator
---

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/trace_and_monitor_crawlers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ You can use different tools to consume the OpenTelemetry data that might better

## Customize the instrumentation

You can customize the <ApiLink to="class/CrawlerInstrumentor">`CrawlerInstrumentor`</ApiLink>. Depending on the arguments used during its initialization, the instrumentation will be applied to different parts ot the Crawlee code. By default, it instruments some functions that can give quite a good picture of each individual request handling. To turn this default instrumentation off, you can pass `request_handling_instrumentation=False` during initialization. You can also extend instrumentation by passing `instrument_classes=[...]` initialization argument that contains classes you want to be auto-instrumented. All their public methods will be automatically instrumented. Bear in mind that instrumentation has some runtime costs as well. The more instrumentation is used, the more overhead it will add to the crawler execution.
You can customize the <ApiLink to="class/CrawlerInstrumentor">`CrawlerInstrumentor`</ApiLink>. Depending on the arguments used during its initialization, the instrumentation will be applied to different parts of the Crawlee code. By default, it instruments some functions that can give quite a good picture of each individual request handling. To turn this default instrumentation off, you can pass `request_handling_instrumentation=False` during initialization. You can also extend instrumentation by passing `instrument_classes=[...]` initialization argument that contains classes you want to be auto-instrumented. All their public methods will be automatically instrumented. Bear in mind that instrumentation has some runtime costs as well. The more instrumentation is used, the more overhead it will add to the crawler execution.

You can also create your instrumentation by selecting only the methods you want to instrument. For more details, see the <ApiLink to="class/CrawlerInstrumentor">`CrawlerInstrumentor`</ApiLink> source code and the [Python documentation for OpenTelemetry](https://opentelemetry.io/docs/languages/python/).

Expand Down
6 changes: 3 additions & 3 deletions src/crawlee/otel/crawler_instrumentor.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def _init_wrapper(wrapped: Any, _: Any, args: Any, kwargs: Any) -> None:

if request_handling_instrumentation:

async def middlware_wrapper(wrapped: Any, instance: _Middleware, args: Any, kwargs: Any) -> Any:
async def middleware_wrapper(wrapped: Any, instance: _Middleware, args: Any, kwargs: Any) -> Any:
with self._tracer.start_as_current_span(
name=f'{instance.generator.__name__}, {wrapped.__name__}', # type:ignore[attr-defined] # valid in our context
attributes={
Expand Down Expand Up @@ -111,8 +111,8 @@ async def _commit_request_handler_result_wrapper(
# Handpicked interesting methods to instrument
self._instrumented.extend(
[
(_Middleware, 'action', middlware_wrapper),
(_Middleware, 'cleanup', middlware_wrapper),
(_Middleware, 'action', middleware_wrapper),
(_Middleware, 'cleanup', middleware_wrapper),
(ContextPipeline, '__call__', context_pipeline_wrapper),
(BasicCrawler, '_BasicCrawler__run_task_function', self._simple_async_wrapper),
(BasicCrawler, '_commit_request_handler_result', _commit_request_handler_result_wrapper),
Expand Down
2 changes: 1 addition & 1 deletion src/crawlee/sessions/_session_pool.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def get_state(self, *, as_dict: bool = False) -> SessionPoolModel | dict:
def add_session(self, session: Session) -> None:
"""Add an externally created session to the pool.

This is intened only for the cases when you want to add a session that was created outside of the pool.
This is intended only for the cases when you want to add a session that was created outside of the pool.
Otherwise, the pool will create new sessions automatically.

Args:
Expand Down
5 changes: 3 additions & 2 deletions tests/e2e/project_template/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,8 @@ def _patch_crawlee_version_in_pyproject_toml_based_project(project_path: Path, w
else:
raise RuntimeError('This does not look like a uv or poetry based project.')

# Create lock file that is expected by the docker to exist(Even though it wil be patched in the docker).
# Create lock file that is expected by the docker to exist (even though it will be patched
# in the docker).
subprocess.run(
args=[package_manager, 'lock'],
cwd=str(project_path),
Expand All @@ -87,7 +88,7 @@ def _patch_crawlee_version_in_pyproject_toml_based_project(project_path: Path, w

# Add command to copy .whl to the docker image and update project with it.
# Patching in docker file due to the poetry not properly supporting relative paths for wheel packages
# and so the absolute path(in the container) is generated when running `add` command in the container.
# and so the absolute path (in the container) is generated when running `add` command in the container.
modified_lines.extend(
[
f'COPY {wheel_path.name} ./\n',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -568,7 +568,7 @@ async def test_adaptive_context_query_selector_beautiful_soup(test_urls: list[st
Handler tries to locate two elements h1 and h2.
h1 exists immediately, h2 is created dynamically by inline JS snippet embedded in the html.
Create situation where page is crawled with static sub crawler first.
Static sub crawler should be able to locate only h1. It wil try to wait for h2, trying to wait for h2 will trigger
Static sub crawler should be able to locate only h1. It will try to wait for h2, trying to wait for h2 will trigger
`AdaptiveContextError` which will force the adaptive crawler to try playwright sub crawler instead. Playwright sub
crawler is able to wait for the h2 element."""

Expand Down Expand Up @@ -610,7 +610,7 @@ async def test_adaptive_context_query_selector_parsel(test_urls: list[str]) -> N
Handler tries to locate two elements h1 and h2.
h1 exists immediately, h2 is created dynamically by inline JS snippet embedded in the html.
Create situation where page is crawled with static sub crawler first.
Static sub crawler should be able to locate only h1. It wil try to wait for h2, trying to wait for h2 will trigger
Static sub crawler should be able to locate only h1. It will try to wait for h2, trying to wait for h2 will trigger
`AdaptiveContextError` which will force the adaptive crawler to try playwright sub crawler instead. Playwright sub
crawler is able to wait for the h2 element."""

Expand Down
4 changes: 2 additions & 2 deletions tests/unit/crawlers/_basic/test_basic_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ async def error_handler(context: BasicCrawlingContext, error: Exception) -> Requ
assert isinstance(error_call.error, RuntimeError)


async def test_calls_error_handler_for_sesion_errors() -> None:
async def test_calls_error_handler_for_session_errors() -> None:
crawler = BasicCrawler(
max_session_rotations=1,
)
Expand Down Expand Up @@ -1045,7 +1045,7 @@ async def handler(context: BasicCrawlingContext) -> None:
assert stats.requests_finished == 2


async def test_services_no_side_effet_on_crawler_init() -> None:
async def test_services_no_side_effect_on_crawler_init() -> None:
custom_configuration = Configuration()
custom_event_manager = LocalEventManager.from_config(custom_configuration)
custom_storage_client = MemoryStorageClient()
Expand Down