Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/typos-check.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: Typos Check

on:
pull_request:
branches: [master]

jobs:
run:
name: Spell Check with Typos
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Check spelling
uses: crate-ci/typos@master
with:
files: ./sources
8 changes: 8 additions & 0 deletions _typos.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[default]
extend-ignore-re = [
'`[^`\n]+`',
'```[\s\S]*?```',
]

[default.extend-words]
SER = "SER"
2 changes: 1 addition & 1 deletion sources/academy/glossary/tools/switchyomega.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ slug: /tools/switchyomega

SwitchyOmega is a Chrome extension for managing and switching between proxies which can be added in the [Chrome Webstore](https://chrome.google.com/webstore/detail/padekgcemlokbadohgkifijomclgjgif).

After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various differnt connection profiles, as well as open the extension's options.
After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various different connection profiles, as well as open the extension's options.

![The SwitchyOmega interface](./images/switchyomega.png)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ const crawler = new PuppeteerCrawler({
try {
buffer = await response.buffer();
} catch (error) {
// some responses do not contain buffer and do not need to be catched
// some responses do not contain buffer and do not need to be cached
return;
}

Expand Down
2 changes: 1 addition & 1 deletion sources/academy/webscraping/anti_scraping/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Anti-scraping protections can work on many different layers and use a large amou

1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot.
2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, cyphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration).
3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently.
3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the initial HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently.
4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses.

These are the 4 main principles that anti-scraping protections are based on.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Bypasing Cloudflare browser check
title: Bypassing Cloudflare browser check
description: Learn how to bypass Cloudflare browser challenge with Crawlee.
sidebar_position: 3
slug: /anti-scraping/mitigation/cloudflare-challenge.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ sidebar_position: 4

---

When developing an [Actor](/sources/platform/actors/index.mdx) on the Apify platform, you can choose from a variety of pre-built Docker iamges to serve as the base for your Actor. These base images come with pre-installed dependencies and tools, making it easier to set up your development envrionment and ensuring consistent behavior across different environments.
When developing an [Actor](/sources/platform/actors/index.mdx) on the Apify platform, you can choose from a variety of pre-built Docker images to serve as the base for your Actor. These base images come with pre-installed dependencies and tools, making it easier to set up your development environment and ensuring consistent behavior across different environments.

## Base Docker images

Expand Down Expand Up @@ -105,7 +105,7 @@ By default, Apify base Docker images with the Apify SDK and Crawlee start your N
}
```

This means the system expects the source code to be in `main.js` by default. If you want to override this behavior, ues a custom `package.json` and/or `Dockerfile`.
This means the system expects the source code to be in `main.js` by default. If you want to override this behavior, use a custom `package.json` and/or `Dockerfile`.

:::tip Optimization tips

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ To set up the Actor's output tab UI using a single configuration file, use the f
}
```

The template above defines the configuration for the default dataset output view. Under the `views` property, there is one view titled _Overview_. The view configuartion consists of two main steps:
The template above defines the configuration for the default dataset output view. Under the `views` property, there is one view titled _Overview_. The view configuration consists of two main steps:

1. `transformation` - set up how to fetch the data.
2. `display` - set up how to visually present the fetched data.
Expand All @@ -124,7 +124,7 @@ The default behavior of the Output tab UI table is to display all fields from `t

Output configuration files need to be located in the `.actor` folder within the Actor's root directory.

You have two choices of how to organize files withing the `.actor` folder.
You have two choices of how to organize files within the `.actor` folder.

### Single configuration file

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ async def main() -> None:
</Tabs>

Please make sure to describe your Actors, their endpoints, and the schema for their
inputs and ouputs in your README.
inputs and outputs in your README.

## Can I monetize my Actor in the Standby mode

Expand Down
20 changes: 10 additions & 10 deletions sources/platform/api_v2/api_v2_reference.apib
Original file line number Diff line number Diff line change
Expand Up @@ -989,7 +989,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec
otherwise it will have a transitional status (e.g. `RUNNING`).
+ webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification
e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see
[Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks).
[Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).

+ Request

Expand Down Expand Up @@ -1023,7 +1023,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec
+ build: `0.1.234` (string, optional) - Specifies the actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the actor (typically `latest`).
+ webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification
e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see
[Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks).
[Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).

### With input [POST]

Expand Down Expand Up @@ -1141,7 +1141,7 @@ To run the actor asynchronously, use the [Run actor](#reference/actors/run-colle
+ build: `0.1.234` (string, optional) - Specifies the actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the actor (typically `latest`).
+ webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification
e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see
[Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks).
[Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).
+ format: `json` (string, optional) - Format of the results, possible values are: `json`, `jsonl`, `csv`, `html`, `xlsx`, `xml` and `rss`. The default value is `json`.
+ clean: `false` (boolean, optional) - If `true` or `1` then the API endpoint returns only non-empty items and skips hidden fields
(i.e. fields starting with the # character).
Expand Down Expand Up @@ -1758,7 +1758,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec
e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks.
**Note**: if you already have a webhook set up for the actor or task, you do not have to add it again here.
For more information, see
[Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks).
[Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).

+ Request

Expand Down Expand Up @@ -1792,7 +1792,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec
in the response. By default, it is `OUTPUT`.
+ webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification
e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see
[Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks).
[Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).

### Run task synchronously (POST) [POST]

Expand Down Expand Up @@ -1898,7 +1898,7 @@ To run the Task asynchronously, use the [Run task asynchronously](#reference/act
+ build: `0.1.234` (string, optional) - Specifies the actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically `latest`).
+ webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification
e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see
[Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks).
[Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks).
+ format: `json` (string, optional) - Format of the results, possible values are: `json`, `jsonl`, `csv`, `html`, `xlsx`, `xml` and `rss`. The default value is `json`.
+ clean: `false` (boolean, optional) - If `true` or `1` then the API endpoint returns only non-empty items and skips hidden fields
(i.e. fields starting with the # character).
Expand Down Expand Up @@ -3005,7 +3005,7 @@ The pagination is always performed with the granularity of a single item, regard
By default, the **Items** in the response are sorted by the time they were stored to the database, therefore you can use
pagination to incrementally fetch the items as they are being added.
The maximum number of items that will be returned in a single API call is limited to 250,000. <!-- GET_ITEMS_LIMIT -->
If you specify `desc=1` query paremeter, the results are returned in the reverse order
If you specify `desc=1` query parameter, the results are returned in the reverse order
than they were stored (i.e. from newest to oldest items).
Note that only the order of **Items** is reversed, but not the order of the `unwind` array elements.

Expand Down Expand Up @@ -3081,7 +3081,7 @@ The POST payload is a JSON object or a JSON array of objects to save into the da
**IMPORTANT:** The limit of request payload size for the dataset is 5 MB. If the array exceeds the size,
you'll need to split it into a number of smaller arrays.

If the dataset has fields schema defined, the push request can potentialy fail with `400 Bad Request` if any item does not match the schema.
If the dataset has fields schema defined, the push request can potentially fail with `400 Bad Request` if any item does not match the schema.
In such case, nothing will be inserted into the dataset and the response will contain an error message with a list of invalid items and their validation errors.

+ Parameters
Expand Down Expand Up @@ -3767,7 +3767,7 @@ parameter.

+ Parameters

+ dispatchId: `Zib4xbZsmvZeK55ua` (string, required) - Webhook dispacth ID.
+ dispatchId: `Zib4xbZsmvZeK55ua` (string, required) - Webhook dispatch ID.
+ token: `soSkq9ekdmfOslopH` (string, required) - API authentication token.

### Get webhook dispatch [GET]
Expand Down Expand Up @@ -4091,7 +4091,7 @@ a summary of your limits, and your current usage.
- taggedBuilds (object, nullable)
- latest (object, nullable)
- buildId: `z2EryhbfhgSyqj6Hn` (string, nullable)
- buldNumber: `0.0.2` (string, nullable)
- buildNumber: `0.0.2` (string, nullable)
- finishedAt: `2019-06-10T11:15:49.286Z` (string, nullable)

## ActCreate (object)
Expand Down
2 changes: 1 addition & 1 deletion sources/platform/proxy/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ Depending on whether you use a [browser](https://apify.com/apify/web-scraper) or
* Browser—a different IP address is used for each browser.
* HTTP request—a different IP address is used for each request.

Use [sessions](#sessions) to controll how you rotate and [persist](#session-persistence) IP addresses. See our guide [Anti-scraping techniques](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works.
Use [sessions](#sessions) to control how you rotate and [persist](#session-persistence) IP addresses. See our guide [Anti-scraping techniques](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works.

## Sessions {#sessions}

Expand Down
6 changes: 3 additions & 3 deletions sources/platform/storage/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Utilize the **Actions** menu to modify the dataset's name, which also affects it

### Apify API

The [Apify API](/api/v2#/reference/datasets) enables you progammatic access to your datasets using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods).
The [Apify API](/api/v2#/reference/datasets) enables you programmatic access to your datasets using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods).

If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations)tab of **Settings** page of your Apify account.

Expand Down Expand Up @@ -155,7 +155,7 @@ Check out the [Python API client documentation](/api/client/python/reference/cla

When working with a JavaScript [Actor](../actors/index.mdx), the [JavaScript SDK](/sdk/js/docs/guides/result-storage#dataset) is an essential tool, especially for dataset management. It simplifies the tasks of storing and retrieving data, seamlessly integrating with the Actor's workflow. Key features of the SDK include the ability to append data, retrieve what is stored, and manage dataset properties effectively. Central to this functionality is the [`Dataset`](/sdk/js/reference/class/Dataset) class. This class allows you to determine where your data is stored - locally or in the Apify cloud. To add data to your chosen datasets, use the [`pushData()`](/sdk/js/reference/class/Dataset#pushData) method.

Additionaly the SDK offers other methods like [`getData()`](/sdk/js/reference/class/Dataset#getData), [`map()`](/sdk/js/reference/class/Dataset#map), and [`reduce()`](/sdk/js/reference/class/Dataset#reduce). For practical applications of these methods, refer to the [example](/sdk/js/docs/examples/map-and-reduce) section.
Additionally the SDK offers other methods like [`getData()`](/sdk/js/reference/class/Dataset#getData), [`map()`](/sdk/js/reference/class/Dataset#map), and [`reduce()`](/sdk/js/reference/class/Dataset#reduce). For practical applications of these methods, refer to the [example](/sdk/js/docs/examples/map-and-reduce) section.

If you have chosen to store your dataset locally, you can find it in the location below.

Expand Down Expand Up @@ -284,7 +284,7 @@ For more information, visit our [Python SDK documentation](/sdk/python/docs/conc

Fields in a dataset that begin with a `#` are treated as hidden. You can exclude these fields when downloading data by using either `skipHidden=1` or `clean=1` in your query parameters. This feature is useful for excluding debug information from the final dataset output.

The following example demonstates a dataset record with hiddent fields, including HTTP response and error details.
The following example demonstrates a dataset record with hiddent fields, including HTTP response and error details.

```json
{
Expand Down
2 changes: 1 addition & 1 deletion sources/platform/storage/key_value_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Click on the **API** button to view and test a store's [API endpoints](/api/v2#/

### Apify API

The [Apify API](/api/v2#/reference/key-value-stores) enables you programmatic acces to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods).
The [Apify API](/api/v2#/reference/key-value-stores) enables you programmatic access to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods).

If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) tab of **Settings** page of your Apify account.

Expand Down
2 changes: 1 addition & 1 deletion sources/platform/storage/request_queue.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ import TabItem from '@theme/TabItem';

Request queues enable you to enqueue and retrieve requests such as URLs with an [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and other parameters. They prove essential not only in web crawling scenarios but also in any situation requiring the management of a large number of URLs and the addition of new links.

The storage system for request queues accomoodates both breadth-first and depth-first crawling stategies, along with the inclusion of custom data attributes. This system enables you to check if certain URLs have already been encountered, add new URLs to the queue, and retrieve the next set of URLs fo processing.
The storage system for request queues accomoodates both breadth-first and depth-first crawling strategies, along with the inclusion of custom data attributes. This system enables you to check if certain URLs have already been encountered, add new URLs to the queue, and retrieve the next set of URLs for processing.

> Named request queues are retained indefinitely. <br/>
> Unnamed request queues expire after 7 days unless otherwise specified.<br/> > [Learn more](./index.md#named-and-unnamed-storages)
Expand Down