diff --git a/.github/workflows/typos-check.yaml b/.github/workflows/typos-check.yaml new file mode 100644 index 0000000000..ab19adf905 --- /dev/null +++ b/.github/workflows/typos-check.yaml @@ -0,0 +1,18 @@ +name: Typos Check + +on: + pull_request: + branches: [master] + +jobs: + run: + name: Spell Check with Typos + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Check spelling + uses: crate-ci/typos@master + with: + files: ./sources diff --git a/_typos.toml b/_typos.toml new file mode 100644 index 0000000000..66515761dd --- /dev/null +++ b/_typos.toml @@ -0,0 +1,8 @@ +[default] +extend-ignore-re = [ + '`[^`\n]+`', + '```[\s\S]*?```', +] + +[default.extend-words] +SER = "SER" diff --git a/sources/academy/glossary/tools/switchyomega.md b/sources/academy/glossary/tools/switchyomega.md index 9a6eec6f5a..8a0eb4b9c5 100644 --- a/sources/academy/glossary/tools/switchyomega.md +++ b/sources/academy/glossary/tools/switchyomega.md @@ -13,7 +13,7 @@ slug: /tools/switchyomega SwitchyOmega is a Chrome extension for managing and switching between proxies which can be added in the [Chrome Webstore](https://chrome.google.com/webstore/detail/padekgcemlokbadohgkifijomclgjgif). -After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various differnt connection profiles, as well as open the extension's options. +After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various different connection profiles, as well as open the extension's options. ![The SwitchyOmega interface](./images/switchyomega.png) diff --git a/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.js b/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.js index 3c73dcf9ad..32b4e58b8b 100644 --- a/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.js +++ b/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.js @@ -29,7 +29,7 @@ const crawler = new PuppeteerCrawler({ try { buffer = await response.buffer(); } catch (error) { - // some responses do not contain buffer and do not need to be catched + // some responses do not contain buffer and do not need to be cached return; } diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 8446d03088..c74bdf8f70 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -66,7 +66,7 @@ Anti-scraping protections can work on many different layers and use a large amou 1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot. 2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, cyphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). -3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. +3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the initial HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. 4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses. These are the 4 main principles that anti-scraping protections are based on. diff --git a/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md b/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md index fdd8491766..516f1ff3bd 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md @@ -1,5 +1,5 @@ --- -title: Bypasing Cloudflare browser check +title: Bypassing Cloudflare browser check description: Learn how to bypass Cloudflare browser challenge with Crawlee. sidebar_position: 3 slug: /anti-scraping/mitigation/cloudflare-challenge.md diff --git a/sources/platform/actors/development/actor_definition/docker.md b/sources/platform/actors/development/actor_definition/docker.md index 0ffbdeb47f..1f1e1611ac 100644 --- a/sources/platform/actors/development/actor_definition/docker.md +++ b/sources/platform/actors/development/actor_definition/docker.md @@ -9,7 +9,7 @@ sidebar_position: 4 --- -When developing an [Actor](/sources/platform/actors/index.mdx) on the Apify platform, you can choose from a variety of pre-built Docker iamges to serve as the base for your Actor. These base images come with pre-installed dependencies and tools, making it easier to set up your development envrionment and ensuring consistent behavior across different environments. +When developing an [Actor](/sources/platform/actors/index.mdx) on the Apify platform, you can choose from a variety of pre-built Docker images to serve as the base for your Actor. These base images come with pre-installed dependencies and tools, making it easier to set up your development environment and ensuring consistent behavior across different environments. ## Base Docker images @@ -105,7 +105,7 @@ By default, Apify base Docker images with the Apify SDK and Crawlee start your N } ``` -This means the system expects the source code to be in `main.js` by default. If you want to override this behavior, ues a custom `package.json` and/or `Dockerfile`. +This means the system expects the source code to be in `main.js` by default. If you want to override this behavior, use a custom `package.json` and/or `Dockerfile`. :::tip Optimization tips diff --git a/sources/platform/actors/development/actor_definition/output_schema.md b/sources/platform/actors/development/actor_definition/output_schema.md index 0020d8eec2..3d3411e9fc 100644 --- a/sources/platform/actors/development/actor_definition/output_schema.md +++ b/sources/platform/actors/development/actor_definition/output_schema.md @@ -111,7 +111,7 @@ To set up the Actor's output tab UI using a single configuration file, use the f } ``` -The template above defines the configuration for the default dataset output view. Under the `views` property, there is one view titled _Overview_. The view configuartion consists of two main steps: +The template above defines the configuration for the default dataset output view. Under the `views` property, there is one view titled _Overview_. The view configuration consists of two main steps: 1. `transformation` - set up how to fetch the data. 2. `display` - set up how to visually present the fetched data. @@ -124,7 +124,7 @@ The default behavior of the Output tab UI table is to display all fields from `t Output configuration files need to be located in the `.actor` folder within the Actor's root directory. -You have two choices of how to organize files withing the `.actor` folder. +You have two choices of how to organize files within the `.actor` folder. ### Single configuration file diff --git a/sources/platform/actors/development/programming_interface/actor_standby.md b/sources/platform/actors/development/programming_interface/actor_standby.md index 5085ae961c..7cf1515b0b 100644 --- a/sources/platform/actors/development/programming_interface/actor_standby.md +++ b/sources/platform/actors/development/programming_interface/actor_standby.md @@ -70,7 +70,7 @@ async def main() -> None: Please make sure to describe your Actors, their endpoints, and the schema for their -inputs and ouputs in your README. +inputs and outputs in your README. ## Can I monetize my Actor in the Standby mode diff --git a/sources/platform/api_v2/api_v2_reference.apib b/sources/platform/api_v2/api_v2_reference.apib index c4e9f1ce27..1858e5cef8 100644 --- a/sources/platform/api_v2/api_v2_reference.apib +++ b/sources/platform/api_v2/api_v2_reference.apib @@ -989,7 +989,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec otherwise it will have a transitional status (e.g. `RUNNING`). + webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see - [Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks). + [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). + Request @@ -1023,7 +1023,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec + build: `0.1.234` (string, optional) - Specifies the actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the actor (typically `latest`). + webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see - [Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks). + [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). ### With input [POST] @@ -1141,7 +1141,7 @@ To run the actor asynchronously, use the [Run actor](#reference/actors/run-colle + build: `0.1.234` (string, optional) - Specifies the actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the actor (typically `latest`). + webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see - [Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks). + [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). + format: `json` (string, optional) - Format of the results, possible values are: `json`, `jsonl`, `csv`, `html`, `xlsx`, `xml` and `rss`. The default value is `json`. + clean: `false` (boolean, optional) - If `true` or `1` then the API endpoint returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). @@ -1758,7 +1758,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. **Note**: if you already have a webhook set up for the actor or task, you do not have to add it again here. For more information, see - [Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks). + [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). + Request @@ -1792,7 +1792,7 @@ received in the response JSON to the [Get items](#reference/datasets/item-collec in the response. By default, it is `OUTPUT`. + webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see - [Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks). + [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). ### Run task synchronously (POST) [POST] @@ -1898,7 +1898,7 @@ To run the Task asynchronously, use the [Run task asynchronously](#reference/act + build: `0.1.234` (string, optional) - Specifies the actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically `latest`). + webhooks: `dGhpcyBpcyBqdXN0IGV4YW1wbGUK...` (string, optional) - Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed. The value is a Base64-encoded JSON array of objects defining the webhooks. For more information, see - [Webhooks documenation](https://docs.apify.com/platform/integrations/webhooks). + [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). + format: `json` (string, optional) - Format of the results, possible values are: `json`, `jsonl`, `csv`, `html`, `xlsx`, `xml` and `rss`. The default value is `json`. + clean: `false` (boolean, optional) - If `true` or `1` then the API endpoint returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). @@ -3005,7 +3005,7 @@ The pagination is always performed with the granularity of a single item, regard By default, the **Items** in the response are sorted by the time they were stored to the database, therefore you can use pagination to incrementally fetch the items as they are being added. The maximum number of items that will be returned in a single API call is limited to 250,000. -If you specify `desc=1` query paremeter, the results are returned in the reverse order +If you specify `desc=1` query parameter, the results are returned in the reverse order than they were stored (i.e. from newest to oldest items). Note that only the order of **Items** is reversed, but not the order of the `unwind` array elements. @@ -3081,7 +3081,7 @@ The POST payload is a JSON object or a JSON array of objects to save into the da **IMPORTANT:** The limit of request payload size for the dataset is 5 MB. If the array exceeds the size, you'll need to split it into a number of smaller arrays. -If the dataset has fields schema defined, the push request can potentialy fail with `400 Bad Request` if any item does not match the schema. +If the dataset has fields schema defined, the push request can potentially fail with `400 Bad Request` if any item does not match the schema. In such case, nothing will be inserted into the dataset and the response will contain an error message with a list of invalid items and their validation errors. + Parameters @@ -3767,7 +3767,7 @@ parameter. + Parameters - + dispatchId: `Zib4xbZsmvZeK55ua` (string, required) - Webhook dispacth ID. + + dispatchId: `Zib4xbZsmvZeK55ua` (string, required) - Webhook dispatch ID. + token: `soSkq9ekdmfOslopH` (string, required) - API authentication token. ### Get webhook dispatch [GET] @@ -4091,7 +4091,7 @@ a summary of your limits, and your current usage. - taggedBuilds (object, nullable) - latest (object, nullable) - buildId: `z2EryhbfhgSyqj6Hn` (string, nullable) - - buldNumber: `0.0.2` (string, nullable) + - buildNumber: `0.0.2` (string, nullable) - finishedAt: `2019-06-10T11:15:49.286Z` (string, nullable) ## ActCreate (object) diff --git a/sources/platform/proxy/usage.md b/sources/platform/proxy/usage.md index b7c9ca96ce..536c5ca59c 100644 --- a/sources/platform/proxy/usage.md +++ b/sources/platform/proxy/usage.md @@ -143,7 +143,7 @@ Depending on whether you use a [browser](https://apify.com/apify/web-scraper) or * Browser—a different IP address is used for each browser. * HTTP request—a different IP address is used for each request. -Use [sessions](#sessions) to controll how you rotate and [persist](#session-persistence) IP addresses. See our guide [Anti-scraping techniques](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works. +Use [sessions](#sessions) to control how you rotate and [persist](#session-persistence) IP addresses. See our guide [Anti-scraping techniques](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works. ## Sessions {#sessions} diff --git a/sources/platform/storage/dataset.md b/sources/platform/storage/dataset.md index 60de0c375a..53c255f765 100644 --- a/sources/platform/storage/dataset.md +++ b/sources/platform/storage/dataset.md @@ -51,7 +51,7 @@ Utilize the **Actions** menu to modify the dataset's name, which also affects it ### Apify API -The [Apify API](/api/v2#/reference/datasets) enables you progammatic access to your datasets using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). +The [Apify API](/api/v2#/reference/datasets) enables you programmatic access to your datasets using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations)tab of **Settings** page of your Apify account. @@ -155,7 +155,7 @@ Check out the [Python API client documentation](/api/client/python/reference/cla When working with a JavaScript [Actor](../actors/index.mdx), the [JavaScript SDK](/sdk/js/docs/guides/result-storage#dataset) is an essential tool, especially for dataset management. It simplifies the tasks of storing and retrieving data, seamlessly integrating with the Actor's workflow. Key features of the SDK include the ability to append data, retrieve what is stored, and manage dataset properties effectively. Central to this functionality is the [`Dataset`](/sdk/js/reference/class/Dataset) class. This class allows you to determine where your data is stored - locally or in the Apify cloud. To add data to your chosen datasets, use the [`pushData()`](/sdk/js/reference/class/Dataset#pushData) method. -Additionaly the SDK offers other methods like [`getData()`](/sdk/js/reference/class/Dataset#getData), [`map()`](/sdk/js/reference/class/Dataset#map), and [`reduce()`](/sdk/js/reference/class/Dataset#reduce). For practical applications of these methods, refer to the [example](/sdk/js/docs/examples/map-and-reduce) section. +Additionally the SDK offers other methods like [`getData()`](/sdk/js/reference/class/Dataset#getData), [`map()`](/sdk/js/reference/class/Dataset#map), and [`reduce()`](/sdk/js/reference/class/Dataset#reduce). For practical applications of these methods, refer to the [example](/sdk/js/docs/examples/map-and-reduce) section. If you have chosen to store your dataset locally, you can find it in the location below. @@ -284,7 +284,7 @@ For more information, visit our [Python SDK documentation](/sdk/python/docs/conc Fields in a dataset that begin with a `#` are treated as hidden. You can exclude these fields when downloading data by using either `skipHidden=1` or `clean=1` in your query parameters. This feature is useful for excluding debug information from the final dataset output. -The following example demonstates a dataset record with hiddent fields, including HTTP response and error details. +The following example demonstrates a dataset record with hiddent fields, including HTTP response and error details. ```json { diff --git a/sources/platform/storage/key_value_store.md b/sources/platform/storage/key_value_store.md index 87209bc049..13d30504f6 100644 --- a/sources/platform/storage/key_value_store.md +++ b/sources/platform/storage/key_value_store.md @@ -47,7 +47,7 @@ Click on the **API** button to view and test a store's [API endpoints](/api/v2#/ ### Apify API -The [Apify API](/api/v2#/reference/key-value-stores) enables you programmatic acces to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). +The [Apify API](/api/v2#/reference/key-value-stores) enables you programmatic access to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) tab of **Settings** page of your Apify account. diff --git a/sources/platform/storage/request_queue.md b/sources/platform/storage/request_queue.md index 19d34dad3e..b99c3cf739 100644 --- a/sources/platform/storage/request_queue.md +++ b/sources/platform/storage/request_queue.md @@ -15,7 +15,7 @@ import TabItem from '@theme/TabItem'; Request queues enable you to enqueue and retrieve requests such as URLs with an [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and other parameters. They prove essential not only in web crawling scenarios but also in any situation requiring the management of a large number of URLs and the addition of new links. -The storage system for request queues accomoodates both breadth-first and depth-first crawling stategies, along with the inclusion of custom data attributes. This system enables you to check if certain URLs have already been encountered, add new URLs to the queue, and retrieve the next set of URLs fo processing. +The storage system for request queues accomoodates both breadth-first and depth-first crawling strategies, along with the inclusion of custom data attributes. This system enables you to check if certain URLs have already been encountered, add new URLs to the queue, and retrieve the next set of URLs for processing. > Named request queues are retained indefinitely.
> Unnamed request queues expire after 7 days unless otherwise specified.
> [Learn more](./index.md#named-and-unnamed-storages)