Skip to content

Commit

Permalink
docs: add deployment section to the intro guide (#2070)
Browse files Browse the repository at this point in the history
  • Loading branch information
B4nan committed Sep 8, 2023
1 parent 201f8fa commit 3b95f46
Show file tree
Hide file tree
Showing 11 changed files with 245 additions and 39 deletions.
6 changes: 3 additions & 3 deletions docs/introduction/02-first-crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ We are using a JavaScript feature called [Top level await](https://blog.saeloun.

Earlier you learned that the crawler uses a queue of requests as its source of URLs to crawl. Let's create it and add the first request.

```ts title="src/main.mjs"
```ts title="src/main.js"
import { RequestQueue } from 'crawlee';

// First you create the request queue instance.
Expand All @@ -65,7 +65,7 @@ Unless you have a good reason to start with a different one, you should try buil

Let's continue with the earlier `RequestQueue` example.

```ts title="src/main.mjs"
```ts title="src/main.js"
// Add import of CheerioCrawler
import { RequestQueue, CheerioCrawler } from 'crawlee';

Expand Down Expand Up @@ -100,7 +100,7 @@ The title of "https://crawlee.dev" is: Crawlee · The scalable web crawling, scr

Earlier we mentioned that you'll learn how to use the `crawler.addRequests()` method to skip the request queue initialization. It's simple. Every crawler has an implicit `RequestQueue` instance, and you can add requests to it with the `crawler.addRequests()` method. In fact, you can go even further and just use the first parameter of `crawler.run()`!

```ts title="src/main.mjs"
```ts title="src/main.js"
// You don't need to import RequestQueue anymore
import { CheerioCrawler } from 'crawlee';

Expand Down
2 changes: 1 addition & 1 deletion docs/introduction/07-saving-data.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,4 @@ If you would like to store your data in a single big file, instead of many small

## Next lesson

In the next and final lesson, we will show you some improvements that you can add to your crawler code that will make it more readable and maintainable in the long run.
In the next lesson, we will show you some improvements that you can add to your crawler code that will make it more readable and maintainable in the long run.
16 changes: 2 additions & 14 deletions docs/introduction/08-refactoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,18 +115,6 @@ At first, it might seem more readable using just a simple `if / else` statement

It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long `requestHandler()` where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.

## Learning more about web scraping
## Next lesson

:::tip

If you want to learn more about web scraping and browser automation, check out the [Apify Academy](https://developers.apify.com/academy). It's full of courses and tutorials on the topic. From beginner to advanced. And the best thing: **It's free and open source** ❤️

:::

## Running your crawler in the Cloud

Now that you have your crawler ready, it's the right time to think about where you want to run it. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready. To read more about how to run this Dockerfile in the cloud, check out the [Apify Platform guide](../guides/apify-platform).

## Thank you! 🎉

That's it! Thanks for reading the whole introduction and if there's anything wrong, please 🙏 let us know on [GitHub](https://github.com/apify/crawlee) or in our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! 👋
In the next and final lesson, we will show you how you can deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready, and the next section will show you how to deploy it to the [Apify Platform](../guides/apify-platform) with ease.
114 changes: 114 additions & 0 deletions docs/introduction/09-deployment.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
id: deployment
title: "Running your crawler in the Cloud"
sidebar_label: "Deployment"
description: Deploying Crawlee projects to the Apify Platform
---

## Apify Platform

Crawlee is developed by [**Apify**](https://apify.com), the web scraping and automation platform. You could say it is the **home of Crawlee projects**. In this section we will show you how to deploy the crawler there with just a few simple steps. You can deploy a **Crawlee** project wherever you want, but using the [**Apify Platform**](https://console.apify.com) will give you the best experience.

[//]: # (TODO mention the other deployment guides here once they are available)

With a few simple steps, you can convert your Crawlee project into a so-called **Actor**. Actors are serverless micro-apps that are easy to develop, run, share, and integrate. The infra, proxies, and storages are ready to go. [Learn more about Actors](https://apify.com/actors).

:::info

We started this guide by using the Crawlee CLI to bootstrap the project - it offers the basic Crawlee templates, including a ready-made `Dockerfile`. If you know you will be deploying your project to the Apify Platform, you might want to start with the Apify CLI instead. It also offers several project templates, and those are all set up to be used on the Apify Platform right ahead.

:::

## Dependencies

The first step will be installing two new dependencies:

- Apify SDK, a toolkit for working with the Apify Platform. This will allow us to wire the storages (e.g. `RequestQueue` and `Dataset`) to the Apify cloud products. This will be a dependency of our Node.js project.
```bash
npm install apify
```

- Apify CLI, a command-line tool that will help us with authentication and deployment. This will be a globally installed tool, you will install it only once and use it in all your Crawlee/Apify projects.
```bash
npm install -g apify-cli
```

## Logging in to the Apify Platform

The next step will be [creating your Apify account](https://console.apify.com/sign-up). Don't worry, we have a **free tier**, so you can try things out before you buy in! Once you have that, it's time to log in with the just-installed [Apify CLI](https://docs.apify.com/cli/). You will need your personal access token, which you can find at https://console.apify.com/account#/integrations.

```bash
apify login
```

## Adjusting the code

Now that you have your account set up, you will need to adjust the code a tiny bit. We will use the [Apify SDK](https://docs.apify.com/sdk/js/), which will help us to wire the Crawlee storages (like the `RequestQueue`) to their Apify Platform counterparts - otherwise Crawlee would keep things only in memory.

Open your `src/main.js` file (or `src/main.ts` if you used a TypeScript template), and add `Actor.init()` to the beginning of your main script and `Actor.exit()` to the end of it. Don't forget to `await` those calls, as both functions are async. Your code should look like this:

```ts title="src/main.js"
// highlight-next-line
import { Actor } from 'apify';
import { PlaywrightCrawler, log } from 'crawlee';
import { router } from './routes.mjs';

// highlight-next-line
await Actor.init();

// This is better set with CRAWLEE_LOG_LEVEL env var
// or a configuration option. This is just for show 😈
log.setLevel(log.LEVELS.DEBUG);

log.debug('Setting up crawler.');
const crawler = new PlaywrightCrawler({
// Instead of the long requestHandler with
// if clauses we provide a router instance.
requestHandler: router,
});

await crawler.run(['https://apify.com/store']);

// highlight-next-line
await Actor.exit();
```

The `Actor.init()` call will configure Crawlee to use the Apify API instead of its default memory storage interface. It also sets up few other things, like listening to the platform events via websockets. The `Actor.exit()` call then handles graceful shutdown - it will close the open handles created by the `Actor.init()` call, as without that, the Node.js process would be stuck.

:::info

The `Actor.init()` call works conditionally based on the environment variables, namely based on the `APIFY_IS_AT_HOME` env var, which is set to `true` on the Apify Platform. This means that your project will remain working the same locally, but will use the Apify API when deployed to the Apify Platform.

:::

## Initializing the project

We will also need to initialize the project for Apify, to do that, let's use the Apify CLI again:

```bash
apify init
```

This will create a folder called `.actor`, and an `actor.json` file inside it - this file contains the configuration relevant to the Apify Platform, namely the Actor name, version, build tag, and few other things. Check out the [relevant documentation](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) to see all the different things you can set there up.

## Ship it!

And that's all, our project is now ready to be published on the Apify Platform. We will use the Apify CLI once more to do that:

```bash
apify push
```

This command will create an archive from your project, upload it to the Apify Platform and initiate a Docker build. Once finished, you will get a link to your new Actor on the platform.

## Learning more about web scraping

:::tip

If you want to learn more about web scraping and browser automation, check out the [Apify Academy](https://developers.apify.com/academy). It's full of courses and tutorials on the topic. From beginner to advanced. And the best thing: **It's free and open source** ❤️

:::

## Thank you! 🎉

That's it! Thanks for reading the whole introduction and if there's anything wrong, please 🙏 let us know on [GitHub](https://github.com/apify/crawlee) or in our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! 👋
1 change: 1 addition & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ module.exports = {
'introduction/scraping',
'introduction/saving-data',
'introduction/refactoring',
'introduction/deployment',
],
},
{
Expand Down
4 changes: 2 additions & 2 deletions website/src/pages/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,8 @@ function Deployment() {
<p>
Crawlee is developed by <a href="https://apify.com" rel="dofollow" target="_blank"><b>Apify</b></a>, the web scraping and automation platform.
You can deploy a <b>Crawlee</b> project wherever you want, but using the <a href="https://console.apify.com/" target="_blank"><b>Apify
platform</b></a> will give you the best experience. With a few simple steps, you can convert your Crawlee project into a so
called <b>Actor</b>. Actors are serverless micro-apps that are easy to develop, run, share, and integrate. The infra, proxies,
platform</b></a> will give you the best experience. With a few simple steps, you can convert your Crawlee project into a so-called&nbsp;
<b>Actor</b>. Actors are serverless micro-apps that are easy to develop, run, share, and integrate. The infra, proxies,
and storages are ready to go. <a href="https://apify.com/actors" target="_blank">Learn more about Actors</a>.
</p>
<p>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ We are using a JavaScript feature called [Top level await](https://blog.saeloun.

Earlier you learned that the crawler uses a queue of requests as its source of URLs to crawl. Let's create it and add the first request.

```ts title="src/main.mjs"
```ts title="src/main.js"
import { RequestQueue } from 'crawlee';

// First you create the request queue instance.
Expand All @@ -65,7 +65,7 @@ Unless you have a good reason to start with a different one, you should try buil

Let's continue with the earlier `RequestQueue` example.

```ts title="src/main.mjs"
```ts title="src/main.js"
// Add import of CheerioCrawler
import { RequestQueue, CheerioCrawler } from 'crawlee';

Expand Down Expand Up @@ -100,7 +100,7 @@ The title of "https://crawlee.dev" is: Crawlee · The scalable web crawling, scr

Earlier we mentioned that you'll learn how to use the `crawler.addRequests()` method to skip the request queue initialization. It's simple. Every crawler has an implicit `RequestQueue` instance, and you can add requests to it with the `crawler.addRequests()` method. In fact, you can go even further and just use the first parameter of `crawler.run()`!

```ts title="src/main.mjs"
```ts title="src/main.js"
// You don't need to import RequestQueue anymore
import { CheerioCrawler } from 'crawlee';

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,4 @@ If you would like to store your data in a single big file, instead of many small

## Next lesson

In the next and final lesson, we will show you some improvements that you can add to your crawler code that will make it more readable and maintainable in the long run.
In the next lesson, we will show you some improvements that you can add to your crawler code that will make it more readable and maintainable in the long run.
16 changes: 2 additions & 14 deletions website/versioned_docs/version-3.5/introduction/08-refactoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,18 +115,6 @@ At first, it might seem more readable using just a simple `if / else` statement

It's good practice in any programming language to split your logic into bite-sized chunks that are easy to read and reason about. Scrolling through a thousand line long `requestHandler()` where everything interacts with everything and variables can be used everywhere is not a beautiful thing to do and a pain to debug. That's why we prefer the separation of routes into their own files.

## Learning more about web scraping
## Next lesson

:::tip

If you want to learn more about web scraping and browser automation, check out the [Apify Academy](https://developers.apify.com/academy). It's full of courses and tutorials on the topic. From beginner to advanced. And the best thing: **It's free and open source** ❤️

:::

## Running your crawler in the Cloud

Now that you have your crawler ready, it's the right time to think about where you want to run it. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready. To read more about how to run this Dockerfile in the cloud, check out the [Apify Platform guide](../guides/apify-platform).

## Thank you! 🎉

That's it! Thanks for reading the whole introduction and if there's anything wrong, please 🙏 let us know on [GitHub](https://github.com/apify/crawlee) or in our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! 👋
In the next and final lesson, we will show you how you can deploy your Crawlee project to the cloud. If you used the CLI to bootstrap your project, you already have a **Dockerfile** ready, and the next section will show you how to deploy it to the [Apify Platform](../guides/apify-platform) with ease.
114 changes: 114 additions & 0 deletions website/versioned_docs/version-3.5/introduction/09-deployment.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
id: deployment
title: "Running your crawler in the Cloud"
sidebar_label: "Deployment"
description: Deploying Crawlee projects to the Apify Platform
---

## Apify Platform

Crawlee is developed by [**Apify**](https://apify.com), the web scraping and automation platform. You could say it is the **home of Crawlee projects**. In this section we will show you how to deploy the crawler there with just a few simple steps. You can deploy a **Crawlee** project wherever you want, but using the [**Apify Platform**](https://console.apify.com) will give you the best experience.

[//]: # (TODO mention the other deployment guides here once they are available)

With a few simple steps, you can convert your Crawlee project into a so-called **Actor**. Actors are serverless micro-apps that are easy to develop, run, share, and integrate. The infra, proxies, and storages are ready to go. [Learn more about Actors](https://apify.com/actors).

:::info

We started this guide by using the Crawlee CLI to bootstrap the project - it offers the basic Crawlee templates, including a ready-made `Dockerfile`. If you know you will be deploying your project to the Apify Platform, you might want to start with the Apify CLI instead. It also offers several project templates, and those are all set up to be used on the Apify Platform right ahead.

:::

## Dependencies

The first step will be installing two new dependencies:

- Apify SDK, a toolkit for working with the Apify Platform. This will allow us to wire the storages (e.g. `RequestQueue` and `Dataset`) to the Apify cloud products. This will be a dependency of our Node.js project.
```bash
npm install apify
```

- Apify CLI, a command-line tool that will help us with authentication and deployment. This will be a globally installed tool, you will install it only once and use it in all your Crawlee/Apify projects.
```bash
npm install -g apify-cli
```

## Logging in to the Apify Platform

The next step will be [creating your Apify account](https://console.apify.com/sign-up). Don't worry, we have a **free tier**, so you can try things out before you buy in! Once you have that, it's time to log in with the just-installed [Apify CLI](https://docs.apify.com/cli/). You will need your personal access token, which you can find at https://console.apify.com/account#/integrations.

```bash
apify login
```

## Adjusting the code

Now that you have your account set up, you will need to adjust the code a tiny bit. We will use the [Apify SDK](https://docs.apify.com/sdk/js/), which will help us to wire the Crawlee storages (like the `RequestQueue`) to their Apify Platform counterparts - otherwise Crawlee would keep things only in memory.

Open your `src/main.js` file (or `src/main.ts` if you used a TypeScript template), and add `Actor.init()` to the beginning of your main script and `Actor.exit()` to the end of it. Don't forget to `await` those calls, as both functions are async. Your code should look like this:

```ts title="src/main.js"
// highlight-next-line
import { Actor } from 'apify';
import { PlaywrightCrawler, log } from 'crawlee';
import { router } from './routes.mjs';

// highlight-next-line
await Actor.init();

// This is better set with CRAWLEE_LOG_LEVEL env var
// or a configuration option. This is just for show 😈
log.setLevel(log.LEVELS.DEBUG);

log.debug('Setting up crawler.');
const crawler = new PlaywrightCrawler({
// Instead of the long requestHandler with
// if clauses we provide a router instance.
requestHandler: router,
});

await crawler.run(['https://apify.com/store']);

// highlight-next-line
await Actor.exit();
```

The `Actor.init()` call will configure Crawlee to use the Apify API instead of its default memory storage interface. It also sets up few other things, like listening to the platform events via websockets. The `Actor.exit()` call then handles graceful shutdown - it will close the open handles created by the `Actor.init()` call, as without that, the Node.js process would be stuck.

:::info

The `Actor.init()` call works conditionally based on the environment variables, namely based on the `APIFY_IS_AT_HOME` env var, which is set to `true` on the Apify Platform. This means that your project will remain working the same locally, but will use the Apify API when deployed to the Apify Platform.

:::

## Initializing the project

We will also need to initialize the project for Apify, to do that, let's use the Apify CLI again:

```bash
apify init
```

This will create a folder called `.actor`, and an `actor.json` file inside it - this file contains the configuration relevant to the Apify Platform, namely the Actor name, version, build tag, and few other things. Check out the [relevant documentation](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) to see all the different things you can set there up.

## Ship it!

And that's all, our project is now ready to be published on the Apify Platform. We will use the Apify CLI once more to do that:

```bash
apify push
```

This command will create an archive from your project, upload it to the Apify Platform and initiate a Docker build. Once finished, you will get a link to your new Actor on the platform.

## Learning more about web scraping

:::tip

If you want to learn more about web scraping and browser automation, check out the [Apify Academy](https://developers.apify.com/academy). It's full of courses and tutorials on the topic. From beginner to advanced. And the best thing: **It's free and open source** ❤️

:::

## Thank you! 🎉

That's it! Thanks for reading the whole introduction and if there's anything wrong, please 🙏 let us know on [GitHub](https://github.com/apify/crawlee) or in our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! 👋

0 comments on commit 3b95f46

Please sign in to comment.