Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/actor/images/github-integration.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/actor/images/run-console.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/actor/images/run-log-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/actor/images/run-log.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/actor/images/source-env-vars.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/actor/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Go to the [Actor](https://my.apify.com/actors) section in the app, create a new

Click **Quick run** to build and run your actor. After the run is finished you should see something like:

![Apify actor run log](/img/docs/actor/run-log.png)
![Apify actor run log]({{@asset actor/images/run-log.png}})

Congratulations, you have successfully created and run your first actor!

Expand All @@ -40,7 +40,7 @@ Save your actor by clicking **Save** and then rebuild it by clicking **Build**.

Then set **Content type** to `application/json; charset=utf-8` and click **Run**. You will see something like:

![Apify actor run log](/img/docs/actor/run-log-2.png)
![Apify actor run log]({{@asset actor/images/run-log-2.png}})

Excellent, you have just created your first actor to accept input and store output! Now you can start adding some magic.

Expand Down
2 changes: 1 addition & 1 deletion docs/actor/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: Run

The actor can be invoked in a number of ways. One option is to start the actor manually in **Console** in the app:

![Apify actor run console](/img/docs/actor/run-console.png)
![Apify actor run console]({{@asset actor/images/run-console.png}})

The following table describes the run settings:

Expand Down
4 changes: 2 additions & 2 deletions docs/actor/source_code.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,15 +115,15 @@ For example, for repositories on GitHub it can be done using the following steps

Then go to your GitHub repository, click **Settings**, select **Webhooks** tab and click **Add webhook**. Paste the API URL to the **Payload URL** as follows:

![GitHub integration](/img/docs/actor/github-integration.png)
![GitHub integration]({{@asset actor/images/github-integration.png}})

And that's it! Now your actor should automatically rebuild on every push to the GitHub repository.

### [](#source-env-vars)Custom environment variables

The actor owner can specify custom environment variables that are set to the actor's process during the run. Sensitive environment variables such as passwords or API tokens can be protected by setting the **Secret** option. With this option enabled, the value of the environment variable is encrypted and it will not be visible in the app or APIs, and the value is redacted from actor logs to avoid the accidental leakage of sensitive data.

![Custom environment variables](/img/docs/actor/source-env-vars.png)
![Custom environment variables]({{@asset actor/images/source-env-vars.png}})

Note that the custom environment variables are fixed during the build of the actor and cannot be changed later. See the [Build]({{@link actor/build.md#build}}) section for details.

Expand Down
Binary file added docs/proxy/images/proxy-status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/proxy/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ To view the status of the connection to Apify Proxy, open the following URL in t

If the proxy connection works well, the web page should look something like this:

![Apify proxy status page](/img/docs/proxy/proxy-status.png)
![Apify proxy status page]({{@asset proxy/images/proxy-status.png}})

To test that your requests are proxied and rotate the IP addresses correctly, you can open the following API endpoint via the proxy. It shows information about the client IP address:

Expand Down
14 changes: 7 additions & 7 deletions docs/scraping/cheerio_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,13 @@ Before we start, let's do a quick recap of the data we chose to scrape:
5. **Last run date**- When the actor was last run.
6. **Number of runs** - How many times the actor was run.

![data to scrape](../img/scraping-practice.jpg "Overview of data to be scraped.")
![data to scrape]({{@asset scraping/images/scraping-practice.jpg}} "Overview of data to be scraped.")

We've already scraped number 1 and 2 in the [Getting started with Apify scrapers](https://apify.com/docs/scraping/tutorial/introduction) tutorial, so let's get to the next one on the list: Title

### [](#title)Title

![actor title](../img/title.jpg "Finding actor title in DevTools.")
![actor title]({{@asset scraping/images/title.jpg}} "Finding actor title in DevTools.")

By using the element selector tool, we find out that the title is there under an `<h1>` tag, as titles should be. Maybe surprisingly, we find that there are actually two `<h1>` tags on the detail page. This should get us thinking. Is there any parent element that includes our `<h1>` tag, but not the other ones? Yes, there is! There is a `<header>` element that we can use to select only the heading we're interested in.

Expand All @@ -52,7 +52,7 @@ To get the title we just need to find it using a `header h1` selector, which sel

Getting the actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the actor description is nested within the `<header>` element too, same as the title. Sadly, we're still left with two `<p>` tags. To finally select only the description, we choose the `<p>` tag that has a `class` that starts with `Text__Paragraph`.

![actor description selector](../img/description.jpg "Finding actor description in DevTools.")
![actor description selector]({{@asset scraping/images/description.jpg}} "Finding actor description in DevTools.")

return {
title: $('header h1').text(),
Expand All @@ -63,7 +63,7 @@ Getting the actor's description is a little more involved, but still pretty stra

The DevTools tell us that the `lastRunDate` can be found in the second of the two `<time>` elements in the page.

![actor last run date selector](../img/last-run-date.jpg "Finding actor last run date in DevTools.")
![actor last run date selector]({{@asset scraping/images/last-run-date.jpg}} "Finding actor last run date in DevTools.")

return {
title: $('header h1').text(),
Expand Down Expand Up @@ -190,23 +190,23 @@ While with `apify/web-scraper` and `apify/puppeteer-scraper`, we could get away

We want to know what happens when we click the **Show more** button, so we open the DevTools Network tab and clear it. Then we click the Show more button and wait for incoming requests to appear in the list.

![inspect-network](../img/inspect-network.jpg "Inspecting network in DevTools.")
![inspect-network]({{@asset scraping/images/inspect-network.jpg}} "Inspecting network in DevTools.")

Now, this is interesting. It seems that we've only received two images after clicking the button and no additional data. This means that the data about actors must already be available in the page and the Show more button only displays it. This is good news.

### [](#finding-the-actors)Finding the actors

Now that we know the information we seek is already in the page, we just need to find it. The first actor in the store is `apify/web-scraper` so let's try using the search tool in the Elements tab to find some reference to it. The first few hits do not provide any interesting information, but in the end, we find our goldmine. There is a `<script>` tag, with the ID `__NEXT_DATA__` that seems to hold a lot of information about `apify/web-scraper`. In DevTools, you can right click an element and click **Store as global variable** to make this element available in the Console.

![find-data](../img/find-data.jpg "Finding the hidden actor data.")
![find-data]({{@asset scraping/images/find-data.jpg}} "Finding the hidden actor data.")

A `temp1` variable is now added to your console. We're mostly interested in its contents and we can get that using the `temp1.textContent` property. You can see that it's a rather large JSON string. How do we know? The `type` attribute of the `<script>` element says `application/json`. But working with a string would be very cumbersome, so we need to parse it.

const data = JSON.parse(temp1.textContent);

After entering the above command into the console, we can inspect the `data` variable and see that all the information we need is there, in the `data.props.pageProps.items` array. Great!

![inspect-data](../img/inspect-data.jpg "Inspecting the hidden actor data.")
![inspect-data]({{@asset scraping/images/inspect-data.jpg}} "Inspecting the hidden actor data.")

> It's obvious that all the information we set to scrape is available in this one data object, so you might already be wondering, can I just make one request to the store to get this JSON and then parse it out and be done with it in a single request? Yes you can! And that's the power of clever page analysis.

Expand Down
Binary file added docs/scraping/images/actor-selection.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/description.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/find-data.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/inspect-data.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/inspect-network.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/last-run-date.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/making-a-pseudo-url.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/scraping-practice.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/the-run-detail.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/the-start-url.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/title.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/using-devtools.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/scraping/images/waiting-for-the-button.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 5 additions & 5 deletions docs/scraping/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Depending on how you arrived at this tutorial, you may already have your first t

> This tutorial covers the use of **Web**, **Cheerio** and **Puppeteer** scrapers, but a lot of the information here can be used with all actors.

![actor-selection](../img/actor-selection.jpg "Selecting the best actor")
![actor-selection]({{@asset scraping/images/actor-selection.jpg}} "Selecting the best actor")

### [](#running-a-task)Running a task

Expand All @@ -40,7 +40,7 @@ After clicking **Save & Run**, the window will change to the run detail. Here, y

Now that the run has `SUCCEEDED`, click on the rightmost card labeled **Clean items** to see the results of the scrape. This takes you to the DATASET tab, where you can display or download the results in various formats. For now, just click the blue **Preview data** button. Voila, the scraped data.

![run detail](../img/the-run-detail.jpg "Viewing results in the run detail.")
![run detail]({{@asset scraping/images/the-run-detail.jpg}} "Viewing results in the run detail.")

Good job! We've run our first task and got some results. Let's learn how to change the default configuration to scrape something more interesting than just the page's `<title>`.

Expand Down Expand Up @@ -107,7 +107,7 @@ We also need to somehow distinguish the Start URL from all the other URLs that t
"label": "START"
}

![start url input](../img/the-start-url.jpg "Adding new Start URL.")
![start url input]({{@asset scraping/images/the-start-url.jpg}} "Adding new Start URL.")

### [](#crawling-the-website-with-pseudo-urls)Crawling the website with Pseudo URLs

Expand Down Expand Up @@ -141,7 +141,7 @@ Let's use the above Pseudo URL in our task. We should also add a label as we did
"label": "DETAIL"
}

![pseudo url input](../img/making-a-pseudo-url.jpg "Adding new Pseudo URL.")
![pseudo url input]({{@asset scraping/images/making-a-pseudo-url.jpg}} "Adding new Pseudo URL.")

### [](#filtering-with-a-link-selector)Filtering with a link selector

Expand Down Expand Up @@ -175,7 +175,7 @@ The DevTools window will pop up, and display a lot of, perhaps unfamiliar, infor

You'll see that the Element tab jumps to the first `<title>` element of the current page and that the title is `Store`. It's always good practice to do your research using the DevTools before writing the `pageFunction` and running your task.

![devtools](../img/using-devtools.jpg "Finding title element in DevTools.")
![devtools]({{@asset scraping/images/using-devtools.jpg}} "Finding title element in DevTools.")

> For the sake of brevity, we won't go into the details of using the DevTools in this tutorial. If you're just starting out with DevTools, this [Google tutorial](https://developers.google.com/web/tools/chrome-devtools/) is a good place to begin.

Expand Down
12 changes: 6 additions & 6 deletions docs/scraping/puppeteer_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,13 @@ Before we start, let's do a quick recap of the data we chose to scrape:
5. **Last run date**- When the actor was last run.
6. **Number of runs** - How many times the actor was run.

![data to scrape](../img/scraping-practice.jpg "Overview of data to be scraped.")
![data to scrape]({{@asset scraping/images/scraping-practice.jpg}} "Overview of data to be scraped.")

We've already scraped number 1 and 2 in the [Getting started with Apify scrapers](https://apify.com/docs/scraping/tutorial/introduction) tutorial, so let's get to the next one on the list: Title

### [](#title)Title

![actor title](../img/title.jpg "Finding actor title in DevTools.")
![actor title]({{@asset scraping/images/title.jpg}} "Finding actor title in DevTools.")

By using the element selector tool, we find out that the title is there under an `<h1>` tag, as titles should be. Maybe surprisingly, we find that there are actually two `<h1>` tags on the detail page. This should get us thinking. Is there any parent element that includes our `<h1>` tag, but not the other ones? Yes, there is! There is a `<header>` element that we can use to select only the heading we're interested in.

Expand All @@ -62,7 +62,7 @@ The [`page.$eval`](https://pptr.dev/#?product=Puppeteer&show=api-elementhandleev

Getting the actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the actor description is nested within the `<header>` element too, same as the title. Sadly, we're still left with two `<p>` tags. To finally select only the description, we choose the `<p>` tag that has a `class` that starts with `Text__Paragraph`.

![actor description selector](../img/description.jpg "Finding actor description in DevTools.")
![actor description selector]({{@asset scraping/images/description.jpg}} "Finding actor description in DevTools.")
const title = await page.$eval('header h1', (el => el.textContent));
const description = await page.$eval('header p[class^=Text__Paragraph]', (el => el.textContent));

Expand All @@ -75,7 +75,7 @@ Getting the actor's description is a little more involved, but still pretty stra

The DevTools tell us that the `lastRunDate` can be found in the second of the two `<time>` elements in the page.

![actor last run date selector](../img/last-run-date.jpg "Finding actor last run date in DevTools.")
![actor last run date selector]({{@asset scraping/images/last-run-date.jpg}} "Finding actor last run date in DevTools.")

const title = await page.$eval('header h1', (el => el.textContent));
const description = await page.$eval('header p[class^=Text__Paragraph]', (el => el.textContent));
Expand Down Expand Up @@ -238,7 +238,7 @@ Before we can wait for the button, we need to know its unique selector. A quick

> Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F).

![waiting for the button](../img/waiting-for-the-button.jpg "Finding show more button in DevTools.")
![waiting for the button]({{@asset scraping/images/waiting-for-the-button.jpg}} "Finding show more button in DevTools.")

Now that we know what to wait for, we just plug it into the `waitFor()` function.

Expand Down Expand Up @@ -341,7 +341,7 @@ We've got the general algorithm ready, so all that's left is to integrate it int

That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the actors and then scrape all of their data. After it succeeds, open the Dataset again and see the clean items. You should have a table of all the actor's details in front of you. If you do, great job! You've successfully scraped the Apify Store. And if not, no worries, just go through the code examples again, it's probably just some typo.

![final results](../img/plugging-it-into-the-pagefunction.jpg "Final results.")
![final results]({{@asset scraping/images/plugging-it-into-the-pagefunction.jpg}} "Final results.")

## [](#downloading-the-scraped-data)Downloading the scraped data

Expand Down
12 changes: 6 additions & 6 deletions docs/scraping/web_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,13 @@ Before we start, let's do a quick recap of the data we chose to scrape:
5. **Last run date**- When the actor was last run.
6. **Number of runs** - How many times the actor was run.

![data to scrape](../img/scraping-practice.jpg "Overview of data to be scraped.")
![data to scrape]({{@asset scraping/images/scraping-practice.jpg}} "Overview of data to be scraped.")

We've already scraped number 1 and 2 in the [Getting started with Apify scrapers](https://apify.com/docs/scraping/tutorial/introduction) tutorial, so let's get to the next one on the list: Title

### [](#title)Title

![actor title](../img/title.jpg "Finding actor title in DevTools.")
![actor title]({{@asset scraping/images/title.jpg}} "Finding actor title in DevTools.")

By using the element selector tool, we find out that the title is there under an `<h1>` tag, as titles should be. Maybe surprisingly, we find that there are actually two `<h1>` tags on the detail page. This should get us thinking. Is there any parent element that includes our `<h1>` tag, but not the other ones? Yes, there is! There is a `<header>` element that we can use to select only the heading we're interested in.

Expand All @@ -52,7 +52,7 @@ To get the title we just need to find it using a `header h1` selector, which sel

Getting the actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the actor description is nested within the `<header>` element too, same as the title. Sadly, we're still left with two `<p>` tags. To finally select only the description, we choose the `<p>` tag that has a `class` that starts with `Text__Paragraph`.

![actor description selector](../img/description.jpg "Finding actor description in DevTools.")
![actor description selector]({{@asset scraping/images/description.jpg}} "Finding actor description in DevTools.")

return {
title: $('header h1').text(),
Expand All @@ -63,7 +63,7 @@ Getting the actor's description is a little more involved, but still pretty stra

The DevTools tell us that the `lastRunDate` can be found in the second of the two `<time>` elements in the page.

![actor last run date selector](../img/last-run-date.jpg "Finding actor last run date in DevTools.")
![actor last run date selector]({{@asset scraping/images/last-run-date.jpg}} "Finding actor last run date in DevTools.")

return {
title: $('header h1').text(),
Expand Down Expand Up @@ -221,7 +221,7 @@ Before we can wait for the button, we need to know its unique selector. A quick

> Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F).

![waiting for the button](../img/waiting-for-the-button.jpg "Finding show more button in DevTools.")
![waiting for the button]({{@asset scraping/images/waiting-for-the-button.jpg}} "Finding show more button in DevTools.")

Now that we know what to wait for, we just plug it into the `waitFor()` function.

Expand Down Expand Up @@ -323,7 +323,7 @@ We've got the general algorithm ready, so all that's left is to integrate it int

That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the actors and then scrape all of their data. After it succeeds, open the Dataset again and see the clean items. You should have a table of all the actor's details in front of you. If you do, great job! You've successfully scraped the Apify Store. And if not, no worries, just go through the code examples again, it's probably just some typo.

![final results](../img/plugging-it-into-the-pagefunction.jpg "Final results.")
![final results]({{@asset scraping/images/plugging-it-into-the-pagefunction.jpg}} "Final results.")

## [](#downloading-the-scraped-data)Downloading the scraped data

Expand Down
7 changes: 4 additions & 3 deletions docs/tasks/configure.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ title: Configure

Once you create the task, you can configure its name and set up options and input for the actor. If you leave the options configuration empty, or partially empty, when you run the task, the missing options configuration will be prefilled with values from the actor's configuration.

![Apify task options](/img/docs/tasks/options.png) <small>Options configuration</small>
![Apify task options]({{@asset tasks/images/options.png}}) <small>Options configuration</small>

A Task's input configuration works like an actor's, you can either set up raw input with a configured content type, or, if the actor has a defined input schema, a visual input UI will be visible.

![Apify task raw input](/img/docs/tasks/raw-input.png) <small>Raw input UI</small>
![Apify task raw input]({{@asset tasks/images/raw-input.png}}) <small>Raw input UI</small>

![Apify task visual input]({{@asset tasks/images/visual-input.png}}) <small>Visual input UI</small>

![Apify task visual input](/img/docs/tasks/visual-input.png) <small>Visual input UI</small>
Loading