Skip to content

Releases: apify/crawlee

v1.2.1

14 May 11:35
3cedd4d
Compare
Choose a tag to compare
  • Fix requestAsBrowser behavior with various combinations of json, payload legacy options. closes: #1028

v1.2.0

11 May 06:56
Compare
Choose a tag to compare

This release brings the long awaited HTTP2 capabilities to requestAsBrowser. It could make HTTP2 requests even before, but it was not very helpful in making browser-like ones. This is very important for disguising as a browser and reduction in the number of blocked requests. requestAsBrowser now uses got-scraping.

The most important new feature is that the full set of headers requestAsBrowser uses will now be generated using live data about browser headers that we collect. This means that the "header fingeprint" will always match existing browsers and should be indistinguishable from a real browser request. The header sets will be automatically rotated for you to further reduce the chances of blocking.

We also switched the default HTTP version from 1 to 2 in requestAsBrowser. We don't expect this change to be breaking, and we took precautions, but we're aware that there are always some edge cases, so please let us know if it causes trouble for you.

Full list of changes:

  • Replace the underlying HTTP client of utils.requestAsBrowser() with got-scraping.
  • Make useHttp2 true by default with utils.requestAsBrowser().
  • Fix Apify.call() failing with empty OUTPUT.
  • Update puppeteer to 8.0.0 and playwright to 1.10.0 with Chromium 90 in Docker images.
  • Update @apify/ps-tree to support Windows better.
  • Update @apify/storage-local to support Node.js 16 prebuilds.

v1.1.2

10 Apr 17:49
058af35
Compare
Choose a tag to compare
  • DEPRECATED: utils.waitForRunToFinish please use the apify-client package and its waitForFinish functions. Sorry, forgot to deprecate this with v1 release.
  • Fix internal require that broke the SDK with underscore 1.13 release.
  • Update @apify/storage-local to v2 written in TypeScript.

v1.1.1

23 Mar 10:01
Compare
Choose a tag to compare
  • Fix SessionPoolOptions not being correctly used in BrowserCrawler.
  • Improve error messages for missing puppeteer or playwright installations.

v1.1.0

19 Mar 17:46
088959f
Compare
Choose a tag to compare

In this minor release we focused on the SessionPool. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.

// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
    id: 'my-session',
    // ... some config
});

// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');

Full list of changes:

  • Add sessionPool.addSession() function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id).
  • Add optional parameter sessionId to sessionPool.getSession() to be able to retrieve a session from the session pool with the specific session id.
  • Fix SessionPool not working properly in both PuppeteerCrawler and PlaywrightCrawler.
  • Fix Apify.call() and Apify.callTask() output - make it backwards compatible with previous versions of the client.
  • Improve handling of browser executable paths when using the official SDK Docker images.
  • Update browser-pool to fix issues with failing hooks causing browsers to get stuck in limbo.
  • Removed proxy-chain dependency because now it's covered in browser-pool.

v1.0.2

05 Mar 13:36
Compare
Choose a tag to compare
  • Add the ability to override ProxyConfiguration status check URL with the APIFY_PROXY_STATUS_URL env var.
  • Fix inconsistencies in cookie handling when SessionPool was used.
  • Fix TS types in multiple places. TS is still not a first class citizen, but this should improve the experience.

v1.0.1

03 Feb 19:23
Compare
Choose a tag to compare
  • Fix dataset.pushData() validation which would not allow other than plain objects.
  • Fix PuppeteerLaunchContext.stealth throwing when used in PuppeteerCrawler.

v1.0.0

25 Jan 19:25
Compare
Choose a tag to compare

After 3.5 years of rapid development, and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).

The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.

We added support for more browsers by replacing PuppeteerPool with browser-pool. A new library that we created specifically for this purpose. It builds on the ideas from PuppeteerPool and extends them to support Playwright. Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool.

A large breaking change is that neither puppeteer nor playwright are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future.

Thanks to the addition of Playwright we now have a PlaywrightCrawler. It is very similar to PuppeteerCrawler and you can pick the one you prefer. It also means we needed to make some interface changes. The launchPuppeteerFunction option of PuppeteerCrawler is gone and launchPuppeteerOptions were replaced by launchContext. We also moved things around in the handlePageFunction arguments. See the migration guide for more detailed explanation and migration examples.

What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.

Full list of changes:

  • BREAKING: Removed puppeteer from dependencies. If you want to use Puppeteer, you must install it yourself.
  • BREAKING: Removed PuppeteerPool. Use browser-pool.
  • BREAKING: Removed PuppeteerCrawlerOptions.launchPuppeteerOptions. Use launchContext.
  • BREAKING: Removed PuppeteerCrawlerOptions.launchPuppeteerFunction. Use PuppeteerCrawlerOptions.preLaunchHooks and postLaunchHooks.
  • BREAKING: Removed args.autoscaledPool and args.puppeteerPool from handle(Page/Request)Function arguments. Use args.crawler.autoscaledPool and args.crawler.browserPool.
  • BREAKING: The useSessionPool and persistCookiesPerSession options of crawlers are now true by default. Explicitly set them to false to override the behavior.
  • BREAKING: Apify.launchPuppeteer() no longer accepts LaunchPuppeteerOptions. It now accepts PuppeteerLaunchContext.

New deprecations:

  • DEPRECATED: PuppeteerCrawlerOptions.gotoFunction. Use PuppeteerCrawlerOptions.preNavigationHooks and postNavigationHooks.

Removals of earlier deprecated functions:

  • BREAKING: Removed Apify.utils.puppeteer.enqueueLinks(). Deprecated in 01/2019. Use Apify.utils.enqueueLinks().
  • BREAKING: Removed autoscaledPool.(set|get)MaxConcurrency(). Deprecated in 2019. Use autoscaledPool.maxConcurrency.
  • BREAKING: Removed CheerioCrawlerOptions.requestOptions. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction.
  • BREAKING: Removed Launch.requestOptions. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction.

New features:

  • Added Apify.PlaywrightCrawler which is almost identical to PuppeteerCrawler, but it crawls with the playwright library.
  • Added Apify.launchPlaywright(launchContext) helper function.
  • Added browserPoolOptions to PuppeteerCrawler to configure BrowserPool.
  • Added crawler to handle(Request/Page)Function arguments.
  • Added browserController to handlePageFunction arguments.
  • Added crawler.crawlingContexts Map which includes all running crawlingContexts.

v0.22.4

10 Jan 15:02
Compare
Choose a tag to compare
  • Fix issues with Apify.pushData() and keyValueStore.forEachKey() by updating @apify/storage-local to 1.0.2.

v0.22.2

22 Dec 13:24
Compare
Choose a tag to compare
  • Pinned cheerio to 1.0.0-rc.3 to avoid install problems in some builds.
  • Increased default maxEventLoopOverloadedRatio in SystemStatusOptions to 0.6.
  • Updated packages and improved docs.