feat: http-crawler #1440

szmarczak · 2022-08-04T10:02:42Z

No description provided.

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

packages/http-crawler/src/internals/http-crawler.ts

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

B4nan

left some general comments, will review the crawler code more in detail later on

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

packages/cheerio-crawler/package.json

packages/http-crawler/package.json

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

packages/http-crawler/package.json

szmarczak · 2022-08-10T00:16:23Z

seems like GitHub Actions is down

vladfrangu

Just three things from me

vladfrangu · 2022-08-10T00:32:00Z

docs/guides/http_crawler.mdx

+
+:::info
+
+Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.


Small nit since you need one or the other, not both

Suggested change

Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.

Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> or <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.

This is copied from the cheerio counterpart

docs/guides/http_crawler.mdx

vladfrangu · 2022-08-10T00:35:14Z

packages/http-crawler/README.md

+const requestList = await RequestList.open(null, [
+    { url: 'http://www.example.com/page-1' },
+    { url: 'http://www.example.com/page-2' },
+]);


Move this to the run method instead 🙏

packages/http-crawler/src/internals/http-crawler.ts

vladfrangu · 2022-08-10T00:37:34Z

packages/http-crawler/src/internals/http-crawler.ts

+
+export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext = InternalHttpCrawlingContext> extends BasicCrawlerOptions<Context> {
+    /**
+     * An alias for {@apilink HttpCrawlerOptions.requestHandler}


This is not an alias, it's deprecated I thought (and pending removal)

crawlee/packages/http-crawler/src/internals/http-crawler.ts

Lines 337 to 344 in efe9828

this._handlePropertyNameChange({

newName: 'requestHandler',

oldName: 'handlePageFunction',

propertyKey: 'requestHandler',

newProperty: requestHandler,

oldProperty: handlePageFunction,

allowUndefined: true,

});

It looks like an alias to me :P You're right I forgot to put @deprecated on it

its there only for back compatibility/easier upgrading/less friction

i guess we need to keep this because it handles the compat for cheerio crawler, right? otherwise, we dont need to care about this for new crawlers.

Yes, it's there for compat.

vladfrangu · 2022-08-10T00:38:20Z

packages/http-crawler/src/internals/http-crawler.ts

+
+export type HttpRequestHandler<
+    UserData extends Dictionary = any, // with default to Dictionary we cant use a typed router in untyped crawler
+    JSONData extends Dictionary = Dictionary,


This should probably be = any instead (JSON data can be anything after all)

I copied this from the cheerio crawler

vladfrangu · 2022-08-10T00:39:54Z

packages/http-crawler/src/internals/http-crawler.ts

+        // Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header
+        Reflect.deleteProperty(requestOptions.headers!, 'cookie');


Does this do the same thing as the previous code?

If by the previous code you mean the current cheeriocrawler, then yes. I haven't modified any logic except extracted the cheerio thing to another class

yeah this exact line is in master too, just elsewhere, lets keep this as close to the original as possible so we dont introduce some hidden BCs

vladfrangu · 2022-08-10T00:41:22Z

test/core/crawlers/basic_crawler.test.ts

@@ -53,8 +53,11 @@ describe('BasicCrawler', () => {
        await localStorageEmulator.init();
    });

-    afterAll(async () => {
+    afterEach(async () => {


Any reason for this change and the others like this? 👀

Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs

Each test should be considered a separate environment

but thats how it was supposed to be working already, destroy call just cleans up the files for all opened storages, this shouldn't be needed as init is what ensures you have distinct storages

(with that said, it shouldnt hurt either if we destroy them early, but it rings a bell if the change was required)

Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs

so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]

so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]

Why my tests would be rather wrong? I don't see how that would be. Anyway, I just tested again with afterAll and it seems to... pass? wtf is going on 🤔 I guess I'll revert the changes then

I mean if your tests required this change, they were probably wrong, as this change should not solve what you were trying to solve :] Or the issue was elsewhere, if they pass now 🤷 :]

No I haven't changed the tests nor the code I swear 😂

Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>

B4nan

looking good, i was a bit surprised with the new addition to guides, but i kinda like it :]

B4nan · 2022-08-10T06:43:14Z

docs/examples/http_crawler.ts

+});
+
+// Run the crawler and wait for it to finish.
+await crawler.run();


lets put some URLs here, the code would otherwise finish immediately

docs/examples/http_crawler.ts

B4nan · 2022-08-10T06:46:48Z

docs/guides/docker_images.mdx

@@ -104,7 +104,7 @@ When we use only what we need, we'll be rewarded with reasonable build and start

 ### actor-node

-This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink>. It benefits from lightning fast builds and container startups.
+This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> or <ApiLink to="http-crawler/class/HttpCrawler">`HttpCrawler`</ApiLink>. It benefits from lightning fast builds and container startups.


i dont think we need to document it here (or in general, i'd only document its extensions, so once we have jsdom, that would be worthy addition, but no so much for the http crawler itself)

B4nan · 2022-08-10T06:48:58Z

docs/guides/http_crawler.mdx

+description: Your first steps into the world of scraping with Crawlee
+---
+
+import ApiLink from '@site/src/components/ApiLink';


i am not sure if we want such guide, i was only asking to add a readme based on jsdoc (so we have something on npm), not for revamping the docs :] but maybe its a good idea, thoughts @vladfrangu @mnmkng?

i though the class will be abstract and we will document only its children, but maybe we could have some shared guide (this one) about http crawlers in general, talking about all of them at one place, stating differences between cheerio/jsdom/libdom. but we need to first have those implementations.

I would start with creating a guide for the high-level ones only, but if the content is already written, I guess it would be a shame not to use it.

The content was written via copy & paste :P I agree with @B4nan - this may be too abstract novice users, as they probably want something that "just works". HttpCrawler is still visible in API so more advanced people still can get to it.

I'll remove this, we can re-add this if there's a need to.

i though the class will be abstract and we will document only its children

The point of making this non-abstract is that people can attach their own layers on top of this. They can use other XML parsers, or image transformation tools and so on.

B4nan · 2022-08-10T07:14:24Z

test/core/crawlers/basic_crawler.test.ts

@@ -53,8 +53,11 @@ describe('BasicCrawler', () => {
        await localStorageEmulator.init();
    });

-    afterAll(async () => {
+    afterEach(async () => {


but thats how it was supposed to be working already, destroy call just cleans up the files for all opened storages, this shouldn't be needed as init is what ensures you have distinct storages

(with that said, it shouldnt hurt either if we destroy them early, but it rings a bell if the change was required)

Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs

so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]

B4nan · 2022-08-10T07:23:07Z

website/sidebars.js

@@ -33,6 +33,7 @@ module.exports = {
                'guides/request-storage',
                'guides/result-storage',
                'guides/configuration',
+                'guides/http-crawler-guide',


this should go to a less prominent place, i'd put it after the got scraping guide maybe

B4nan · 2022-08-10T07:25:21Z

test/core/crawlers/http_crawler.test.ts

+import type { AddressInfo } from 'net';
+import http from 'http';


lets use the node: prefix, ideally we should find some lint rule for it

Yes I'm constantly forgetting about this

packages/http-crawler/src/internals/http-crawler.ts

B4nan · 2022-08-10T07:28:46Z

packages/http-crawler/src/internals/http-crawler.ts

+        // Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header
+        Reflect.deleteProperty(requestOptions.headers!, 'cookie');


yeah this exact line is in master too, just elsewhere, lets keep this as close to the original as possible so we dont introduce some hidden BCs

B4nan · 2022-08-10T07:31:10Z

packages/http-crawler/src/internals/http-crawler.ts

+    use(extension: CrawlerExtension) {
+        ow(extension, ow.object.instanceOf(CrawlerExtension));
+
+        const clazz = this.constructor.name;


i dont mind this, but className is generally better than coming up with non existing spelling :] or just cls, but that's just me being into shortcuts :D

className sounds good, clazz should be an actual class

B4nan · 2022-08-10T11:38:54Z

FYI here are e2e tests running on master after the merge, just to be sure

https://github.com/apify/crawlee/actions/runs/2832305419

szmarczak added 7 commits August 4, 2022 11:53

feat: http-crawler

4aa5e30

fixes

a617703

remove moot stuff

debe401

more work

32a7a97

more work

5287510

fix type

021796b

fix

ccfd8f8

B4nan reviewed Aug 4, 2022

View reviewed changes

packages/cheerio-crawler/src/internals/cheerio-crawler.ts Show resolved Hide resolved

B4nan reviewed Aug 4, 2022

View reviewed changes

packages/http-crawler/src/internals/http-crawler.ts Show resolved Hide resolved

szmarczak added 9 commits August 4, 2022 16:42

fix imports

5c94c1b

clazz

98fb3a3

fix missing await

b708c69

typescript madness

524279b

fix casting

bf6eb48

fix cheerio imports

3afe547

more cleanup

0a9ce6a

fix

ca088e9

fix

a9ca7b5

B4nan reviewed Aug 8, 2022

View reviewed changes

packages/cheerio-crawler/src/internals/cheerio-crawler.ts Outdated Show resolved Hide resolved

fix

a2a6cdf

szmarczak marked this pull request as ready for review August 8, 2022 14:37

szmarczak requested review from B4nan and vladfrangu August 8, 2022 14:37

szmarczak added 4 commits August 9, 2022 12:26

docs

3335b55

Merge branch 'master' into http-crawler

5010421

apilink

448e3a7

core

0a8dd9f

B4nan requested changes Aug 9, 2022

View reviewed changes

fixes

c7c6266

szmarczak requested a review from B4nan August 9, 2022 15:01

B4nan reviewed Aug 9, 2022

View reviewed changes

packages/http-crawler/package.json Show resolved Hide resolved

szmarczak added 3 commits August 9, 2022 17:56

fixes

6d1c872

more fixes

aa4ea43

fix tests

44e6e17

szmarczak requested a review from B4nan August 9, 2022 17:17

szmarczak added 2 commits August 10, 2022 01:40

docs

32ec1a3

doc fixes

d99d5a2

vladfrangu requested changes Aug 10, 2022

View reviewed changes

szmarczak and others added 3 commits August 10, 2022 02:49

Update docs/guides/http_crawler.mdx

884cc2b

Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>

Update packages/http-crawler/src/internals/http-crawler.ts

306cfde

Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>

Update packages/http-crawler/src/internals/http-crawler.ts

efe9828

Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>

B4nan approved these changes Aug 10, 2022

View reviewed changes

final touches?

8901671

szmarczak merged commit 8c303f7 into master Aug 10, 2022

szmarczak deleted the http-crawler branch August 10, 2022 11:29


		:::info

		Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.

	this._handlePropertyNameChange({
	newName: 'requestHandler',
	oldName: 'handlePageFunction',
	propertyKey: 'requestHandler',
	newProperty: requestHandler,
	oldProperty: handlePageFunction,
	allowUndefined: true,
	});

		// Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header
		Reflect.deleteProperty(requestOptions.headers!, 'cookie');

		import type { AddressInfo } from 'net';
		import http from 'http';

feat: http-crawler #1440

feat: http-crawler #1440

Conversation

szmarczak commented Aug 4, 2022

B4nan left a comment

Choose a reason for hiding this comment

szmarczak commented Aug 10, 2022

vladfrangu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

B4nan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szmarczak Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

B4nan commented Aug 10, 2022

szmarczak Aug 10, 2022 •

edited

Loading