Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: http-crawler #1440

Merged
merged 31 commits into from
Aug 10, 2022
Merged

feat: http-crawler #1440

merged 31 commits into from
Aug 10, 2022

Conversation

szmarczak
Copy link
Contributor

No description provided.

@szmarczak szmarczak marked this pull request as ready for review August 8, 2022 14:37
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some general comments, will review the crawler code more in detail later on

packages/cheerio-crawler/src/internals/cheerio-crawler.ts Outdated Show resolved Hide resolved
packages/cheerio-crawler/package.json Outdated Show resolved Hide resolved
packages/http-crawler/package.json Show resolved Hide resolved
packages/cheerio-crawler/src/internals/cheerio-crawler.ts Outdated Show resolved Hide resolved
@szmarczak szmarczak requested a review from B4nan August 9, 2022 15:01
@szmarczak szmarczak requested a review from B4nan August 9, 2022 17:17
@szmarczak
Copy link
Contributor Author

seems like GitHub Actions is down

Copy link
Member

@vladfrangu vladfrangu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just three things from me


:::info

Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit since you need one or the other, not both

Suggested change
Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.
Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> or <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the cheerio counterpart

docs/guides/http_crawler.mdx Outdated Show resolved Hide resolved
Comment on lines 31 to 34
const requestList = await RequestList.open(null, [
{ url: 'http://www.example.com/page-1' },
{ url: 'http://www.example.com/page-2' },
]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the run method instead 🙏

packages/http-crawler/src/internals/http-crawler.ts Outdated Show resolved Hide resolved
packages/http-crawler/src/internals/http-crawler.ts Outdated Show resolved Hide resolved

export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext = InternalHttpCrawlingContext> extends BasicCrawlerOptions<Context> {
/**
* An alias for {@apilink HttpCrawlerOptions.requestHandler}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an alias, it's deprecated I thought (and pending removal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this._handlePropertyNameChange({
newName: 'requestHandler',
oldName: 'handlePageFunction',
propertyKey: 'requestHandler',
newProperty: requestHandler,
oldProperty: handlePageFunction,
allowUndefined: true,
});

It looks like an alias to me :P You're right I forgot to put @deprecated on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its there only for back compatibility/easier upgrading/less friction

i guess we need to keep this because it handles the compat for cheerio crawler, right? otherwise, we dont need to care about this for new crawlers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's there for compat.


export type HttpRequestHandler<
UserData extends Dictionary = any, // with default to Dictionary we cant use a typed router in untyped crawler
JSONData extends Dictionary = Dictionary,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be = any instead (JSON data can be anything after all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the cheerio crawler

Comment on lines +622 to +623
// Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header
Reflect.deleteProperty(requestOptions.headers!, 'cookie');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do the same thing as the previous code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by the previous code you mean the current cheeriocrawler, then yes. I haven't modified any logic except extracted the cheerio thing to another class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this exact line is in master too, just elsewhere, lets keep this as close to the original as possible so we dont introduce some hidden BCs

@@ -53,8 +53,11 @@ describe('BasicCrawler', () => {
await localStorageEmulator.init();
});

afterAll(async () => {
afterEach(async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this change and the others like this? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each test should be considered a separate environment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but thats how it was supposed to be working already, destroy call just cleans up the files for all opened storages, this shouldn't be needed as init is what ensures you have distinct storages

(with that said, it shouldnt hurt either if we destroy them early, but it rings a bell if the change was required)

Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs

so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]

Why my tests would be rather wrong? I don't see how that would be. Anyway, I just tested again with afterAll and it seems to... pass? wtf is going on 🤔 I guess I'll revert the changes then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if your tests required this change, they were probably wrong, as this change should not solve what you were trying to solve :] Or the issue was elsewhere, if they pass now 🤷 :]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I haven't changed the tests nor the code I swear 😂

szmarczak and others added 3 commits August 10, 2022 02:49
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, i was a bit surprised with the new addition to guides, but i kinda like it :]

});

// Run the crawler and wait for it to finish.
await crawler.run();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put some URLs here, the code would otherwise finish immediately

docs/examples/http_crawler.ts Show resolved Hide resolved
@@ -104,7 +104,7 @@ When we use only what we need, we'll be rewarded with reasonable build and start

### actor-node

This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink>. It benefits from lightning fast builds and container startups.
This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> or <ApiLink to="http-crawler/class/HttpCrawler">`HttpCrawler`</ApiLink>. It benefits from lightning fast builds and container startups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we need to document it here (or in general, i'd only document its extensions, so once we have jsdom, that would be worthy addition, but no so much for the http crawler itself)

description: Your first steps into the world of scraping with Crawlee
---

import ApiLink from '@site/src/components/ApiLink';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure if we want such guide, i was only asking to add a readme based on jsdoc (so we have something on npm), not for revamping the docs :] but maybe its a good idea, thoughts @vladfrangu @mnmkng?

i though the class will be abstract and we will document only its children, but maybe we could have some shared guide (this one) about http crawlers in general, talking about all of them at one place, stating differences between cheerio/jsdom/libdom. but we need to first have those implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start with creating a guide for the high-level ones only, but if the content is already written, I guess it would be a shame not to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content was written via copy & paste :P I agree with @B4nan - this may be too abstract novice users, as they probably want something that "just works". HttpCrawler is still visible in API so more advanced people still can get to it.

I'll remove this, we can re-add this if there's a need to.

i though the class will be abstract and we will document only its children

The point of making this non-abstract is that people can attach their own layers on top of this. They can use other XML parsers, or image transformation tools and so on.

@@ -53,8 +53,11 @@ describe('BasicCrawler', () => {
await localStorageEmulator.init();
});

afterAll(async () => {
afterEach(async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but thats how it was supposed to be working already, destroy call just cleans up the files for all opened storages, this shouldn't be needed as init is what ensures you have distinct storages

(with that said, it shouldnt hurt either if we destroy them early, but it rings a bell if the change was required)

Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs

so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]

@@ -33,6 +33,7 @@ module.exports = {
'guides/request-storage',
'guides/result-storage',
'guides/configuration',
'guides/http-crawler-guide',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go to a less prominent place, i'd put it after the got scraping guide maybe

Comment on lines 2 to 3
import type { AddressInfo } from 'net';
import http from 'http';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use the node: prefix, ideally we should find some lint rule for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm constantly forgetting about this

Comment on lines +622 to +623
// Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header
Reflect.deleteProperty(requestOptions.headers!, 'cookie');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this exact line is in master too, just elsewhere, lets keep this as close to the original as possible so we dont introduce some hidden BCs

use(extension: CrawlerExtension) {
ow(extension, ow.object.instanceOf(CrawlerExtension));

const clazz = this.constructor.name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont mind this, but className is generally better than coming up with non existing spelling :] or just cls, but that's just me being into shortcuts :D

Copy link
Contributor Author

@szmarczak szmarczak Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

className sounds good, clazz should be an actual class

@szmarczak szmarczak merged commit 8c303f7 into master Aug 10, 2022
@szmarczak szmarczak deleted the http-crawler branch August 10, 2022 11:29
@B4nan
Copy link
Member

B4nan commented Aug 10, 2022

FYI here are e2e tests running on master after the merge, just to be sure

https://github.com/apify/crawlee/actions/runs/2832305419

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants