-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: http-crawler #1440
feat: http-crawler #1440
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some general comments, will review the crawler code more in detail later on
seems like GitHub Actions is down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just three things from me
docs/guides/http_crawler.mdx
Outdated
|
||
:::info | ||
|
||
Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit since you need one or the other, not both
Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> and <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. | |
Modern web pages often do not serve all of their content in the first HTML response, but rather the first HTML contains links to other resources such as CSS and JavaScript that get downloaded afterwards, and together they create the final page. To crawl those, see <ApiLink to="puppeteer-crawler/class/PuppeteerCrawler">`PuppeteerCrawler`</ApiLink> or <ApiLink to="playwright-crawler/class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the cheerio counterpart
packages/http-crawler/README.md
Outdated
const requestList = await RequestList.open(null, [ | ||
{ url: 'http://www.example.com/page-1' }, | ||
{ url: 'http://www.example.com/page-2' }, | ||
]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to the run method instead 🙏
|
||
export interface HttpCrawlerOptions<Context extends InternalHttpCrawlingContext = InternalHttpCrawlingContext> extends BasicCrawlerOptions<Context> { | ||
/** | ||
* An alias for {@apilink HttpCrawlerOptions.requestHandler} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an alias, it's deprecated I thought (and pending removal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crawlee/packages/http-crawler/src/internals/http-crawler.ts
Lines 337 to 344 in efe9828
this._handlePropertyNameChange({ | |
newName: 'requestHandler', | |
oldName: 'handlePageFunction', | |
propertyKey: 'requestHandler', | |
newProperty: requestHandler, | |
oldProperty: handlePageFunction, | |
allowUndefined: true, | |
}); |
It looks like an alias to me :P You're right I forgot to put @deprecated
on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its there only for back compatibility/easier upgrading/less friction
i guess we need to keep this because it handles the compat for cheerio crawler, right? otherwise, we dont need to care about this for new crawlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's there for compat.
|
||
export type HttpRequestHandler< | ||
UserData extends Dictionary = any, // with default to Dictionary we cant use a typed router in untyped crawler | ||
JSONData extends Dictionary = Dictionary, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be = any
instead (JSON data can be anything after all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the cheerio crawler
// Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header | ||
Reflect.deleteProperty(requestOptions.headers!, 'cookie'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do the same thing as the previous code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If by the previous code you mean the current cheeriocrawler, then yes. I haven't modified any logic except extracted the cheerio thing to another class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this exact line is in master too, just elsewhere, lets keep this as close to the original as possible so we dont introduce some hidden BCs
@@ -53,8 +53,11 @@ describe('BasicCrawler', () => { | |||
await localStorageEmulator.init(); | |||
}); | |||
|
|||
afterAll(async () => { | |||
afterEach(async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this change and the others like this? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The HttpCrawler
tests fail otherwise because there are tests that use the same URLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each test should be considered a separate environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but thats how it was supposed to be working already, destroy
call just cleans up the files for all opened storages, this shouldn't be needed as init
is what ensures you have distinct storages
(with that said, it shouldnt hurt either if we destroy them early, but it rings a bell if the change was required)
Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs
so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]
Why my tests would be rather wrong? I don't see how that would be. Anyway, I just tested again with afterAll
and it seems to... pass? wtf is going on 🤔 I guess I'll revert the changes then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if your tests required this change, they were probably wrong, as this change should not solve what you were trying to solve :] Or the issue was elsewhere, if they pass now 🤷 :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I haven't changed the tests nor the code I swear 😂
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, i was a bit surprised with the new addition to guides, but i kinda like it :]
docs/examples/http_crawler.ts
Outdated
}); | ||
|
||
// Run the crawler and wait for it to finish. | ||
await crawler.run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets put some URLs here, the code would otherwise finish immediately
docs/guides/docker_images.mdx
Outdated
@@ -104,7 +104,7 @@ When we use only what we need, we'll be rewarded with reasonable build and start | |||
|
|||
### actor-node | |||
|
|||
This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink>. It benefits from lightning fast builds and container startups. | |||
This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> or <ApiLink to="http-crawler/class/HttpCrawler">`HttpCrawler`</ApiLink>. It benefits from lightning fast builds and container startups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think we need to document it here (or in general, i'd only document its extensions, so once we have jsdom, that would be worthy addition, but no so much for the http crawler itself)
docs/guides/http_crawler.mdx
Outdated
description: Your first steps into the world of scraping with Crawlee | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure if we want such guide, i was only asking to add a readme based on jsdoc (so we have something on npm), not for revamping the docs :] but maybe its a good idea, thoughts @vladfrangu @mnmkng?
i though the class will be abstract and we will document only its children, but maybe we could have some shared guide (this one) about http crawlers in general, talking about all of them at one place, stating differences between cheerio/jsdom/libdom. but we need to first have those implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would start with creating a guide for the high-level ones only, but if the content is already written, I guess it would be a shame not to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content was written via copy & paste :P I agree with @B4nan - this may be too abstract novice users, as they probably want something that "just works". HttpCrawler
is still visible in API so more advanced people still can get to it.
I'll remove this, we can re-add this if there's a need to.
i though the class will be abstract and we will document only its children
The point of making this non-abstract is that people can attach their own layers on top of this. They can use other XML parsers, or image transformation tools and so on.
@@ -53,8 +53,11 @@ describe('BasicCrawler', () => { | |||
await localStorageEmulator.init(); | |||
}); | |||
|
|||
afterAll(async () => { | |||
afterEach(async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but thats how it was supposed to be working already, destroy
call just cleans up the files for all opened storages, this shouldn't be needed as init
is what ensures you have distinct storages
(with that said, it shouldnt hurt either if we destroy them early, but it rings a bell if the change was required)
Yes. The HttpCrawler tests fail otherwise because there are tests that use the same URLs
so you have tests that are rather wrong, but instead of changing them you changed how we handle this problem everywhere? :]
website/sidebars.js
Outdated
@@ -33,6 +33,7 @@ module.exports = { | |||
'guides/request-storage', | |||
'guides/result-storage', | |||
'guides/configuration', | |||
'guides/http-crawler-guide', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go to a less prominent place, i'd put it after the got scraping guide maybe
import type { AddressInfo } from 'net'; | ||
import http from 'http'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use the node:
prefix, ideally we should find some lint rule for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'm constantly forgetting about this
// Delete any possible lowercased header for cookie as they are merged in _applyCookies under the uppercase Cookie header | ||
Reflect.deleteProperty(requestOptions.headers!, 'cookie'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this exact line is in master too, just elsewhere, lets keep this as close to the original as possible so we dont introduce some hidden BCs
use(extension: CrawlerExtension) { | ||
ow(extension, ow.object.instanceOf(CrawlerExtension)); | ||
|
||
const clazz = this.constructor.name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont mind this, but className
is generally better than coming up with non existing spelling :] or just cls
, but that's just me being into shortcuts :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
className
sounds good, clazz
should be an actual class
FYI here are e2e tests running on master after the merge, just to be sure |
No description provided.