Replies: 5 comments 7 replies
-
If you do
|
Beta Was this translation helpful? Give feedback.
-
The response needs to have correct content type. We check if it contains "xml", that is what drives the option. crawlee/packages/http-crawler/src/internals/http-crawler.ts Lines 606 to 609 in 79538b8 You can extend the crawler and reimplement the method to your needs if this won't work for you. |
Beta Was this translation helpful? Give feedback.
-
async requestHandler({ log, isXml, response, body, contentType, $ }) {
log.info(isXml);
}
|
Beta Was this translation helpful? Give feedback.
-
The error is due to the fact that option POC: import { CheerioCrawler } from 'crawlee';
import { DomHandler } from 'htmlparser2';
import { WritableStream } from 'htmlparser2/lib/WritableStream';
class ExtCheerioCrawler extends CheerioCrawler {
constructor(options, config) {
super(options, config);
}
async _parseHtmlToDom(response) {
return new Promise((resolve, reject) => {
const domHandler = new DomHandler((err, dom) => {
if (err) reject(err);
else resolve(dom);
}, { xmlMode: true });
const parser = new WritableStream(domHandler, { xmlMode: true, decodeEntities: true });
parser.on('error', reject);
response
.on('error', reject)
.pipe(parser);
});
}
}
const crawler = new ExtCheerioCrawler({
async requestHandler({ log, response, body, contentType, $ }) {
const items = [...$("item")].map(e => {
const text = s => $(e).find(s).text().trim();
return {
title: text("title"),
link: text("link"),
published_on: text('pubDate'),
published_by: text("source"),
snippet: text("description"),
};
});
log.info(JSON.stringify(items, null, 4));
},
});
await crawler.run(['https://news.google.com/rss/search?q=test&hl=fr&gl=FR&ceid=FR:fr']); Result:
@charnould: Note also that I'm using |
Beta Was this translation helpful? Give feedback.
-
Just checked through this, thanks for reporting! Will hopefully prepare the patch today :D |
Beta Was this translation helpful? Give feedback.
-
Hello,
First, thanks for
Crawlee
; it indeed makes scraping/crawling almost a breath!I have however a question: how can I pass
xmlMode: true
toCheerioCrawler
, I didn't find anything in the Docs.Below my issue:
Considering an XML file like this one (Google New RSS feed) and item like this:
I wrote the following (quite-working) function:
It might be obvious, but I do not understand why I can't retrieve link and published_by content whereas it's working for the other ones --> It seems to be linked to
XML
/HTML
parsing.Thanks a lot.
Link to my question on StackOverflow + some answers : Why Cheerio XML parsing doesn't return text() for some keys?
Beta Was this translation helpful? Give feedback.
All reactions