Why CheerioCrawler parsing doesn't return text() for some XML keys? #1794

charnould · 2023-02-19T18:25:04Z

charnould
Feb 19, 2023

Hello,
First, thanks for Crawlee; it indeed makes scraping/crawling almost a breath!
I have however a question: how can I pass xmlMode: true to CheerioCrawler, I didn't find anything in the Docs.

Below my issue:

Considering an XML file like this one (Google New RSS feed) and item like this:

<item>
  <title>Test Like a Dragon Ishin...</title>
  <link>https://news.google.com/rss/articles/CBMie2....</link>
  <guid isPermaLink="false">CBMie2h0dHB...</guid>
  <pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
  <description>Test Like a Dragon Ishin...</description>
  <source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>

I wrote the following (quite-working) function:

const crawler = new CheerioCrawler({
    async requestHandler({ request, response, body, contentType, $ }) {
      $("item").each(function (i, ref) {
        const el = $(ref);
        const title = el.find("title").text();
        const link = el.find('link').text();
        const published_on = el.find('pubdate').text();
        const published_by = $("source").text();
        const snippet = el.find("description").text();

        console.log("TITLE: ", title);
        console.log("LINK: ", link);                 // DOESN'T WORK
        console.log("PUBLISHED_ON: ", published_on);
        console.log("PUBLISHED_BY: ", published_by); // DOESN'T WORK
        console.log("SNIPPET: ", snippet )
        console.log("AUTHOR: ", author);
      });
    },
  });

It might be obvious, but I do not understand why I can't retrieve link and published_by content whereas it's working for the other ones --> It seems to be linked to XML/HTML parsing.

Thanks a lot.

Link to my question on StackOverflow + some answers : Why Cheerio XML parsing doesn't return text() for some keys?

LeMoussel · 2023-02-20T09:13:37Z

LeMoussel
Feb 20, 2023

If you do console.log(body), you can see what how parsed the text to.
You'll notice that :

the link tag is parsed as follows
<link/>https://news.google.com/search?q=test&hl=fr&gl=FR&ceid=FR:fr<language> ...
the source tag is parsed as follows
<source url="https://www.jeuxvideo.com"/>...

0 replies

B4nan · 2023-02-20T09:16:36Z

B4nan
Feb 20, 2023
Maintainer

The response needs to have correct content type. We check if it contains "xml", that is what drives the option.

crawlee/packages/http-crawler/src/internals/http-crawler.ts

Lines 606 to 609 in 79538b8

    
           } else if (HTML_AND_XML_MIME_TYPES.includes(type)) { 
        
               const isXml = type.includes('xml'); 
        
               const parsed = await this._parseHTML(response, isXml, crawlingContext); 
        
               return { ...parsed, isXml, response, contentType };

crawlee/packages/cheerio-crawler/src/internals/cheerio-crawler.ts

Line 140 in 79538b8

xmlMode: isXml,

You can extend the crawler and reimplement the method to your needs if this won't work for you.

3 replies

charnould Feb 20, 2023
Author

Indeed, contentType is { type: 'application/xml', encoding: 'utf-8' } and isXml = true for the example file.
Why is it wrongly parsed then? (See parsed body as @LeMoussel suggests.)
Thanks.

B4nan Feb 20, 2023
Maintainer

That's a question for cheerio maintainers I guess, we don't parse it ourselves. Unless this part would somehow break it.

LeMoussel Feb 20, 2023

This part breaks it in some way.
By using cheerio in the following way, I get results

import * as cheerio from 'cheerio';
import request from 'request';

function gotXML(err, resp, xml) {
  if (err) return console.error(err)

  const $ = cheerio.load(xml, {xmlMode: true})
  $("item").each(function (i, ref) {
    const el = $(ref);
    const title = el.find("title").text();
    const link = el.find('link').text();
    const published_on = el.find('pubDate').text();
    const published_by = $("source").text();
    const snippet = el.find("description").text();

    console.log("TITLE: ", title);
    console.log("LINK: ", link);
    console.log("PUBLISHED_ON: ", published_on);
    console.log("PUBLISHED_BY: ", published_by);
    console.log("SNIPPET: ", snippet)
  });
}

request('https://news.google.com/rss/search?q=test&hl=fr&gl=FR&ceid=FR:fr', gotXML);

LeMoussel · 2023-02-20T09:18:36Z

LeMoussel
Feb 20, 2023

async requestHandler({ log, isXml, response, body, contentType, $ }) {
   log.info(isXml);
}

IsXml is set to True

0 replies

LeMoussel · 2023-02-20T13:22:35Z

LeMoussel
Feb 20, 2023

The error is due to the fact that option xmlMode: true is not set in DomHandler & WritableStream interface to process the streaming XML input.

POC:

import { CheerioCrawler } from 'crawlee';

import { DomHandler } from 'htmlparser2';
import { WritableStream } from 'htmlparser2/lib/WritableStream';

class ExtCheerioCrawler extends CheerioCrawler {
  constructor(options, config) {
    super(options, config);
  }

  async _parseHtmlToDom(response) {
    return new Promise((resolve, reject) => {
      const domHandler = new DomHandler((err, dom) => {
        if (err) reject(err);
        else resolve(dom);
      }, { xmlMode: true });

      const parser = new WritableStream(domHandler, { xmlMode: true, decodeEntities: true });
      parser.on('error', reject);
      response
        .on('error', reject)
        .pipe(parser);
    });
  }
}

const crawler = new ExtCheerioCrawler({
  async requestHandler({ log, response, body, contentType, $ }) {
    const items = [...$("item")].map(e => {
      const text = s => $(e).find(s).text().trim();

      return {
        title: text("title"),
        link: text("link"),
        published_on: text('pubDate'),
        published_by: text("source"),
        snippet:  text("description"),
      };
    });

    log.info(JSON.stringify(items, null, 4));
  },
});

await crawler.run(['https://news.google.com/rss/search?q=test&hl=fr&gl=FR&ceid=FR:fr']);

Result:

INFO  ExtCheerioCrawler: Starting the crawl
INFO  ExtCheerioCrawler: [
    {
        "title": "Test Like a Dragon Ishin : Le remake tant attendu d'un jeu d'action-aventure légendaire ? - jeuxvideo.com",
        "link": "https://news.google.com/rss/articles/CBMie2h0dHBzOi8vd3d3LmpldXh2aWRlby5jb20vdGVzdC8xNzEwMDYwL2xpa2UtYS1kcmFnb24taXNoaW4tbGUtcmVtYWtlLXRhbnQtYXR0ZW5kdS1kLXVuLWpldS1kLWFjdGlvbi1hdmVudHVyZS1sZWdlbmRhaXJlLmh0bdIBf2h0dHBzOi8vd3d3LmpldXh2aWRlby5jb20vYW1wL3Rlc3QvMTcxMDA2MC9saWtlLWEtZHJhZ29uLWlzaGluLWxlLXJlbWFrZS10YW50LWF0dGVuZHUtZC11bi1qZXUtZC1hY3Rpb24tYXZlbnR1cmUtbGVnZW5kYWlyZS5odG0?oc=5",
        "published_on": "Fri, 17 Feb 2023 15:00:03 GMT",
        "published_by": "jeuxvideo.com",
        "snippet": "<a href=\"https://news.google.com/rss/articles/CBMie2h0dHBzOi8vd3d3LmpldXh2aWRlby5jb20vdGVzdC8xNzEwMDYwL2xpa2UtYS1kcmFnb24taXNoaW4tbGUtcmVtYWtlLXRhbnQtYXR0ZW5kdS1kLXVuLWpldS1kLWFjdGlvbi1hdmVudHVyZS1sZWdlbmRhaXJlLmh0bdIBf2h0dHBzOi8vd3d3LmpldXh2aWRlby5jb20vYW1wL3Rlc3QvMTcxMDA2MC9saWtlLWEtZHJhZ29uLWlzaGluLWxlLXJlbWFrZS10YW50LWF0dGVuZHUtZC11bi1qZXUtZC1hY3Rpb24tYXZlbnR1cmUtbGVnZW5kYWlyZS5odG0?oc=5\" target=\"_blank\">Test Like a Dragon Ishin : Le remake tant attendu d'un jeu d'action-aventure légendaire ?</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">jeuxvideo.com</font>"
    },
    {
        "title": "Test Octopath Traveler 2 : le meilleur jeu de rôle de ce début d'année ? - jeuxvideo.com",
        "link": "https://news.google.com/rss/articles/CBMiamh0dHBzOi8vd3d3LmpldXh2aWRlby5jb20vdGVzdC8xNzExNjY4L29jdG9wYXRoLXRyYXZlbGVyLTItbGUtbWVpbGxldXItamV1LWRlLXJvbGUtZGUtY2UtZGVidXQtZC1hbm5lZS5odG3SAW5odHRwczovL3d3dy5qZXV4dmlkZW8uY29tL2FtcC90ZXN0LzE3MTE2Njgvb2N0b3BhdGgtdHJhdmVsZXItMi1sZS1tZWlsbGV1ci1qZXUtZGUtcm9sZS1kZS1jZS1kZWJ1dC1kLWFubmVlLmh0bQ?oc=5",
        "published_on": "Fri, 17 Feb 2023 09:34:42 GMT",
        "published_by": "jeuxvideo.com",
        "snippet": "<a href=\"https://news.google.com/rss/articles/CBMiamh0dHBzOi8vd3d3LmpldXh2aWRlby5jb20vdGVzdC8xNzExNjY4L29jdG9wYXRoLXRyYXZlbGVyLTItbGUtbWVpbGxldXItamV1LWRlLXJvbGUtZGUtY2UtZGVidXQtZC1hbm5lZS5odG3SAW5odHRwczovL3d3dy5qZXV4dmlkZW8uY29tL2FtcC90ZXN0LzE3MTE2Njgvb2N0b3BhdGgtdHJhdmVsZXItMi1sZS1tZWlsbGV1ci1qZXUtZGUtcm9sZS1kZS1jZS1kZWJ1dC1kLWFubmVlLmh0bQ?oc=5\" target=\"_blank\">Test Octopath Traveler 2 : le meilleur jeu de rôle de ce début d'année ?</a>&nbsp;&nbsp;<font color=\"#6f6f6f\">jeuxvideo.com</font>"
    }, ....
INFO  ExtCheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
index.js:97
INFO  ExtCheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1485,"requestsFinishedPerMinute":31,"requestsFailedPerMinute":0,"requestTotalDurationMillis":1485,"requestsTotal":1,"crawlerRuntimeMillis":1910}

@charnould: Note also that I'm using pubDate rather than pubdate since XML is case-sensitive.

2 replies

charnould Feb 21, 2023
Author

@LeMoussel Thanks a lot for your help. Pretty cool to see how you debug this (I'm not able to do this much).
Bonne journée en Haute Normandie !

LeMoussel Feb 21, 2023

@charnould : Merci. La journée est bonne avec un magnifique soleil :)

vladfrangu · 2023-02-20T23:34:39Z

vladfrangu
Feb 20, 2023
Maintainer

Just checked through this, thanks for reporting! Will hopefully prepare the patch today :D

2 replies

charnould Feb 21, 2023
Author

Please, let me/us know when this is fixed: I will update accordingly my Stack Overflow question. Thanks a lot.

LeMoussel Mar 9, 2023

It's fixed in V3.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why CheerioCrawler parsing doesn't return text() for some XML keys? #1794

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why CheerioCrawler parsing doesn't return text() for *some* XML keys? #1794

charnould Feb 19, 2023

Replies: 5 comments · 7 replies

LeMoussel Feb 20, 2023

B4nan Feb 20, 2023 Maintainer

charnould Feb 20, 2023 Author

B4nan Feb 20, 2023 Maintainer

LeMoussel Feb 20, 2023

LeMoussel Feb 20, 2023

LeMoussel Feb 20, 2023

charnould Feb 21, 2023 Author

LeMoussel Feb 21, 2023

vladfrangu Feb 20, 2023 Maintainer

charnould Feb 21, 2023 Author

LeMoussel Mar 9, 2023

Why CheerioCrawler parsing doesn't return text() for some XML keys? #1794

charnould
Feb 19, 2023

Replies: 5 comments 7 replies

LeMoussel
Feb 20, 2023

B4nan
Feb 20, 2023
Maintainer

charnould Feb 20, 2023
Author

B4nan Feb 20, 2023
Maintainer

LeMoussel
Feb 20, 2023

LeMoussel
Feb 20, 2023

charnould Feb 21, 2023
Author

vladfrangu
Feb 20, 2023
Maintainer

charnould Feb 21, 2023
Author