Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed Sitemap content when url contains searchParams #2420

Closed
1 task done
austinbuckler opened this issue Apr 16, 2024 · 7 comments · Fixed by #2430
Closed
1 task done

Malformed Sitemap content when url contains searchParams #2420

austinbuckler opened this issue Apr 16, 2024 · 7 comments · Fixed by #2430
Labels
bug Something isn't working.

Comments

@austinbuckler
Copy link
Contributor

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/utils

Issue description

  1. Pass a sitemap url that ends with searchParams
  2. See that a warning is printed: WARN Malformed sitemap content: <url>?from=1234&to=5678

Code sample

import { Sitemap } from "crawlee";

const sitemap = await Sitemap.load("https://www.fashionnova.com/sitemap.xml");

// sample of the sitemap:
// <sitemap>
//   <loc>https://www.fashionnova.com/sitemap_products_1.xml?from=175895313&to=8651136721</loc>
// </sitemap>

// Run the crawler as you do.
// Get the warning message above.

Package version

3.9.1

Node.js version

20

Operating system

Linux

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

N/A

Other context

I see that the Sitemap module was updated recently to support additional content types. I would like to contribute an additional improvement that changes sitemapUrl from a string to a URL object.

something like:

// L150
let sitemapUrl = new URL(parsingState.sitemapUrls.pop()!);

// L156
const request = gotScraping.stream({ url: sitemapUrl.toString(), proxyUrl, method: 'GET' });

// L164
if (sitemapUrl.pathname.endsWith('.gz')) {
 // L166
 sitemapUrl = new URL(sitemapUrl.pathname.substring(0, sitemapUrl.pathname.length - 3), sitemapUrl);
}

// L172
if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.pathname.endsWith('.xml') {}

// L176
if (contentType === 'text/plain' || sitemapUrl.pathname.endsWith('.txt')) { }
@austinbuckler austinbuckler added the bug Something isn't working. label Apr 16, 2024
@janbuchar
Copy link
Contributor

Hello @austinbuckler! Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?

If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!

@austinbuckler
Copy link
Contributor Author

Do I understand correctly that the problem is that the sitemap content type is not detected correctly because the extension is followed by a query string?

You are correct!

If that's the case, the solution that you propose should help. If you want to contribute a patch, feel free to do so - we'll be grateful to accept that!

Awesome, will submit a patch this week. Thank you for the prompt response!

@austinbuckler
Copy link
Contributor Author

@janbuchar here is the patch 🥂

From: Austin <austin@lionhat.co>
Date: Thu, 18 Apr 2024 06:59:13 +0000
Subject: [PATCH] fix: malformed sitemap when child loc contains querystrings.

---
 packages/utils/CHANGELOG.md             | 12 ++++++++++++
 packages/utils/src/internals/sitemap.ts | 12 ++++++------
 packages/utils/test/sitemap.test.ts     |  8 +++++++-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/packages/utils/CHANGELOG.md b/packages/utils/CHANGELOG.md
index 70181f66..dd1ec681 100644
--- a/packages/utils/CHANGELOG.md
+++ b/packages/utils/CHANGELOG.md
@@ -3,6 +3,18 @@
 All notable changes to this project will be documented in this file.
 See [Conventional Commits](https://conventionalcommits.org) for commit guidelines.
 
+
+## [3.9.3](https://github.com/apify/crawlee/compare/v3.9.2...v3.9.3) (2024-04-18)
+
+
+### Features
+
+* **sitemap:** Support for querystrings in sitemap child urls. ([#2420](https://github.com/apify/crawlee/issues/2420))
+
+
+
+
+
 ## [3.9.2](https://github.com/apify/crawlee/compare/v3.9.1...v3.9.2) (2024-04-17)
 
 
diff --git a/packages/utils/src/internals/sitemap.ts b/packages/utils/src/internals/sitemap.ts
index bf3c88e3..c25d891d 100644
--- a/packages/utils/src/internals/sitemap.ts
+++ b/packages/utils/src/internals/sitemap.ts
@@ -149,8 +149,8 @@ export class Sitemap {
         parsingState.sitemapUrls = Array.isArray(urls) ? urls : [urls];
 
         while (parsingState.sitemapUrls.length > 0) {
-            let sitemapUrl = parsingState.sitemapUrls.pop()!;
-            parsingState.visitedSitemapUrls.push(sitemapUrl);
+            let sitemapUrl = new URL(parsingState.sitemapUrls.pop()!);
+            parsingState.visitedSitemapUrls.push(sitemapUrl.toString());
             parsingState.resetContext();
 
             try {
@@ -163,19 +163,19 @@ export class Sitemap {
                 if (sitemapStream.response!.statusCode === 200) {
                     await new Promise((resolve, reject) => {
                         let stream: Duplex = sitemapStream;
-                        if (sitemapUrl.endsWith('.gz')) {
+                        if (sitemapUrl.pathname.endsWith('.gz')) {
                             stream = stream.pipe(createGunzip()).on('error', reject);
-                            sitemapUrl = sitemapUrl.substring(0, sitemapUrl.length - 3);
+                            sitemapUrl.pathname = sitemapUrl.pathname.substring(0, sitemapUrl.pathname.length - 3)
                         }
 
                         const parser = (() => {
                             const contentType = sitemapStream.response!.headers['content-type'];
 
-                            if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.endsWith('.xml')) {
+                            if (['text/xml', 'application/xml'].includes(contentType ?? '') || sitemapUrl.pathname.endsWith('.xml')) {
                                 return Sitemap.createXmlParser(parsingState, () => resolve(undefined), reject);
                             }
 
-                            if (contentType === 'text/plain' || sitemapUrl.endsWith('.txt')) {
+                            if (contentType === 'text/plain' || sitemapUrl.pathname.endsWith('.txt')) {
                                 return new SitemapTxtParser(parsingState, () => resolve(undefined));
                             }
 
diff --git a/packages/utils/test/sitemap.test.ts b/packages/utils/test/sitemap.test.ts
index b3791c52..1b0087ca 100644
--- a/packages/utils/test/sitemap.test.ts
+++ b/packages/utils/test/sitemap.test.ts
@@ -7,7 +7,9 @@ describe('Sitemap', () => {
     beforeEach(() => {
         nock.disableNetConnect();
         nock('http://not-exists.com').persist()
-            .get('/sitemap_child.xml')
+            .get(url => {
+                return url === '/sitemap_child.xml' || url === '/sitemap_child_2.xml'
+            })
             .reply(200, [
                 '<?xml version="1.0" encoding="UTF-8"?>',
                 '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
@@ -59,6 +61,10 @@ describe('Sitemap', () => {
                 '<loc>http://not-exists.com/sitemap_child.xml</loc>',
                 '<lastmod>2004-12-23</lastmod>',
                 '</sitemap>',
+                '<sitemap>',
+                '<loc>http://not-exists.com/sitemap_child_2.xml?from=94937939985&amp;to=1318570721404</loc>',
+                '<lastmod>2004-12-23</lastmod>',
+                '</sitemap>',
                 '</sitemapindex>',
             ].join('\n'))
             .get('/not_actual_xml.xml')
-- 
2.44.0

@janbuchar
Copy link
Contributor

@janbuchar here is the patch 🥂

Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?

@B4nan
Copy link
Member

B4nan commented Apr 18, 2024

Also please don't update changelogs, they are generated.

@austinbuckler
Copy link
Contributor Author

austinbuckler commented Apr 18, 2024

Thank you very much! I assumed you'd open a pull request so that we 1. see if it passes tests and 2. can discuss it better. Also, if we accept that PR, you'll be listed as a contributor 🙂 Care to do that?

sure, will get that done this evening.

Also please don't update changelogs, they are generated.

good to know, will adjust — the contribution document implies that it should be done manually 😅

@janbuchar
Copy link
Contributor

good to know, will adjust — the contribution document implies that it should be done manually 😅

Ah, it's great that somebody actually reads those 😁 We'll update it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants