Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA in description not parsed as desired #66

Closed
CoryKniefel opened this issue Dec 4, 2022 · 5 comments
Closed

CDATA in description not parsed as desired #66

CoryKniefel opened this issue Dec 4, 2022 · 5 comments

Comments

@CoryKniefel
Copy link

CoryKniefel commented Dec 4, 2022

Hi Team,

Thanks for building this open-source tool. I'm new to dealing with RSS feeds and wanted an easy way to parse the data into typed objects. I'm having an issue with one feed where they have embedded A LOT Of CDATA in the description, with a lot of HTML with styles and links to images, etc..

Here is an example:
(NOTE: some of this is being hidden by the browser; open this issue in Edit view to see all the data might work. If there is a way to prevent it from rendering as HTML in this Issue, I don't know how.)

<description><![CDATA[<a href="https://someorg.org/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/" title="Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs" rel="nofollow"><img width="300" height="157" src="https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-300x157.jpg" class="webfeedsFeaturedVisual wp-post-image" alt="German Flag over building" decoding="async" style="float: left; margin-right: 5px;" link_thumbnail="1" loading="lazy" srcset="https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-300x157.jpg 300w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-1024x536.jpg 1024w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-768x402.jpg 768w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a><p>The German Supply Chain <span class="glossaryLink"  aria-describedby="tt"  data-cmtooltip="&#38;lt;!-- wp:paragraph --&#38;gt;Often the second stage in the third-party risk management life cycle. Due diligence involves conducting a review of a potential third party prior to signing a contract. This review should involve developing a deeper understanding of the third party&#8217;s ownership, operations, resources, financial status, relevant employees, risk and control framework, business continuity program, third-party risk management program, and other factors important to the third-party relationship. Due diligence helps ensure the organization selects an appropriate third party to partner with, and that the organization understands both the inherent and residual risks posed by the relationship. These residual risks should be within the organization&#8217;s risk appetite.&#38;lt;br/&#38;gt;&#38;lt;!-- /wp:paragraph --&#38;gt;"  data-gt-translate-attributes='[{"attribute":"data-cmtooltip", "format":"html"}]'>Due Diligence</span> Act goes into effect January 2023 and is already making waves within supply chain, risk management, and compliance communities. [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://someorg.org/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/">Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs</a> appeared first on <a rel="nofollow" href="https://someorg.org">Aravo</a>.</p>
]]></description>

Options: { descriptionMaxLen: 20000, xmlParserOptions: { // I've tried a bunch. . . nothing "worked"} }

Output:

description:  "The German Supply Chain Due Diligence Act goes into effect January 2023 and is already making waves within supply chain, risk management, and compliance communities. [&#8230;] The post Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs appeared first on Aravo."
link:  "https://aravo.com/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/"
published:  "2022-12-01T14:33:07.000Z"
title:  "Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs"

Desired output: All contents of the description CDATA

Questions:

  • Is this something that can be supported?
  • How unusual (to you) is this use of the description field (all CDATA of HTML)?
@ndaidong
Copy link
Collaborator

ndaidong commented Dec 5, 2022

@CoryKniefel could you share link to that feed source so I can investigate its content structure? (I can not access someorg.org)

@CoryKniefel
Copy link
Author

@CoryKniefel could you share link to that feed source so I can investigate its content structure? (I can not access someorg.org)

I emailed you at the link. Please let me know if you didn't receive.

@ndaidong
Copy link
Collaborator

ndaidong commented Dec 5, 2022

@CoryKniefel I received. Maybe I can detect the problem there.

@ndaidong
Copy link
Collaborator

ndaidong commented Dec 5, 2022

@CoryKniefel you can use getExtraEntryFields. This is a function that allows you to customize the output, add or modify any part of feed data.

Please try this and let's me know if the result matches your expectation:

import { read } from '@extractus/feed-extractor'

const YOUR_FEED_URL = 'https://a...'

await read(YOUR_FEED_URL, {
  getExtraEntryFields: (feedEntry) => {
      const { description: content } = feedEntry
      return {
        content,
      }
    }
})

@CoryKniefel
Copy link
Author

Yes, matches expectations perfectly. Thanks for showing me how that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants