# Analyzing Fragment Usage in DocCloud en-US Pages


Analyze content to find fragement usage. 

In [1]:
import { xml2js } from "https://deno.land/x/xml2js@1.0.0/mod.ts";
import { DOMParser } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";

Functions to read robots.txt, sitemap.xml, and plain.html.

In [2]:
function parseRobotsTxt(text) {
  const lines = text.split('\n');
  const sitemapLines = lines.filter((x) => x.startsWith('Sitemap:'));
  return { sitemaps: sitemapLines.map((x) => x.replace('Sitemap:', '').trim()) };
}

async function readRobotsTxt(url) {
  const resp = await fetch(url);
  return parseRobotsTxt(await resp.text());
}

function parseSitemapXml(text) {
  return xml2js(text, { compact: true });
}

async function readSitemapXml(url) {
  const resp = await fetch(url);
  return parseSitemapXml(await resp.text());
}

function parsePlainHtml(text) {
  return new DOMParser().parseFromString(text, 'text/html');
}

async function readPlainHtml(url) {
  const resp = await fetch(url);
  if (!resp.ok) return null;
  return parsePlainHtml(await resp.text());
}

Get all `sitemap.xml`s from robots.txt.

In [3]:
const urlRobotsTxt = 'https://www.adobe.com/robots.txt';
const robots = await readRobotsTxt(urlRobotsTxt);

Use the sitemap-index.xml `https://www.adobe.com/dc.milo.sitemap-index.xml`

Use the en-US sitemap only `https://www.adobe.com/ae_ar/dc-shared/assets/sitemap.xml` 

In [4]:
const urlSitemapXml = robots.sitemaps.filter((x) => x.includes('dc.milo'))[0];
const urlSitemaps = await readSitemapXml(urlSitemapXml);
const urlLocSitemaps = urlSitemaps.sitemapindex.sitemap.filter((x) => x.loc._text.includes('adobe.com/dc-shared'));

`pages`: HTML pages

`framgments`: Fragments

`subFrags`: Fragments in a fragment

In [5]:
const pages = {};
const fragments = {};
const subFrags = {};
const regex = new RegExp('/dc-shared/fragments/');

Read en-US sitemaps

In [10]:
for (let i = 0; i < urlLocSitemaps.length; i++) {
  const url = urlLocSitemaps[i].loc._text;
  console.log(url);
  const sitemap = await readSitemapXml(url);
  if (sitemap.urlset.url) {
    console.log(sitemap.urlset.url.length);
    sitemap.urlset.url.map((x) => x.loc._text).forEach(x => pages[x] = {fragments: []});
  }
}

https://www.adobe.com/dc-shared/assets/sitemap.xml
1380


Clear the result objects in case of re-run from the middle

In [11]:
Object.keys(pages).forEach(key => pages[key].fragments = []);
Object.keys(fragments).forEach(key => delete fragments[key]);
Object.keys(subFrags).forEach(key => delete subFrags[key]);

Search fragments in all pages in the en-US sitemap. Some pages are from Dexter. If there is no plain.html, then it is a Dexter page and it is removed.

In [12]:
for (const url of Object.keys(pages)) {
  console.log(url);
  
  const urlObj = new URL(url);

  const urlPlainHtml = url.replace('.html', '.plain.html');
  const document = await readPlainHtml(urlPlainHtml);

  if (!document) {
    delete pages[url];
    continue;
  }

  const nodes = Array.from(document.querySelectorAll('a'));
  const links = nodes.map(x => x.attributes.getNamedItem('href')?.value);

  const frags = links.filter((x) => regex.test(x));

  for (let frag of frags) {
    [frag] = frag.split('#');
    urlObj.pathname = frag;
    const urlFrag = urlObj.href;

    if (!fragments[urlFrag]) {
      fragments[urlFrag] = { pages: [] };
    }

    fragments[urlFrag].pages.push(url);
    pages[url].fragments.push(urlFrag);
  }
}

https://www.adobe.com/acrobat/online/ai-chat-pdf.html
https://www.adobe.com/acrobat/online/ocr-pdf.html
https://www.adobe.com/acrobat/online/pdf-to-ppt.html
https://www.adobe.com/acrobat/online/crop-pdf.html
https://www.adobe.com/acrobat/online/pdf-to-jpg.html
https://www.adobe.com/acrobat/online/jpg-to-pdf.html
https://www.adobe.com/acrobat/online/pdf-editor.html
https://www.adobe.com/acrobat/online/pdf-to-word.html
https://www.adobe.com/acrobat/online/pdf-to-excel.html
https://www.adobe.com/acrobat/online/excel-to-pdf.html
https://www.adobe.com/acrobat/online/png-to-pdf.html
https://www.adobe.com/acrobat/online/password-protect-pdf.html
https://www.adobe.com/acrobat/online/extract-pdf-pages.html
https://www.adobe.com/acrobat/online/convert-pdf.html
https://www.adobe.com/acrobat/online/compress-pdf.html
https://www.adobe.com/acrobat/online/ppt-to-pdf.html
https://www.adobe.com/acrobat/online/request-signature.html
https://www.adobe.com/acrobat/online/sign-pdf.html
https://www.adobe.co

[33m2[39m

Find if there are fragments in a fragment.

In [13]:
for (const url of Object.keys(fragments)) {
  console.log(url);
  
  const urlObj = new URL(url);

  const urlPlainHtml = `${url}.plain.html`;
  const document = await readPlainHtml(urlPlainHtml);

  if (!document) {
    delete fragments[url];
    continue;
  }

  const nodes = Array.from(document.querySelectorAll('a'));
  const links = nodes.map(x => x.attributes.getNamedItem('href')?.value);

  const frags = links.filter((x) => regex.test(x));

  for (let frag of frags) {
    [frag] = frag.split('#');
    urlObj.pathname = frag;
    const urlFrag = urlObj.href;

    if (!subFrags[urlFrag]) {
      subFrags[urlFrag] = { fragments: [] };
    }

    subFrags[urlFrag].fragments.push(url);

    if (!fragments[url].subFrags) {
      fragments[url].subFrags = [];
    }

    fragments[url].subFrags.push(urlFrag);
  }
}

https://www.adobe.com/dc-shared/fragments/seo-articles/seo-caas-collection
https://www.adobe.com/dc-shared/fragments/seo-articles/acrobat-color-blade
https://www.adobe.com/dc-shared/fragments/resources/tax-preparation/marquee/acquisition
https://www.adobe.com/dc-shared/fragments/resources/tax-preparation/sticky-banner/acquisition
https://www.adobe.com/dc-shared/fragments/resources/tax-preparation/qr-code-blade/acquisition
https://www.adobe.com/dc-shared/fragments/modals/videos/resources/combine-and-organize
https://www.adobe.com/dc-shared/fragments/modals/videos/resources/password-protect
https://www.adobe.com/dc-shared/fragments/resources/tax-preparation/bottom-blade/acquisition
https://www.adobe.com/dc-shared/fragments/modals/videos/seo-how-to/holiday-2022-thank-you
https://www.adobe.com/dc-shared/fragments/promo-banners/dc-refresh
https://www.adobe.com/dc-shared/fragments/shared-fragments/pricing-pods/standard-pro-know
https://www.adobe.com/dc-shared/fragments/modals/videos/how-to/2

Analyze how many pages have fragments.

In [14]:
const buckets = {};
for (const url of Object.keys(pages)) {
  const count = pages[url].fragments.length;
  if (!buckets[count]) {
    buckets[count] = []
  }
  buckets[count].push(url);
}
for (const count of Object.keys(buckets).sort()) {
  console.log(`Fragment Count ${count}: ${buckets[count].length} pages`)
}

Fragment Count 0: 1023 pages
Fragment Count 1: 183 pages
Fragment Count 2: 147 pages
Fragment Count 3: 18 pages
Fragment Count 4: 5 pages
Fragment Count 6: 1 pages
