# Analyzing Fragment Usage in DocCloud en-US Pages


Analyze content to find fragement usage. 

In [15]:
import { xml2js } from "https://deno.land/x/xml2js@1.0.0/mod.ts";
import { DOMParser } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";

Functions to read robots.txt, sitemap.xml, and plain.html.

In [16]:
function parseRobotsTxt(text) {
  const lines = text.split('\n');
  const sitemapLines = lines.filter((x) => x.startsWith('Sitemap:'));
  return { sitemaps: sitemapLines.map((x) => x.replace('Sitemap:', '').trim()) };
}

async function readRobotsTxt(url) {
  const resp = await fetch(url);
  return parseRobotsTxt(await resp.text());
}

function parseSitemapXml(text) {
  return xml2js(text, { compact: true });
}

async function readSitemapXml(url) {
  const resp = await fetch(url);
  return parseSitemapXml(await resp.text());
}

function parsePlainHtml(text) {
  return new DOMParser().parseFromString(text, 'text/html');
}

async function readPlainHtml(url) {
  const resp = await fetch(url);
  if (!resp.ok) return null;
  return parsePlainHtml(await resp.text());
}

Get all `sitemap.xml`s from robots.txt.

In [17]:
const urlRobotsTxt = 'https://www.adobe.com/robots.txt';
const robots = await readRobotsTxt(urlRobotsTxt);

Use the sitemap-index.xml `https://www.adobe.com/dc.milo.sitemap-index.xml`

Use the en-US sitemap only `https://www.adobe.com/ae_ar/dc-shared/assets/sitemap.xml` 

In [18]:
const urlSitemapXml = robots.sitemaps.filter((x) => x.includes('dc.milo'))[0];
const urlSitemaps = await readSitemapXml(urlSitemapXml);
const urlLocSitemaps = urlSitemaps.sitemapindex.sitemap.filter((x) => x.loc._text.includes('adobe.com/dc-shared'));

`pages`: HTML pages

`framgments`: Fragments

`subFrags`: Fragments in a fragment

In [19]:
const pages = {};
const fragments = {};
const subFrags = {};
const regex = new RegExp('/dc-shared/fragments/');

Read the en-US sitemap

In [20]:
for (let i = 0; i < urlLocSitemaps.length; i++) {
  const url = urlLocSitemaps[i].loc._text;
  console.log(url);
  const sitemap = await readSitemapXml(url);
  if (sitemap.urlset.url) {
    console.log(sitemap.urlset.url.length);
    sitemap.urlset.url.map((x) => x.loc._text).forEach(x => pages[x] = {fragments: []});
  }
}

https://www.adobe.com/dc-shared/assets/sitemap.xml
1380


Clear the result objects in case of re-run from the middle

In [21]:
Object.keys(pages).forEach(key => pages[key].fragments = []);
Object.keys(fragments).forEach(key => delete fragments[key]);
Object.keys(subFrags).forEach(key => delete subFrags[key]);

Search fragments in all pages in the en-US sitemap. Some pages are from Dexter. If there is no plain.html, then it is a Dexter page and it is removed.

In [22]:
for (const url of Object.keys(pages)) {
  //console.log(url);
  
  const urlObj = new URL(url);

  const urlPlainHtml = url.replace('.html', '.plain.html');
  const document = await readPlainHtml(urlPlainHtml);

  if (!document) {
    delete pages[url];
    continue;
  }

  const nodes = Array.from(document.querySelectorAll('a'));
  const links = nodes.map(x => x.attributes.getNamedItem('href')?.value);

  const frags = links.filter((x) => regex.test(x));

  for (let frag of frags) {
    [frag] = frag.split('#');
    urlObj.pathname = frag;
    const urlFrag = urlObj.href;

    if (!fragments[urlFrag]) {
      fragments[urlFrag] = { pages: [] };
    }

    fragments[urlFrag].pages.push(url);
    pages[url].fragments.push(urlFrag);
  }
}

[33m2[39m

Find if there are fragments in a fragment.

In [23]:
for (const url of Object.keys(fragments)) {
  //console.log(url);
  
  const urlObj = new URL(url);

  const urlPlainHtml = `${url}.plain.html`;
  const document = await readPlainHtml(urlPlainHtml);

  if (!document) {
    delete fragments[url];
    continue;
  }

  const nodes = Array.from(document.querySelectorAll('a'));
  const links = nodes.map(x => x.attributes.getNamedItem('href')?.value);

  const frags = links.filter((x) => regex.test(x));

  for (let frag of frags) {
    [frag] = frag.split('#');
    urlObj.pathname = frag;
    const urlFrag = urlObj.href;

    if (!subFrags[urlFrag]) {
      subFrags[urlFrag] = { fragments: [] };
    }

    subFrags[urlFrag].fragments.push(url);

    if (!fragments[url].subFrags) {
      fragments[url].subFrags = [];
    }

    fragments[url].subFrags.push(urlFrag);
  }
}

Analyze how many pages have fragments.

In [24]:
const buckets = {};
for (const url of Object.keys(pages)) {
  const count = pages[url].fragments.length;
  if (!buckets[count]) {
    buckets[count] = []
  }
  buckets[count].push(url);
}
for (const count of Object.keys(buckets).sort()) {
  console.log(`Fragment Count ${count}: ${buckets[count].length} pages`)
}

Fragment Count 0: 1023 pages
Fragment Count 1: 183 pages
Fragment Count 2: 147 pages
Fragment Count 3: 18 pages
Fragment Count 4: 5 pages
Fragment Count 6: 1 pages


In [42]:
const fragsUsedBy = Object.keys(fragments).map(frag => ({url: frag, usedByCount: fragments[frag].pages.length}));
fragsUsedBy.sort((a, b) => b.usedByCount - a.usedByCount)
for (let i=0; i < 10; i++) {
  console.log(`${fragsUsedBy[i].url} is used by ${fragsUsedBy[i].usedByCount} pages.`)
}

https://www.adobe.com/dc-shared/fragments/seo-articles/acrobat-color-blade is used by 244 pages.
https://www.adobe.com/dc-shared/fragments/seo-articles/seo-caas-collection is used by 113 pages.
https://www.adobe.com/dc-shared/fragments/shared-fragments/pricing-pods/standard-pro-know is used by 26 pages.
https://www.adobe.com/dc-shared/fragments/promo-banners/dc-refresh is used by 25 pages.
https://www.adobe.com/dc-shared/fragments/resources/want-to-know-more is used by 25 pages.
https://www.adobe.com/dc-shared/fragments/resources/assurance-you-need is used by 15 pages.
https://www.adobe.com/dc-shared/fragments/shared-fragments/business/red-acrobat-bg-want-know-more is used by 7 pages.
https://www.adobe.com/dc-shared/fragments/shared-fragments/acrobat-icon-blocks/purple-acrobat-iconblock-know-more is used by 7 pages.
https://www.adobe.com/dc-shared/fragments/modals/free-trial/sign-free-trial is used by 7 pages.
https://www.adobe.com/dc-shared/fragments/shared-fragments/business/black-ge