Crawl seems to be missing some pages #9

chosak · 2020-10-29T17:23:56Z

The crawl results seem to miss some pages, even though they should fall within the current specified depth.

Current behavior

Crawl results are incomplete. For example, last night's run used --depth=4. But the results under /owning-a-home/compare/ only include two sub-pages even though the page sidebar has 5 sub-pages:

From the page root, these links should only be at depth 3:

Home page (/)
Mega menu link to Buying a House (/owning-a-home/)
Link to Compare page (/owning-a-home/process/compare/)
Sub-pages

Expected behavior

All pages with the specific depth are crawled.

The text was updated successfully, but these errors were encountered:

chosak · 2020-10-29T20:32:06Z

The above example is wrong! Confusingly, those sub-pages of /owning-a-home/compare/ are actually under /owning-a-home/process/compare/. And those are getting properly crawled. Need to do more review of a more thorough crawl.

chosak · 2020-10-30T19:27:55Z

An example that fails: on Submit a complaint, the "Start a new complaint" button points to https://www.consumerfinance.gov/complaint/getting-started. Because we use Django's settings.APPEND_SLASH this redirects to the actual page at https://www.consumerfinance.gov/complaint/getting-started/. But wget doesn't even try to download this link.

This URL shows up in wget's rejection log as being rejected because of reason RULES. I think this is because we now use --accept html. This works if wget hits a URL ending in slash, and creates an index.html file. I suspect when it sees a URL like "getting-started", it treats that as a file of type "getting-started", and thus ignores it. We might need to go back to the old version from 2ab81ec where we instead explicitly reject non-html file types.

chosak · 2020-11-03T15:11:35Z

We might need to go back to the old version from 2ab81ec where we instead explicitly reject non-html file types.

Fixed by #15; see for example that /complaint/getting-started/index.html is properly saved.

This was referenced Nov 2, 2020

Long URLs get truncated by wget #13

Open

Reject non-HTML instead of accepting only HTML #15

Merged

chosak closed this as completed Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl seems to be missing some pages #9

Crawl seems to be missing some pages #9

chosak commented Oct 29, 2020 •

edited

Loading

chosak commented Oct 29, 2020

chosak commented Oct 30, 2020

chosak commented Nov 3, 2020

Crawl seems to be missing some pages #9

Crawl seems to be missing some pages #9

Comments

chosak commented Oct 29, 2020 • edited Loading

Current behavior

Expected behavior

chosak commented Oct 29, 2020

chosak commented Oct 30, 2020

chosak commented Nov 3, 2020

chosak commented Oct 29, 2020 •

edited

Loading