Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl seems to be missing some pages #9

Closed
chosak opened this issue Oct 29, 2020 · 3 comments
Closed

Crawl seems to be missing some pages #9

chosak opened this issue Oct 29, 2020 · 3 comments

Comments

@chosak
Copy link
Member

chosak commented Oct 29, 2020

The crawl results seem to miss some pages, even though they should fall within the current specified depth.

Current behavior

Crawl results are incomplete. For example, last night's run used --depth=4. But the results under /owning-a-home/compare/ only include two sub-pages even though the page sidebar has 5 sub-pages:

image

image

From the page root, these links should only be at depth 3:

  • Home page (/)
  • Mega menu link to Buying a House (/owning-a-home/)
  • Link to Compare page (/owning-a-home/process/compare/)
  • Sub-pages

Expected behavior

All pages with the specific depth are crawled.

@chosak
Copy link
Member Author

chosak commented Oct 29, 2020

The above example is wrong! Confusingly, those sub-pages of /owning-a-home/compare/ are actually under /owning-a-home/process/compare/. And those are getting properly crawled. Need to do more review of a more thorough crawl.

@chosak
Copy link
Member Author

chosak commented Oct 30, 2020

An example that fails: on Submit a complaint, the "Start a new complaint" button points to https://www.consumerfinance.gov/complaint/getting-started. Because we use Django's settings.APPEND_SLASH this redirects to the actual page at https://www.consumerfinance.gov/complaint/getting-started/. But wget doesn't even try to download this link.

This URL shows up in wget's rejection log as being rejected because of reason RULES. I think this is because we now use --accept html. This works if wget hits a URL ending in slash, and creates an index.html file. I suspect when it sees a URL like "getting-started", it treats that as a file of type "getting-started", and thus ignores it. We might need to go back to the old version from 2ab81ec where we instead explicitly reject non-html file types.

@chosak
Copy link
Member Author

chosak commented Nov 3, 2020

We might need to go back to the old version from 2ab81ec where we instead explicitly reject non-html file types.

Fixed by #15; see for example that /complaint/getting-started/index.html is properly saved.

@chosak chosak closed this as completed Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant