New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs could be linked from multiple locations #8

Open
ethanp opened this Issue Jun 19, 2016 · 5 comments

Comments

Projects
None yet
2 participants
@ethanp
Collaborator

ethanp commented Jun 19, 2016

Suppose we have two valid URLs A & B. Both of them contain a link to a third URL Z, which is a 404.

  • The first one to be crawled is A, so Z's parent points to A.
  • Then B is crawled, and we don't re-add Z to the pages map because it is already in there.
  • Then Z is crawled, and it is marked as a failure, with parent A

In the report, we don't find out that B had a broken link to Z.

Is that correct? If so, perhaps a WebPage should have a set of parent links, not just one? Then whenever we see a URL that has already been added to pages, the new parent link is added.

@Kyle-Falconer

This comment has been minimized.

Member

Kyle-Falconer commented Jun 19, 2016

Oh, I see how this could be useful, so that someone could see each instance of the broken link and go to all those pages to fix it.

The implementation solution would work as well, but I'm wondering how that might be displayed to the user. Maybe something that represents "link was referenced by these pages", instead of "parent"

referenced_by : []
@ethanp

This comment has been minimized.

Collaborator

ethanp commented Jun 19, 2016

yeah sounds good

@Kyle-Falconer

This comment has been minimized.

Member

Kyle-Falconer commented Jun 19, 2016

The other way to look at this is as a one-directional tree. We could remove the parent or referenced_by fields and instead keep track of each link on the page.

I think this model more closely represents a web page, since each web page has zero or more links to other pages, but there's no in-built concept of pages that link to this page.

I guess the question we should ask is which representation is more useful when displaying the results?

Given that we are looking to include this program potentially as a browser plugin or as a web service, where a user can ask to scan a single page or a whole site, I'm leaning towards the latter: representing each webpage as a collection of links.

If we do this, we could potentially present the whole website as a graph, with nodes as WebPage and edges as links. I think it would be neat to look at, at the very least.

@ethanp

This comment has been minimized.

Collaborator

ethanp commented Jun 19, 2016

Yeah, though the way it is now where each URL goes into a global map, each URL can only be checked once, so maybe for efficiency we should keep the pages map to prevent duplicate work.

@Kyle-Falconer

This comment has been minimized.

Member

Kyle-Falconer commented Jun 19, 2016

Yeah, we should keep the same behavior, just append the current webpage to the target webpage's referenced_by list.

Kyle-Falconer added a commit that referenced this issue Jun 19, 2016

Fix for issue #8.
All links coming in will now be added to linkedFromPages, which is
represented in the final output by the "referenced_by" value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment