Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually specify paths #93

Open
alexnederlof opened this issue Jan 2, 2013 · 5 comments
Open

Manually specify paths #93

alexnederlof opened this issue Jan 2, 2013 · 5 comments
Assignees
Milestone

Comments

@alexnederlof
Copy link
Contributor

Right now you can only specify elements to click by id or xpath. It would be nice to also allow the Crawler to visit certain URL's. This has the advantage that it's easier to configure then XPath and it might be more stable. It also allows a Crawler to find places that are hidden by click paths.

@ghost ghost assigned alexnederlof Jan 2, 2013
@wesleytsai
Copy link

Hi Alex, my ECE310 group would like to work on this issue as our project. Can you expand upon what's meant by certain URL's.

For example, does this imply that you'd like it to scan for direct URLs, say in a blog post or a forum post, and for crawljax to be able to click it?

Can we implement this by scanning the entire HTML for urls, say using the token "http:", and adding it as an new clickable URL object (or simply a string) in Crawljax?

@alexnederlof
Copy link
Contributor Author

Crawljax starts from one seed URL: the URL you give it in the configuration. You can specify that Crawljax clicks certain elements to get to every state you want. However, it may not reach a certain state, that is reachable by a certain URL. For example you observe that is doesn't crawl http://mysite.com/someplace/.

It would be nice if you could configure the crawler to also crawl that URL, and not only the seed URL. It would require an extra builder parameter like config.alsoCrawl(theUrl).

Crawljax already extracts all the URL's from the HTML. That is not the issue here. It could be that the URL we're looking for is only reachable via a certain state that Crawljax can't access.

@wesleytsai
Copy link

We're trying to understand what would be the most beneficial implementation of this issue.

Should the new URL simply act as another seed, where we crawl the same amount of states and depths as the initial URL. Or, should it be one of the states under the initial URL. Say, we've finished crawling 4 states, and end up at the new given URL?

@thc202
Copy link
Contributor

thc202 commented Aug 24, 2016

As users of Crawljax, we might have multiple paths/URLs but they might not be exactly known before hand (e.g. it's provided a regex instead of well-formed URL). [1]

Would be possible as part of this issue allow to control which URLs are considered under the crawl scope?
For example, config.crawlWithScope(new MyScope()); with MyScope implementing an interface that's used to know if a URL/site is valid or not to keep crawling.

If not part of this issue, is this something that you think worth adding? At the moment, we are doing the changes directly into Crawler class to allow any URL to be crawled (which we then allow to pass-through or reject based on custom criteria).

[1] zaproxy/zap-extensions#468

@amesbah
Copy link
Member

amesbah commented Aug 24, 2016

This sounds like a nice addition. Feel free to submit a pull request, which we will include in the next release (if accepted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants