-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually specify paths #93
Comments
Hi Alex, my ECE310 group would like to work on this issue as our project. Can you expand upon what's meant by certain URL's. For example, does this imply that you'd like it to scan for direct URLs, say in a blog post or a forum post, and for crawljax to be able to click it? Can we implement this by scanning the entire HTML for urls, say using the token "http:", and adding it as an new clickable URL object (or simply a string) in Crawljax? |
Crawljax starts from one seed URL: the URL you give it in the configuration. You can specify that Crawljax clicks certain elements to get to every state you want. However, it may not reach a certain state, that is reachable by a certain URL. For example you observe that is doesn't crawl It would be nice if you could configure the crawler to also crawl that URL, and not only the seed URL. It would require an extra builder parameter like Crawljax already extracts all the URL's from the HTML. That is not the issue here. It could be that the URL we're looking for is only reachable via a certain state that Crawljax can't access. |
We're trying to understand what would be the most beneficial implementation of this issue. Should the new URL simply act as another seed, where we crawl the same amount of states and depths as the initial URL. Or, should it be one of the states under the initial URL. Say, we've finished crawling 4 states, and end up at the new given URL? |
As users of Crawljax, we might have multiple paths/URLs but they might not be exactly known before hand (e.g. it's provided a regex instead of well-formed URL). [1] Would be possible as part of this issue allow to control which URLs are considered under the crawl scope? If not part of this issue, is this something that you think worth adding? At the moment, we are doing the changes directly into Crawler class to allow any URL to be crawled (which we then allow to pass-through or reject based on custom criteria). |
This sounds like a nice addition. Feel free to submit a pull request, which we will include in the next release (if accepted). |
Right now you can only specify elements to click by id or xpath. It would be nice to also allow the Crawler to visit certain URL's. This has the advantage that it's easier to configure then XPath and it might be more stable. It also allows a Crawler to find places that are hidden by click paths.
The text was updated successfully, but these errors were encountered: