Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented issue #93 - "Manually Specify Paths" #197

Closed
wants to merge 16 commits into from
Closed

Implemented issue #93 - "Manually Specify Paths" #197

wants to merge 16 commits into from

Conversation

bobbajs
Copy link

@bobbajs bobbajs commented Mar 27, 2013

EECE310 L2A2 Group, implementing Issue #93

Crawljax is now able to crawl a path specified by using alsoCrawl() function in CrawljaxConfigurationBuilder.

Short description regarding the implementation of this feature:
Instead of having one URL to store the seed URL, we replace the member variable with an ArrayList of URLs. By calling alsoCrawl(), new url specified by user will be added to this ArrayList.
WorkQueue is modified not to be final anymore, since it needs to crawl another URL once it is done crawling the seedURL.

Should there be any questions / concerns, please let us know.

bobbajs and others added 16 commits March 26, 2013 12:43
Added simpleExample
Added ArrayList<URL> urls to CrawljaxConfiguration
Is it because of new feature implemented? or maybe I forgot to port
some code...
crawljax can crawl different URL by calling alsoCrawl().
Needs refactoring.
Before only strings were accepted for crawling additional sites now URLs
can be entered too.
Removing print statements that were used for debugging purposes.
Removing unused functions
Checks that it is possible to build the CrawljaxController after adding
a second url as a url and as a string.
Merging diana-new branch with master
@alexnederlof
Copy link
Contributor

There's one conceptual problem with this solution: you won' see the links between two sites. For example: I specify to crawl my.blog.com which links to my.othersite.com and vice versa. Then I want the crawler to show the links between the two. Using the solution provided, it will first crawl the first, and then the second, but will will never cross over to the other one. Preventing me from inspecting the relation between the two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants