Wivet patches #72

Merged
merged 3 commits into from Dec 16, 2012

Projects

None yet

4 participants

@amesbah
Member
amesbah commented Dec 15, 2012
  • cleaned up WIVET patch submitted by @guifre
  • also fixes issue #58
@guifre @alexnederlof guifre WIVET enhancements
-Added support for html frame tags.
-Included support for inspecting code within frame tags.
-Added support for crawling meta refresh tags.
-Added a new optional specification to carry out deeper analyses.
-Fixed bug causing crawlax not to update candidates after running the
prestatcrawling plugin.
-Created a new test class that targets the wivet benchmark .

For detailed information refer to the patch files.

We used a set of frameworks to validate the results of our
improvements. WIVET is the most noticeable one, it is widely used to
test crawlers. If you run the WivetTest class, you will see if crawls
up to a 74% of the site. In the trunk version of crawljax it is only
able to crawl betwen 0% and 10% depending on the targeted node (due to
the issues fixed in our patches).
256727f
@alexnederlof
Member

I imported these patches from @guifre . However the main method he provided doesn't seem to work:

Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.

Member

I confirm Arie's finding. Changing the browser to Firefox resolves the issue.

There is a newer version of WIVET (v3) available: http://code.google.com/p/wivet/

Would it be an idea to include the WIVET page in the embedded jetty server and write a test case with proper assertions (on the number of states, coverage statistics that WIVET provides)?

Member

@alexnederlof and I further discussed wivet yesterday at TU Delft.

  1. To me it seems that the present very nice contribution by @guifre can be merged (after changing to firefox).
  2. We may want to report the htmlunit issue
  3. The site hosting an instance of wivet v3 seems unavailable quite often -- embedding it in the jetty sounds useful.
  4. I really like the idea of enriching the wivet3 test cases with proper assertions
  5. The wivet3 test cases should be part of the 'largetests' -- crawljax would benefit from a good separation between a unit test suite that is instantaneous and a (slower) integration test suite.

If we agree we can turn the above list into separate issues and go for it.

Member

Great. Agreed on all items.

Item 5 is really needed to speed up the regular unit testing time. As a developer, you don't want to wait 5 minutes (give or take) every time the test suite is run.

@avandeursen

Could the SAXNotRecognizedException be a htmlunit issue?
If I change the browser type to firefox things start working for me.

@avandeursen

Perhaps these dontClicks are a bit outdated: The current wivet site has url's like

http://caos.uab.es/~gruiz/test/wivet/offscanpages/statistics.php

that should not be scanned (anything containing offscanpages).

@avandeursen

This is wivet v2; v3 is available too.

@avandeursen avandeursen was assigned Dec 15, 2012
@alexnederlof alexnederlof merged commit 6010409 into master Dec 16, 2012

1 check passed

default The Travis build passed
Details
@amesbah amesbah referenced this pull request Dec 19, 2012
Closed

Frames and Iframes #3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment