New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links in onClick property not captured in WAT 'Links' metadata #8

Closed
e271828- opened this Issue Feb 16, 2017 · 6 comments

Comments

Projects
None yet
2 participants
@e271828-

e271828- commented Feb 16, 2017

Some examples below:

div onclick="location.href='webpage.html'"

input type=button onClick="parent.location='index.html'" value='click here'

input type=button onClick="parent.open('http://www.x.com/')" value='new window'

input type=button onClick=window.open("button-child.php","demo","width=550,height=300,left=150,top=200,toolbar=0,status=0,"); value="Open child Window"

input type="button" value="Open" onclick="window.location.href='http://www.y.com/'"
@sebastian-nagel

This comment has been minimized.

Show comment
Hide comment
@sebastian-nagel

sebastian-nagel Feb 22, 2017

Thanks for reporting this problem. The challenge is to extract the URL from the value of the onclick attribute, esp. because quoting in embedded Javascript isn't trivial, e.g.: onclick="window.open('http://example.com/', #39;width=500');" Need to find a reliable solution, given that the onclick attribute is frequent and also other event-handler attributes (onsubmit etc.) should ideally be covered.

sebastian-nagel commented Feb 22, 2017

Thanks for reporting this problem. The challenge is to extract the URL from the value of the onclick attribute, esp. because quoting in embedded Javascript isn't trivial, e.g.: onclick="window.open('http://example.com/', #39;width=500');" Need to find a reliable solution, given that the onclick attribute is frequent and also other event-handler attributes (onsubmit etc.) should ideally be covered.

@e271828-

This comment has been minimized.

Show comment
Hide comment
@e271828-

e271828- May 24, 2017

Any further thoughts on this? Seems like a partial solution would still get you pretty far.

e271828- commented May 24, 2017

Any further thoughts on this? Seems like a partial solution would still get you pretty far.

sebastian-nagel added a commit that referenced this issue Aug 23, 2017

WAT extractor: get links from onClick attributes, fixes #8
- extract links from JavaScript code snippets
  in onClick attributes of INPUT and DIV elements
@sebastian-nagel

This comment has been minimized.

Show comment
Hide comment
@sebastian-nagel

sebastian-nagel Aug 24, 2017

Hi @e271828-, a significant portion of JavaScript onclick links (see unit test) will be included in the August crawl (CC-MAIN-2017-34). Thanks!

sebastian-nagel commented Aug 24, 2017

Hi @e271828-, a significant portion of JavaScript onclick links (see unit test) will be included in the August crawl (CC-MAIN-2017-34). Thanks!

@e271828-

This comment has been minimized.

Show comment
Hide comment
@e271828-

e271828- Aug 24, 2017

Thanks, @sebastian-nagel! That was my next question :)

Have you by any chance done an analysis of how this change increases URL counts? Quite curious to know the answer.

e271828- commented Aug 24, 2017

Thanks, @sebastian-nagel! That was my next question :)

Have you by any chance done an analysis of how this change increases URL counts? Quite curious to know the answer.

@sebastian-nagel

This comment has been minimized.

Show comment
Hide comment
@sebastian-nagel

sebastian-nagel Aug 24, 2017

I've only verified it on a single WARC (CC-MAIN-20170629154125-20170629174125-00719.warc.gz): 3200 more links for 131,000 records (934,000 links before). Here the overview of link "paths":

7777909 A@/href
1266284 IMG@/src
90022   STYLE/#text
82498   FORM@/action
30165   A@/data-href
29271   IFRAME@/src
12383   DIV@/data-href
9034    TD@/background
8339    AREA@/href
7932    SPAN@/data-href
7595    INPUT@/src
6296    IMG@/longdesc
2710    DIV@/onclick     <<<<<
2524    EMBED@/src
1521    TABLE@/background
1481    BUTTON@/data-href
1125    BLOCKQUOTE@/cite
995     OBJECT@/codebase
860     OBJECT@/data
608     SOURCE@/src
500     INPUT@/onclick      <<<<<
405     LI@/data-href
378     INPUT@/data-href
370     BODY@/background
351     LABEL@/data-href

sebastian-nagel commented Aug 24, 2017

I've only verified it on a single WARC (CC-MAIN-20170629154125-20170629174125-00719.warc.gz): 3200 more links for 131,000 records (934,000 links before). Here the overview of link "paths":

7777909 A@/href
1266284 IMG@/src
90022   STYLE/#text
82498   FORM@/action
30165   A@/data-href
29271   IFRAME@/src
12383   DIV@/data-href
9034    TD@/background
8339    AREA@/href
7932    SPAN@/data-href
7595    INPUT@/src
6296    IMG@/longdesc
2710    DIV@/onclick     <<<<<
2524    EMBED@/src
1521    TABLE@/background
1481    BUTTON@/data-href
1125    BLOCKQUOTE@/cite
995     OBJECT@/codebase
860     OBJECT@/data
608     SOURCE@/src
500     INPUT@/onclick      <<<<<
405     LI@/data-href
378     INPUT@/data-href
370     BODY@/background
351     LABEL@/data-href
@e271828-

This comment has been minimized.

Show comment
Hide comment
@e271828-

e271828- Aug 24, 2017

Interesting, thanks Sebastian.

e271828- commented Aug 24, 2017

Interesting, thanks Sebastian.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment