made changes for NUTCH-2108 and formatted the previously unformatted … by asitang · Pull Request #62 · apache/nutch

asitang · 2015-09-21T21:34:26Z

…code for this plugin

MJJoyce · 2015-09-22T17:26:42Z

I was thinking about this last night. I think we may have missed a few points when we were talking about this previously. All the Driver creation, clean up, and content pulling are done in lib-selenium so we can use that functionality across plugins. I think we can add this functionality without many (or any) changes though.

If you want to to do multiple content extractions and include that with the body, the handler can do that by incrementally pulling content out of the page and appending (or replacing) the body of the fetched page. This effectively allows the handler to return whatever subset of data that it wants and it doesn't require us to make any changes. I think that's probably a reasonably clean way of handling the functionality.

Thoughts?

asitang · 2015-09-23T00:14:58Z

Do you mean we can keep appending the new content to the driver instance and return it??

MJJoyce · 2015-09-23T20:05:48Z

Hey @asitang,

If I'm remembering correctly we were talking about wanting to pull content out of various parts of the page and append that to the body in the same interaction correct? So in psudocode:

public void processDriver(WebDriver driver) {
  String stuffWeCareAbout = ""
  for allInteractionsWeNeedToDo {
    driver.doInteraction()
    stuffWeCareAbout += fetchHTMLFromTheInteractionWeDid()
  }
  driver.appendToBody(stuffWeCareAbout)
}

Wouldn't this cover the use case we were looking to handle sufficiently? Or in other words, if we want to do a bunch of interactions that generate content on a page the workflow per-interaction is:

Do the interaction on the driver
Grab the content this generates that we care about and save it into a variable
Undo the interaction if necessary

Once all the interactions we care about are done, we append this content to the body (or completely replace the body even).

So imagine an example of a paginated table that dynamically loads content. This should handle what we're looking for I think (again, pseudocode)

public void processDriver(WebDriver driver) {
  String paginatedTableContent = ""
  for tableInteractions {
    if (! onFirstTablePage)
      driver.clickPaginationButton()

    paginatedTableContent += driver.table.innerHTML
  }
  driver.appendToBody(stuffWeCareAbout)
}

Now when we process all the links coming out of this page they'll all be coming off the page with the table.

asitang · 2015-09-24T16:53:31Z

Yup I got that part Mike. But I don't think this is possible in selenium: driver.appendToBody(stuffWeCareAbout)

MJJoyce · 2015-09-24T22:06:09Z

Hey have you checked out https://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/JavascriptExecutor.html

I think it might do what we're hoping to accomplish.

made changes for NUTCH-2108 and formatted the previously unformatted …

cd74235

…code for this plugin

asitang closed this Sep 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

made changes for NUTCH-2108 and formatted the previously unformatted …#62

made changes for NUTCH-2108 and formatted the previously unformatted …#62
asitang wants to merge 1 commit intoapache:trunkfrom
asitang:NUTCH-2108

asitang commented Sep 21, 2015

Uh oh!

MJJoyce commented Sep 22, 2015

Uh oh!

asitang commented Sep 23, 2015

Uh oh!

MJJoyce commented Sep 23, 2015

Uh oh!

asitang commented Sep 24, 2015

Uh oh!

MJJoyce commented Sep 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asitang commented Sep 21, 2015

Uh oh!

MJJoyce commented Sep 22, 2015

Uh oh!

asitang commented Sep 23, 2015

Uh oh!

MJJoyce commented Sep 23, 2015

Uh oh!

asitang commented Sep 24, 2015

Uh oh!

MJJoyce commented Sep 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants