Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to add metadata to crawl queue #122

Closed
GoogleCodeExporter opened this issue Mar 26, 2015 · 5 comments
Closed

Ability to add metadata to crawl queue #122

GoogleCodeExporter opened this issue Mar 26, 2015 · 5 comments

Comments

@GoogleCodeExporter
Copy link

I am currently using Abot to crawl a CMS-site. The data from the crawls are 
used to monitor site status, as well as generating a report of failing links so 
they can be fixed by the owner of the site. 

However, in order generate this report, I need to know more about the pages to 
be crawled (such as link text for the anchor pointing to the page, and whether 
or not the link points to an image). I have modified my own code to crawl 
ILinkInfo objects instead of Urls. Would this be something that can be included 
into the main source? I can implement it and do a pull request to github if 
this would be nice to have in the main branch.

Original issue reported on code.google.com by d.st...@gmail.com on 18 Dec 2013 at 12:17

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

I love the idea of having more information about the links but am hesitant on 
adding anymore parsing than needs to happen since most people wouldn't need the 
link text or need to know if the link was an image. Can you first attach the 
impl that actually fills/returns the list of ILinkInfo object so I can take a 
quick look? 

Thank you for offering your code!!!!

Original comment by sjdir...@gmail.com on 18 Dec 2013 at 9:40

  • Changed state: Accepted
  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

My code is here, I have changed the HyperLinkParser to return a list of 
ILinkInfo, instead of the Uri-list that is returned now. In addition I have 
changed the interface for the PageRequester to crawl PageToCrawl objects 
directly instead of the Uri-object it currently accepts. When I want to crawl 
extra metadata I can then subclass ILinkInfo and update my own HyperLinkParser 
accordingly. The only remaining implementation would be to implement something 
like a PageToCrawl.Bag for storing the metadata. I have done this the ugly way 
locally (By just modifying the PageToCrawl class), so I am not sharing that 
code. Also, I have not updated the CsQueryHyperLinkParser, as I am using the 
HAP-parser:)

I dont know if this is the best way of implementing the described 
functionality, but I have made an attempt at least, so just let me know if you 
like it :) I havent tested it, but I assume it will work just fine :)

Modified files are attached.

Original comment by d.st...@gmail.com on 19 Dec 2013 at 7:53

  • Added labels: ****
  • Removed labels: ****

Attachments:

@GoogleCodeExporter
Copy link
Author

fyi, v1.2.3 already has a PageToCrawl.PageBag of dynamic expando type.

I'll take a look at your impl and get back to you. Thanks again.




Original comment by sjdir...@gmail.com on 19 Dec 2013 at 6:04

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

As of right now, i don't think I will pull your changes into the product due to 
the reasons I stated above. However, i may change my position in the future. 
Thanks for offering your implementation. Your time is appreciated.

Original comment by sjdir...@gmail.com on 30 Dec 2013 at 3:12

  • Changed state: WontFix
  • Added labels: ****
  • Removed labels: ****

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant