Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all links in a page is crawled and indexed #1157

Closed
riceming opened this issue Jul 11, 2017 · 2 comments
Closed

Not all links in a page is crawled and indexed #1157

riceming opened this issue Jul 11, 2017 · 2 comments
Labels

Comments

@riceming
Copy link

I have a site page
http://www.successug.com/en/ir/announcements.php

In the page there is a list of pdf, but only some of them are crawled and indexed. What is the problem?

fess-crawler.log

2017-07-11 12:13:54,443 [WebFsCrawler] INFO Target URL: http://www.successug.com/en/ir/announcements.php
2017-07-11 12:13:54,443 [WebFsCrawler] INFO Included URL: http://www.successug.com/en/ir/announcements.php
2017-07-11 12:13:54,443 [WebFsCrawler] INFO Included URL: http://www.irasia.com/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://203.194.162.10/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://file.irasia.com/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://202.66.146.82/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://doc.irasia.com/irasiafile/pdf/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://47.52.45.56/irasiafile/pdf/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*print=Y
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*mp3
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*jpg
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*gif
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*png
2017-07-11 12:13:54,445 [WebFsCrawler] INFO Excluded URL: .*mp4
2017-07-11 12:13:54,626 [Crawler-20170711121345-1-1] INFO Crawling URL: http://www.successug.com/en/ir/announcements.php
2017-07-11 12:13:54,689 [Crawler-20170711121345-1-1] INFO Checking URL: http://www.successug.com/robots.txt
2017-07-11 12:14:04,479 [IndexUpdater] INFO Processing 1/1 docs (Doc:{access 5ms}, Mem:{used 99MB, heap 151MB, max 495MB})
2017-07-11 12:14:04,541 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 22ms}, Mem:{used 101MB, heap 151MB, max 495MB})
2017-07-11 12:14:04,662 [IndexUpdater] INFO Sent 1 docs (Doc:{process 36ms, send 121ms, size 12KB}, Mem:{used 105MB, heap 151MB, max 495MB})
2017-07-11 12:14:05,158 [Crawler-20170711121345-1-4] INFO Crawling URL: http://www.irasia.com/listco/hk/successug/announcement/a170707.pdf
2017-07-11 12:14:05,158 [Crawler-20170711121345-1-4] INFO Checking URL: http://www.irasia.com/robots.txt
2017-07-11 12:14:05,158 [Crawler-20170711121345-1-2] INFO Crawling URL: http://file.irasia.com/listco/hk/successug/annual/2016/agm.pdf
2017-07-11 12:14:05,159 [Crawler-20170711121345-1-2] INFO Checking URL: http://file.irasia.com/robots.txt
2017-07-11 12:14:05,159 [Crawler-20170711121345-1-3] INFO Crawling URL: http://file.irasia.com/listco/hk/successug/announcement/a170619.pdf
2017-07-11 12:14:05,159 [Crawler-20170711121345-1-5] INFO Crawling URL: http://www.irasia.com/listco/hk/successug/announcement/a171837-ew_00487ann_20032017.pdf
2017-07-11 12:14:05,783 [Thread-3] WARN Building on-disk font cache, this may take a while
2017-07-11 12:14:05,784 [Thread-3] WARN Finished building on-disk font cache, found 0 fonts
2017-07-11 12:14:05,784 [Thread-3] WARN Using fallback font 'LiberationSans' for 'Times New Roman,Italic'
2017-07-11 12:14:05,822 [Thread-3] WARN Using fallback font 'LiberationSans' for 'Times New Roman'
2017-07-11 12:14:05,826 [Thread-3] WARN Using fallback font 'LiberationSans' for 'Times New Roman,Bold'
2017-07-11 12:14:05,856 [Thread-3] INFO OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
2017-07-11 12:14:14,473 [IndexUpdater] INFO Processing 4/4 docs (Doc:{access 4ms, cleanup 22ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:14:14,533 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 20ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:14:14,578 [IndexUpdater] INFO Sent 4 docs (Doc:{process 38ms, send 44ms, size 328KB}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:14:16,478 [Crawler-20170711121345-1-3] INFO Crawling URL: http://www.irasia.com/listco/hk/successug/announcement/a175963-ew_00487ann_votingresults_2017.pdf
2017-07-11 12:14:16,546 [Thread-7] WARN Using fallback font LiberationSans for base font Times-Roman
2017-07-11 12:14:16,547 [Thread-7] WARN Using fallback font LiberationSans for base font Times-Bold
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Times-Italic
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Times-BoldItalic
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica-Bold
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica-Oblique
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica-BoldOblique
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Courier
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Courier-Bold
2017-07-11 12:14:16,550 [Thread-7] WARN Using fallback font LiberationSans for base font Courier-Oblique
2017-07-11 12:14:16,550 [Thread-7] WARN Using fallback font LiberationSans for base font Courier-BoldOblique
2017-07-11 12:14:16,551 [Thread-7] WARN Using fallback font LiberationSans for base font Symbol
2017-07-11 12:14:16,552 [Thread-7] WARN Using fallback font LiberationSans for base font ZapfDingbats
2017-07-11 12:14:16,552 [Thread-7] WARN Using fallback font LiberationSans for Times-Roman
2017-07-11 12:14:16,554 [Thread-7] WARN Using fallback font LiberationSans for Times-Italic
2017-07-11 12:14:16,557 [Thread-7] WARN Using fallback font LiberationSans for Times-Bold
2017-07-11 12:14:24,473 [IndexUpdater] INFO Processing 1/1 docs (Doc:{access 4ms, cleanup 20ms}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:24,495 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:24,516 [IndexUpdater] INFO Sent 1 docs (Doc:{process 7ms, send 21ms, size 106KB}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:34,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:44,472 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:14:54,472 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:04,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:14,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:24,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:34,472 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:44,473 [IndexUpdater] INFO Processing no docs (Doc:{access 4ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:54,472 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:16:04,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:16:14,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:24,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:34,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:44,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:54,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:04,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:14,473 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:24,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:34,472 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:44,473 [IndexUpdater] INFO Processing no docs (Doc:{access 4ms, cleanup 13ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:17:54,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:17:56,674 [WebFsCrawler] INFO [EXEC TIME] crawling time: 242399ms
2017-07-11 12:18:04,472 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:18:04,472 [IndexUpdater] INFO [EXEC TIME] index update time: 414ms
2017-07-11 12:18:04,514 [main] INFO Finished Crawler
2017-07-11 12:18:04,547 [main] INFO [CRAWL INFO] DataCrawlEndTime=2017-07-11T12:13:54.271+0800,CrawlerEndTime=2017-07-11T12:18:04.515+0800,WebFsCrawlExecTime=242399,CrawlerStatus=true,CrawlerStartTime=2017-07-11T12:13:54.207+0800,WebFsCrawlEndTime=2017-07-11T12:18:04.514+0800,WebFsIndexExecTime=414,WebFsIndexSize=6,CrawlerExecTime=250308,DataCrawlStartTime=2017-07-11T12:13:54.242+0800,WebFsCrawlStartTime=2017-07-11T12:13:54.241+0800
2017-07-11 12:18:09,574 [main] INFO Disconnected to elasticsearch:localhost:9300
2017-07-11 12:18:11,746 [main] INFO Destroyed LaContainer.

Url like http://file.irasia.com/listco/hk/successug/annual/2016/res.pdf in the page is not crawled and indexed.

in fess_config.properties
crawler.document.html.cannonical.xpath=
set to empty

@marevol
Copy link
Contributor

marevol commented Jul 11, 2017

http://file.irasia.com/robots.txt disallows all files.
To disable robots.txt, change crawler.ignore.robots.txt in fess_config.properties.

crawler.ignore.robots.txt=true

@riceming
Copy link
Author

Thanks a lot, issue solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants