You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the page there is a list of pdf, but only some of them are crawled and indexed. What is the problem?
fess-crawler.log
2017-07-11 12:13:54,443 [WebFsCrawler] INFO Target URL: http://www.successug.com/en/ir/announcements.php
2017-07-11 12:13:54,443 [WebFsCrawler] INFO Included URL: http://www.successug.com/en/ir/announcements.php
2017-07-11 12:13:54,443 [WebFsCrawler] INFO Included URL: http://www.irasia.com/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://203.194.162.10/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://file.irasia.com/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://202.66.146.82/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://doc.irasia.com/irasiafile/pdf/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Included URL: http://47.52.45.56/irasiafile/pdf/listco/hk/successug/.*
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*print=Y
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*mp3
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*jpg
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*gif
2017-07-11 12:13:54,444 [WebFsCrawler] INFO Excluded URL: .*png
2017-07-11 12:13:54,445 [WebFsCrawler] INFO Excluded URL: .*mp4
2017-07-11 12:13:54,626 [Crawler-20170711121345-1-1] INFO Crawling URL: http://www.successug.com/en/ir/announcements.php
2017-07-11 12:13:54,689 [Crawler-20170711121345-1-1] INFO Checking URL: http://www.successug.com/robots.txt
2017-07-11 12:14:04,479 [IndexUpdater] INFO Processing 1/1 docs (Doc:{access 5ms}, Mem:{used 99MB, heap 151MB, max 495MB})
2017-07-11 12:14:04,541 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 22ms}, Mem:{used 101MB, heap 151MB, max 495MB})
2017-07-11 12:14:04,662 [IndexUpdater] INFO Sent 1 docs (Doc:{process 36ms, send 121ms, size 12KB}, Mem:{used 105MB, heap 151MB, max 495MB})
2017-07-11 12:14:05,158 [Crawler-20170711121345-1-4] INFO Crawling URL: http://www.irasia.com/listco/hk/successug/announcement/a170707.pdf
2017-07-11 12:14:05,158 [Crawler-20170711121345-1-4] INFO Checking URL: http://www.irasia.com/robots.txt
2017-07-11 12:14:05,158 [Crawler-20170711121345-1-2] INFO Crawling URL: http://file.irasia.com/listco/hk/successug/annual/2016/agm.pdf
2017-07-11 12:14:05,159 [Crawler-20170711121345-1-2] INFO Checking URL: http://file.irasia.com/robots.txt
2017-07-11 12:14:05,159 [Crawler-20170711121345-1-3] INFO Crawling URL: http://file.irasia.com/listco/hk/successug/announcement/a170619.pdf
2017-07-11 12:14:05,159 [Crawler-20170711121345-1-5] INFO Crawling URL: http://www.irasia.com/listco/hk/successug/announcement/a171837-ew_00487ann_20032017.pdf
2017-07-11 12:14:05,783 [Thread-3] WARN Building on-disk font cache, this may take a while
2017-07-11 12:14:05,784 [Thread-3] WARN Finished building on-disk font cache, found 0 fonts
2017-07-11 12:14:05,784 [Thread-3] WARN Using fallback font 'LiberationSans' for 'Times New Roman,Italic'
2017-07-11 12:14:05,822 [Thread-3] WARN Using fallback font 'LiberationSans' for 'Times New Roman'
2017-07-11 12:14:05,826 [Thread-3] WARN Using fallback font 'LiberationSans' for 'Times New Roman,Bold'
2017-07-11 12:14:05,856 [Thread-3] INFO OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
2017-07-11 12:14:14,473 [IndexUpdater] INFO Processing 4/4 docs (Doc:{access 4ms, cleanup 22ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:14:14,533 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 20ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:14:14,578 [IndexUpdater] INFO Sent 4 docs (Doc:{process 38ms, send 44ms, size 328KB}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:14:16,478 [Crawler-20170711121345-1-3] INFO Crawling URL: http://www.irasia.com/listco/hk/successug/announcement/a175963-ew_00487ann_votingresults_2017.pdf
2017-07-11 12:14:16,546 [Thread-7] WARN Using fallback font LiberationSans for base font Times-Roman
2017-07-11 12:14:16,547 [Thread-7] WARN Using fallback font LiberationSans for base font Times-Bold
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Times-Italic
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Times-BoldItalic
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica
2017-07-11 12:14:16,548 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica-Bold
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica-Oblique
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Helvetica-BoldOblique
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Courier
2017-07-11 12:14:16,549 [Thread-7] WARN Using fallback font LiberationSans for base font Courier-Bold
2017-07-11 12:14:16,550 [Thread-7] WARN Using fallback font LiberationSans for base font Courier-Oblique
2017-07-11 12:14:16,550 [Thread-7] WARN Using fallback font LiberationSans for base font Courier-BoldOblique
2017-07-11 12:14:16,551 [Thread-7] WARN Using fallback font LiberationSans for base font Symbol
2017-07-11 12:14:16,552 [Thread-7] WARN Using fallback font LiberationSans for base font ZapfDingbats
2017-07-11 12:14:16,552 [Thread-7] WARN Using fallback font LiberationSans for Times-Roman
2017-07-11 12:14:16,554 [Thread-7] WARN Using fallback font LiberationSans for Times-Italic
2017-07-11 12:14:16,557 [Thread-7] WARN Using fallback font LiberationSans for Times-Bold
2017-07-11 12:14:24,473 [IndexUpdater] INFO Processing 1/1 docs (Doc:{access 4ms, cleanup 20ms}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:24,495 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:24,516 [IndexUpdater] INFO Sent 1 docs (Doc:{process 7ms, send 21ms, size 106KB}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:34,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 117MB, heap 189MB, max 495MB})
2017-07-11 12:14:44,472 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:14:54,472 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:04,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:14,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:24,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:34,472 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:44,473 [IndexUpdater] INFO Processing no docs (Doc:{access 4ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:15:54,472 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:16:04,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 118MB, heap 189MB, max 495MB})
2017-07-11 12:16:14,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:24,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:34,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:44,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:16:54,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:04,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:14,473 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:24,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:34,472 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 119MB, heap 189MB, max 495MB})
2017-07-11 12:17:44,473 [IndexUpdater] INFO Processing no docs (Doc:{access 4ms, cleanup 13ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:17:54,471 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:17:56,674 [WebFsCrawler] INFO [EXEC TIME] crawling time: 242399ms
2017-07-11 12:18:04,472 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms, cleanup 13ms}, Mem:{used 120MB, heap 189MB, max 495MB})
2017-07-11 12:18:04,472 [IndexUpdater] INFO [EXEC TIME] index update time: 414ms
2017-07-11 12:18:04,514 [main] INFO Finished Crawler
2017-07-11 12:18:04,547 [main] INFO [CRAWL INFO] DataCrawlEndTime=2017-07-11T12:13:54.271+0800,CrawlerEndTime=2017-07-11T12:18:04.515+0800,WebFsCrawlExecTime=242399,CrawlerStatus=true,CrawlerStartTime=2017-07-11T12:13:54.207+0800,WebFsCrawlEndTime=2017-07-11T12:18:04.514+0800,WebFsIndexExecTime=414,WebFsIndexSize=6,CrawlerExecTime=250308,DataCrawlStartTime=2017-07-11T12:13:54.242+0800,WebFsCrawlStartTime=2017-07-11T12:13:54.241+0800
2017-07-11 12:18:09,574 [main] INFO Disconnected to elasticsearch:localhost:9300
2017-07-11 12:18:11,746 [main] INFO Destroyed LaContainer.
I have a site page
http://www.successug.com/en/ir/announcements.php
In the page there is a list of pdf, but only some of them are crawled and indexed. What is the problem?
fess-crawler.log
Url like http://file.irasia.com/listco/hk/successug/annual/2016/res.pdf in the page is not crawled and indexed.
in fess_config.properties
crawler.document.html.cannonical.xpath=
set to empty
The text was updated successfully, but these errors were encountered: